By and large, you won't ever actually be interested in very many documents,
what's returned in the TopDocs structure internal document ID and score, in
score order. But retrieval by document ID is quite efficient, it's not a
search. I'm quite sure this won't be a problem.
Adding 10,000 documents a day means that in 588 years you'll exceed a 31-bit
number. I don't think you really need to worry about that either. And that's
the worst-case, assuming the ints are signed. And I believe that they're
What you will have to worry about is the time to get the top N
highest-scoring documents. That is, IndexSearcher.seach() will be your
limiting factor long before you reach these numbers. By that time, though,
you'll have moved to SOLR or some other distributed search mechanism.
Performance is influenced by the complexity of the queries and the structure
and size of your index. The time spent retrieving the top few matches is
completely dwarfed by the search time for an index of any size.
All this may be irrelevant if you really want to retrieve a very large
number of documents rather than, say, the top 100. But the use case would
have to be very interesting for it to be a requirement to return, say,
100,000 documents to a user.
But do be aware that you're not retrieving the *original* text with
IndexSearcher. Typically, the relevant data is indexed but not stored These
two concepts are confusing when you start using Lucene, especially since
they're specified in the same call. Indexing a field splits it up into
tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
indexed data is the part that's searched. You can also store the input
verbatim, the but stored part is just a copy that's never searched but is
available for retrieval.
Which brings up one of the central decisions you need to make. Are you,
indeed, going to store all the data for retrieval in your index or just
index the relevant text to be searched along with some locator information
to the original document? You mention Cassandra, which leads me to speculate
that it's the latter.
On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
As I told you, I am quite new with Lucene, so there are many things that
might be wrong.
I'm using Lucene to make a search service for a website that has a large
amount of information daily. This amount of information is directly avaible
as text in a Cassandra Database.
There might be as much as 10.000 new documents added daily, and yes my
concern is it possible to retrieve more documents than the integer max
I don't really see also how the IndexSearcher.doc( ) really works, because
it seems like we give this method an ID and it is going to search in the
indexed documents. So what exactly is going to do this
*Or are you concerned about retrieving all documents
containing term "XY" if the number of documents matching is large?*
I'm also concerned by this problem, yes
Could you explain me a little bit how it works, and how Lucene enables one
to retrieve a very large number of documents even if it uses int ?
Thank you for your answers,
2010/6/20 Simon Willnauer <email@example.com>
Hi, maybe I don't understand your question correctly. Are you asking
if you could run into problems if you retrieve more documents than
integer max value? Or are you concerned about retrieving all documents
containing term "XY" if the number of documents matching is large? If
you are afraid of loading all documents matched from a stored field I
guess you are doing something wrong.
What are you using lucene for?
On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
I am new to Apache Lucene and it seems to fit perfectly my needs for my
However I'm a little concerned about something (pardon me if it's a
recurrent question, I've searched the archives but I didn't find something
So here is my case :
I have index a few files (like 10) and I'm trying to search something stupid
in it. The word "test". So after opening everything etc... (assuming it
works also) I do that :
*Term test = new Term("text_comment","test");*
* Query query = new TermQuery(test);*
* TopDocs top = searcher.search(query, 10);*
I want to recover the first document (I have 2 documents in TopDocs), I do :
I searched a little bit in javadoc and I saw that this method uses
I'm a little bit concerned about this... At the moment, I have 10 documents
so that's ok, but if I want to index let's say 20 files documents, how will
the IndexSearcher.doc(int) be able to retrieve documents ?
Same problem if 100.000 files have the word "test" in "text_comment"
still be able to get these 100.000 documents or is it going to be a problem
Thank you very much.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com