FAQ
Hello everybody,

I am new to Apache Lucene and it seems to fit perfectly my needs for my
application.
However I'm a little concerned about something (pardon me if it's a
recurrent question, I've searched the archives but I didn't find something
about that)

So here is my case :

I have index a few files (like 10) and I'm trying to search something stupid
in it. The word "test". So after opening everything etc... (assuming it
works also) I do that :

*Term test = new Term("text_comment","test");*
* Query query = new TermQuery(test);*
* TopDocs top = searcher.search(query, 10);*

I want to recover the first document (I have 2 documents in TopDocs), I do :

*IndexSearcher.doc( top[0].doc)*

I searched a little bit in javadoc and I saw that this method uses "int" as
a parameter
I'm a little bit concerned about this... At the moment, I have 10 documents
so that's ok, but if I want to index let's say 20 files documents, how will
the IndexSearcher.doc(int) be able to retrieve documents ?
Same problem if 100.000 files have the word "test" in "text_comment" will I
still be able to get these 100.000 documents or is it going to be a problem
?

Thank you very much.

Search Discussions

  • Simon Willnauer at Jun 20, 2010 at 7:14 pm
    Hi, maybe I don't understand your question correctly. Are you asking
    if you could run into problems if you retrieve more documents than
    integer max value? Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large? If
    you are afraid of loading all documents matched from a stored field I
    guess you are doing something wrong.
    What are you using lucene for?

    simon

    On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
    wrote:
    Hello everybody,

    I am new to Apache Lucene and it seems to fit perfectly my needs for my
    application.
    However I'm a little concerned about something (pardon me if it's a
    recurrent question, I've searched the archives but I didn't find something
    about that)

    So here is my case :

    I have index a few files (like 10) and I'm trying to search something stupid
    in it. The word "test". So after opening everything etc... (assuming it
    works also) I do that :

    *Term test = new Term("text_comment","test");*
    *        Query query = new TermQuery(test);*
    *        TopDocs top = searcher.search(query, 10);*

    I want to recover the first document (I have 2 documents in TopDocs), I do :

    *IndexSearcher.doc( top[0].doc)*

    I searched a little bit in javadoc and I saw that this method uses "int" as
    a parameter
    I'm a little bit concerned about this... At the moment, I have 10 documents
    so that's ok, but if I want to index let's say 20 files documents, how will
    the IndexSearcher.doc(int) be able to retrieve documents ?
    Same problem if 100.000 files have the word "test" in "text_comment" will I
    still be able to get these 100.000 documents or is it going to be a problem
    ?

    Thank you very much.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Victor Kabdebon at Jun 20, 2010 at 8:05 pm
    Hello Simon,

    As I told you, I am quite new with Lucene, so there are many things that
    might be wrong.
    I'm using Lucene to make a search service for a website that has a large
    amount of information daily. This amount of information is directly avaible
    as text in a Cassandra Database.
    There might be as much as 10.000 new documents added daily, and yes my
    concern is it possible to retrieve more documents than the integer max value
    ?
    I don't really see also how the IndexSearcher.doc( ) really works, because
    it seems like we give this method an ID and it is going to search in the
    indexed documents. So what exactly is going to do this
    IndexSearcher.doc(int) ?

    *Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large?*
    *
    *

    I'm also concerned by this problem, yes

    Could you explain me a little bit how it works, and how Lucene enables one
    to retrieve a very large number of documents even if it uses int ?

    Thank you for your answers,
    Victor

    2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
    Hi, maybe I don't understand your question correctly. Are you asking
    if you could run into problems if you retrieve more documents than
    integer max value? Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large? If
    you are afraid of loading all documents matched from a stored field I
    guess you are doing something wrong.
    What are you using lucene for?

    simon

    On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
    wrote:
    Hello everybody,

    I am new to Apache Lucene and it seems to fit perfectly my needs for my
    application.
    However I'm a little concerned about something (pardon me if it's a
    recurrent question, I've searched the archives but I didn't find something
    about that)

    So here is my case :

    I have index a few files (like 10) and I'm trying to search something stupid
    in it. The word "test". So after opening everything etc... (assuming it
    works also) I do that :

    *Term test = new Term("text_comment","test");*
    * Query query = new TermQuery(test);*
    * TopDocs top = searcher.search(query, 10);*

    I want to recover the first document (I have 2 documents in TopDocs), I do :
    *IndexSearcher.doc( top[0].doc)*

    I searched a little bit in javadoc and I saw that this method uses "int" as
    a parameter
    I'm a little bit concerned about this... At the moment, I have 10 documents
    so that's ok, but if I want to index let's say 20 files documents, how will
    the IndexSearcher.doc(int) be able to retrieve documents ?
    Same problem if 100.000 files have the word "test" in "text_comment" will I
    still be able to get these 100.000 documents or is it going to be a problem
    ?

    Thank you very much.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jun 21, 2010 at 1:46 am
    By and large, you won't ever actually be interested in very many documents,
    what's returned in the TopDocs structure internal document ID and score, in
    score order. But retrieval by document ID is quite efficient, it's not a
    search. I'm quite sure this won't be a problem.

    Adding 10,000 documents a day means that in 588 years you'll exceed a 31-bit
    number. I don't think you really need to worry about that either. And that's
    the worst-case, assuming the ints are signed. And I believe that they're
    unsigned anyway.

    What you will have to worry about is the time to get the top N
    highest-scoring documents. That is, IndexSearcher.seach() will be your
    limiting factor long before you reach these numbers. By that time, though,
    you'll have moved to SOLR or some other distributed search mechanism.

    Performance is influenced by the complexity of the queries and the structure
    and size of your index. The time spent retrieving the top few matches is
    completely dwarfed by the search time for an index of any size.

    All this may be irrelevant if you really want to retrieve a very large
    number of documents rather than, say, the top 100. But the use case would
    have to be very interesting for it to be a requirement to return, say,
    100,000 documents to a user.

    But do be aware that you're not retrieving the *original* text with
    IndexSearcher. Typically, the relevant data is indexed but not stored These
    two concepts are confusing when you start using Lucene, especially since
    they're specified in the same call. Indexing a field splits it up into
    tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
    indexed data is the part that's searched. You can also store the input
    verbatim, the but stored part is just a copy that's never searched but is
    available for retrieval.

    Which brings up one of the central decisions you need to make. Are you,
    indeed, going to store all the data for retrieval in your index or just
    index the relevant text to be searched along with some locator information
    to the original document? You mention Cassandra, which leads me to speculate
    that it's the latter.

    HTH
    Erick


    On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
    wrote:
    Hello Simon,

    As I told you, I am quite new with Lucene, so there are many things that
    might be wrong.
    I'm using Lucene to make a search service for a website that has a large
    amount of information daily. This amount of information is directly avaible
    as text in a Cassandra Database.
    There might be as much as 10.000 new documents added daily, and yes my
    concern is it possible to retrieve more documents than the integer max
    value
    ?
    I don't really see also how the IndexSearcher.doc( ) really works, because
    it seems like we give this method an ID and it is going to search in the
    indexed documents. So what exactly is going to do this
    IndexSearcher.doc(int) ?

    *Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large?*
    *
    *

    I'm also concerned by this problem, yes

    Could you explain me a little bit how it works, and how Lucene enables one
    to retrieve a very large number of documents even if it uses int ?

    Thank you for your answers,
    Victor

    2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
    Hi, maybe I don't understand your question correctly. Are you asking
    if you could run into problems if you retrieve more documents than
    integer max value? Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large? If
    you are afraid of loading all documents matched from a stored field I
    guess you are doing something wrong.
    What are you using lucene for?

    simon

    On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
    wrote:
    Hello everybody,

    I am new to Apache Lucene and it seems to fit perfectly my needs for my
    application.
    However I'm a little concerned about something (pardon me if it's a
    recurrent question, I've searched the archives but I didn't find something
    about that)

    So here is my case :

    I have index a few files (like 10) and I'm trying to search something stupid
    in it. The word "test". So after opening everything etc... (assuming it
    works also) I do that :

    *Term test = new Term("text_comment","test");*
    * Query query = new TermQuery(test);*
    * TopDocs top = searcher.search(query, 10);*

    I want to recover the first document (I have 2 documents in TopDocs), I do :
    *IndexSearcher.doc( top[0].doc)*

    I searched a little bit in javadoc and I saw that this method uses
    "int"
    as
    a parameter
    I'm a little bit concerned about this... At the moment, I have 10 documents
    so that's ok, but if I want to index let's say 20 files documents, how will
    the IndexSearcher.doc(int) be able to retrieve documents ?
    Same problem if 100.000 files have the word "test" in "text_comment"
    will
    I
    still be able to get these 100.000 documents or is it going to be a problem
    ?

    Thank you very much.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Victor Kabdebon at Jun 21, 2010 at 7:30 am
    Hi Erick,

    Thank you very much for you explanations. 588 is a rather long way to go, so
    you're right maybe I won't need at the moment to care about that problem.
    To answer your final question : no indeed I won't need to store a lot of
    data. Just some keys in order to find the data in Cassandra later on.

    If you don't mind, please let me ask you another question :

    Is it really interesting to begin with Lucene rather than directly with solR
    (or Nutch) ? What I mean by that is : is it the same difficulty to implement
    a search with solR and stay with it instead of first implement a search with
    Lucene, then when the project becomes very big change it to a new system ?
    My goal is to have that can evolve with time even if I have 1 million
    documents added daily ?

    Thank you,
    Victor

    2010/6/21 Erick Erickson <erickerickson@gmail.com>
    By and large, you won't ever actually be interested in very many documents,
    what's returned in the TopDocs structure internal document ID and score, in
    score order. But retrieval by document ID is quite efficient, it's not a
    search. I'm quite sure this won't be a problem.

    Adding 10,000 documents a day means that in 588 years you'll exceed a
    31-bit
    number. I don't think you really need to worry about that either. And
    that's
    the worst-case, assuming the ints are signed. And I believe that they're
    unsigned anyway.

    What you will have to worry about is the time to get the top N
    highest-scoring documents. That is, IndexSearcher.seach() will be your
    limiting factor long before you reach these numbers. By that time, though,
    you'll have moved to SOLR or some other distributed search mechanism.

    Performance is influenced by the complexity of the queries and the
    structure
    and size of your index. The time spent retrieving the top few matches is
    completely dwarfed by the search time for an index of any size.

    All this may be irrelevant if you really want to retrieve a very large
    number of documents rather than, say, the top 100. But the use case would
    have to be very interesting for it to be a requirement to return, say,
    100,000 documents to a user.

    But do be aware that you're not retrieving the *original* text with
    IndexSearcher. Typically, the relevant data is indexed but not stored These
    two concepts are confusing when you start using Lucene, especially since
    they're specified in the same call. Indexing a field splits it up into
    tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
    indexed data is the part that's searched. You can also store the input
    verbatim, the but stored part is just a copy that's never searched but is
    available for retrieval.

    Which brings up one of the central decisions you need to make. Are you,
    indeed, going to store all the data for retrieval in your index or just
    index the relevant text to be searched along with some locator information
    to the original document? You mention Cassandra, which leads me to
    speculate
    that it's the latter.

    HTH
    Erick


    On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
    wrote:
    Hello Simon,

    As I told you, I am quite new with Lucene, so there are many things that
    might be wrong.
    I'm using Lucene to make a search service for a website that has a large
    amount of information daily. This amount of information is directly avaible
    as text in a Cassandra Database.
    There might be as much as 10.000 new documents added daily, and yes my
    concern is it possible to retrieve more documents than the integer max
    value
    ?
    I don't really see also how the IndexSearcher.doc( ) really works, because
    it seems like we give this method an ID and it is going to search in the
    indexed documents. So what exactly is going to do this
    IndexSearcher.doc(int) ?

    *Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large?*
    *
    *

    I'm also concerned by this problem, yes

    Could you explain me a little bit how it works, and how Lucene enables one
    to retrieve a very large number of documents even if it uses int ?

    Thank you for your answers,
    Victor

    2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
    Hi, maybe I don't understand your question correctly. Are you asking
    if you could run into problems if you retrieve more documents than
    integer max value? Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large? If
    you are afraid of loading all documents matched from a stored field I
    guess you are doing something wrong.
    What are you using lucene for?

    simon

    On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
    wrote:
    Hello everybody,

    I am new to Apache Lucene and it seems to fit perfectly my needs for
    my
    application.
    However I'm a little concerned about something (pardon me if it's a
    recurrent question, I've searched the archives but I didn't find something
    about that)

    So here is my case :

    I have index a few files (like 10) and I'm trying to search something stupid
    in it. The word "test". So after opening everything etc... (assuming
    it
    works also) I do that :

    *Term test = new Term("text_comment","test");*
    * Query query = new TermQuery(test);*
    * TopDocs top = searcher.search(query, 10);*

    I want to recover the first document (I have 2 documents in TopDocs),
    I
    do :
    *IndexSearcher.doc( top[0].doc)*

    I searched a little bit in javadoc and I saw that this method uses
    "int"
    as
    a parameter
    I'm a little bit concerned about this... At the moment, I have 10 documents
    so that's ok, but if I want to index let's say 20 files documents,
    how
    will
    the IndexSearcher.doc(int) be able to retrieve documents ?
    Same problem if 100.000 files have the word "test" in "text_comment"
    will
    I
    still be able to get these 100.000 documents or is it going to be a problem
    ?

    Thank you very much.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jun 21, 2010 at 2:44 pm
    They're quite different beasts to use. SOLR will have you up and running
    with some configuration very quickly, and if you're comfortable with servlet
    containers, it'll be even faster. It has a DIH handler which will index data
    from a database (again, with some configuration, but not necessarily
    programming). SOLR has, out of the box, support for sharding, replication,
    etc.

    Lucene is a pure Java library that you have to write infrastructure for.
    An understanding of Lucene, which SOLR uses under the covers can be
    quite valuable.

    But from what you've described, I suspect you'll be better off starting off
    with
    SOLR. You can add custom bits to SOLR if you need to, but it'll almost
    certainly be some time before you do if you do. And it won't be as likely to
    be
    throw-away work as it would be if you started with Lucene then migrated
    to Java.

    Nutch is a web-crawler/indexer, so from what you've described Nutch isn't
    a good match for what you're trying to do.

    HTH
    Erick


    On Mon, Jun 21, 2010 at 3:29 AM, Victor Kabdebon
    wrote:
    Hi Erick,

    Thank you very much for you explanations. 588 is a rather long way to go,
    so
    you're right maybe I won't need at the moment to care about that problem.
    To answer your final question : no indeed I won't need to store a lot of
    data. Just some keys in order to find the data in Cassandra later on.

    If you don't mind, please let me ask you another question :

    Is it really interesting to begin with Lucene rather than directly with
    solR
    (or Nutch) ? What I mean by that is : is it the same difficulty to
    implement
    a search with solR and stay with it instead of first implement a search
    with
    Lucene, then when the project becomes very big change it to a new system ?
    My goal is to have that can evolve with time even if I have 1 million
    documents added daily ?

    Thank you,
    Victor

    2010/6/21 Erick Erickson <erickerickson@gmail.com>
    By and large, you won't ever actually be interested in very many
    documents,
    what's returned in the TopDocs structure internal document ID and score, in
    score order. But retrieval by document ID is quite efficient, it's not a
    search. I'm quite sure this won't be a problem.

    Adding 10,000 documents a day means that in 588 years you'll exceed a
    31-bit
    number. I don't think you really need to worry about that either. And
    that's
    the worst-case, assuming the ints are signed. And I believe that they're
    unsigned anyway.

    What you will have to worry about is the time to get the top N
    highest-scoring documents. That is, IndexSearcher.seach() will be your
    limiting factor long before you reach these numbers. By that time, though,
    you'll have moved to SOLR or some other distributed search mechanism.

    Performance is influenced by the complexity of the queries and the
    structure
    and size of your index. The time spent retrieving the top few matches is
    completely dwarfed by the search time for an index of any size.

    All this may be irrelevant if you really want to retrieve a very large
    number of documents rather than, say, the top 100. But the use case would
    have to be very interesting for it to be a requirement to return, say,
    100,000 documents to a user.

    But do be aware that you're not retrieving the *original* text with
    IndexSearcher. Typically, the relevant data is indexed but not stored These
    two concepts are confusing when you start using Lucene, especially since
    they're specified in the same call. Indexing a field splits it up into
    tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The
    indexed data is the part that's searched. You can also store the input
    verbatim, the but stored part is just a copy that's never searched but is
    available for retrieval.

    Which brings up one of the central decisions you need to make. Are you,
    indeed, going to store all the data for retrieval in your index or just
    index the relevant text to be searched along with some locator
    information
    to the original document? You mention Cassandra, which leads me to
    speculate
    that it's the latter.

    HTH
    Erick


    On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon
    wrote:
    Hello Simon,

    As I told you, I am quite new with Lucene, so there are many things
    that
    might be wrong.
    I'm using Lucene to make a search service for a website that has a
    large
    amount of information daily. This amount of information is directly avaible
    as text in a Cassandra Database.
    There might be as much as 10.000 new documents added daily, and yes my
    concern is it possible to retrieve more documents than the integer max
    value
    ?
    I don't really see also how the IndexSearcher.doc( ) really works, because
    it seems like we give this method an ID and it is going to search in
    the
    indexed documents. So what exactly is going to do this
    IndexSearcher.doc(int) ?

    *Or are you concerned about retrieving all documents
    containing term "XY" if the number of documents matching is large?*
    *
    *

    I'm also concerned by this problem, yes

    Could you explain me a little bit how it works, and how Lucene enables one
    to retrieve a very large number of documents even if it uses int ?

    Thank you for your answers,
    Victor

    2010/6/20 Simon Willnauer <simon.willnauer@googlemail.com>
    Hi, maybe I don't understand your question correctly. Are you asking
    if you could run into problems if you retrieve more documents than
    integer max value? Or are you concerned about retrieving all
    documents
    containing term "XY" if the number of documents matching is large? If
    you are afraid of loading all documents matched from a stored field I
    guess you are doing something wrong.
    What are you using lucene for?

    simon

    On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon
    wrote:
    Hello everybody,

    I am new to Apache Lucene and it seems to fit perfectly my needs
    for
    my
    application.
    However I'm a little concerned about something (pardon me if it's a
    recurrent question, I've searched the archives but I didn't find something
    about that)

    So here is my case :

    I have index a few files (like 10) and I'm trying to search
    something
    stupid
    in it. The word "test". So after opening everything etc...
    (assuming
    it
    works also) I do that :

    *Term test = new Term("text_comment","test");*
    * Query query = new TermQuery(test);*
    * TopDocs top = searcher.search(query, 10);*

    I want to recover the first document (I have 2 documents in
    TopDocs),
    I
    do :
    *IndexSearcher.doc( top[0].doc)*

    I searched a little bit in javadoc and I saw that this method uses
    "int"
    as
    a parameter
    I'm a little bit concerned about this... At the moment, I have 10 documents
    so that's ok, but if I want to index let's say 20 files documents,
    how
    will
    the IndexSearcher.doc(int) be able to retrieve documents ?
    Same problem if 100.000 files have the word "test" in
    "text_comment"
    will
    I
    still be able to get these 100.000 documents or is it going to be a problem
    ?

    Thank you very much.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 20, '10 at 6:01p
activeJun 21, '10 at 2:44p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase