FAQ
Hi,

Now lucene uses integer as document id, so it means we cannot have more
than 2^31-1 documents within one collection? Even if we use MultiSearcher
the document id is still integer so it seems this is still a problem?

We have been using lucene for some time and our document count is growing
rather rapidly, maybe this is a much-discussed issue already, but I did not
find the lead, any pointer would be really appreciated.

Thanks very much for helps, Lisheng



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Lance Norskog at Nov 2, 2010 at 12:58 am
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?

    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Nov 2, 2010 at 7:03 pm

    On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?
    This is really the limit of a segment, I think you can write you own
    collector and collect documents which higher (absolute) doc ids than
    INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
    should really rethink the way your search works and apply some
    sharding techniques. I really haven't been up to that many docs in a
    single index but I think it should work to have multiple segments with
    INT_MAX documents in it since we search on segment level provided if
    you collector supports it.

    simon
    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lance Norskog at Nov 3, 2010 at 2:00 am
    You would have to control your MergePolicy so it doesn't collapse
    everything back to one segment.

    On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
    wrote:
    On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?
    This is really the limit of a segment, I think you can write you own
    collector and collect documents which higher (absolute) doc ids than
    INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
    should really rethink the way your search works and apply some
    sharding techniques. I really haven't been up to that many docs in a
    single index but I think it should work to have multiple segments with
    INT_MAX documents in it since we search on segment level provided if
    you collector supports it.

    simon
    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Zhang, Lisheng at Nov 3, 2010 at 6:09 am
    Hi,

    Thanks very much for your helps!

    Your point is well taken and it may cover most use cases, but it seems
    to me that in principle the limit is not just for one segment: suppose
    within one index we have 3 segments and each has docs close to 2^31-1,
    then if I need to loop through most docs in all three segments we would
    still have problems?

    The use case is (rare one): if user searched a word which is in most
    docs and we use pagination, and user somehow just wants to get last a
    few pages (lowest rank), then we have to use a large nDocs to call search
    (may go beyond Integer.INTEGER_MAX).

    Best regards, Lisheng

    -----Original Message-----
    From: Lance Norskog
    Sent: Tuesday, November 02, 2010 7:00 PM
    To: java-user@lucene.apache.org; simon.willnauer@gmail.com
    Subject: Re: How to handle more than Integer.MAX_VALUE documents?


    You would have to control your MergePolicy so it doesn't collapse
    everything back to one segment.

    On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
    wrote:
    On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?
    This is really the limit of a segment, I think you can write you own
    collector and collect documents which higher (absolute) doc ids than
    INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
    should really rethink the way your search works and apply some
    sharding techniques. I really haven't been up to that many docs in a
    single index but I think it should work to have multiple segments with
    INT_MAX documents in it since we search on segment level provided if
    you collector supports it.

    simon
    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Nov 3, 2010 at 6:27 am

    On Wed, Nov 3, 2010 at 3:00 AM, Lance Norskog wrote:
    You would have to control your MergePolicy so it doesn't collapse
    everything back to one segment.
    maxmergedocs is an int too though!

    simon
    On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
    wrote:
    On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?
    This is really the limit of a segment, I think you can write you own
    collector and collect documents which higher (absolute) doc ids than
    INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
    should really rethink the way your search works and apply some
    sharding techniques. I really haven't been up to that many docs in a
    single index but I think it should work to have multiple segments with
    INT_MAX documents in it since we search on segment level provided if
    you collector supports it.

    simon
    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Zhang, Lisheng at Nov 3, 2010 at 3:56 pm
    Thanks very much, I got it.

    -----Original Message-----
    From: Simon Willnauer
    Sent: Tuesday, November 02, 2010 11:28 PM
    To: Lance Norskog
    Cc: java-user@lucene.apache.org
    Subject: Re: How to handle more than Integer.MAX_VALUE documents?

    On Wed, Nov 3, 2010 at 3:00 AM, Lance Norskog wrote:
    You would have to control your MergePolicy so it doesn't collapse
    everything back to one segment.
    maxmergedocs is an int too though!

    simon
    On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
    wrote:
    On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
    2billion is a hard limit. Usually people split indexes into multiple
    index long before this, and use the parallel multi reader (I think) to
    read from all of the sub-indexes.

    On Mon, Nov 1, 2010 at 2:16 PM, Zhang, Lisheng
    wrote:
    Hi,

    Now lucene uses integer as document id, so it means we cannot have more
    than 2^31-1 documents within one collection? Even if we use MultiSearcher
    the document id is still integer so it seems this is still a problem?
    This is really the limit of a segment, I think you can write you own
    collector and collect documents which higher (absolute) doc ids than
    INT_MAX. Yet, I think if you reach the limit of INT_MAX documents you
    should really rethink the way your search works and apply some
    sharding techniques. I really haven't been up to that many docs in a
    single index but I think it should work to have multiple segments with
    INT_MAX documents in it since we search on segment level provided if
    you collector supports it.

    simon
    We have been using lucene for some time and our document count is growing
    rather rapidly, maybe this is a much-discussed issue already, but I did not
    find the lead, any pointer would be really appreciated.

    Thanks very much for helps, Lisheng



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 1, '10 at 9:16p
activeNov 3, '10 at 3:56p
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase