FAQ
I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

Many of our indexes are 5M+ Documents, however, only a small subset of these are
relevant to any user. As a DocIdSet, backed by a BitSet or OpenBitSet, is
rather inefficient in terms of memory use, what is the recommended way to
DocIdSet implementation to use in this scenario?

Seems like SortedVIntList can be used to store the info, but it has no methods
to build the list in the first place, requiring an array or bitset in the
constructor.

I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2 deployment,
but want to move away from that Nutch dependency, so wondered if Lucene had a
way to do this?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Apr 5, 2011 at 10:20 am
    Can we simply factor out (poach!) those useful-sounding classes from
    Nutch into Lucene?

    Mike

    http://blog.mikemccandless.com
    On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman wrote:
    I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

    Many of our indexes are 5M+ Documents, however, only a small subset of these
    are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
    is rather inefficient in terms of memory use, what is the recommended way to
    DocIdSet implementation to use in this scenario?

    Seems like SortedVIntList can be used to store the info, but it has no
    methods to build the list in the first place, requiring an array or bitset
    in the constructor.

    I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
    deployment, but want to move away from that Nutch dependency, so wondered if
    Lucene had a way to do this?

    Thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jason Rutherglen at Apr 5, 2011 at 2:54 pm
    I think Solr has a HashDocSet implementation?

    On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless
    wrote:
    Can we simply factor out (poach!) those useful-sounding classes from
    Nutch into Lucene?

    Mike

    http://blog.mikemccandless.com
    On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman wrote:
    I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

    Many of our indexes are 5M+ Documents, however, only a small subset of these
    are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
    is rather inefficient in terms of memory use, what is the recommended way to
    DocIdSet implementation to use in this scenario?

    Seems like SortedVIntList can be used to store the info, but it has no
    methods to build the list in the first place, requiring an array or bitset
    in the constructor.

    I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
    deployment, but want to move away from that Nutch dependency, so wondered if
    Lucene had a way to do this?

    Thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Apr 5, 2011 at 3:06 pm
    This (HashDocSet, and any other impls that handle the sparse case
    well) could be useful to have in Lucene's core.

    For example, for certain MultiTermQuerys we have this
    CONSTANT_SCORE_AUTO_REWRITE, which has iffy smelling heuristics to try
    to determine the best cutover point from
    ConstantScoreQuery(BooleanQuery(<OR of Terms>)) to FILTER_REWRITE,
    because FILTER_REWRITE is costly in the sparse case.

    Mike

    http://blog.mikemccandless.com

    On Tue, Apr 5, 2011 at 10:53 AM, Jason Rutherglen
    wrote:
    I think Solr has a HashDocSet implementation?

    On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless
    wrote:
    Can we simply factor out (poach!) those useful-sounding classes from
    Nutch into Lucene?

    Mike

    http://blog.mikemccandless.com
    On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman wrote:
    I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

    Many of our indexes are 5M+ Documents, however, only a small subset of these
    are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
    is rather inefficient in terms of memory use, what is the recommended way to
    DocIdSet implementation to use in this scenario?

    Seems like SortedVIntList can be used to store the info, but it has no
    methods to build the list in the first place, requiring an array or bitset
    in the constructor.

    I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
    deployment, but want to move away from that Nutch dependency, so wondered if
    Lucene had a way to do this?

    Thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Apr 5, 2011 at 3:02 pm

    On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman wrote:
    Seems like SortedVIntList can be used to store the info, but it has no
    methods to build the list in the first place, requiring an array or bitset
    in the constructor.
    It has a constructor that takes DocIdSetIterator - so you can pass an
    iterator obtained from anywhere else (a Scorer actually is a
    DocIdSetIterator, and you can get a DocIdSet from a Filter), or
    implement your own. It's a simple iterator interface.


    -Yonik
    http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
    25-26, San Francisco

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 5, '11 at 6:25a
activeApr 5, '11 at 3:06p
posts5
users4
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase