FAQ
For the searches I want to run on my index I want to return all matching
documents (as opposed to N top hits).



My first naïve approach was just to use Searcher.search(query, filter,
Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number of
possible docs to return. That unfortunately seems to have huge heap
requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
in my index gets large. Multiply that per search heap requirement by a
handful of concurrent threads and I OOME my server.



When I don’t need to do any sorting it pretty easy to just use my own
collector to gather the doc ids. Of course depending on the number of hits
I might still need a good amount of heap but at least it a factor of the
number of matches (not the index size).



I’m struggling to figure out how to do the same search but with sorting.
I’m looking for a method like Searcher.search(Query, Filter, Sort,
Collector), but perhaps that isn’t a reasonable thing to have, please
enlighten me if so :-)



I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
is any different in 3.2.0.



Hopefully this made sense, any help you can provide is appreciated.

Search Discussions

  • Tim Eck at Jun 23, 2011 at 6:17 am
    For the searches I want to run on my index I want to return all matching
    documents (as opposed to N top hits).



    My first naïve approach was just to use Searcher.search(query, filter,
    Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number
    of possible docs to return. That unfortunately seems to have huge heap
    requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
    in my index gets large. Multiply that per search heap requirement by a
    handful of concurrent threads and I OOME my server.



    When I don’t need to do any sorting it pretty easy to just use my own
    collector to gather the doc ids. Of course depending on the number of
    hits I might still need a good amount of heap but at least it a factor of
    the number of matches (not the index size).



    I’m struggling to figure out how to do the same search but with sorting.
    I’m looking for a method like Searcher.search(Query, Filter, Sort,
    Collector), but perhaps that isn’t a reasonable thing to have, please
    enlighten me if so :-)



    I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
    is any different in 3.2.0.



    Hopefully this made sense, any help you can provide is appreciated.
  • Ian Lea at Jun 23, 2011 at 10:13 am
    One possibility would be to execute the search first just to get the
    number of hits - see TotalHitCountCollector in recent versions of
    lucene, not sure when it was added - and use the hit count from that
    as the max docs to return. The counting only search would typically
    be very quick, certainly much quicker than sorting a large number of
    hits.


    --
    Ian.

    On Wed, Jun 22, 2011 at 10:13 PM, Tim Eck wrote:
    For the searches I want to run on my index I want to return all matching
    documents (as opposed to N top hits).



    My first naļve approach was just to use Searcher.search(query, filter,
    Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number
    of possible docs to return. That unfortunately seems to have huge heap
    requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
    in my index gets large. Multiply that per search heap requirement by a
    handful of concurrent threads and I OOME my server.



    When I don’t need to do any sorting it pretty easy to just use my own
    collector to gather the doc ids.  Of course depending on the number of
    hits I might still need a good amount of heap but at least it a factor of
    the number of matches (not the index size).



    I’m struggling to figure out how to do the same search but with sorting.
    I’m looking for a method like Searcher.search(Query, Filter, Sort,
    Collector), but perhaps that isn’t a reasonable thing to have, please
    enlighten me if so :-)



    I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
    is any different in 3.2.0.



    Hopefully this made sense, any help you can provide is appreciated.








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tim Eck at Jun 23, 2011 at 8:42 pm
    Thanks for the idea Ian. I still need to think about it, but the race between running the total count search and then the sorted search worries me. I have very pretty specific visibility guarantees I must provide on this data (with respect to concurrent updates). It'd be a bummer to have to block all concurrent updates to get these two searches to operate on an unchanging index.

    I don't want to accuse anyone of bad code but always preallocating a potentially large array in org.apache.lucene.util.PriorityQueue seems non-ideal for the search I want to run. I'll have to dig into some more lucene code :-)

    FYI: TotalHitCountCollector looks like it was added in 3.1.0



    -----Original Message-----
    From: Ian Lea
    Sent: Thursday, June 23, 2011 3:12 AM
    To: java-user@lucene.apache.org
    Subject: Re: field sorted searches with unbounded hit count

    One possibility would be to execute the search first just to get the
    number of hits - see TotalHitCountCollector in recent versions of
    lucene, not sure when it was added - and use the hit count from that
    as the max docs to return. The counting only search would typically
    be very quick, certainly much quicker than sorting a large number of
    hits.


    --
    Ian.

    On Wed, Jun 22, 2011 at 10:13 PM, Tim Eck wrote:
    For the searches I want to run on my index I want to return all matching
    documents (as opposed to N top hits).



    My first naļve approach was just to use Searcher.search(query, filter,
    Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number
    of possible docs to return. That unfortunately seems to have huge heap
    requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
    in my index gets large. Multiply that per search heap requirement by a
    handful of concurrent threads and I OOME my server.



    When I don’t need to do any sorting it pretty easy to just use my own
    collector to gather the doc ids. Of course depending on the number of
    hits I might still need a good amount of heap but at least it a factor of
    the number of matches (not the index size).



    I’m struggling to figure out how to do the same search but with sorting.
    I’m looking for a method like Searcher.search(Query, Filter, Sort,
    Collector), but perhaps that isn’t a reasonable thing to have, please
    enlighten me if so :-)



    I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
    is any different in 3.2.0.



    Hopefully this made sense, any help you can provide is appreciated.








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Jun 24, 2011 at 5:16 am

    On Thu, Jun 23, 2011 at 10:41 PM, Tim Eck wrote:
    Thanks for the idea Ian. I still need to think about it, but the race between running the total count search and then the sorted search worries me. I have very pretty specific visibility guarantees I must provide on this data (with respect to concurrent updates). It'd be a bummer to have to block all concurrent updates to get these two searches to operate on an unchanging index.
    if you use the same IndexReader / Searcher for both queries nothing
    changes. How frequently do you open your index?
    I don't want to accuse anyone of bad code but always preallocating a potentially large array in org.apache.lucene.util.PriorityQueue seems non-ideal for the search I want to run. I'll have to dig into some more lucene code :-)
    the common usecase for this is a fixed size queue (top k retrieval)
    and allocating memory takes time so this is a very specialized class
    for exactly this. You can still write your own collector to make this
    more efficient for you.

    simon
    FYI: TotalHitCountCollector looks like it was added in 3.1.0



    -----Original Message-----
    From: Ian Lea
    Sent: Thursday, June 23, 2011 3:12 AM
    To: java-user@lucene.apache.org
    Subject: Re: field sorted searches with unbounded hit count

    One possibility would be to execute the search first just to get the
    number of hits - see TotalHitCountCollector in recent versions of
    lucene, not sure when it was added - and use the hit count from that
    as the max docs to return.  The counting only search would typically
    be very quick, certainly much quicker than sorting a large number of
    hits.


    --
    Ian.

    On Wed, Jun 22, 2011 at 10:13 PM, Tim Eck wrote:
    For the searches I want to run on my index I want to return all matching
    documents (as opposed to N top hits).



    My first naļve approach was just to use Searcher.search(query, filter,
    Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number
    of possible docs to return. That unfortunately seems to have huge heap
    requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
    in my index gets large. Multiply that per search heap requirement by a
    handful of concurrent threads and I OOME my server.



    When I don’t need to do any sorting it pretty easy to just use my own
    collector to gather the doc ids.  Of course depending on the number of
    hits I might still need a good amount of heap but at least it a factor of
    the number of matches (not the index size).



    I’m struggling to figure out how to do the same search but with sorting.
    I’m looking for a method like Searcher.search(Query, Filter, Sort,
    Collector), but perhaps that isn’t a reasonable thing to have, please
    enlighten me if so :-)



    I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
    is any different in 3.2.0.



    Hopefully this made sense, any help you can provide is appreciated.








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tim Eck at Jun 24, 2011 at 6:14 pm

    if you use the same IndexReader / Searcher for both queries nothing
    changes. How frequently do you open your index?
    I'm currently using the "real-time" readers from IndexWriter.getReader() and never closing my IndexWriter. I was (perhaps wrongly) assuming that those readers can observe mutations that have occurred after creating them. If my assumption is wrong then I guess I don't have a race and I'll try the approach of using a hit-count only query first and then the real sorted search.

    With regards to a collector -- it isn't immediately clear to me how I go about just using/writing my own collector if I want to use an arbitrary org.apache.lucene.search.Sort. There is no IndexSearcher.search() method that takes a Sort and Collector as far as I can tell.

    p.s. Thanks Simon and Toke for the responses!

    -----Original Message-----
    From: Simon Willnauer
    Sent: Thursday, June 23, 2011 10:15 PM
    To: java-user@lucene.apache.org
    Subject: Re: field sorted searches with unbounded hit count
    On Thu, Jun 23, 2011 at 10:41 PM, Tim Eck wrote:
    Thanks for the idea Ian. I still need to think about it, but the race between running the total count search and then the sorted search worries me. I have very pretty specific visibility guarantees I must provide on this data (with respect to concurrent updates). It'd be a bummer to have to block all concurrent updates to get these two searches to operate on an unchanging index.
    if you use the same IndexReader / Searcher for both queries nothing
    changes. How frequently do you open your index?
    I don't want to accuse anyone of bad code but always preallocating a potentially large array in org.apache.lucene.util.PriorityQueue seems non-ideal for the search I want to run. I'll have to dig into some more lucene code :-)
    the common usecase for this is a fixed size queue (top k retrieval)
    and allocating memory takes time so this is a very specialized class
    for exactly this. You can still write your own collector to make this
    more efficient for you.

    simon
    FYI: TotalHitCountCollector looks like it was added in 3.1.0



    -----Original Message-----
    From: Ian Lea
    Sent: Thursday, June 23, 2011 3:12 AM
    To: java-user@lucene.apache.org
    Subject: Re: field sorted searches with unbounded hit count

    One possibility would be to execute the search first just to get the
    number of hits - see TotalHitCountCollector in recent versions of
    lucene, not sure when it was added - and use the hit count from that
    as the max docs to return. The counting only search would typically
    be very quick, certainly much quicker than sorting a large number of
    hits.


    --
    Ian.

    On Wed, Jun 22, 2011 at 10:13 PM, Tim Eck wrote:
    For the searches I want to run on my index I want to return all matching
    documents (as opposed to N top hits).



    My first naļve approach was just to use Searcher.search(query, filter,
    Integer.MAX_VALUE, sort) – that is, pass Integer.MAX_VALUE for the number
    of possible docs to return. That unfortunately seems to have huge heap
    requirements in org.apache.lucene.util.PriorityQueue.heap as the max docID
    in my index gets large. Multiply that per search heap requirement by a
    handful of concurrent threads and I OOME my server.



    When I don’t need to do any sorting it pretty easy to just use my own
    collector to gather the doc ids. Of course depending on the number of
    hits I might still need a good amount of heap but at least it a factor of
    the number of matches (not the index size).



    I’m struggling to figure out how to do the same search but with sorting.
    I’m looking for a method like Searcher.search(Query, Filter, Sort,
    Collector), but perhaps that isn’t a reasonable thing to have, please
    enlighten me if so :-)



    I’m using 3.0.3 lucene-core at the moment but I don’t see that this aspect
    is any different in 3.2.0.



    Hopefully this made sense, any help you can provide is appreciated.








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jun 26, 2011 at 1:00 pm

    On Fri, Jun 24, 2011 at 2:14 PM, Tim Eck wrote:

    I'm currently using the "real-time" readers from IndexWriter.getReader() and never closing my IndexWriter. I was (perhaps wrongly) assuming that those readers can observe mutations that have occurred after creating them.
    Actually, the NRT reader (from IW.getReader() or, in newer releases,
    IR.open(IW)) is still a point-in-time reader, ie, it will only reflect
    changes done in IW before it was opened.

    Mike McCandless

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tim Eck at Jun 30, 2011 at 4:55 pm
    Thanks for the confirmation Mike, two pass search it is. I appreciate the
    knowledge on this list very much!

    -----Original Message-----
    From: Michael McCandless
    Sent: Sunday, June 26, 2011 6:00 AM
    To: java-user@lucene.apache.org
    Subject: Re: field sorted searches with unbounded hit count
    On Fri, Jun 24, 2011 at 2:14 PM, Tim Eck wrote:

    I'm currently using the "real-time" readers from IndexWriter.getReader()
    and never closing my IndexWriter. I was (perhaps wrongly) assuming that
    those readers can observe mutations that have occurred after creating them.

    Actually, the NRT reader (from IW.getReader() or, in newer releases,
    IR.open(IW)) is still a point-in-time reader, ie, it will only reflect
    changes done in IW before it was opened.

    Mike McCandless

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Jun 24, 2011 at 6:39 am

    On Thu, 2011-06-23 at 22:41 +0200, Tim Eck wrote:
    I don't want to accuse anyone of bad code but always preallocating a
    potentially large array in org.apache.lucene.util.PriorityQueue seems
    non-ideal for the search I want to run.
    The current implementation of IndexSearcher uses threaded search where
    each slice collects docID's independently, then adds them to a shared
    PriorityQueue one at a time. With this architecture, making the
    PriorityQueue size-optimized would either require multiple resizings
    (more GC activity, slightly more processing) or that all search-threads
    finishes before constructing the queue (longer response time).

    The current implementation works really well when requesting small
    result sets. It is not so fine for larger sets (partly because of memory
    allocation, partly because the standard heap-based priority queue has
    horrible locality, making it perform rather bad when it cannot be
    contained in the cache) and - as you have observed - really bad for the
    full document set. Finding a better general solution that covers all
    three cases is a real challenge, a very interesting one I might add.
    Of course one can always special case, but using a Collector seems like
    the way to go there.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 22, '11 at 9:15p
activeJun 30, '11 at 4:55p
posts9
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase