FAQ
My index contains approximately 5 millions documents. During a
search, I need to grab the value of a field for every document in the
result set. I am currently using a HitCollector to search. Below is
my code:

searcher.search(query, new HitCollector(){
public void collect(int doc, float score){
if(searcher.doc(doc).get("SYM") !=
null){
addSymbolsToHash(searcher.doc
(doc).get("SYM").split("ENDOFSYM"));
}
}
});

This is fairly fast for small and medium-sized result sets. However,
it gets slow as the result set grows. I read this on HitCollector's
API page:

"For good search performance, implementations of this method should
not call Searcher.doc(int) or Reader.document(int) on every document
number encountered. Doing so can slow searches by an order of
magnitude or more."

Along with this implementation, I've also tried using FieldCache.
This faired better with large-sized result sets, but worse with small
and medium-sized result sets. Anyone have any ideas of what the best
approach might be?

Thanks a lot,
Ryan

Search Discussions

  • Otis Gospodnetic at Jul 21, 2006 at 6:58 pm
    I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them?

    Otis

    ----- Original Message ----
    From: Ryan O'Hara <ohara@genome.chop.edu>
    To: java-user@lucene.apache.org
    Sent: Friday, July 21, 2006 2:43:41 PM
    Subject: Fastest Method for Searching (need all results)

    My index contains approximately 5 millions documents. During a
    search, I need to grab the value of a field for every document in the
    result set. I am currently using a HitCollector to search. Below is
    my code:

    searcher.search(query, new HitCollector(){
    public void collect(int doc, float score){
    if(searcher.doc(doc).get("SYM") !=
    null){
    addSymbolsToHash(searcher.doc
    (doc).get("SYM").split("ENDOFSYM"));
    }
    }
    });

    This is fairly fast for small and medium-sized result sets. However,
    it gets slow as the result set grows. I read this on HitCollector's
    API page:

    "For good search performance, implementations of this method should
    not call Searcher.doc(int) or Reader.document(int) on every document
    number encountered. Doing so can slow searches by an order of
    magnitude or more."

    Along with this implementation, I've also tried using FieldCache.
    This faired better with large-sized result sets, but worse with small
    and medium-sized result sets. Anyone have any ideas of what the best
    approach might be?

    Thanks a lot,
    Ryan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 7:19 pm
    Provides a new api, IndexReader.document(int doc, String[] fields). A document containing
    only the specified fields is created. The other fields of the document are not loaded, although
    unfortunately uncompressed strings still have to be scanned because the length information
    in the index is for UTF-8 encoded chars and not bytes. This is useful for applications that
    need quick access to a small subset of the fields. It can be used in conjunction with or
    for some uses instead of ParallelReader.

    Does this mean that you must be compressing the fields to really take advantage of this? Or does 'scanned' not infer a load.

    - mark


    Otis Gospodnetic wrote:
    I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them?

    Otis




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ryan O'Hara at Jul 21, 2006 at 7:26 pm

    I haven't had the chance to use this new feature yet, but have you
    tried with selective field loading, so that you can load only that
    1 field from your index and not all of them?
    I have not tried selective field loading, but it sounds like a good
    idea. What class is it in? Any more information would be
    appreciated. Thanks again.

    Ryan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 6:59 pm

    Ryan O'Hara wrote:
    My index contains approximately 5 millions documents. During a
    search, I need to grab the value of a field for every document in the
    result set. I am currently using a HitCollector to search. Below is
    my code:

    searcher.search(query, new HitCollector(){
    public void collect(int doc, float score){
    if(searcher.doc(doc).get("SYM") != null){

    addSymbolsToHash(searcher.doc(doc).get("SYM").split("ENDOFSYM"));
    }
    }
    });

    This is fairly fast for small and medium-sized result sets. However,
    it gets slow as the result set grows. I read this on HitCollector's
    API page:

    "For good search performance, implementations of this method should
    not call Searcher.doc(int) or Reader.document(int) on every document
    number encountered. Doing so can slow searches by an order of
    magnitude or more."

    Along with this implementation, I've also tried using FieldCache.
    This faired better with large-sized result sets, but worse with small
    and medium-sized result sets. Anyone have any ideas of what the best
    approach might be?

    Thanks a lot,
    Ryan
    Perhaps I am speaking too quickly, but I would try by not grabbing the
    value of the field for every document in the results set. Someone will
    see that value or use it for a couple million hits? Could be I
    suppose...but if not than axe it. Grab the first few thousand (or MUCH
    less) and if they need more head back in and grab more.


    - mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ryan O'Hara at Jul 21, 2006 at 7:11 pm

    Perhaps I am speaking too quickly, but I would try by not grabbing
    the value of the field for every document in the results set.
    Someone will see that value or use it for a couple million hits?
    Could be I suppose...but if not than axe it. Grab the first few
    thousand (or MUCH less) and if they need more head back in and grab
    more.


    - mark
    I need all values of a certain field from each document. More
    specifically, I need a compilation of all symbols in the result set.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Eks dev at Jul 21, 2006 at 8:12 pm
    have you tried to only collect doc-ids and see if the speed problem is there, or maybe to fetch only field values? If you have dense results it can easily be split() or addSymbolsToHash() what takes the time.

    I see 3 possibilities what could be slow, getting doc-ids, fetching field value or doing something with this value

    Would be interesting to know what you get here

    yeah, I know, it sounds to naive, but sometimes repeting the obvious helps

    ----- Original Message ----
    From: Ryan O'Hara <ohara@genome.chop.edu>
    To: java-user@lucene.apache.org
    Sent: Friday, 21 July, 2006 8:43:41 PM
    Subject: Fastest Method for Searching (need all results)

    My index contains approximately 5 millions documents. During a
    search, I need to grab the value of a field for every document in the
    result set. I am currently using a HitCollector to search. Below is
    my code:

    searcher.search(query, new HitCollector(){
    public void collect(int doc, float score){
    if(searcher.doc(doc).get("SYM") !=
    null){
    addSymbolsToHash(searcher.doc
    (doc).get("SYM").split("ENDOFSYM"));
    }
    }
    });

    This is fairly fast for small and medium-sized result sets. However,
    it gets slow as the result set grows. I read this on HitCollector's
    API page:

    "For good search performance, implementations of this method should
    not call Searcher.doc(int) or Reader.document(int) on every document
    number encountered. Doing so can slow searches by an order of
    magnitude or more."

    Along with this implementation, I've also tried using FieldCache.
    This faired better with large-sized result sets, but worse with small
    and medium-sized result sets. Anyone have any ideas of what the best
    approach might be?

    Thanks a lot,
    Ryan



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ryan O'Hara at Aug 2, 2006 at 8:35 pm
    eks dev,

    The most best way of looping through all results that I have come
    across is using a HitCollector and grabbing the field values via
    FieldCache. This is under two conditions: 1) The FieldCache arrays
    are initialized only once, since creating these arrays creates
    serious overhead, especially if you have millions of documents in
    your index. I use Tomcat as my application server, so the way I
    accomplished this was I created a Listener class that extends
    ServletContextListener. This way, when Tomcat restarts, the
    contextInitialize method in the Listener class is executed,
    initializing the arrays only once. These arrays are then accessible
    to all users across all sessions. 2)You have enough RAM to store the
    arrays. If you are dealing with millions of documents, you can
    easily use up hundreds of megabytes of RAM, so keep this in mind.
    Just thought I would let you know how I made out. Thanks again.

    Ryan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 21, '06 at 6:43p
activeAug 2, '06 at 8:35p
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase