FAQ
Hi folks,

I need to collect some global information from my first 1000 search results
in order to build up some search refining components containing only
relevant values (those which correspond to at least one of the first 1000
hits). For example, the results are products and there is a store filter
component that shows only the stores that sells a product between the first
1000 hits. So even if the user sees just the first 20, I would have to
inspect the first 1000. I've read that Hits mantains a cache of about 100 or
200 hits. Is this configurable? If I could set this cache to 1000 I would
then use Hits to browse the search results. Another way, I should use
HitCollector. What's your advice?

TIA
Cheers,
Carlos

Search Discussions

  • Erick Erickson at May 24, 2007 at 1:20 pm
    I know of no way to alter the Hits behavior, I recommend using
    a TopDocs/TopDocCollector.

    But be aware that if you load the document for each one, you may incur
    a significant penalty, although the lazy-loading helped me a lot, see
    FieldSelector.....
    On 5/23/07, Carlos Pita wrote:

    Hi folks,

    I need to collect some global information from my first 1000 search
    results
    in order to build up some search refining components containing only
    relevant values (those which correspond to at least one of the first 1000
    hits). For example, the results are products and there is a store filter
    component that shows only the stores that sells a product between the
    first
    1000 hits. So even if the user sees just the first 20, I would have to
    inspect the first 1000. I've read that Hits mantains a cache of about 100
    or
    200 hits. Is this configurable? If I could set this cache to 1000 I would
    then use Hits to browse the search results. Another way, I should use
    HitCollector. What's your advice?

    TIA
    Cheers,
    Carlos
  • Carlos Pita at May 24, 2007 at 2:52 pm
    Hi Erick,

    thank you for your prompt answer. What do you mean by loading the document?
    Accessing one of the stored fields? In that case I'm afraid I would need to
    do it. For example, in the aforementioned case of a result of products, I
    have to look at any product store_id, which is stored along the document. Is
    this a performance killer? Maybe I should keep some tables in memory, for
    example an array mapping from id to store_id in O(1). I will do some
    benchmarking before anyway.

    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    I know of no way to alter the Hits behavior, I recommend using
    a TopDocs/TopDocCollector.

    But be aware that if you load the document for each one, you may incur
    a significant penalty, although the lazy-loading helped me a lot, see
    FieldSelector.....
    On 5/23/07, Carlos Pita wrote:

    Hi folks,

    I need to collect some global information from my first 1000 search
    results
    in order to build up some search refining components containing only
    relevant values (those which correspond to at least one of the first 1000
    hits). For example, the results are products and there is a store filter
    component that shows only the stores that sells a product between the
    first
    1000 hits. So even if the user sees just the first 20, I would have to
    inspect the first 1000. I've read that Hits mantains a cache of about 100
    or
    200 hits. Is this configurable? If I could set this cache to 1000 I would
    then use Hits to browse the search results. Another way, I should use
    HitCollector. What's your advice?

    TIA
    Cheers,
    Carlos
  • Erick Erickson at May 24, 2007 at 4:35 pm
    You're on the right track. But that said, access to anything that's
    indexed (stored or not) should be pretty quick. Things
    stored, but not indexed, are costlier. This might drive your
    decision on what to index .vs. store.....

    Loading the document is anything like IndexReader.document(), or
    Hits.doc().

    Part of the difference is that if you load the document, you get
    all the fields, whether you need them or not.

    Also, you can use your own TermEnum/TermDocs lookup for
    this kind of thing if the terms you're interested in are indexed...

    I wrote a mail some time ago that detailed my experience, in my
    situation with my peculiar data set that you may want to read,
    see...

    Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+,


    As I mentioned in that message, I suspect that my improvement was
    *highly* dependent upon how the index is structured.....

    All that said, your notion of benchmarking is a very good one. It lead
    me to using FieldSelector in the first place...

    Best
    Erick
    On 5/24/07, Carlos Pita wrote:

    Hi Erick,

    thank you for your prompt answer. What do you mean by loading the
    document?
    Accessing one of the stored fields? In that case I'm afraid I would need
    to
    do it. For example, in the aforementioned case of a result of products, I
    have to look at any product store_id, which is stored along the document.
    Is
    this a performance killer? Maybe I should keep some tables in memory, for
    example an array mapping from id to store_id in O(1). I will do some
    benchmarking before anyway.

    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    I know of no way to alter the Hits behavior, I recommend using
    a TopDocs/TopDocCollector.

    But be aware that if you load the document for each one, you may incur
    a significant penalty, although the lazy-loading helped me a lot, see
    FieldSelector.....
    On 5/23/07, Carlos Pita wrote:

    Hi folks,

    I need to collect some global information from my first 1000 search
    results
    in order to build up some search refining components containing only
    relevant values (those which correspond to at least one of the first 1000
    hits). For example, the results are products and there is a store
    filter
    component that shows only the stores that sells a product between the
    first
    1000 hits. So even if the user sees just the first 20, I would have to
    inspect the first 1000. I've read that Hits mantains a cache of about 100
    or
    200 hits. Is this configurable? If I could set this cache to 1000 I would
    then use Hits to browse the search results. Another way, I should use
    HitCollector. What's your advice?

    TIA
    Cheers,
    Carlos
  • Carlos Pita at May 24, 2007 at 4:50 pm
    Hi Erick,

    I don't think that FieldSelector would be that valuable in my case because I
    just need to access a few fields, and those are all fields that are in fact
    stored (and indexed too). I was thinking of keeping this extra information
    in memory, precisely into an array mapping doc ids to the data structure. I
    see that this is done for ScoreDocComparator in a Lucene in Action example.
    I'm still not sure how to achieve something similar with a HitCollector. I
    mean, I could instantiate a maxDoc() size array and index it by the document
    ids that are passed to the collector. But that said, I don't know how to
    keep this array synchronized with the index. I've opened a new thread for
    this subject, "maxDoc and arrays".

    Thank you again.
    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    You're on the right track. But that said, access to anything that's
    indexed (stored or not) should be pretty quick. Things
    stored, but not indexed, are costlier. This might drive your
    decision on what to index .vs. store.....

    Loading the document is anything like IndexReader.document(), or
    Hits.doc().

    Part of the difference is that if you load the document, you get
    all the fields, whether you need them or not.

    Also, you can use your own TermEnum/TermDocs lookup for
    this kind of thing if the terms you're interested in are indexed...

    I wrote a mail some time ago that detailed my experience, in my
    situation with my peculiar data set that you may want to read,
    see...

    Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+,


    As I mentioned in that message, I suspect that my improvement was
    *highly* dependent upon how the index is structured.....

    All that said, your notion of benchmarking is a very good one. It lead
    me to using FieldSelector in the first place...

    Best
    Erick
    On 5/24/07, Carlos Pita wrote:

    Hi Erick,

    thank you for your prompt answer. What do you mean by loading the
    document?
    Accessing one of the stored fields? In that case I'm afraid I would need
    to
    do it. For example, in the aforementioned case of a result of products, I
    have to look at any product store_id, which is stored along the document.
    Is
    this a performance killer? Maybe I should keep some tables in memory, for
    example an array mapping from id to store_id in O(1). I will do some
    benchmarking before anyway.

    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    I know of no way to alter the Hits behavior, I recommend using
    a TopDocs/TopDocCollector.

    But be aware that if you load the document for each one, you may incur
    a significant penalty, although the lazy-loading helped me a lot, see
    FieldSelector.....
    On 5/23/07, Carlos Pita wrote:

    Hi folks,

    I need to collect some global information from my first 1000 search
    results
    in order to build up some search refining components containing only
    relevant values (those which correspond to at least one of the first 1000
    hits). For example, the results are products and there is a store
    filter
    component that shows only the stores that sells a product between
    the
    first
    1000 hits. So even if the user sees just the first 20, I would have
    to
    inspect the first 1000. I've read that Hits mantains a cache of
    about
    100
    or
    200 hits. Is this configurable? If I could set this cache to 1000 I would
    then use Hits to browse the search results. Another way, I should
    use
    HitCollector. What's your advice?

    TIA
    Cheers,
    Carlos
  • Chris Hostetter at May 24, 2007 at 7:29 pm
    : just need to access a few fields, and those are all fields that are in fact
    : stored (and indexed too). I was thinking of keeping this extra information
    : in memory, precisely into an array mapping doc ids to the data structure. I

    if the fields you need are indexed and single valued (and untokenized) the
    FieldCache will do this for you.

    It is also what is used for sorting, so using it instead of your own hand
    rollwed cache will save you some space, not to mention simplifing the work
    you have to do...

    : ids that are passed to the collector. But that said, I don't know how to
    : keep this array synchronized with the index. I've opened a new thread for
    : this subject, "maxDoc and arrays".



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at May 24, 2007 at 6:41 pm
    Carlos,
    It sounds like you'll have to build logic that knows when the index has been reopened and repopulates your cache. Take a look at Solr, it does this type of stuff.

    Otis
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Simpy -- http://www.simpy.com/ - Tag - Search - Share

    ----- Original Message ----
    From: Carlos Pita <carlosjosepita@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, May 24, 2007 12:50:04 PM
    Subject: Re: HitCollector or Hits

    Hi Erick,

    I don't think that FieldSelector would be that valuable in my case because I
    just need to access a few fields, and those are all fields that are in fact
    stored (and indexed too). I was thinking of keeping this extra information
    in memory, precisely into an array mapping doc ids to the data structure. I
    see that this is done for ScoreDocComparator in a Lucene in Action example.
    I'm still not sure how to achieve something similar with a HitCollector. I
    mean, I could instantiate a maxDoc() size array and index it by the document
    ids that are passed to the collector. But that said, I don't know how to
    keep this array synchronized with the index. I've opened a new thread for
    this subject, "maxDoc and arrays".

    Thank you again.
    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    You're on the right track. But that said, access to anything that's
    indexed (stored or not) should be pretty quick. Things
    stored, but not indexed, are costlier. This might drive your
    decision on what to index .vs. store.....

    Loading the document is anything like IndexReader.document(), or
    Hits.doc().

    Part of the difference is that if you load the document, you get
    all the fields, whether you need them or not.

    Also, you can use your own TermEnum/TermDocs lookup for
    this kind of thing if the terms you're interested in are indexed...

    I wrote a mail some time ago that detailed my experience, in my
    situation with my peculiar data set that you may want to read,
    see...

    Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+,


    As I mentioned in that message, I suspect that my improvement was
    *highly* dependent upon how the index is structured.....

    All that said, your notion of benchmarking is a very good one. It lead
    me to using FieldSelector in the first place...

    Best
    Erick
    On 5/24/07, Carlos Pita wrote:

    Hi Erick,

    thank you for your prompt answer. What do you mean by loading the
    document?
    Accessing one of the stored fields? In that case I'm afraid I would need
    to
    do it. For example, in the aforementioned case of a result of products, I
    have to look at any product store_id, which is stored along the document.
    Is
    this a performance killer? Maybe I should keep some tables in memory, for
    example an array mapping from id to store_id in O(1). I will do some
    benchmarking before anyway.

    Cheers,
    Carlos
    On 5/24/07, Erick Erickson wrote:

    I know of no way to alter the Hits behavior, I recommend using
    a TopDocs/TopDocCollector.

    But be aware that if you load the document for each one, you may incur
    a significant penalty, although the lazy-loading helped me a lot, see
    FieldSelector.....
    On 5/23/07, Carlos Pita wrote:

    Hi folks,

    I need to collect some global information from my first 1000 search
    results
    in order to build up some search refining components containing only
    relevant values (those which correspond to at least one of the first 1000
    hits). For example, the results are products and there is a store
    filter
    component that shows only the stores that sells a product between
    the
    first
    1000 hits. So even if the user sees just the first 20, I would have
    to
    inspect the first 1000. I've read that Hits mantains a cache of
    about
    100
    or
    200 hits. Is this configurable? If I could set this cache to 1000 I would
    then use Hits to browse the search results. Another way, I should
    use
    HitCollector. What's your advice?

    TIA
    Cheers,
    Carlos



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 24, '07 at 3:31a
activeMay 24, '07 at 7:29p
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase