FAQ
I have a custom query object whose scorer uses the 'AllTermDocs' to get all
non-deleted documents. AllTermDocs returns the docId relative to the
segment, but I need the absolute (index-wide) docId to access external data.
What's the best way to get the unique, non-deleted docId?

Thanks,
Peter

Search Discussions

  • Peter Keegan at Nov 16, 2009 at 7:06 pm
    I forgot to mention that this is with V2.9.1
    On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan wrote:

    I have a custom query object whose scorer uses the 'AllTermDocs' to get all
    non-deleted documents. AllTermDocs returns the docId relative to the
    segment, but I need the absolute (index-wide) docId to access external data.
    What's the best way to get the unique, non-deleted docId?

    Thanks,
    Peter
  • Peter Keegan at Nov 16, 2009 at 7:51 pm
    The same thing is occurring in my custom sort comparator. The ScoreDocs
    passed to the 'compare' method have docIds that seem to be relative to the
    segment. Is there any way to translate these into index-wide docIds?

    Peter
    On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan wrote:

    I forgot to mention that this is with V2.9.1

    On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan wrote:

    I have a custom query object whose scorer uses the 'AllTermDocs' to get
    all non-deleted documents. AllTermDocs returns the docId relative to the
    segment, but I need the absolute (index-wide) docId to access external data.
    What's the best way to get the unique, non-deleted docId?

    Thanks,
    Peter
  • Michael McCandless at Nov 16, 2009 at 10:16 pm
    Can you remap your external data to be per segment? Presumably hat
    would make reopens faster for your app.

    For your custom sort comparator, are you using FieldComparator? If
    so, Lucene calls setNextReader to tell you the reader & docBase.

    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.

    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?

    Mike
    On Mon, Nov 16, 2009 at 2:50 PM, Peter Keegan wrote:
    The same thing is occurring in my custom sort comparator. The ScoreDocs
    passed to the 'compare' method have docIds that seem to be relative to the
    segment. Is there any way to translate these into index-wide docIds?

    Peter
    On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan wrote:

    I forgot to mention that this is with V2.9.1

    On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan wrote:

    I have a custom query object whose scorer uses the 'AllTermDocs' to get
    all non-deleted documents. AllTermDocs returns the docId relative to the
    segment, but I need the absolute (index-wide) docId to access external data.
    What's the best way to get the unique, non-deleted docId?

    Thanks,
    Peter
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 16, 2009 at 11:39 pm
    Can you remap your external data to be per segment?
    That would provide the tightest integration but would require a major
    redesign. Currently, the external data is in a single file created by
    reading a stored field after the Lucene index has been committed. Creating
    this file is very fast with 2.9 (considering the cost of reading all those
    stored fields).
    For your custom sort comparator, are you using FieldComparator?
    I'm using the deprecated FieldSortedHitQueue. I started looking into
    replacing it with FieldComparator, but it was much more involved than I had
    expected, so I postponed. Also, this would only be a partial solution to a
    query with a custom scorer and custom sorter.
    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.
    This would work for the Scorer but not the Sorter, right?
    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?
    I think this would work for both Scorer and Sorter, right?
    This seems like the best solution right now.

    Thanks for good suggestions!

    Peter
    On Mon, Nov 16, 2009 at 5:16 PM, Michael McCandless wrote:

    Can you remap your external data to be per segment? Presumably hat
    would make reopens faster for your app.

    For your custom sort comparator, are you using FieldComparator? If
    so, Lucene calls setNextReader to tell you the reader & docBase.

    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.

    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?

    Mike
    On Mon, Nov 16, 2009 at 2:50 PM, Peter Keegan wrote:
    The same thing is occurring in my custom sort comparator. The ScoreDocs
    passed to the 'compare' method have docIds that seem to be relative to the
    segment. Is there any way to translate these into index-wide docIds?

    Peter

    On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan <peterlkeegan@gmail.com
    wrote:
    I forgot to mention that this is with V2.9.1


    On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan <peterlkeegan@gmail.com
    wrote:
    I have a custom query object whose scorer uses the 'AllTermDocs' to get
    all non-deleted documents. AllTermDocs returns the docId relative to
    the
    segment, but I need the absolute (index-wide) docId to access external
    data.
    What's the best way to get the unique, non-deleted docId?

    Thanks,
    Peter
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 17, 2009 at 10:50 am

    On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan wrote:

    Can you remap your external data to be per segment?
    That would provide the tightest integration but would require a major
    redesign. Currently, the external data is in a single file created by
    reading a stored field after the Lucene index has been committed. Creating
    this file is very fast with 2.9 (considering the cost of reading all those
    stored fields).
    OK. Though if you update a few docs and open a new reader, you have
    to fully recreate the file? (Or, your app may simply never need to do
    that...).
    For your custom sort comparator, are you using FieldComparator?
    I'm using the deprecated FieldSortedHitQueue. I started looking into
    replacing it with FieldComparator, but it was much more involved than I had
    expected, so I postponed. Also, this would only be a partial solution to a
    query with a custom scorer and custom sorter.
    You are using FSHQ directly, yourself? (Ie, not via TopFieldDocCollector)?

    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.
    This would work for the Scorer but not the Sorter, right?
    I don't fully understand the question -- the sorter is simply a
    Collector impl, and Collector.setNextReader tells you docBase when a
    the search advances to the next reader.
    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?
    I think this would work for both Scorer and Sorter, right?
    This seems like the best solution right now.
    This is a generic solution, but just make sure you don't do the
    map lookup for every doc collected, if you can help it, else that'll
    slow down your search.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 17, 2009 at 1:58 pm
    The external data is just an array of fixed-length records, one for each
    Lucene document. Indexes are updated at regular intervals in one jvm. A
    searcher jvm opens the index and reads all the fixed-length records into
    RAM. Given an index-wide docId, the custom scorer can quickly access the
    corresponding fixed-length external data.

    Could you explain a bit more about how mapping the external data to be per
    segment would work? As I said, rebuilding the whole file isn't a big deal
    and the single file keeps the Searcher's use of it simple.

    With or without a SegmentReader->docBase map (which does sound like a huge
    performance hit), I still don't see how the custom scorer gets the segment
    number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
    (if that matters)
    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    For sorting, I'm using FSHQ directly with a custom collector that inserts
    docs to the FSHQ. But the custom collector is passed the segment-relative
    docId and the custom comparator needs the index-wide docId. The custom
    collector extends HitCollector. I'm missing where this type of collector
    finds the docBase.

    Thanks,
    Peter
    On Tue, Nov 17, 2009 at 5:49 AM, Michael McCandless wrote:
    On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan wrote:

    Can you remap your external data to be per segment?
    That would provide the tightest integration but would require a major
    redesign. Currently, the external data is in a single file created by
    reading a stored field after the Lucene index has been committed. Creating
    this file is very fast with 2.9 (considering the cost of reading all those
    stored fields).
    OK. Though if you update a few docs and open a new reader, you have
    to fully recreate the file? (Or, your app may simply never need to do
    that...).
    For your custom sort comparator, are you using FieldComparator?
    I'm using the deprecated FieldSortedHitQueue. I started looking into
    replacing it with FieldComparator, but it was much more involved than I had
    expected, so I postponed. Also, this would only be a partial solution to a
    query with a custom scorer and custom sorter.
    You are using FSHQ directly, yourself? (Ie, not via TopFieldDocCollector)?

    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.
    This would work for the Scorer but not the Sorter, right?
    I don't fully understand the question -- the sorter is simply a
    Collector impl, and Collector.setNextReader tells you docBase when a
    the search advances to the next reader.
    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?
    I think this would work for both Scorer and Sorter, right?
    This seems like the best solution right now.
    This is a generic solution, but just make sure you don't do the
    map lookup for every doc collected, if you can help it, else that'll
    slow down your search.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 17, 2009 at 3:24 pm

    This is a generic solution, but just make sure you don't do the
    map lookup for every doc collected, if you can help it, else that'll
    slow down your search.
    What I just learned is that a Scorer is created for each segment (lights
    on!).
    So, couldn't I just do the subreader->docBase map lookup once when the
    custom scorer is created? No need to access the map for every doc this way.

    Peter
    On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan wrote:

    The external data is just an array of fixed-length records, one for each
    Lucene document. Indexes are updated at regular intervals in one jvm. A
    searcher jvm opens the index and reads all the fixed-length records into
    RAM. Given an index-wide docId, the custom scorer can quickly access the
    corresponding fixed-length external data.

    Could you explain a bit more about how mapping the external data to be per
    segment would work? As I said, rebuilding the whole file isn't a big deal
    and the single file keeps the Searcher's use of it simple.

    With or without a SegmentReader->docBase map (which does sound like a huge
    performance hit), I still don't see how the custom scorer gets the segment
    number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
    (if that matters)

    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    For sorting, I'm using FSHQ directly with a custom collector that inserts
    docs to the FSHQ. But the custom collector is passed the segment-relative
    docId and the custom comparator needs the index-wide docId. The custom
    collector extends HitCollector. I'm missing where this type of collector
    finds the docBase.

    Thanks,
    Peter


    On Tue, Nov 17, 2009 at 5:49 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:
    On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan <peterlkeegan@gmail.com>
    wrote:
    Can you remap your external data to be per segment?
    That would provide the tightest integration but would require a major
    redesign. Currently, the external data is in a single file created by
    reading a stored field after the Lucene index has been committed. Creating
    this file is very fast with 2.9 (considering the cost of reading all those
    stored fields).
    OK. Though if you update a few docs and open a new reader, you have
    to fully recreate the file? (Or, your app may simply never need to do
    that...).
    For your custom sort comparator, are you using FieldComparator?
    I'm using the deprecated FieldSortedHitQueue. I started looking into
    replacing it with FieldComparator, but it was much more involved than I had
    expected, so I postponed. Also, this would only be a partial solution to a
    query with a custom scorer and custom sorter.
    You are using FSHQ directly, yourself? (Ie, not via
    TopFieldDocCollector)?

    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    Failing these, Lucene currently visits the readers in index order.
    So, you could accumulate the docBase by adding up the reader.maxDoc()
    for each reader you've seen. However, this may change in future
    Lucene releases.
    This would work for the Scorer but not the Sorter, right?
    I don't fully understand the question -- the sorter is simply a
    Collector impl, and Collector.setNextReader tells you docBase when a
    the search advances to the next reader.
    You could also, externally, build your own map from SegmentReader ->
    docBase, by calling IndexReader.getSequentialSubReaders() and stepping
    through adding up the maxDoc. Then, in your search, you can lookup
    the SegmentReader you're working on to get the docBase?
    I think this would work for both Scorer and Sorter, right?
    This seems like the best solution right now.
    This is a generic solution, but just make sure you don't do the
    map lookup for every doc collected, if you can help it, else that'll
    slow down your search.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 17, 2009 at 4:47 pm

    On Tue, Nov 17, 2009 at 10:23 AM, Peter Keegan wrote:
    This is a generic solution, but just make sure you don't do the
    map lookup for every doc collected, if you can help it, else that'll
    slow down your search.
    What I just learned is that a Scorer is created for each segment (lights
    on!).
    So, couldn't I just do the subreader->docBase map lookup once when the
    custom scorer is created? No need to access the map for every doc this way.
    Right, that should work.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 17, 2009 at 4:52 pm

    On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan wrote:
    The external data is just an array of fixed-length records, one for each
    Lucene document. Indexes are updated at regular intervals in one jvm. A
    searcher jvm opens the index and reads all the fixed-length records into
    RAM. Given an index-wide docId, the custom scorer can quickly access the
    corresponding fixed-length external data.

    Could you explain a bit more about how mapping the external data to be per
    segment would work? As I said, rebuilding the whole file isn't a big deal
    and the single file keeps the Searcher's use of it simple.
    Well, you could use IndexReader.getSequentialSubReaders(), then step
    through that array of SegmentReaders, making a seprate external file
    for each?

    This way, when you reopen your readers, you would only need to make a
    new external file for those segments that are new.

    But if re-creating the entire file on each reopen isn't a problem for
    you then there's no need to change this :)
    With or without a SegmentReader->docBase map (which does sound like a huge
    performance hit), I still don't see how the custom scorer gets the segment
    number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
    (if that matters)
    Looks like you already answered this (Lucene asks the Query's weight
    for a new scorer one segment at a time).
    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    For sorting, I'm using FSHQ directly with a custom collector that inserts
    docs to the FSHQ. But the custom collector is passed the segment-relative
    docId and the custom comparator needs the index-wide docId. The custom
    collector extends HitCollector. I'm missing where this type of collector
    finds the docBase.
    Hmm -- if you are extending HitCollector and passing that to search(),
    then the docIDs fed to it should already be top-level docIDs, not
    segment relative.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 17, 2009 at 5:37 pm

    But if re-creating the entire file on each reopen isn't a problem for
    you then there's no need to change this :)
    It's actually created after IndexWriter.commit(), but same idea. If we
    needed real-time indexing, or if disk I/O gets excessive, I'd go with
    separate files per segment.
    Hmm -- if you are extending HitCollector and passing that to search(),
    then the docIDs fed to it should already be top-level docIDs, not
    segment relative.
    I just assumed the same was true for the collector, but you're right. The
    incorrect sorting I see must be due to something else.

    Thanks,
    Peter
    On Tue, Nov 17, 2009 at 11:51 AM, Michael McCandless wrote:
    On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan wrote:
    The external data is just an array of fixed-length records, one for each
    Lucene document. Indexes are updated at regular intervals in one jvm. A
    searcher jvm opens the index and reads all the fixed-length records into
    RAM. Given an index-wide docId, the custom scorer can quickly access the
    corresponding fixed-length external data.

    Could you explain a bit more about how mapping the external data to be per
    segment would work? As I said, rebuilding the whole file isn't a big deal
    and the single file keeps the Searcher's use of it simple.
    Well, you could use IndexReader.getSequentialSubReaders(), then step
    through that array of SegmentReaders, making a seprate external file
    for each?

    This way, when you reopen your readers, you would only need to make a
    new external file for those segments that are new.

    But if re-creating the entire file on each reopen isn't a problem for
    you then there's no need to change this :)
    With or without a SegmentReader->docBase map (which does sound like a huge
    performance hit), I still don't see how the custom scorer gets the segment
    number. Btw, the custom scorer usually becomes part of a
    ConjunctionScorer
    (if that matters)
    Looks like you already answered this (Lucene asks the Query's weight
    for a new scorer one segment at a time).
    FSHQ expects you to init it with the top-level reader, and then insert
    using top docIDs.
    For sorting, I'm using FSHQ directly with a custom collector that inserts
    docs to the FSHQ. But the custom collector is passed the segment-relative
    docId and the custom comparator needs the index-wide docId. The custom
    collector extends HitCollector. I'm missing where this type of collector
    finds the docBase.
    Hmm -- if you are extending HitCollector and passing that to search(),
    then the docIDs fed to it should already be top-level docIDs, not
    segment relative.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 16, '09 at 6:39p
activeNov 17, '09 at 5:37p
posts11
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase