FAQ
Hi Michael,

I have updated my lucene-1458, and I discovered there was big
modifications in the StandardCodec interface.
I updated my own codecs to this new interface, but I encounter a
problem. My codecs are creating DocsAndPositionsEnum subclasses that
allow to access more information than simply the doc, freq and position
(I have other information encoded into the Prox file).
In the code, to be able to manipulate the additional interface that my
classes provide, I was casting the DocsAndPositionsEnum object returned
by IndexReader#termPositionsEnum() into the correct subclass. While this
approach was working in the previous flewx branch, this does not work
anymore with the last committed changes. In certain cases, the
IndexReader#termPositionsEnum() does not return the DocsAndPositionsEnum
created by the StandardPostingsReader, but a MultiDocsAndPositionsEnum.
However, I am not able either to subclass the MultiDocsAndPositionsEnum
or to wrap it into a decorator because it is declared as 'private static
final' in DirectoryReader.

Are these classes (MultiTermEnum, MultiDocsAndPositionsEnum, etc.)
hidden in a voluntary manner ? Or is there is another way to extends
StandardCodec without having to deal with these classes ?

Cheers
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Uwe Schindler at Feb 9, 2010 at 12:17 pm
    Hi Renaud,

    In flex the correct way to add additional posting data to these classes would be the usage of custom attributes, registered in the attributes() AttributeSource.

    Due to some limitations, there is currently no working support in MultiReaders to have a "view" on the underlying Enums, but we are working on that.

    In general what you do (if it works in future):
    Define an interface for your extensions based on the Attribute interface and also provide the implementation class. Then call YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of your enum, store a local reference to the attribute and fill this on iteration. Any consumer of this Enum can check using TermPositions.attributes().hasAttribute/getAttribute/addAttribute for the existence of the the same and then read the attributes during iteration. There is no need to change the Enum class API at all.

    It works in the same way like the TokenStreams since 2.9/3.0.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Renaud Delbru
    Sent: Tuesday, February 09, 2010 1:05 PM
    To: java-user
    Cc: Michael McCandless
    Subject: Flex & Docs/AndPositionsEnum

    Hi Michael,

    I have updated my lucene-1458, and I discovered there was big
    modifications in the StandardCodec interface.
    I updated my own codecs to this new interface, but I encounter a
    problem. My codecs are creating DocsAndPositionsEnum subclasses that
    allow to access more information than simply the doc, freq and position
    (I have other information encoded into the Prox file).
    In the code, to be able to manipulate the additional interface that my
    classes provide, I was casting the DocsAndPositionsEnum object returned
    by IndexReader#termPositionsEnum() into the correct subclass. While
    this
    approach was working in the previous flewx branch, this does not work
    anymore with the last committed changes. In certain cases, the
    IndexReader#termPositionsEnum() does not return the
    DocsAndPositionsEnum
    created by the StandardPostingsReader, but a MultiDocsAndPositionsEnum.
    However, I am not able either to subclass the MultiDocsAndPositionsEnum
    or to wrap it into a decorator because it is declared as 'private
    static
    final' in DirectoryReader.

    Are these classes (MultiTermEnum, MultiDocsAndPositionsEnum, etc.)
    hidden in a voluntary manner ? Or is there is another way to extends
    StandardCodec without having to deal with these classes ?

    Cheers
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 9, 2010 at 12:44 pm
    Hi Uwe,
    On 09/02/10 12:16, Uwe Schindler wrote:
    In flex the correct way to add additional posting data to these classes would be the usage of custom attributes, registered in the attributes() AttributeSource.
    Ok, I have changed my codes to use the AttributeSource interface.
    Due to some limitations, there is currently no working support in MultiReaders to have a "view" on the underlying Enums, but we are working on that.
    But, I have still the same problem, it seems that
    MultiDocsAndPositionsEnum does not have access to the underlying
    attributes added to my DocsAndPositionsEnum subclass. I got the
    following exception (IllegalArgumentException):
    "This AttributeSource does not have the attribute
    'org.sindice.siren.analysis.attributes.TupleAttribute'."

    Is this related to your previous comment, i.e., that MultiReaders do not
    have a view on the underlying Enums ?
    In general what you do (if it works in future):
    Define an interface for your extensions based on the Attribute interface and also provide the implementation class. Then call YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of your enum, store a local reference to the attribute and fill this on iteration. Any consumer of this Enum can check using TermPositions.attributes().hasAttribute/getAttribute/addAttribute for the existence of the the same and then read the attributes during iteration. There is no need to change the Enum class API at all.
    Ok, it works like a charm except the problem related to MultiReaders.

    Thanks
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 9, 2010 at 1:05 pm
    Hi Renaud,

    On 09/02/10 12:16, Uwe Schindler wrote:
    In flex the correct way to add additional posting data to these
    classes would be the usage of custom attributes, registered in the
    attributes() AttributeSource.

    Ok, I have changed my codes to use the AttributeSource interface.
    Due to some limitations, there is currently no working support in
    MultiReaders to have a "view" on the underlying Enums, but we are
    working on that.
    But, I have still the same problem, it seems that
    MultiDocsAndPositionsEnum does not have access to the underlying
    attributes added to my DocsAndPositionsEnum subclass. I got the
    following exception (IllegalArgumentException):
    "This AttributeSource does not have the attribute
    'org.sindice.siren.analysis.attributes.TupleAttribute'."

    Is this related to your previous comment, i.e., that MultiReaders do
    not
    have a view on the underlying Enums ?
    Exactly, MultiEnums have their own attributes at the moment, there is no "Proxy" view on it. For this to work, proxy AttributeImpls are needed and there is no support at the moment.

    See https://issues.apache.org/jira/browse/LUCENE-2154

    The problem behind is that when a consumer gets/adds an Attribute, all subreaders must use the same attribute or the MultiReader/DirectoryReader must proxy the attributes. For this to work we need dynamic proxies or you also have to implement ProxyImpls: Attribute, AttributeImpl, AttributeProxyImpl.

    We have no progress for that at the moment, so I am sorry, we have no working support for attributes in MultiReaders (which all DirectoryReaders are, because a index could consist of more than one segment).
    In general what you do (if it works in future):
    Define an interface for your extensions based on the Attribute
    interface and also provide the implementation class. Then call
    YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of
    your enum, store a local reference to the attribute and fill this on
    iteration. Any consumer of this Enum can check using
    TermPositions.attributes().hasAttribute/getAttribute/addAttribute for
    the existence of the the same and then read the attributes during
    iteration. There is no need to change the Enum class API at all.
    Ok, it works like a charm except the problem related to MultiReaders.
    See above.

    But attributes are the way to go for this extended posting/prox lists.

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 9, 2010 at 1:36 pm
    Renaud,

    It's great that you're testing the flex APIs... things are still "in
    flux" as you've seen. There's another big patch pending on
    LUCENE-2111...

    Out of curiosity... in what circumstances do you see a Multi*Enum appearing?

    Lucene's core always searches "by segment". Are you doing something
    external (directly using the flex APIs against a
    Multi/DirectoryReader)?

    Mike
    On Tue, Feb 9, 2010 at 8:04 AM, Uwe Schindler wrote:
    Hi Renaud,

    On 09/02/10 12:16, Uwe Schindler wrote:
    In flex the correct way to add additional posting data to these
    classes would be the usage of custom attributes, registered in the
    attributes() AttributeSource.

    Ok, I have changed my codes to use the AttributeSource interface.
    Due to some limitations, there is currently no working support in
    MultiReaders to have a "view" on the underlying Enums, but we are
    working on that.
    But, I have still the same problem, it seems that
    MultiDocsAndPositionsEnum does not have access to the underlying
    attributes added to my DocsAndPositionsEnum subclass. I got the
    following exception (IllegalArgumentException):
    "This AttributeSource does not have the attribute
    'org.sindice.siren.analysis.attributes.TupleAttribute'."

    Is this related to your previous comment, i.e., that MultiReaders do
    not
    have a view on the underlying Enums ?
    Exactly, MultiEnums have their own attributes at the moment, there is no "Proxy" view on it. For this to work, proxy AttributeImpls are needed and there is no support at the moment.

    See https://issues.apache.org/jira/browse/LUCENE-2154

    The problem behind is that when a consumer gets/adds an Attribute, all subreaders  must use the same attribute or the MultiReader/DirectoryReader must proxy the attributes. For this to work we need dynamic proxies or you also have to implement ProxyImpls: Attribute, AttributeImpl, AttributeProxyImpl.

    We have no progress for that at the moment, so I am sorry, we have no working support for attributes in MultiReaders (which all DirectoryReaders are, because a index could consist of more than one segment).
    In general what you do (if it works in future):
    Define an interface for your extensions based on the Attribute
    interface and also provide the implementation class. Then call
    YourEnums.attributes().addAttribute(YourInterface.class) in the ctor of
    your enum, store a local reference to the attribute and fill this on
    iteration. Any consumer of this Enum can check using
    TermPositions.attributes().hasAttribute/getAttribute/addAttribute for
    the existence of the the same and then read the attributes during
    iteration. There is no need to change the Enum class API at all.
    Ok, it works like a charm except the problem related to MultiReaders.
    See above.

    But attributes are the way to go for this extended posting/prox lists.

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marc Schwarz at Feb 9, 2010 at 2:00 pm
    Hi,

    i try to implement synonyma, but i didn't exactly know how to do it
    (lucene 3.0).

    Is anybody out there who has some small code snippets or a good link ?

    Thanks & Greetings,
    Marc




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Feb 9, 2010 at 4:04 pm
    Lucene in Action second edition has Synonym stuff that I think will
    work with lucene 3.0.

    Source code available from http://www.manning.com/hatcher3/


    --
    Ian.

    On Tue, Feb 9, 2010 at 2:03 PM, Marc Schwarz wrote:
    Hi,

    i try to implement synonyma, but i didn't exactly know how to do it
    (lucene 3.0).

    Is anybody out there who has some small code snippets or a good link ?

    Thanks & Greetings,
    Marc




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Feb 9, 2010 at 8:22 pm
    Maybe I miss something but what is wrong with SynonymTokenFilter in
    contrib/wordnet?

    simon
    On Tue, Feb 9, 2010 at 5:03 PM, Ian Lea wrote:
    Lucene in Action second edition has Synonym stuff that I think will
    work with lucene 3.0.

    Source code available from http://www.manning.com/hatcher3/


    --
    Ian.

    On Tue, Feb 9, 2010 at 2:03 PM, Marc Schwarz wrote:
    Hi,

    i try to implement synonyma, but i didn't exactly know how to do it
    (lucene 3.0).

    Is anybody out there who has some small code snippets or a good link ?

    Thanks & Greetings,
    Marc




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 9, 2010 at 2:09 pm
    Hi Michael,
    On 09/02/10 13:35, Michael McCandless wrote:
    It's great that you're testing the flex APIs... things are still "in
    flux" as you've seen. There's another big patch pending on
    LUCENE-2111...
    So, does it mean that the codec interface is likely to change ? Do I
    need to be prepared to change again all my code ;o) ?
    Out of curiosity... in what circumstances do you see a Multi*Enum appearing?

    Lucene's core always searches "by segment". Are you doing something
    external (directly using the flex APIs against a
    Multi/DirectoryReader)?
    I am using the flex API with the high level Lucene interface
    (IndexWriter and IndexReader).
    I am creating a RamDirectory, register my codec into the IndexWriter,
    and index 64 documents. Then, I use the IndexReader.termPositionsEnum to
    get my own DocsAndPositionsEnum in order to check if all the information
    that have been stored in the new index data structure are correctly
    retrieved.
    In that case, I got the previous errors (a MultiDocsAndPositionsEnum is
    returned). However, when I am indexing only one or two documents, the
    original DocsAndPositionsEnum is returned.

    Hope that helps,
    cheers
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 9, 2010 at 4:04 pm

    On Tue, Feb 9, 2010 at 9:08 AM, Renaud Delbru wrote:
    Hi Michael,
    On 09/02/10 13:35, Michael McCandless wrote:

    It's great that you're testing the flex APIs... things are still "in
    flux" as you've seen.  There's another big patch pending on
    LUCENE-2111...
    So, does it mean that the codec interface is likely to change ? Do I need to
    be prepared to change again all my code ;o) ?
    This particular patch doesn't change the Codecs API -- it "only"
    factors out the Multi* APIs from MultiReader. Likely you won't need
    to change your codec... but try applying the patch and see :)

    However: if you consume the flex API directly, on top of multi readers
    (something you shouldn't do, for performance reasons), you will have
    to use MultiField's static methods to get the enums.
    Out of curiosity... in what circumstances do you see a Multi*Enum
    appearing?

    Lucene's core always searches "by segment".  Are you doing something
    external (directly using the flex APIs against a
    Multi/DirectoryReader)?
    I am using the flex API with the high level Lucene interface (IndexWriter
    and IndexReader).
    I am creating a RamDirectory, register my codec into the IndexWriter, and
    index 64 documents. Then, I use the IndexReader.termPositionsEnum to get my
    own DocsAndPositionsEnum in order to check if all the information that have
    been stored in the new index data structure are correctly retrieved.
    In that case, I got the previous errors (a MultiDocsAndPositionsEnum is
    returned). However, when I am indexing only one or two documents, the
    original DocsAndPositionsEnum is returned.
    Got it, so you're directly consuming the flex API in your test.
    Whenever the index has > 1 segment, you'll get a multi enum.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 9, 2010 at 4:36 pm

    On 09/02/10 16:04, Michael McCandless wrote:
    On Tue, Feb 9, 2010 at 9:08 AM, Renaud Delbruwrote:
    So, does it mean that the codec interface is likely to change ? Do I need to
    be prepared to change again all my code ;o) ?
    This particular patch doesn't change the Codecs API -- it "only"
    factors out the Multi* APIs from MultiReader. Likely you won't need
    to change your codec... but try applying the patch and see :)
    Ok, good news ;o).
    However: if you consume the flex API directly, on top of multi readers
    (something you shouldn't do, for performance reasons), you will have
    to use MultiField's static methods to get the enums.
    In my previous example (registering my codec in IndexWriter, and then
    use IndexReader), do I consume the flex API directly on top of the
    multi-readers directly ? If yes, how to avoid that ?

    Cheers
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 9, 2010 at 4:52 pm

    On Tue, Feb 9, 2010 at 11:35 AM, Renaud Delbru wrote:

    This particular patch doesn't change the Codecs API -- it "only"
    factors out the Multi* APIs from MultiReader.  Likely you won't need
    to change your codec... but try applying the patch and see :)
    Ok, good news ;o).
    Flex is still in flux, though :)
    However: if you consume the flex API directly, on top of multi readers
    (something you shouldn't do, for performance reasons), you will have
    to use MultiField's static methods to get the enums.
    In my previous example (registering my codec in IndexWriter, and then use
    IndexReader), do I consume the flex API directly on top of the multi-readers
    directly ? If yes, how to avoid that ?
    You should (when possible/reasonable) instead use
    ReaderUtil.gatherSubReaders, then iterate through those sub readers
    asking each for its flex fields.

    But if this is only for testing purposes, and Multi*Enum is more
    convenient (and, once attrs work correctly), then Multi*Enum is
    perfectly fine.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Feb 9, 2010 at 6:13 pm

    On Tue, Feb 09, 2010 at 11:51:31AM -0500, Michael McCandless wrote:

    You should (when possible/reasonable) instead use
    ReaderUtil.gatherSubReaders, then iterate through those sub readers
    asking each for its flex fields.

    But if this is only for testing purposes, and Multi*Enum is more
    convenient (and, once attrs work correctly), then Multi*Enum is
    perfectly fine.
    Mike, FWIW, I've removed the ability to iterate over posting data at anything
    other than the segment level from KS. There's still a priority-queue-based
    aggregator for iterating over all terms in a multi-segment index, but not for
    anything lower.

    Forcing pluggable index formats to support the extra level of indirection
    necessary for iterating postings from a high level both introduces
    inefficiency and constrains their development. Consider what would happen if
    we tried indexed terms within a flat positions space and returned an array of
    positions instead of one position at a time. The instant you return objects
    or aggregates rather than primitives, you force support for offsets down into
    the low-level decoder.

    It's not really necessary to iterate aggregated postings across multiple
    segments, so IMO it's best to shunt users like Renaud towards the segment
    level.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 9, 2010 at 8:47 pm

    On Tue, Feb 9, 2010 at 1:12 PM, Marvin Humphrey wrote:
    On Tue, Feb 09, 2010 at 11:51:31AM -0500, Michael McCandless wrote:

    You should (when possible/reasonable) instead use
    ReaderUtil.gatherSubReaders, then iterate through those sub readers
    asking each for its flex fields.
    But if this is only for testing purposes, and Multi*Enum is more
    convenient (and, once attrs work correctly), then Multi*Enum is
    perfectly fine.
    Mike, FWIW, I've removed the ability to iterate over posting data at
    anything other than the segment level from KS. There's still a
    priority-queue-based aggregator for iterating over all terms in a
    multi-segment index, but not for anything lower.
    Interesting... and segment merging just does its own private
    concatenation/mapping-around-deletes of the doc/positions?

    I'm torn on the Multi*Enum.... it's easy to get one "by accident"
    (because you're interacting with multi reader) and as a result take a
    silent performance hit. And often the caller can easily change to
    operate per segment instead.

    But, then, it's very convenient when you need it and don't care about
    performance. EG in Renaud's usage, a test case that is trying to
    assert that all indexed docs look right, why should you be forced to
    operate per segment? He shouldn't have to bother with the details of
    which field/term/doc was indexed into which segment.

    Or, I guess we could argue that this test really should create a
    TermQuery and walk the matching docs... instead of using the low level
    flex enum APIs. Because searching impl already knows how to step
    through the segments.

    Anyway, my current patch on LUCENE-2111 reflects my torn-ness: it
    makes it just a bit harder to get Multi*Enum on a multi-reader. If
    you call MultiReader.fields(), it throws
    UnsupportedOperationException, and you must instead use
    MultiFields.getXXXEnum to explicitly create the enum.
    Forcing pluggable index formats to support the extra level of indirection
    necessary for iterating postings from a high level both introduces
    inefficiency and constrains their development. Consider what would happen if
    we tried indexed terms within a flat positions space and returned an array of
    positions instead of one position at a time. The instant you return objects
    or aggregates rather than primitives, you force support for offsets down into
    the low-level decoder.
    I don't understand this example -- can you give more detail? Eg,
    what's a "flat positions space"? And "force support for offsets".
    And we don't return "objects or aggregates" with Multi*Enum now...

    In flex right now the codec is unware that it's being "consumed" by a
    Multi*Enum. It still returns primitives. If instead we returned an
    int[] for positions (hmm -- may be a good reason to make positions be
    an Attribute, Uwe), I think it would still be OK?
    It's not really necessary to iterate aggregated postings across multiple
    segments, so IMO it's best to shunt users like Renaud towards the segment
    level.
    Still torn... I think it's convenience vs performance. But I
    want convenience to be an explicit choice. We shouldn't default our
    APIs to a silent perf hit...

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Feb 9, 2010 at 9:44 pm

    On Tue, Feb 09, 2010 at 03:47:19PM -0500, Michael McCandless wrote:

    Interesting... and segment merging just does its own private
    concatenation/mapping-around-deletes of the doc/positions?
    I think the answer is yes, but I'm not sure I understand the question
    completely since I'm not sure why you'd ask that in this context.
    what's a "flat positions space"?
    It's something Google once used. Instead of positions starting with 0 at each
    document, they just keep going.

    doc 1: "Three Blind Mice" - positions 0, 1, 2
    doc 2: "Peter Peter Pumpkin Eater" - positions 3, 4, 5, 6
    And we don't return "objects or aggregates" with Multi*Enum now...
    Yeah, this is different. In KS right now, we use a generic PostingList, which
    conveys different information depending on what class of Posting it contains.
    In flex right now the codec is unware that it's being "consumed" by a
    Multi*Enum.
    Right, but in KinoSearch's case PostingList had to be aware of that because
    the Posting object could be consumed at either the segment level or the index
    level -- so it needed a setDocBase(offset) method which adjusted the doc num in
    the Posting. It was messy.

    The change I made was to eliminate PolyPostingList and PolyPostingListReader,
    which made it possible to remove the setDocBase() method from SegPostingList.
    It still returns primitives. If instead we returned an int[] for positions
    (hmm -- may be a good reason to make positions be an Attribute, Uwe), I
    think it would still be OK?
    In the flat positions space example, it would be necessary to add an offset to
    each of the positions in that array. Each segment would have a "positions
    max" analogous to maxDoc(); these would be summed to obtain the positions
    offset the same way we add up maxDoc() now to obtain the doc id offset.

    That example may not be a deal breaker for you, but I'm not willing to
    guarantee that Lucy will always return primitives from these enums, now and
    forever, one per method call.
    Still torn... I think it's convenience vs performance.
    But convenience for the posting format plugin developer matters too, right?
    Are you confident that a generic aggregator can support all possible codecs,
    or will plugin developers be forced to ensure that aggregation works because
    you've guaranteed to users like Renaud that it will?

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 10, 2010 at 9:47 am

    And we don't return "objects or aggregates" with Multi*Enum now...
    Yeah, this is different. In KS right now, we use a generic
    PostingList, which
    conveys different information depending on what class of Posting it
    contains.
    In flex right now the codec is unware that it's being "consumed" by a
    Multi*Enum.
    Right, but in KinoSearch's case PostingList had to be aware of that
    because
    the Posting object could be consumed at either the segment level or the
    index
    level -- so it needed a setDocBase(offset) method which adjusted the
    doc num in
    the Posting. It was messy.
    The doc base adaption is done in the MultiDocsEnum in Lucene.
    The change I made was to eliminate PolyPostingList and
    PolyPostingListReader,
    which made it possible to remove the setDocBase() method from
    SegPostingList.
    It still returns primitives. If instead we returned an int[] for positions
    (hmm -- may be a good reason to make positions be an Attribute, Uwe), I
    think it would still be OK?
    Positions as attributes would be good. For positions we need a new Attribute (not PositionIncrement), but e.g. for offsets and payloads we can use the standard attributes from the analysis, which is really cool. This would also make it possible to add all custom attributes from the analysis phase to the posting list and make them visible in the TermDocs enum. In my opinion, there should be no DocsEnum, DocsAndPositionsEnum and so on enums, just one class, which only differes in provided attributes. So if you want the payloads, ask for a standard DocsEnum and pass the requested attribute classes as parameter):
    IndexReader.termDocsEnum(Bits skipDocs, String field, BytesRef term, Class<? extends Attribute>... atts)

    If somebody wants offsets and payloads:
    reader.termDocsEnum(skipDocs, "field", term, OffsetAttribute.class, PayloadAttribute.class);

    But before we can implement this for MultiEnums we need the Proxy attributes or we need to copy them around (and the MultiEnums get their own AttributeSource). For this to work I will add a AttributeSource.copyTo(AttributeSource), which is on my todolist, but still missing. For some TokenStreams this method may also be convenient (e.g. concenating TokenStreams).

    On the other hand: with Proxy attributes, concenating TokenStreams are easy (and very performant!), too.
    You should (when possible/reasonable) instead use
    ReaderUtil.gatherSubReaders, then iterate through those sub readers
    asking each for its flex fields.

    But if this is only for testing purposes, and Multi*Enum is more
    convenient (and, once attrs work correctly), then Multi*Enum is
    perfectly fine.
    Mike, FWIW, I've removed the ability to iterate over posting data at
    anything
    other than the segment level from KS. There's still a priority-queue-
    based
    aggregator for iterating over all terms in a multi-segment index, but
    not for
    anything lower.
    I am not sure if this is very good in Lucene as it would break lots of apps. E.g. simple autocompletes use a PrefixTerm(s)Enums, but must use the top-level reader or they have to emulate merging of all TermsEnums themselves. A second problem (currently) is rewrites (e.g. Fuzzy) to BooleanQuery for MTQs. They operate on the top level reader.

    So I propose "simple" and not so performant Enums for MultiReaders. In my opinion, it would also be possible without ProxyAttributes, if we simply copy them around. It’s a performance problem, but if somebody needs speed, segment-level enums should be used (and search does this by the way).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 10, 2010 at 12:59 pm

    On 10/02/10 09:47, Uwe Schindler wrote:
    Positions as attributes would be good. For positions we need a new Attribute (not PositionIncrement), but e.g. for offsets and payloads we can use the standard attributes from the analysis, which is really cool. This would also make it possible to add all custom attributes from the analysis phase to the posting list and make them visible in the TermDocs enum. In my opinion, there should be no DocsEnum, DocsAndPositionsEnum and so on enums, just one class, which only differes in provided attributes. So if you want the payloads, ask for a standard DocsEnum and pass the requested attribute classes as parameter):
    IndexReader.termDocsEnum(Bits skipDocs, String field, BytesRef term, Class<? extends Attribute>... atts)

    If somebody wants offsets and payloads:
    reader.termDocsEnum(skipDocs, "field", term, OffsetAttribute.class, PayloadAttribute.class);
    I kind of like this idea. This interface to iterate over the postings
    looks more flexible, and imho it will be easy to use this interface with
    any "home-brewed" codec.
    Read optimisations based on the user need such as the current
    termDocsEnum and termPositionsEnum (where one is reading only the freq
    file, the second one is also reading the prox file) will be done under
    the hood by the respective PostingReader. Given the set of Attribute
    class received, the PostingReader knows what he needs to read, and what
    he does not need to read. So, there is also a simplification of the
    interface for the user. It does not have to take care of choosing the
    right enum.
    I am not sure if this is very good in Lucene as it would break lots of apps. E.g. simple autocompletes use a PrefixTerm(s)Enums, but must use the top-level reader or they have to emulate merging of all TermsEnums themselves. A second problem (currently) is rewrites (e.g. Fuzzy) to BooleanQuery for MTQs. They operate on the top level reader.

    So I propose "simple" and not so performant Enums for MultiReaders. In my opinion, it would also be possible without ProxyAttributes, if we simply copy them around. It’s a performance problem, but if somebody needs speed, segment-level enums should be used (and search does this by the way).
    Could you provide pointers to search code that uses the segment-level
    enum ?
    As I explained in my last answer to Michael, the TermScorer is using the
    DocsEnum interface, and therefore do not know if it manipulates
    segment-level enum or a Multi*Enums. What search (or query operators) in
    Lucene is using segment-level enums ?

    Cheers
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 10, 2010 at 1:15 pm

    Could you provide pointers to search code that uses the segment-level
    enum ?
    As I explained in my last answer to Michael, the TermScorer is using
    the
    DocsEnum interface, and therefore do not know if it manipulates
    segment-level enum or a Multi*Enums. What search (or query operators)
    in
    Lucene is using segment-level enums ?
    All of them, only rewrites are currently done on the top-level reader. IndexSearcher since 2.9 creates Scorers in separate for each segment and merges the results in its collector. Because of that we have a modified Collector interface that has setNextReader() methods and so on.

    So you can assume that every Scorer uses a SegmentReader, but legacy code may behave different (like if somebody instantiates a TermScorer and passes the top level reader to it). Also Solr is not yet completely free of global readers (as far as I know).


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 10, 2010 at 2:48 pm

    On 10/02/10 13:15, Uwe Schindler wrote:
    Could you provide pointers to search code that uses the segment-level
    enum ?
    As I explained in my last answer to Michael, the TermScorer is using
    the
    DocsEnum interface, and therefore do not know if it manipulates
    segment-level enum or a Multi*Enums. What search (or query operators)
    in
    Lucene is using segment-level enums ?
    All of them, only rewrites are currently done on the top-level reader. IndexSearcher since 2.9 creates Scorers in separate for each segment and merges the results in its collector. Because of that we have a modified Collector interface that has setNextReader() methods and so on.
    Ok, so for example, in TermQuery$TermWeight#scorer(reader,
    scoreDocsInOrder, topScorer), the reader passed as parameter is one of
    the subscorer ? Is that right ?

    If this is the case, now I understand why Michael was saying that the
    way I am testing the postings (using termPositionsEnum on the top-level
    reader) was not really the proper way to test it, and that the correct
    way will be instead to use directly a TermQuery.

    Thanks for the clarification.
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 10, 2010 at 5:20 pm

    On Wed, Feb 10, 2010 at 9:47 AM, Renaud Delbru wrote:
    On 10/02/10 13:15, Uwe Schindler wrote:

    Could you provide pointers to search code that uses the segment-level
    enum ?
    As I explained in my last answer to Michael, the TermScorer is using
    the
    DocsEnum interface, and therefore do not know if it manipulates
    segment-level enum or a Multi*Enums. What search (or query operators)
    in
    Lucene is using segment-level enums ?
    All of them, only rewrites are currently done on the top-level reader.
    IndexSearcher since 2.9 creates Scorers in separate for each segment and
    merges the results in its collector. Because of that we have a modified
    Collector interface that has setNextReader() methods and so on.
    Ok, so for example, in TermQuery$TermWeight#scorer(reader, scoreDocsInOrder,
    topScorer), the reader passed as parameter is one of the subscorer ? Is that
    right ?
    Right, it will be a SegmentReader.

    But, you're right -- the scorer method will also accept a
    Multi/DirectoryReader, and iterate a Multi*Enum in that case. It's
    just less performant, so, Lucene doesn't do that when it creates
    scorers.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 10, 2010 at 11:58 am

    On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey wrote:

    Interesting... and segment merging just does its own private
    concatenation/mapping-around-deletes of the doc/positions?
    I think the answer is yes, but I'm not sure I understand the
    question completely since I'm not sure why you'd ask that in this
    context.
    Segment merging is one place that "legitimately" needs to append
    docs/positions enum of multiple sub readers... but obviously it can
    just do this itself (and it must, since it renumbers the docIDs).
    what's a "flat positions space"?
    It's something Google once used. Instead of positions starting with
    0 at each document, they just keep going.

    doc 1: "Three Blind Mice" - positions 0, 1, 2
    doc 2: "Peter Peter Pumpkin Eater" - positions 3, 4, 5, 6
    And we don't return "objects or aggregates" with Multi*Enum now...
    Yeah, this is different. In KS right now, we use a generic
    PostingList, which conveys different information depending on what
    class of Posting it contains.
    OK
    In flex right now the codec is unware that it's being "consumed" by
    a Multi*Enum.
    Right, but in KinoSearch's case PostingList had to be aware of that
    because the Posting object could be consumed at either the segment
    level or the index level -- so it needed a setDocBase(offset) method
    which adjusted the doc num in the Posting. It was messy.

    The change I made was to eliminate PolyPostingList and
    PolyPostingListReader, which made it possible to remove the
    setDocBase() method from SegPostingList.
    But why didn't you have the Multi*Enums layer add the offset (so that
    the codec need not know who's consuming it)? Performance?
    It still returns primitives. If instead we returned an int[] for
    positions (hmm -- may be a good reason to make positions be an
    Attribute, Uwe), I think it would still be OK?
    In the flat positions space example, it would be necessary to add an
    offset to each of the positions in that array. Each segment would
    have a "positions max" analogous to maxDoc(); these would be summed
    to obtain the positions offset the same way we add up maxDoc() now
    to obtain the doc id offset.
    OK, but [so far] we don't have that problem with the flex APIs -- the
    codec is not aware that there's a multi enum layer consuming it.
    That example may not be a deal breaker for you, but I'm not willing
    to guarantee that Lucy will always return primitives from these
    enums, now and forever, one per method call.
    But it'd be a major API change down the road to change this, for
    Lucy/KS? Ie this example seems not to apply to Lucene, and even for
    KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up
    and make such a major API change to the enums, once "committed".

    Also, this is why we're adding Attribute* to all the postings enums,
    with flex -- any codec & consumer can use their own private
    attributes. The attrs pass through Multi*Enum.
    Still torn... I think it's convenience vs performance.
    But convenience for the posting format plugin developer matters too,
    right?
    Right but the existince of Multi*Enums isn't affecting the codec dev
    (so far, I think).
    Are you confident that a generic aggregator can support all possible
    codecs, or will plugin developers be forced to ensure that
    aggregation works because you've guaranteed to users like Renaud
    that it will?
    Well... pretty confident. So far, at least? We have an existence
    proof :) The codec API really should not (and, should not have to)
    bake in details of who's consuming it.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Feb 10, 2010 at 1:28 pm

    On Wed, Feb 10, 2010 at 06:58:01AM -0500, Michael McCandless wrote:
    But why didn't you have the Multi*Enums layer add the offset (so that
    the codec need not know who's consuming it)? Performance?
    That would have involved something like this within the aggregator:

    posting.setDocID(posting.getDodID() + docBase).

    The problem is that that's the docID the SegPostingList is using for its
    deltas. If the SegPostingList skips during a call to advance(), it needs to
    reset that docID to the what the skip data says -- but if the aggregator layer
    doesn't tell it that it needs to account for a docBase, the new docID will
    lose the offset. Can't solve that problem at the aggregator level either --
    the aggregator doesn't know when skipping is occurring, so it can't intervene
    on an as-needed basis.

    The fix was to make SegPostingList aware of a docBase, so that on skipping it
    could add it to the docID in the skip data and land at the right docID from
    the perspective of the consumer. Messy.

    I suppose another possibility would have been to have the aggregator keep its
    own Posting and copy all data over from the SegPostingList's Posting on each
    iteration then add its offset. However, that would have been a lot less
    efficient, and it still wouldn't have worked for the "flat positions space"
    example because the generic aggregator would not have known about the needs of
    the specific codec.
    That example may not be a deal breaker for you, but I'm not willing
    to guarantee that Lucy will always return primitives from these
    enums, now and forever, one per method call.
    But it'd be a major API change down the road to change this, for
    Lucy/KS?
    I suppose so. It's either foreclose on the possibility of aggregating (Lucy),
    or foreclose on the possibility of using properties that cannot be aggregated
    (Lucene).
    Also, this is why we're adding Attribute* to all the postings enums,
    with flex -- any codec & consumer can use their own private
    attributes. The attrs pass through Multi*Enum.
    Hmm. Does that mean that the consumer needs to refresh the attributes with
    each iteration? Because what happens when you switch sub-enums within the
    Multi*Enum? Don't those attributes go stale, as they belong to a sub-enum
    that has finished?

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 10, 2010 at 5:33 pm

    On Wed, Feb 10, 2010 at 8:27 AM, Marvin Humphrey wrote:

    But why didn't you have the Multi*Enums layer add the offset (so
    that the codec need not know who's consuming it)? Performance?
    That would have involved something like this within the aggregator:

    posting.setDocID(posting.getDodID() + docBase).

    The problem is that that's the docID the SegPostingList is using for
    its deltas. If the SegPostingList skips during a call to advance(),
    it needs to reset that docID to the what the skip data says -- but
    if the aggregator layer doesn't tell it that it needs to account for
    a docBase, the new docID will lose the offset. Can't solve that
    problem at the aggregator level either -- the aggregator doesn't
    know when skipping is occurring, so it can't intervene on an
    as-needed basis.
    In Lucene, skipping is done through the aggregator, so it knows that
    it's skipping, and in fact skips whole segments at a time until it
    gets to the segment that may contain the doc.
    The fix was to make SegPostingList aware of a docBase, so that on
    skipping it could add it to the docID in the skip data and land at
    the right docID from the perspective of the consumer. Messy. OK
    I suppose another possibility would have been to have the aggregator
    keep its own Posting and copy all data over from the
    SegPostingList's Posting on each iteration then add its offset.
    I think this is what Lucene does (?). EG the aggregator holds its own
    "int doc" which it must copy to (adding the offset) from the
    underlying sub enum.
    However, that would have been a lot less efficient, and it still
    wouldn't have worked for the "flat positions space" example because
    the generic aggregator would not have known about the needs of the
    specific codec.
    But aggregator could also add the positions offset on each
    nextPosition() call, in Lucene. Like that use case could be made to
    work, if Lucene had used a flat position space.
    That example may not be a deal breaker for you, but I'm not
    willing to guarantee that Lucy will always return primitives from
    these enums, now and forever, one per method call.
    But it'd be a major API change down the road to change this, for
    Lucy/KS?
    I suppose so. It's either foreclose on the possibility of aggregating (Lucy),
    or foreclose on the possibility of using properties that cannot be aggregated
    (Lucene).
    Right, though... if this even happens in practice for some future app,
    that app can choose to avoid Multi*Enum. Lucene internally doesn't
    use Multi*Enum (except during merging, which your codec can
    override, as of flex).
    Also, this is why we're adding Attribute* to all the postings enums,
    with flex -- any codec & consumer can use their own private
    attributes. The attrs pass through Multi*Enum.
    Hmm. Does that mean that the consumer needs to refresh the attributes with
    each iteration? Because what happens when you switch sub-enums within the
    Multi*Enum? Don't those attributes go stale, as they belong to a sub-enum
    that has finished?
    Switching sub-enums is indeed tricky (we're iterating on this in
    LUCENE-2154). Our current plan is to pass an attr source (maps attr
    interface to an actual instance that implements it) to each sub-enum,
    meaning, all codecs being aggregated must be able to use the same attr
    impl.

    So consumer gets a single instance for TupleAttribute, next's through
    the enum, calling TupleAttribute.get() each time, regardless of
    whether it's an aggreggated or non-aggregated enum.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Feb 10, 2010 at 7:42 pm

    On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote:

    In Lucene, skipping is done through the aggregator,
    I had a look at MultiDocsEnum in the flex blanch. It doesn't know when
    sub-enum is reading skip data.
    I suppose another possibility would have been to have the aggregator
    keep its own Posting and copy all data over from the
    SegPostingList's Posting on each iteration then add its offset.
    I think this is what Lucene does (?). EG the aggregator holds its own
    "int doc" which it must copy to (adding the offset) from the
    underlying sub enum.
    That's fine for a *primitive* type. Modifying an int returned by a sub-enum
    doesn't affect the sub-enum. :)

    The problem arises when there's an opaque *object* conveying data to the
    consumer. The aggregator knows everything there is to know about an int, but
    it doesn't know what it needs to do to prepare an opaque object owned by the
    sub-enum for consumption at the aggregate level.
    However, that would have been a lot less efficient, and it still
    wouldn't have worked for the "flat positions space" example because
    the generic aggregator would not have known about the needs of the
    specific codec.
    But aggregator could also add the positions offset on each
    nextPosition() call, in Lucene. Like that use case could be made to
    work, if Lucene had used a flat position space.
    A generic aggregator wouldn't know that it needed to do that. The postings
    codec developer would be forced to write aggregation code in addition to
    segment-level code.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Feb 11, 2010 at 1:30 pm

    On Wed, Feb 10, 2010 at 2:42 PM, Marvin Humphrey wrote:
    On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote:

    In Lucene, skipping is done through the aggregator,
    I had a look at MultiDocsEnum in the flex blanch. It doesn't know when
    sub-enum is reading skip data.
    I'm confused -- the MultiDocsEnum's advance method impl is the only
    place where we invoke advance on the sub readers. Oh you're saying we
    don't know if the underlying enum actually skipped vs just scanned?

    Isn't the skip data also based on deltas? So even if real skipping
    happened, Lucy/KS would not "lose" the offset that the aggregator had
    previously added? Or maybe I'm lost on what the issue is here...
    I suppose another possibility would have been to have the aggregator
    keep its own Posting and copy all data over from the
    SegPostingList's Posting on each iteration then add its offset.
    I think this is what Lucene does (?). EG the aggregator holds its own
    "int doc" which it must copy to (adding the offset) from the
    underlying sub enum.
    That's fine for a *primitive* type. Modifying an int returned by a sub-enum
    doesn't affect the sub-enum. :)

    The problem arises when there's an opaque *object* conveying data to the
    consumer. The aggregator knows everything there is to know about an int, but
    it doesn't know what it needs to do to prepare an opaque object owned by the
    sub-enum for consumption at the aggregate level.
    OK.
    However, that would have been a lot less efficient, and it still
    wouldn't have worked for the "flat positions space" example because
    the generic aggregator would not have known about the needs of the
    specific codec.
    But aggregator could also add the positions offset on each
    nextPosition() call, in Lucene. Like that use case could be made to
    work, if Lucene had used a flat position space.
    A generic aggregator wouldn't know that it needed to do that. The postings
    codec developer would be forced to write aggregation code in addition to
    segment-level code.
    Right, if position were not primitive but contained within an opaque
    (to the aggregator) object. And, you were doing the flat positions
    space.

    I guess... this restriction still seems academic... ie, not a real
    issue in Lucene. We use primitives in Lucene for doc/position, which
    we can remap as needed. We then require that opaque stuff (using
    attributes) "survive", unchanged, when passed through the aggregator.
    Either that, or, you enum segment by segment in the code. I don't [yet]
    see this as an issue for Lucene...

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marvin Humphrey at Feb 11, 2010 at 5:17 pm

    On Thu, Feb 11, 2010 at 08:30:14AM -0500, Michael McCandless wrote:
    Oh you're saying we don't know if the underlying enum actually skipped vs
    just scanned? Yep.
    Isn't the skip data also based on deltas?
    Yes, but that's internal to the skip reader, in both Lucene and Lucy/KS. When
    it comes time to skip, the skip reader's doc id is assigned directly, in both
    libraries. From StandardPostingsReaderImpl.java:

    doc = skipper.getDoc();

    Trying to apply the skip reader's doc id information as a delta would get
    quite complicated. (A delta against... what?) I'm not sure that's even
    possible.
    So even if real skipping happened, Lucy/KS would not "lose" the offset that
    the aggregator had previously added? Or maybe I'm lost on what the issue is
    here...
    It would indeed "lose" the offset, because the skip reader's doc id
    information gets assigned directly rather than applied as a delta.

    And since the aggregator layer is not aware of when this occurs, it cannot
    intervene to re-apply the offset.

    Having driven down this dead-end, turned around and come back, I've become
    persuaded that requiring the segment-level postings iterator to be aware of
    its consumer is not a good idea.
    A generic aggregator wouldn't know that it needed to do that. The postings
    codec developer would be forced to write aggregation code in addition to
    segment-level code.
    Right, if position were not primitive but contained within an opaque
    (to the aggregator) object. And, you were doing the flat positions
    space.

    I guess... this restriction still seems academic... ie, not a real
    issue in Lucene.
    Not for the standard posting formats that Lucene offers. But the point of
    flex is to provide an extension framework, I thought.

    Well, whatever. It's just another place where Lucy and Lucene will part ways.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Feb 10, 2010 at 12:46 pm
    Hi Michael,
    On 09/02/10 20:47, Michael McCandless wrote:
    But, then, it's very convenient when you need it and don't care about
    performance. EG in Renaud's usage, a test case that is trying to
    assert that all indexed docs look right, why should you be forced to
    operate per segment? He shouldn't have to bother with the details of
    which field/term/doc was indexed into which segment.

    Or, I guess we could argue that this test really should create a
    TermQuery and walk the matching docs... instead of using the low level
    flex enum APIs. Because searching impl already knows how to step
    through the segments.
    In fact, I care about performance, but I was using the
    IndexReader.termPositionsEnum to mimic the implementation of the
    different query scorers (e.g., TermScorer).
    I have already reimplemented many of the original Lucene Scorers to use
    my particular index structure. From what I have seen, the main low level
    scorers (e.g., TermScorer, PhraseScorer) are using the DocsEnum
    interface, and not a segment-level enum. From what I understand, these
    scorers are not aware if they are using a segment-level enum or a
    Multi*Enum. So, there is a loss of performance in this case ? Or do I
    miss something ?

    I'll try to clarify my usage of the Flex API, maybe it can highlight you
    certain aspects.
    In the ideal world, what I would like to do is the following:
    1) write my own codec,
    2) register my codec in the IndexWriter, and tell him to use this codec
    for one or more fields (similar to the PerFieldCodecWrapper),
    3) write query operators that are compatible with my codec,
    4) at search time, use these query operators with the fields that use my
    codec.

    If by error, I am using the query operators which are not compatible
    with a field (and its related codec), an exception is thrown telling me
    that I am not able to use these query operators with this field.

    So, in my current use case, I don't think it is necessary to be aware of
    that fact that I am manipulating multiple segments or only one segment.
    I think this should be hidden.

    But what you were suggesting is to create my own "MultiReader" that is
    optimised for my codec. Is that right ? A MultiReader that just iterates
    over the subreaders, checks if they are using my codec (and therefore
    associated fields), and uses them to iterate over my own postings ?
    --
    Renaud Delbru

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 9, '10 at 12:05p
activeFeb 11, '10 at 5:17p
posts27
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase