FAQ
Hi,

I currently use multiple fieldable instances for indexing sentences of a
document.
When there is only one single fieldable instance, the token offset
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances. In
DocumentWriter$FieldData#invertField method, if the field is tokenized,
instead of updating offset attribute with stringValue.length() (which is
performed if the field is not tokenized, line 1458), you update the
offset attribute with the end offset of the last token (line 1503:
offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a stopword,
a dot, a space, etc.), the offset attribute is updated with the end
offset of the last token not filtered. In this case, you store inside
the offset attribute an incorrect offset (the offset is shift back) and
all the next fieldable instances will have their offset shifted back.

Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Mar 5, 2008 at 9:16 am
    This is how Lucene has worked for quite some time (since 1.9).

    When there are multiple fields with the same name in one Document,
    each field's offset starts from the last offset (offset of the last
    token) seen in the previous field. If tokens are skipped at the end
    there's no way IndexWriter can know (because tokenStream doesn't
    return them). It's as if we need the ability to query a tokenStream
    for its "final" offset or something.

    One workaround might be to insert an "end marker" token, with the
    true end offset, which is a term you would never search on?

    Mike

    Renaud Delbru wrote:
    Hi,

    I currently use multiple fieldable instances for indexing sentences
    of a document.
    When there is only one single fieldable instance, the token offset
    generation performed in DocumentWriter is correct.
    The problem appears when there is two or more fieldable instances.
    In DocumentWriter$FieldData#invertField method, if the field is
    tokenized, instead of updating offset attribute with
    stringValue.length() (which is performed if the field is not
    tokenized, line 1458), you update the offset attribute with the end
    offset of the last token (line 1503: offset = offsetEnd+1;).
    As a consequence, if a token has been filtered (for example a
    stopword, a dot, a space, etc.), the offset attribute is updated
    with the end offset of the last token not filtered. In this case,
    you store inside the offset attribute an incorrect offset (the
    offset is shift back) and all the next fieldable instances will
    have their offset shifted back.

    Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

    Regards.

    --
    Renaud Delbru,
    E.C.S., Ph.D. Student,
    Semantic Information Systems and
    Language Engineering Group (SmILE),
    Digital Enterprise Research Institute,
    National University of Ireland, Galway.
    http://smile.deri.ie/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Delbru at Mar 5, 2008 at 9:53 am
    Do you know if there will be side-effects if we replace in
    DocumentWriter$FieldData#invertField
    offset = offsetEnd+1;
    by
    offset = stringValue.length();

    I still not understand the reason of such choice for the incrementation
    of the start offset.

    Regards.

    Michael McCandless wrote:
    This is how Lucene has worked for quite some time (since 1.9).

    When there are multiple fields with the same name in one Document,
    each field's offset starts from the last offset (offset of the last
    token) seen in the previous field. If tokens are skipped at the end
    there's no way IndexWriter can know (because tokenStream doesn't
    return them). It's as if we need the ability to query a tokenStream
    for its "final" offset or something.

    One workaround might be to insert an "end marker" token, with the true
    end offset, which is a term you would never search on?

    Mike

    Renaud Delbru wrote:
    Hi,

    I currently use multiple fieldable instances for indexing sentences
    of a document.
    When there is only one single fieldable instance, the token offset
    generation performed in DocumentWriter is correct.
    The problem appears when there is two or more fieldable instances. In
    DocumentWriter$FieldData#invertField method, if the field is
    tokenized, instead of updating offset attribute with
    stringValue.length() (which is performed if the field is not
    tokenized, line 1458), you update the offset attribute with the end
    offset of the last token (line 1503: offset = offsetEnd+1;).
    As a consequence, if a token has been filtered (for example a
    stopword, a dot, a space, etc.), the offset attribute is updated with
    the end offset of the last token not filtered. In this case, you
    store inside the offset attribute an incorrect offset (the offset is
    shift back) and all the next fieldable instances will have their
    offset shifted back.

    Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
    --
    Renaud Delbru,
    E.C.S., Ph.D. Student,
    Semantic Information Systems and
    Language Engineering Group (SmILE),
    Digital Enterprise Research Institute,
    National University of Ireland, Galway.
    http://smile.deri.ie/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Mar 5, 2008 at 11:10 am
    Well, first off, sometimes the thing being indexed isn't a string, so
    you have no stringValue to get its length. It could be a Reader or a
    TokenStream.

    Second off, it's conceivable that an analyzer computes its own
    "interesting" offsets that are not in fact simple indices into the
    stringValue, though I would expect that to be the exception not the
    rule.

    I can't think of any other harm ... so if neither of these apply in
    your situation then it should be OK?

    I do agree this seems like a bug. EG, if you use Highlighter on a
    multi-valued field indexed with stored field & term vectors and say
    the first field ended with a stop word that was filtered out, then
    your offsets will be off and the wrong parts will be highlighted in
    all but the first field (I think?). I think we really need some way
    for the tokenStream to "declare" its final offset at the end.

    Mike

    Renaud Delbru wrote:
    Do you know if there will be side-effects if we replace in
    DocumentWriter$FieldData#invertField
    offset = offsetEnd+1;
    by
    offset = stringValue.length();

    I still not understand the reason of such choice for the
    incrementation of the start offset.

    Regards.

    Michael McCandless wrote:
    This is how Lucene has worked for quite some time (since 1.9).

    When there are multiple fields with the same name in one Document,
    each field's offset starts from the last offset (offset of the
    last token) seen in the previous field. If tokens are skipped at
    the end there's no way IndexWriter can know (because tokenStream
    doesn't return them). It's as if we need the ability to query a
    tokenStream for its "final" offset or something.

    One workaround might be to insert an "end marker" token, with the
    true end offset, which is a term you would never search on?

    Mike

    Renaud Delbru wrote:
    Hi,

    I currently use multiple fieldable instances for indexing
    sentences of a document.
    When there is only one single fieldable instance, the token
    offset generation performed in DocumentWriter is correct.
    The problem appears when there is two or more fieldable
    instances. In DocumentWriter$FieldData#invertField method, if the
    field is tokenized, instead of updating offset attribute with
    stringValue.length() (which is performed if the field is not
    tokenized, line 1458), you update the offset attribute with the
    end offset of the last token (line 1503: offset = offsetEnd+1;).
    As a consequence, if a token has been filtered (for example a
    stopword, a dot, a space, etc.), the offset attribute is updated
    with the end offset of the last token not filtered. In this case,
    you store inside the offset attribute an incorrect offset (the
    offset is shift back) and all the next fieldable instances will
    have their offset shifted back.

    Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
    --
    Renaud Delbru,
    E.C.S., Ph.D. Student,
    Semantic Information Systems and
    Language Engineering Group (SmILE),
    Digital Enterprise Research Institute,
    National University of Ireland, Galway.
    http://smile.deri.ie/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toph at Jun 30, 2008 at 11:14 pm
    Interesting discussion... glad I'm not the only one with this challenge.


    Michael McCandless-2 wrote:
    EG, if you use Highlighter on a
    multi-valued field indexed with stored field & term vectors and say
    the first field ended with a stop word that was filtered out, then
    your offsets will be off and the wrong parts will be highlighted
    I found this post by attempting just this exact thing, and I can confirm,
    that yes, the offsets are incorrect for all but the first instance of the
    field in the document, so they are useless for highlighting. I tried
    concatenating all instances of the fields, but of course if an instance of
    the field ended with punctuation or a stop word, those characters were not
    added to the offset. I'll try the suggested workaround re adding a false
    term at the end of each field, but a better API would be if "offset" became
    a pair of ints, first being the index of the Field for getFields(name) and
    the second being the offset in that instance of the field.

    Christopher
    --
    View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18206216.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 2, 2008 at 12:33 pm
    This would actually be a fairly large change: it's a change to the
    index format and all APIs that handle offsets during indexing &
    searching/retrieving.

    We could alternatively extend TokenStream so you could query it for
    the final offset, then fix indexing to use that value instead of the
    endOffset of the last token that it saw.

    Mike

    Toph wrote:
    Interesting discussion... glad I'm not the only one with this
    challenge.


    Michael McCandless-2 wrote:
    EG, if you use Highlighter on a
    multi-valued field indexed with stored field & term vectors and say
    the first field ended with a stop word that was filtered out, then
    your offsets will be off and the wrong parts will be highlighted
    I found this post by attempting just this exact thing, and I can
    confirm,
    that yes, the offsets are incorrect for all but the first instance
    of the
    field in the document, so they are useless for highlighting. I tried
    concatenating all instances of the fields, but of course if an
    instance of
    the field ended with punctuation or a stop word, those characters
    were not
    added to the offset. I'll try the suggested workaround re adding a
    false
    term at the end of each field, but a better API would be if "offset"
    became
    a pair of ints, first being the index of the Field for
    getFields(name) and
    the second being the offset in that instance of the field.

    Christopher
    --
    View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18206216.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toph at Jul 2, 2008 at 2:20 pm

    Michael McCandless-2 wrote:


    This would actually be a fairly large change: it's a change to the
    index format and all APIs that handle offsets during indexing &
    searching/retrieving.
    For now I just changed the offset calculation in DocumentWriter as specified
    here by the OP:


    replace DocumentWriter$FieldData#invertField offset = offsetEnd+1; by
    offset = stringValue.length();
    It has side effects as previously mentioned on this list, e.g. if the
    tokenstream is not backed by a stringValue or the Analyzer does not
    calculate offsets in the normal way. But for my purposes it works.

    This issue was also discussed previously
    http://lucene.markmail.org/search/?q=offset%20documentwriter#query:offset%20documentwriter+page:1+mid:l6jbfmfisyg5zyre+state:results
    here .


    Michael McCandless-2 wrote:

    We could alternatively extend TokenStream so you could query it for
    the final offset, then fix indexing to use that value instead of the
    endOffset of the last token that it saw.
    Querying the tokenstream for the final offset would good, but then would the
    change be put into the DocumentWriter directly or available as an option?

    Chris
    --
    View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18238566.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 2, 2008 at 5:22 pm

    Toph wrote:

    Michael McCandless-2 wrote:

    We could alternatively extend TokenStream so you could query it for
    the final offset, then fix indexing to use that value instead of the
    endOffset of the last token that it saw.
    Querying the tokenstream for the final offset would good, but then
    would the
    change be put into the DocumentWriter directly or available as an
    option?
    I would put the change into DocumentsWriter directly (ie running by
    default) with an option to enable the old (buggy) behavior for those
    apps that have workarounds and want to get back to the back-compatible
    behavior.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 4, '08 at 6:05p
activeJul 2, '08 at 5:22p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase