FAQ
Hi,
I am storing custom values in the Tokens provided by a Tokenizer but
when retrieving them from the index the values don't match. I've looked
in the LIA book but it's not current since it mentioned term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing has
happened with older versions. Looking at the internals, DocumentWriter
seems to keep track of the end offset that was placed into the index and
modifies the token values (with +1) but I'm not sure whether I should be
concerned with it.
No existing analyzers are used when adding the document so all the
offsets are generated manually.
Any suggestions of how the token offsets should be stored?

Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11

Thanks,
Shahan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ard Schrijvers at Jul 16, 2007 at 8:14 am
    Hello,
    Hi,
    I am storing custom values in the Tokens provided by a Tokenizer but
    when retrieving them from the index the values don't match.
    What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a search with words you know that should be in, but you do not find a match?

    In the latter, you must make sure that you are using the same analyzer for the search as you used for indexing.
    I've looked
    in the LIA book but it's not current since it mentioned term vectors
    aren't stored. I'm using Lucene Nightly 146 but the same thing has
    happened with older versions. Looking at the internals,
    DocumentWriter
    seems to keep track of the end offset that was placed into
    the index and
    modifies the token values (with +1) but I'm not sure whether
    I should be
    concerned with it.
    No existing analyzers are used when adding the document so all the
    offsets are generated manually.
    Any suggestions of how the token offsets should be stored?
    Look at other clases that implement TokenStream. Also take a look at setPositionIncrement when you are putting in your own terms

    Regards Ard
    Is this valid?
    Token, start, end
    aaa, 0, 3
    bbb, 4, 7
    ccc, 8, 11

    Thanks,
    Shahan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shahan Khatchadourian at Jul 16, 2007 at 3:33 pm
    Thank you for the reply Ard,

    The tokens exist in the index and are returned accurately, except for
    the offsets. In this case I am not dealing with the positions, so the
    termvector is specified as using 'with_offsets'. I have left the term
    position incrememt as its default. Looking at the existing tokenstreams,
    they don't maintain knowledge of the current position, they always
    generate values startoffsets beginning at 0 of the current stream, and
    then a 'proper' offset is generated based on the +1 of the previous
    token the DocumentWriter applies when indexeding. Nor are there any test
    cases for offsets. I found a bug that was opened a while ago dealing
    with this issue (as well as related one). It is:
    https://issues.apache.org/jira/browse/LUCENE-579

    I am retrieving the a text token's offset values using
    TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].
    The same offset values that were placed into the token during indexing
    are not being returned, they have been shifted.
    Thanks.
    Shahan

    Ard Schrijvers wrote:
    Hello,

    Hi,
    I am storing custom values in the Tokens provided by a Tokenizer but
    when retrieving them from the index the values don't match.
    What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a search with words you know that should be in, but you do not find a match?

    In the latter, you must make sure that you are using the same analyzer for the search as you used for indexing.

    I've looked
    in the LIA book but it's not current since it mentioned term vectors
    aren't stored. I'm using Lucene Nightly 146 but the same thing has
    happened with older versions. Looking at the internals,
    DocumentWriter
    seems to keep track of the end offset that was placed into
    the index and
    modifies the token values (with +1) but I'm not sure whether
    I should be
    concerned with it.
    No existing analyzers are used when adding the document so all the
    offsets are generated manually.
    Any suggestions of how the token offsets should be stored?
    Look at other clases that implement TokenStream. Also take a look at setPositionIncrement when you are putting in your own terms

    Regards Ard

    Is this valid?
    Token, start, end
    aaa, 0, 3
    bbb, 4, 7
    ccc, 8, 11

    Thanks,
    Shahan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ard Schrijvers at Jul 16, 2007 at 3:41 pm
    Hello,

    The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is already addressed and solved...

    Regards Ard
    Thank you for the reply Ard,

    The tokens exist in the index and are returned accurately, except for
    the offsets. In this case I am not dealing with the positions, so the
    termvector is specified as using 'with_offsets'. I have left the term
    position incrememt as its default. Looking at the existing
    tokenstreams,
    they don't maintain knowledge of the current position, they always
    generate values startoffsets beginning at 0 of the current
    stream, and
    then a 'proper' offset is generated based on the +1 of the previous
    token the DocumentWriter applies when indexeding. Nor are
    there any test
    cases for offsets. I found a bug that was opened a while ago dealing
    with this issue (as well as related one). It is:
    https://issues.apache.org/jira/browse/LUCENE-579

    I am retrieving the a text token's offset values using
    TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].
    The same offset values that were placed into the token during
    indexing
    are not being returned, they have been shifted.
    Thanks.
    Shahan

    Ard Schrijvers wrote:
    Hello,

    Hi,
    I am storing custom values in the Tokens provided by a
    Tokenizer but
    when retrieving them from the index the values don't match.
    What do you mean by retrieving? Do you mean retrieving
    terms, or do you mean doing a search with words you know that
    should be in, but you do not find a match?
    In the latter, you must make sure that you are using the
    same analyzer for the search as you used for indexing.
    I've looked
    in the LIA book but it's not current since it mentioned
    term vectors
    aren't stored. I'm using Lucene Nightly 146 but the same thing has
    happened with older versions. Looking at the internals,
    DocumentWriter
    seems to keep track of the end offset that was placed into
    the index and
    modifies the token values (with +1) but I'm not sure whether
    I should be
    concerned with it.
    No existing analyzers are used when adding the document so all the
    offsets are generated manually.
    Any suggestions of how the token offsets should be stored?
    Look at other clases that implement TokenStream. Also take
    a look at setPositionIncrement when you are putting in your own terms
    Regards Ard

    Is this valid?
    Token, start, end
    aaa, 0, 3
    bbb, 4, 7
    ccc, 8, 11

    Thanks,
    Shahan
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shahan Khatchadourian at Jul 16, 2007 at 3:46 pm
    The issue continues to exist with nightly 146 from Jul 10, 2007.

    http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/


    Ard Schrijvers wrote:
    Hello,

    The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is already addressed and solved...

    Regards Ard

    Thank you for the reply Ard,

    The tokens exist in the index and are returned accurately, except for
    the offsets. In this case I am not dealing with the positions, so the
    termvector is specified as using 'with_offsets'. I have left the term
    position incrememt as its default. Looking at the existing
    tokenstreams,
    they don't maintain knowledge of the current position, they always
    generate values startoffsets beginning at 0 of the current
    stream, and
    then a 'proper' offset is generated based on the +1 of the previous
    token the DocumentWriter applies when indexeding. Nor are
    there any test
    cases for offsets. I found a bug that was opened a while ago dealing
    with this issue (as well as related one). It is:
    https://issues.apache.org/jira/browse/LUCENE-579

    I am retrieving the a text token's offset values using
    TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].
    The same offset values that were placed into the token during
    indexing
    are not being returned, they have been shifted.
    Thanks.
    Shahan

    Ard Schrijvers wrote:
    Hello,


    Hi,
    I am storing custom values in the Tokens provided by a
    Tokenizer but
    when retrieving them from the index the values don't match.
    What do you mean by retrieving? Do you mean retrieving
    terms, or do you mean doing a search with words you know that
    should be in, but you do not find a match?
    In the latter, you must make sure that you are using the
    same analyzer for the search as you used for indexing.
    I've looked
    in the LIA book but it's not current since it mentioned
    term vectors
    aren't stored. I'm using Lucene Nightly 146 but the same thing has
    happened with older versions. Looking at the internals,
    DocumentWriter
    seems to keep track of the end offset that was placed into
    the index and
    modifies the token values (with +1) but I'm not sure whether
    I should be
    concerned with it.
    No existing analyzers are used when adding the document so all the
    offsets are generated manually.
    Any suggestions of how the token offsets should be stored?

    Look at other clases that implement TokenStream. Also take
    a look at setPositionIncrement when you are putting in your own terms
    Regards Ard


    Is this valid?
    Token, start, end
    aaa, 0, 3
    bbb, 4, 7
    ccc, 8, 11

    Thanks,
    Shahan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '07 at 4:43p
activeJul 16, '07 at 3:46p
posts5
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase