FAQ
CJKTokenizer generates tokens with incorrect offsets
----------------------------------------------------

Key: LUCENE-2207
URL: https://issues.apache.org/jira/browse/LUCENE-2207
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Reporter: Koji Sekiguchi


If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Koji Sekiguchi (JIRA) at Jan 13, 2010 at 4:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Koji Sekiguchi updated LUCENE-2207:
    -----------------------------------

    Attachment: TestCJKOffset.java

    Attached the program that reproduces the problem. In the program, I didn't use FastVectorHighlighter, instead, I printed out offsets from TermVectorOffsetInfo. You'll see the following results:

    {code}
    === WhitespaceAnalyzer ===
    あい(0,2)
    うえお(3,6)
    === CJKAnalyzer ===
    あい(0,2)
    うえ(4,6)
    えお(5,7)
    === BasicNGramAnalyzer ===
    あい(0,2)
    うえ(3,5)
    えお(4,6)
    {code}

    For people who are seeing garbage characters, I want to rephrase using 'Cn' symbols as follows:

    {code}
    === WhitespaceAnalyzer ===
    C1C2(0,2)
    C3C4C5(3,6)
    === CJKAnalyzer ===
    C1C2(0,2)
    C3C4(4,6)
    C4C5(5,7)
    === BasicNGramAnalyzer ===
    C1C2(0,2)
    C3C4(3,5)
    C4C5(4,6)
    {code}

    As you can see, the start offset of 'C3' is 3 in WhitespaceAnalyzer and BasicNGramAnalyzer (an analyzer which uses BasicNGramTokenizer. BasicNGramTokenizer is used in FastVectorHighlighter test code. It works as a 2-gram tokenizer for not only CJK but also ASCII), but is 4 in CJKAnalyzer -- incorrect!

    I'll look into it tomorrow or after, but volunteers are welcome!
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 13, 2010 at 5:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799831#action_12799831 ]

    Robert Muir commented on LUCENE-2207:
    -------------------------------------

    Hi Koji, this looks like a bug in CJK offset calculations, probably involving end()

    Personally i find the offset logic a little complex. It currently does both additions and subtractions to the offset and I think there is an off-by-one error in this.

    I will play around and see if I can simplify this logic, but please don't wait on me, if you have an idea already how to fix it!

    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 13, 2010 at 5:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2207:
    --------------------------------

    Attachment: LUCENE-2207.patch

    ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

    line 165:
    {code}
    offset++;
    {code}

    so, when incrementToken() has no more text to return and returns false, it needs to subtract from this.

    again i think in the future we try to refactor this offset logic to be simpler, but for the short term, this fixes the bug and all tests pass.

    Koji, can you review?
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 13, 2010 at 6:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2207:
    --------------------------------

    Attachment: LUCENE-2207.patch

    i added a testcase for end() to my patch that fails on trunk, but passes with the fix.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 13, 2010 at 6:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2207:
    --------------------------------

    Attachment: LUCENE-2207.patch

    hello, this tokenizer has more serious offset/end problems than I originally thought.

    attached is my previous patch and testcase but with 3 more testcases, one still fails.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Jan 17, 2010 at 3:52 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Koji Sekiguchi updated LUCENE-2207:
    -----------------------------------

    Attachment: LUCENE-2207.patch

    Hi Robert, thank you for looking this so quickly!

    {quote}
    ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

    line 165:

    offset++;
    {quote}

    Indeed.

    In attached patch, I added one more offset-- line and two more testcases. All tests pass.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Jan 17, 2010 at 4:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801312#action_12801312 ]

    Koji Sekiguchi edited comment on LUCENE-2207 at 1/17/10 4:54 AM:
    -----------------------------------------------------------------

    Hi Robert, thank you for looking this so quickly!

    {quote}
    ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

    line 165:

    offset++;
    {quote}

    Indeed.

    In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter.

    was (Author: koji):
    Hi Robert, thank you for looking this so quickly!

    {quote}
    ok i found the bug. the problem is incrementToken() unconditionally increments the offset before it starts its main loop:

    line 165:

    offset++;
    {quote}

    Indeed.

    In attached patch, I added one more offset-- line and two more testcases. All tests pass.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 7:26 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801336#action_12801336 ]

    Robert Muir commented on LUCENE-2207:
    -------------------------------------

    bq. In attached patch, I added one more offset-- line and two more testcases. All tests pass and this patch fixes the original problem that was found in Solr with FastVectorHighlighter.

    nice, fix looks good to me.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 8:52 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801346#action_12801346 ]

    Robert Muir commented on LUCENE-2207:
    -------------------------------------

    Koji, I am testing other end() offsets with all other tokenizers, I noticed that CJKTokenizer does not call correctOffset() in end:

    {code}
    final int finalOffset = offset;
    {code}

    I think instead this should be
    {code}
    final int finalOffset = correctOffset(offset);
    {code}

    in case there is a CharFilter present.

    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Koji Sekiguchi (JIRA) at Jan 17, 2010 at 10:28 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801363#action_12801363 ]

    Koji Sekiguchi commented on LUCENE-2207:
    ----------------------------------------

    {quote}
    I think instead this should be

    final int finalOffset = correctOffset(offset);
    {quote}

    Agreed, thank you for pointing this!
    I think this is ready to commit. Robert, can you do it? And it'd be great if it could go 2.9 branch so that Solr can use the fix.
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 10:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801366#action_12801366 ]

    Robert Muir commented on LUCENE-2207:
    -------------------------------------

    Koji, sure I can take care of it.

    Also i added LUCENE-2219 to find these bugs in other tokenizers.

    In the future I also want to explore if we can somehow use a fake CharFilter in BaseTokenStreamTest to also ensure that correctOffset() is called when setting offsets in both incrementToken() and end(), don't yet know how it would work yet.

    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 10:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir reassigned LUCENE-2207:
    -----------------------------------

    Assignee: Robert Muir
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Koji Sekiguchi
    Assignee: Robert Muir
    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 10:34 am
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2207:
    --------------------------------

    Lucene Fields: [New, Patch Available] (was: [New])
    Affects Version/s: 2.9.1
    3.0
    Fix Version/s: 3.1
    3.0.1
    2.9.2
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Affects Versions: 2.9.1, 3.0
    Reporter: Koji Sekiguchi
    Assignee: Robert Muir
    Fix For: 2.9.2, 3.0.1, 3.1

    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 17, 2010 at 9:47 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir resolved LUCENE-2207.
    ---------------------------------

    Resolution: Fixed

    thanks Koji!
    CJKTokenizer generates tokens with incorrect offsets
    ----------------------------------------------------

    Key: LUCENE-2207
    URL: https://issues.apache.org/jira/browse/LUCENE-2207
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Affects Versions: 2.9.1, 3.0
    Reporter: Koji Sekiguchi
    Assignee: Robert Muir
    Fix For: 2.9.2, 3.0.1, 3.1

    Attachments: LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, LUCENE-2207.patch, TestCJKOffset.java


    If I index a Japanese *multi-valued* document with CJKTokenizer and highlight a term with FastVectorHighlighter, the output snippets have incorrect highlighted string. I'll attach a program that reproduces the problem soon.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 13, '10 at 4:39p
activeJan 17, '10 at 9:47p
posts15
users1
websitelucene.apache.org

1 user in discussion

Robert Muir (JIRA): 15 posts

People

Translate

site design / logo © 2021 Grokbase