FAQ
position increment bug: smartcn
-------------------------------

Key: LUCENE-2014
URL: https://issues.apache.org/jira/browse/LUCENE-2014
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Reporter: Robert Muir
Attachments: LUCENE-2014.patch

If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.

its especially annoying because it happens in 2.9.1 RC as well.

this is because the position increments for tokens after stopwords are bogus:

Here's an example (from test case), where the position increment should be 2, but is instead 91975314!

{code}
public void testChineseStopWords2() throws Exception {
Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
String sentence = "Title:San"; // : is a stopword
String result[] = { "titl", "san"};
int startOffsets[] = { 0, 6 };
int endOffsets[] = { 5, 9 };
int posIncr[] = { 1, 2 };
assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
}
{code}

junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:(Assert.java:47)
at junit.framework.Assert.failNotEquals(Assert.java:280)
at junit.framework.Assert.assertEquals(Assert.java:64)
at junit.framework.Assert.assertEquals(Assert.java:198)
at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
...






--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Robert Muir (JIRA) at Oct 29, 2009 at 8:05 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2014:
    --------------------------------

    Attachment: LUCENE-2014.patch

    this patch only contains a testcase demonstrating the problem.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 8:09 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771322#action_12771322 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    Maybe we should use now BaseTokenStreamTestcase (which now no longer uses old/new TS API) to now test all Version constants, which is easy in 3.0 (because it's enum now) and you can iterate using for(Version v : Version.values()).

    I proposed this already for Highlighter (see other issue).

    Is it a problem in StopWordFilter?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771323#action_12771323 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    {quote}
    Is it a problem in StopWordFilter?
    {quote}

    I don't really know where it is to tell you the truth... i spent a little time trying to create an english testcase for StopFilter, but couldn't reproduce it there.

    smartcn doesn't even touch position increment attributes, so its really wierd...
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 8:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771327#action_12771327 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    I do not see the problem, from StopFilter:
    {code:java}
    @Override
    public final boolean incrementToken() throws IOException {
    // return the first non-stop word found
    int skippedPositions = 0;
    while (input.incrementToken()) {
    if (!stopWords.contains(termAtt.termBuffer(), 0, termAtt.termLength())) {
    if (enablePositionIncrements) {
    posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
    }
    return true;
    }
    skippedPositions += posIncrAtt.getPositionIncrement();
    }
    // reached EOS -- return false
    return false;
    }
    {code}

    The problem can only be that the input filter returned some big posIncr for the stop word. The code seems very clear to me. Let's debug :-)
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:27 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771329#action_12771329 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    Uwe, check this out.

    smartcn doesn't use PositionIncrementAttribute, but its tokenizer does call clearAttributes() as it should.

    but if I modify WordTokenFilter to set the positionincrement to 1:
    posIncAtt = addAttribute(PositionIncrementAttribute.class);
    ...
    posIncAtt.setPositionIncrement(1);

    then the test passes... basically uninitialized variable problem... but smartcn shouldnt have to do this, right?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:31 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771332#action_12771332 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    duh, this is the problem Uwe.

    WordTokenFilter is like a source of tokens, even though it is not a tokenizer.

    this is because smartcn's tokenizer just breaks out sentences.... WordTokenFilter breaks these into words.

    so i think WordTokenFilter must call clearAttributes()... ?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 8:31 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771331#action_12771331 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    Hm hm
    But the StopFilter also adds the attribute and therefore the clearAttributes call should clear it.

    I'll look into it.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:35 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2014:
    --------------------------------

    Attachment: LUCENE-2014.patch

    this patch adds clearAttributes to chinese WordTokenFilter, fixes the issue.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 8:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771335#action_12771335 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    This is the problem, you are right. I thought about that, too.

    The question is, why does the PosIncr get such strange values even when the filter is source of tokens? Nobody else modifies it?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:43 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771336#action_12771336 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    Uwe, yeah the only thing modifying it should be StopFilter... so I can see the values being "kinda strange" but not as wierd as what I see.

    i worry about this clearAttributes solution though, perhaps WordTokenFilter should use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
    If i use capture/restoreState this should not be a problem right?

    And this way things like custom attributes would be preserved?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 8:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771338#action_12771338 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    Hihi, I know where the strange values come from: It is the test in BaseTokenStreamTestCase itsself, that does it to check for missing clearAttributes, see assertTokenStreamContents.... It sets all Attributes to bogus values before calling incrementToken. If you do not clear the attributes, the bogus values stay there.

    But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 8:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771339#action_12771339 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    bq. Maybe we should use now BaseTokenStreamTestcase (which now no longer uses old/new TS API) to now test all Version constants, which is easy in 3.0 (because it's enum now) and you can iterate using for(Version v : Version.values()).

    this might be a good idea, although the behavior of the analyzer could change depending upon Version. Maybe best to actually test the different possibilities explicitly?

    I think after this one is resolved, i will open a task as a first step to improve the tests of these analyzers to test posInc as well, because I don't see it tested for similar cases like Thai.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771341#action_12771341 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    bq. But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.

    Oh, the IndexWriter fails because of integer overflow with any large document (lots of posIncr's get added up, overflow and create a negative posIncr)
    so the negative posIncr creates an exception.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:05 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771342#action_12771342 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    Ah understand, because nobody resets the posinc to 1 back, it adds up in a 2^n manner. stop filter updates to 2, because stop word. After that nobody resets to 1 back, so it gets 2, 4, 8,... bäng if more stopwords occur.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:05 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771341#action_12771341 ]

    Robert Muir edited comment on LUCENE-2014 at 10/29/09 9:03 AM:
    ---------------------------------------------------------------

    bq. But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.

    Oh, the IndexWriter fails because of integer overflow with any large document (lots of posIncr's get added up, overflow and create a negative posIncr)
    so the negative posIncr creates an exception.

    <edit>

    Uwe I think this really happens especially because of the way smartcn works.
    smartcn creates individual tokens for each piece of punctuation (including things like whitespace), and puts these in the stopword list.
    so if you have a chinese document with lots of space ... you can imagine how it can add up and overflow.

    was (Author: rcmuir):
    bq. But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.

    Oh, the IndexWriter fails because of integer overflow with any large document (lots of posIncr's get added up, overflow and create a negative posIncr)
    so the negative posIncr creates an exception.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:09 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771343#action_12771343 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    Uwe exactly. so only remaining question is, do you think I should change this filter to use capture/restoreState api instead of using clearAttributes?

    I guess the only advantage would be that it would preserve any customAttributes or payloads that someone might add after the SentenceTokenizer, but before the WordTokenFilter propagating them downto the individual words.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:09 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771344#action_12771344 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter should use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
    bq. If i use capture/restoreState this should not be a problem right?

    I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes when you produce more than one token out of one big one (the sentence). No need for captureState, because the tokens are new ones. If somebody adds custom attributes, they would have cleared, but would that be not correct?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771345#action_12771345 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    bq. I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes when you produce more than one token out of one big one (the sentence). No need for captureState, because the tokens are new ones. If somebody adds custom attributes, they would have cleared, but would that be not correct?

    not really sure, thats why I asked you :)

    I guess for now, its good enough to fix it to not crash IndexWriter.

    I will commit soon.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2014:
    --------------------------------

    Lucene Fields: [New, Patch Available] (was: [New])
    Fix Version/s: 3.0
    Assignee: Robert Muir
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:13 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771347#action_12771347 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    bq. I will commit soon.

    +1

    And how about 2.9.1?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:13 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771344#action_12771344 ]

    Uwe Schindler edited comment on LUCENE-2014 at 10/29/09 9:11 AM:
    -----------------------------------------------------------------

    bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter should use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
    bq. If i use capture/restoreState this should not be a problem right?

    I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes when you produce more than one token out of one big one (the sentence). No need for captureState, because the tokens are new ones. If somebody adds custom attributes, they would have cleared, but would that be not correct?

    bq. I guess the only advantage would be that it would preserve any customAttributes or payloads that someone might add after the SentenceTokenizer, but before the WordTokenFilter propagating them downto the individual words.

    Does this make sense to insert a filter between both? The transition from sentence tokens to word tokens creates totally different tokens, how should a payload or other custom att work correct here? Normally such payload filters should be inserted after the WordFilter. The problem of capture/restore state is addiional copy cost for nothing (the *long* sentence token is copied again and again and always reset to the text word).

    was (Author: thetaphi):
    bq. i worry about this clearAttributes solution though, perhaps WordTokenFilter should use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
    bq. If i use capture/restoreState this should not be a problem right?

    I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes when you produce more than one token out of one big one (the sentence). No need for captureState, because the tokens are new ones. If somebody adds custom attributes, they would have cleared, but would that be not correct?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771350#action_12771350 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    bq. And how about 2.9.1?

    I will upload and test a patch against 2.9 branch (can you commit it for me?)
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:19 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771351#action_12771351 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    ok.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771355#action_12771355 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    No prob. I also forgot about the bogus values set by BaseTokenStreamTestcase.... But there is no possibility to test/document this in a good way.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771353#action_12771353 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    Committed revision 830871 to trunk.

    I will test this against 2.9 and upload a patch.

    Thanks for your help Uwe.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Oct 29, 2009 at 9:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771360#action_12771360 ]

    Michael McCandless commented on LUCENE-2014:
    --------------------------------------------

    Guys, how serious is this issue? Should we respin 2.9.1?
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2014:
    --------------------------------

    Attachment: LUCENE-2014_branch.patch

    patch against 2.9 branch
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771362#action_12771362 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    I merged your changes into 2.9. I can commit, no need for a patch!
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771363#action_12771363 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    Mike, its up to you.

    I was just analyzing some not-ridiculously-large Chinese texts from Gutenberg, when I hit the issue.

    The problem is that smartcn indexes punctuation as individual tokens, but filters them out with StopFilter (its stopword list is all punctuation).
    This means it makes heavy use of stopfilter, compared to other analyzers.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:45 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771364#action_12771364 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    I also merged the BaseTokenStreamTestcase back, because the bogus values setter was missing in 2.9. Now the tests produce same results.

    Will commit soon.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:49 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771365#action_12771365 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    bq. I also merged the BaseTokenStreamTestcase back, because the bogus values setter was missing in 2.9. Now the tests produce same results.

    good deal... i didnt test the bug with the JUnit test against 2.9, but my IndexWriter threw the exception if i used Version.LUCENE_29, so i knew it was affected.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:51 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771365#action_12771365 ]

    Robert Muir edited comment on LUCENE-2014 at 10/29/09 9:50 AM:
    ---------------------------------------------------------------

    bq. I also merged the BaseTokenStreamTestcase back, because the bogus values setter was missing in 2.9. Now the tests produce same results.

    good deal... i didnt test the bug with the JUnit test against 2.9, but my IndexWriter threw the exception if i used Version.LUCENE_29 (with the RC2 jars), so i knew it was affected.


    was (Author: rcmuir):
    bq. I also merged the BaseTokenStreamTestcase back, because the bogus values setter was missing in 2.9. Now the tests produce same results.

    good deal... i didnt test the bug with the JUnit test against 2.9, but my IndexWriter threw the exception if i used Version.LUCENE_29, so i knew it was affected.

    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Oct 29, 2009 at 9:57 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771367#action_12771367 ]

    Uwe Schindler commented on LUCENE-2014:
    ---------------------------------------

    Committed in 2.9, revision: 830876

    I think you can close the issue. We should ask Mike, to create a new RC, then we also have the other bug fixed in 2.9 (I resolved yesterday). Mike then only have to move the CHANGES entries down to 2.9.1 in contrib/CHANGES.txt

    The other problem still in 2.9 is the default for posincr in StopFilter is version is <2.9, which is now always false for StandardAnalyzer-no-argctor and others.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 9:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir resolved LUCENE-2014.
    ---------------------------------

    Resolution: Fixed

    thanks again Uwe
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 29, 2009 at 4:51 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771473#action_12771473 ]

    Robert Muir commented on LUCENE-2014:
    -------------------------------------

    {quote}
    Does this make sense to insert a filter between both? The transition from sentence tokens to word tokens creates totally different tokens, how should a payload or other custom att work correct here? Normally such payload filters should be inserted after the WordFilter. The problem of capture/restore state is addiional copy cost for nothing (the long sentence token is copied again and again and always reset to the text word).
    {quote}

    I could imagine a use case where a person wants to keep the sentence information intact (perhaps to improve retrieval accuracy or maybe just restrict phrase queries to match within sentences).
    But I guess to some extent, the chinese phrasequery works pretty intelligently already with >= Version.LUCENE_29 because punctuation is a stopword, and the position increments are adjusted.

    I agree about the expensive cost though... best to leave it for now. But this is the way the Thai analyzer works.
    position increment bug: smartcn
    -------------------------------

    Key: LUCENE-2014
    URL: https://issues.apache.org/jira/browse/LUCENE-2014
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Assignee: Robert Muir
    Fix For: 3.0

    Attachments: LUCENE-2014.patch, LUCENE-2014.patch, LUCENE-2014_branch.patch


    If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
    its especially annoying because it happens in 2.9.1 RC as well.
    this is because the position increments for tokens after stopwords are bogus:
    Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
    {code}
    public void testChineseStopWords2() throws Exception {
    Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
    String sentence = "Title:San"; // : is a stopword
    String result[] = { "titl", "san"};
    int startOffsets[] = { 0, 6 };
    int endOffsets[] = { 5, 9 };
    int posIncr[] = { 1, 2 };
    assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
    }
    {code}
    junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
    at junit.framework.Assert.fail(Assert.java:47)
    at junit.framework.Assert.failNotEquals(Assert.java:280)
    at junit.framework.Assert.assertEquals(Assert.java:64)
    at junit.framework.Assert.assertEquals(Assert.java:198)
    at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
    ...
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedOct 29, '09 at 8:05a
activeOct 29, '09 at 4:51p
posts36
users1
websitelucene.apache.org

1 user in discussion

Robert Muir (JIRA): 36 posts

People

Translate

site design / logo © 2021 Grokbase