FAQ
CJKAnalyzer not matching mutlibyte character followed by non-multibyte character
--------------------------------------------------------------------------------

Key: LUCENE-2673
URL: https://issues.apache.org/jira/browse/LUCENE-2673
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Affects Versions: 3.0.1
Reporter: Kevin Hayen


Here is a listing of text indexed in a field, followed by various search terms that did or did not match the document.

[QES様文字化けテスト]
QES -> retrievable
QES様 -> not retrievable
QES様文字化けテスト -> retrievable

[SOA基盤]
SOA ->retrievable
SOA基 -> not retrievable
SOA基盤 -> retrievable

[日経BP]
日経 -> retrievable
日経B -> not retrievable
日経BP -> retrievable

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Koji Sekiguchi (JIRA) at Sep 27, 2010 at 2:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915323#action_12915323 ]

    Koji Sekiguchi commented on LUCENE-2673:
    ----------------------------------------

    I think CJKAnalyzer works as expected.

    {quote}
    QES様 -> not retrievable
    SOA基 -> not retrievable
    {quote}

    Because CJK chars are tokenized 2-gram, "様" and "基" are not token.

    {quote}
    日経B -> not retrievable
    {quote}

    Because non CJK chars are tokenized at white space, "B" is not token.
    CJKAnalyzer not matching mutlibyte character followed by non-multibyte character
    --------------------------------------------------------------------------------

    Key: LUCENE-2673
    URL: https://issues.apache.org/jira/browse/LUCENE-2673
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Affects Versions: 3.0.1
    Reporter: Kevin Hayen

    Here is a listing of text indexed in a field, followed by various search terms that did or did not match the document.
    [QES様文字化けテスト]
    QES -> retrievable
    QES様 -> not retrievable
    QES様文字化けテスト -> retrievable
    [SOA基盤]
    SOA ->retrievable
    SOA基 -> not retrievable
    SOA基盤 -> retrievable
    [日経BP]
    日経 -> retrievable
    日経B -> not retrievable
    日経BP -> retrievable
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Kevin Hayen (JIRA) at Oct 5, 2010 at 2:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917994#action_12917994 ]

    Kevin Hayen commented on LUCENE-2673:
    -------------------------------------

    That is the current behavior however, after checking with our Japanese office, I have confirmed that it is a common occurrence for western and Asian characters to be placed side by side. So the current behavior does not match what the user will expect.
    CJKAnalyzer not matching mutlibyte character followed by non-multibyte character
    --------------------------------------------------------------------------------

    Key: LUCENE-2673
    URL: https://issues.apache.org/jira/browse/LUCENE-2673
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Affects Versions: 3.0.1
    Reporter: Kevin Hayen

    Here is a listing of text indexed in a field, followed by various search terms that did or did not match the document.
    [QES様文字化けテスト]
    QES -> retrievable
    QES様 -> not retrievable
    QES様文字化けテスト -> retrievable
    [SOA基盤]
    SOA ->retrievable
    SOA基 -> not retrievable
    SOA基盤 -> retrievable
    [日経BP]
    日経 -> retrievable
    日経B -> not retrievable
    日経BP -> retrievable
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedSep 27, '10 at 2:24p
activeOct 5, '10 at 2:38p
posts3
users1
websitelucene.apache.org

1 user in discussion

Kevin Hayen (JIRA): 3 posts

People

Translate

site design / logo © 2021 Grokbase