FAQ
org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of lucene, so I'm not sure if this is the right place to mention this or not. I was doing some detailed analysis of how this tokenizer worked and noticed the following behavior (which I would classify as a bug).



If you pass the word "construccion" to the tokenizer, it returns a single token: "construccion". That seems correct. If you pass the word "construcción" to this tokenizer, it will generate three tokens: "construcci", "ó", and "n". This is happens because the accented "o" is not treated as a Latin-1 character. Splitting the word seems like a bug and violates the "does a decent job for most European languages" statement.



The fix seems straight forward. I replaced the following 2 lines (in the CJKTokenizer class):



if ((ub == Character.UnicodeBlock.BASIC_LATIN)
(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))


With



if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
(ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
(ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))


Am I missing something or does this seem like a reasonable thing to want to do?

Search Discussions

  • Steven A Rowe at Jul 18, 2008 at 9:32 pm
    Hi Scott,

    I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively.

    Steve
    On 07/18/2008 at 5:03 PM, Scott Smith wrote:
    org.apache.lucene.analysis.cjk.CJKTokenizer is in the
    "contrib" portion of lucene, so I'm not sure if this is the
    right place to mention this or not. I was doing some
    detailed analysis of how this tokenizer worked and noticed
    the following behavior (which I would classify as a bug).

    If you pass the word "construccion" to the tokenizer, it returns a
    single token: "construccion". That seems correct. If you pass the word
    "construcción" to this tokenizer, it will generate three tokens:
    "construcci", "ó", and "n". This is happens because the accented "o" is
    not treated as a Latin-1 character. Splitting the word seems like a bug
    and violates the "does a decent job for most European languages"
    statement.

    The fix seems straight forward. I replaced the following 2
    lines (in the CJKTokenizer class):

    if ((ub == Character.UnicodeBlock.BASIC_LATIN)
    (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
    With

    if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
    (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
    (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
    (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
    Am I missing something or does this seem like a reasonable
    thing to want to do?
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Scott Smith at Jul 18, 2008 at 9:48 pm
    I'm certainly not a language expert and so you may be correct. I do see references to some of the eastern European languages in the descriptions of these two and so maybe they should be added as well.

    -----Original Message-----
    From: Steven A Rowe
    Sent: Friday, July 18, 2008 3:29 PM
    To: java-user@lucene.apache.org
    Subject: RE: Bug in CJKTokenizer

    Hi Scott,

    I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively.

    Steve
    On 07/18/2008 at 5:03 PM, Scott Smith wrote:
    org.apache.lucene.analysis.cjk.CJKTokenizer is in the
    "contrib" portion of lucene, so I'm not sure if this is the
    right place to mention this or not. I was doing some
    detailed analysis of how this tokenizer worked and noticed
    the following behavior (which I would classify as a bug).

    If you pass the word "construccion" to the tokenizer, it returns a
    single token: "construccion". That seems correct. If you pass the word
    "construcción" to this tokenizer, it will generate three tokens:
    "construcci", "ó", and "n". This is happens because the accented "o" is
    not treated as a Latin-1 character. Splitting the word seems like a bug
    and violates the "does a decent job for most European languages"
    statement.

    The fix seems straight forward. I replaced the following 2
    lines (in the CJKTokenizer class):

    if ((ub == Character.UnicodeBlock.BASIC_LATIN)
    (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
    With

    if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
    (ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
    (ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
    (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))
    Am I missing something or does this seem like a reasonable
    thing to want to do?
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 18, '08 at 9:04p
activeJul 18, '08 at 9:48p
posts3
users2
websitelucene.apache.org

2 users in discussion

Scott Smith: 2 posts Steven A Rowe: 1 post

People

Translate

site design / logo © 2022 Grokbase