org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of lucene, so I'm not sure if this is the right place to mention this or not. I was doing some detailed analysis of how this tokenizer worked and noticed the following behavior (which I would classify as a bug).

If you pass the word "construccion" to the tokenizer, it returns a single token: "construccion". That seems correct. If you pass the word "construcción" to this tokenizer, it will generate three tokens: "construcci", "ó", and "n". This is happens because the accented "o" is not treated as a Latin-1 character. Splitting the word seems like a bug and violates the "does a decent job for most European languages" statement.

The fix seems straight forward. I replaced the following 2 lines (in the CJKTokenizer class):

if ((ub == Character.UnicodeBlock.BASIC_LATIN)
(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))


if ((ub == Character.UnicodeBlock.BASIC_LATIN) // chars 0x00-0x7f
(ub == Character.UnicodeBlock.LATIN_1_SUPPLEMENT) // char 0x80-0xff
(ub == Character.UnicodeBlock.LATIN_EXTENDED_A) // char 0x100-0x17f
(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS))

Am I missing something or does this seem like a reasonable thing to want to do?

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 3 | next ›
Discussion Overview
groupjava-user @
postedJul 18, '08 at 9:04p
activeJul 18, '08 at 9:48p

2 users in discussion

Scott Smith: 2 posts Steven A Rowe: 1 post



site design / logo © 2022 Grokbase