Hi all.

We discovered that fullwidth letters are not treated as <LETTER> and fullwidth
digits are not treated as <DIGIT>.

This in itself is probably easy to fix (including the filter for normalising
these back to the normal versions) but while sanity checking the blocks in
StandardTokenizer.jj I found some suspicious parts and felt it necessary to
check that this is by design as there is no comment explaining the anomalies.

Line 87:

The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> as
expected, so I'm wondering if these halfwidth Hangul "letters" should
actually be in <KOREAN> instead of <LETTER>.

Line 92:

This block appears to duplicate the ranges in the next three lines and
suspiciously also includes a range which belongs to <KOREAN>, making me
wonder what happens when a range is in two blocks.

In case anyone is wondering, the JFlex version of the tokeniser on Lucene
trunk has the same ranges.

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
postedJan 7, '08 at 12:48a
activeJan 7, '08 at 12:48a

1 user in discussion

Daniel Noll: 1 post



site design / logo © 2022 Grokbase