Hey Bernd,
On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
wrote:
Dear list,
a very basic question about lucene, which version of
unicode can be handled (indexed and searched) with lucene?
if you ask for what the indexer / query can handle then it is really
what UTF-8 can handle. Strings passed to the writer / reader are
converted to UTF-8 internally (rough picture). On Trunk we are
indexing bytes only (UTF-8 bytes by default). so the question is
really what you platform supports in terms of utilities / operations
on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
have the possibility to respect code points which are above the BMP.
Lucene 2.9 still has Java 1.4 System Requirements that prevented us
from moving forward to Unicode 4.0. If you look at Character.java all
methods have been converted to operate on UTF-32 code points instead
of UTF-16 code points in Java 1.4.
Since 3.0 is a Java Generics / move to Java 1.5 only release these
APIs are not in use yet in the latest released version. Lucene 3.1
holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
(I think there are one or two which still have problems, I should
check... Robert did we fix all NGram stuff?).
So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
I hope) will fix most of the problems and includes ICU based analysis
for full Unicode 5 support.
hope that helps
simon
It looks like lucene can only handle the very old Unicode 2.0
but not the newer 3.1 version (4 byte utf-8 unicode).
Is that true?
Regards,
Bernd
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]