Olly Betts wrote:
A quick answer as I have almost no spare time this week...
On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:chun yu wrote:
I am wandered if the version 1.0.5 has support the chinese/japanese
indexing.
There's nothing specific to Chinese or Japanese currently, although we
do support all of Unicode in the character classification code, so
Chinese and Japanese characters should be correctly identified as part
of words.
Exact matches work, which in my personal tests account for a majority of
searches, more later.
or how can I implement to support indexing chinese?
The usual approaches are based on n-gram matching. Someone posted a
link to some code they'd written (and I think were using with Xapian)
on the list, but I've not had a chance to study it yet.
senna supports it well, in this particular case. More later, yet again :)
I haven't yet successfully used Xapian for indexing any character from
the CJK set in a production environment, but from my experience so far
it's not so convenient to use it for such a thing (no stemming support
that I can see, and significance of spaces in many cases!).
My understanding is that stemming isn't really meaningful for Chinese.
I'm not aware of a suitably licensed Japanese stemming algorithm.
[snip]
Spaces are only significant to TermGenerator and QueryParser.
[/snip]
Case in point, TermGenerator & QueryParser can be quite significant (as
you know) in this case.
The best
approach to addressing this might be to have variants of these designed
specifically for languages which don't generally use whitespace to
signify word breaks. The important thing is that they work together so
if both use n-grams, everything should work.
Cheers,
Olly
Stemming in it's proper form is not so meaningful to Chinese,
particularly, in my own limited experience (I will ask tomorrow just to
see if such a concept exists :p). Things get hairy in Japanese, and a
bit in Korean (and in all honesty, even in Chinese). The Japanese
language, bless it's heart, is actually relatively simple if handled
100% properly. Unfortunately, much like the English language itself,
there are multiple permutations of various word-handling's, and it goes
pretty deep. Xapian, as a matter of [unfortunate?] fact, does not
handle the Japanese language for beans when it comes down to the
nitty-gritty.
I don't mean to sound negative at all with my previous statement; I'd go
so far as to say that Xapian, at it's core, should probably avoid
catering to CJK sets at all due to their inherent complexity (but I'd
not whine if proper support were implemented in an elegant way, it'd
just surprise me if it were possible).
I do make some attempt to make sure I'm not propagating FUD with my
statements, so please call me out if I'm in the wrong direction :)
Kind Regards,
Rick