Hi, all

I am studying Xapain project.

I am wandered if the version 1.0.5 has support the chinese/japanese indexing.

If so, could you please tell me the code in the project to implement it?

or how can I implement to support indexing chinese?


Thanks a lot!
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Search Discussions

  • Rick Olson at Feb 26, 2008 at 9:27 am

    chun yu wrote:
    Hi, all

    I am studying Xapain project.

    I am wandered if the version 1.0.5 has support the chinese/japanese indexing.

    If so, could you please tell me the code in the project to implement it?

    or how can I implement to support indexing chinese?


    Thanks a lot!
    Hello,

    For indexing of Chinese/Japanese/Korean data, I have to suggest a
    product called Senna (http://qwik.jp/senna/). It is also free as in
    free (if that's what floats your boat), but is not Xapian specifically.
    I haven't yet successfully used Xapian for indexing any character from
    the CJK set in a production environment, but from my experience so far
    it's not so convenient to use it for such a thing (no stemming support
    that I can see, and significance of spaces in many cases!).

    Technically, however, the indexing for 90% of the cases I've tested
    with which include Asian character support have _functioned_, just not
    catered to that use is all.

    Perhaps one of the core developers can shed some light on this
    situation, but I do believe I am correct with my personal tests. There
    are other solutions as well which cater to Asian character sets.

    Regards,
    Rick
  • Olly Betts at Feb 26, 2008 at 9:48 am
    A quick answer as I have almost no spare time this week...
    On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
    chun yu wrote:
    I am wandered if the version 1.0.5 has support the chinese/japanese
    indexing.
    There's nothing specific to Chinese or Japanese currently, although we
    do support all of Unicode in the character classification code, so
    Chinese and Japanese characters should be correctly identified as part
    of words.
    or how can I implement to support indexing chinese?
    The usual approaches are based on n-gram matching. Someone posted a
    link to some code they'd written (and I think were using with Xapian)
    on the list, but I've not had a chance to study it yet.
    I haven't yet successfully used Xapian for indexing any character from
    the CJK set in a production environment, but from my experience so far
    it's not so convenient to use it for such a thing (no stemming support
    that I can see, and significance of spaces in many cases!).
    My understanding is that stemming isn't really meaningful for Chinese.
    I'm not aware of a suitably licensed Japanese stemming algorithm.

    Spaces are only significant to TermGenerator and QueryParser. The best
    approach to addressing this might be to have variants of these designed
    specifically for languages which don't generally use whitespace to
    signify word breaks. The important thing is that they work together so
    if both use n-grams, everything should work.

    Cheers,
    Olly
  • Rick Olson at Feb 26, 2008 at 10:15 am

    Olly Betts wrote:
    A quick answer as I have almost no spare time this week...
    On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:

    chun yu wrote:
    I am wandered if the version 1.0.5 has support the chinese/japanese
    indexing.
    There's nothing specific to Chinese or Japanese currently, although we
    do support all of Unicode in the character classification code, so
    Chinese and Japanese characters should be correctly identified as part
    of words.
    Exact matches work, which in my personal tests account for a majority of
    searches, more later.
    or how can I implement to support indexing chinese?
    The usual approaches are based on n-gram matching. Someone posted a
    link to some code they'd written (and I think were using with Xapian)
    on the list, but I've not had a chance to study it yet.
    senna supports it well, in this particular case. More later, yet again :)
    I haven't yet successfully used Xapian for indexing any character from
    the CJK set in a production environment, but from my experience so far
    it's not so convenient to use it for such a thing (no stemming support
    that I can see, and significance of spaces in many cases!).
    My understanding is that stemming isn't really meaningful for Chinese.
    I'm not aware of a suitably licensed Japanese stemming algorithm.
    [snip]
    Spaces are only significant to TermGenerator and QueryParser.
    [/snip]
    Case in point, TermGenerator & QueryParser can be quite significant (as
    you know) in this case.
    The best
    approach to addressing this might be to have variants of these designed
    specifically for languages which don't generally use whitespace to
    signify word breaks. The important thing is that they work together so
    if both use n-grams, everything should work.

    Cheers,
    Olly
    Stemming in it's proper form is not so meaningful to Chinese,
    particularly, in my own limited experience (I will ask tomorrow just to
    see if such a concept exists :p). Things get hairy in Japanese, and a
    bit in Korean (and in all honesty, even in Chinese). The Japanese
    language, bless it's heart, is actually relatively simple if handled
    100% properly. Unfortunately, much like the English language itself,
    there are multiple permutations of various word-handling's, and it goes
    pretty deep. Xapian, as a matter of [unfortunate?] fact, does not
    handle the Japanese language for beans when it comes down to the
    nitty-gritty.

    I don't mean to sound negative at all with my previous statement; I'd go
    so far as to say that Xapian, at it's core, should probably avoid
    catering to CJK sets at all due to their inherent complexity (but I'd
    not whine if proper support were implemented in an elegant way, it'd
    just surprise me if it were possible).

    I do make some attempt to make sure I'm not propagating FUD with my
    statements, so please call me out if I'm in the wrong direction :)

    Kind Regards,
    Rick
  • Fabrice Colin at Feb 26, 2008 at 12:31 pm

    On Tue, 26 Feb 2008 09:48:29 +0000, Olly Betts wrote:
    On Tue, Feb 26, 2008 at 01:27:36AM -0800, Rick Olson wrote:
    chun yu wrote:
    I am wandered if the version 1.0.5 has support the chinese/japanese
    indexing.
    There's nothing specific to Chinese or Japanese currently, although we
    do support all of Unicode in the character classification code, so
    Chinese and Japanese characters should be correctly identified as part
    of words.
    or how can I implement to support indexing chinese?
    The usual approaches are based on n-gram matching. Someone posted a
    link to some code they'd written (and I think were using with Xapian)
    on the list, but I've not had a chance to study it yet.
    Yung-chung Lin wrote a CJKV n-gram tokenizer. The source is here :
    http://svn.berlios.de/wsvn/dijon/trunk/cjkv/?rev=0&sc=1
    It's not tied to Xapian in particular. It needs libunicode 0.4 or glib.

    I make use of it in Pinot, to generate terms when indexing CJKV documents,
    and at search time to pre-process CJKV queries before feeding them to the
    QueryParser.

    Fabrice
  • Jean-Francois Dockes at Mar 6, 2008 at 8:16 am

    Fabrice Colin writes:
    Yung-chung Lin wrote a CJKV n-gram tokenizer. The source is here :
    http://svn.berlios.de/wsvn/dijon/trunk/cjkv/?rev=0&sc=1
    It's not tied to Xapian in particular. It needs libunicode 0.4 or glib. >
    I make use of it in Pinot, to generate terms when indexing CJKV documents,
    and at search time to pre-process CJKV queries before feeding them to the
    QueryParser.
    Just for the record, Recoll also has limited ngram-based CJK support, not
    based on Yung-Chung Lin's code (which was the initial inspiration).

    It's relatively primitive, but the few users tell me that it is at least
    better than nothing and "good enough" in many cases.

    J.F. Dockes

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedFeb 26, '08 at 8:02a
activeMar 6, '08 at 8:16a
posts6
users5
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase