FAQ
Hi, guys,
I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
the Snowball stemmers only include European languages. Does stemming
not make sense for ideograph-based languages (i.e., no stemming is
needed for Japanese, Korean and Chinese)?

Also for spell checking, does the default Lucene SpellChecker work for
Japanese, Korean and Chinese? Does edit distance make sense for these
languages?

What other gotcha's can you guys think of when making Lucene work with
foreign languages, besides analyzer, stemmer and spell checking? Thanks
in advance for your help.

Search Discussions

  • Mathieu Lecarme at Jul 25, 2007 at 7:00 am

    Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
    Hi, guys,
    I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
    the Snowball stemmers only include European languages. Does stemming
    not make sense for ideograph-based languages (i.e., no stemming is
    needed for Japanese, Korean and Chinese)? No.
    Also for spell checking, does the default Lucene SpellChecker work for
    Japanese, Korean and Chinese? Does edit distance make sense for these
    languages?
    Japanese used group of ideogram, but levenstein distance don't make
    sense with few letters but I'm not a CJK expert.

    M.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Maximilian Hütter at Jul 25, 2007 at 9:23 am

    Mathieu Lecarme schrieb:
    Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
    Hi, guys,
    I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
    the Snowball stemmers only include European languages. Does stemming
    not make sense for ideograph-based languages (i.e., no stemming is
    needed for Japanese, Korean and Chinese)?
    No.
    This not quite correct, Chinese doesn't need any stemming but Japanese
    is not completely ideograph-based and it could use stemming. I doubt
    anyone has done this, besides some commercial software for the japanese
    market. I don't know for Korean.
    Also for spell checking, does the default Lucene SpellChecker work for
    Japanese, Korean and Chinese? Does edit distance make sense for these
    languages?
    Japanese used group of ideogram, but levenstein distance don't make
    sense with few letters but I'm not a CJK expert.

    M.
    Edit distance only seems to work with latin character based (writen)
    languages. Spell checking Chinese, Japanese (and Korean?) is more or
    less pointless, as they are inputed using input methods, which should
    produce "correct" words.

    Best regards,

    Max


    --
    Maximilian Hütter
    blue elephant systems GmbH
    Wollgrasweg 49
    D-70599 Stuttgart

    Tel : (+49) 0711 - 45 10 17 578
    Fax : (+49) 0711 - 45 10 17 573
    e-mail : max.huetter@blue-elephant-systems.com
    Sitz : Stuttgart, Amtsgericht Stuttgart, HRB 24106
    Geschäftsführer: Joachim Hörnle, Thomas Gentsch, Holger Dietrich

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 24, '07 at 8:01p
activeJul 25, '07 at 9:23a
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase