FAQ
Hello,

What's a good source to get dictionaries (for spellcorrections) and/or
thesaurus (for synonyms) that can be used with Lucene for non-English
languages such as Fresh, Chinese, Korean etc?

For example, the wordnet contrib module is based on the data set
provided by the Princeton based wordnet system but I'm wondering where
the Lucene users go for similar reliable source for other languages?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Hong-Thai Nguyen at Jan 6, 2011 at 5:39 pm
    Hi,

    I'm not sure these non-English spellcheckers, analyzers and related resources are good idea in real usage. English grammar is quite simple and can be captured in Porter's rules, but others so different. For example, Porter's rules can not work well in French grammar, neither in Asian languages. Languages libraries providing in Lucene are practical, but they are satisfy in real usage? Need an evaluation effort in each language.

    Even you have a special language, it's still difficult to decide "satisfy level". You may need simply Lucene analyser in general search, but in sophisticate content with high quality required result, you may need a morpho-syntax or semantic analyzer in this context. And these analyzers are so expensive (in cost, and in processing time).

    Develop a language spellchecker, analyzer, stemmer, ... and its dictionaries is still difficult and out of developer's scope. And Search Provider keeps these advanced features in private.

    For each language, you can find out a library (mostly in research context) for spellchecker, analyzer. You have to integrate in Lucene.
    Apache Open-NLP project is a nice effort to collect languages NLP works around the world, then integrate in common platform: http://incubator.apache.org/opennlp/

    Princeton Wordnet is concept's definition, not dictionary. These is some other languages: http://www.globalwordnet.org/gwa/wordnet_table.htm

    I'm wondering how Wordnet is useful in search context. You can may uses synsets (synonyms) like a suggestion dictionary. But stopwords, stem and analyzer dictionaries are dependant to associate modules.

    Best,

    -------------------
    Hong-Thai
    -----Message d'origine-----
    De : Pulkit Singhal
    Envoyé : jeudi 6 janvier 2011 17:54
    À : java-user@lucene.apache.org
    Objet : Where to find non-English dictionaries, thesaurus, synonyms

    Hello,

    What's a good source to get dictionaries (for spellcorrections) and/or
    thesaurus (for synonyms) that can be used with Lucene for non-English
    languages such as Fresh, Chinese, Korean etc?

    For example, the wordnet contrib module is based on the data set
    provided by the Princeton based wordnet system but I'm wondering where
    the Lucene users go for similar reliable source for other languages?

    Thanks!

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jan 7, 2011 at 9:27 pm

    On Thu, Jan 6, 2011 at 11:53 AM, Pulkit Singhal wrote:
    Hello,

    What's a good source to get dictionaries (for spellcorrections) and/or
    thesaurus (for synonyms) that can be used with Lucene for non-English
    languages such as Fresh, Chinese, Korean etc?
    if you can't find a wordlist of correctly-spelled words somewhere
    else, you can always try
    http://wiki.services.openoffice.org/wiki/Dictionaries, grab the
    openoffice spellchecker dictionary for that language, and use the
    hunspell "unmunch" command (sort of like morphological generation) to
    generate a list of words you could then use with PlainTextDictionary.
    For example, the wordnet contrib module is based on the data set
    provided by the Princeton based wordnet system but I'm wondering where
    the Lucene users go for similar reliable source for other languages?
    in this case i would also investigate the openoffice thesaurus data,
    if you cant find anything else.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 7, 2011 at 9:36 pm
    Somehow, I had the impression that the TrebleCLEF and EuroMatrix european projects are meant to gather this kind of information sources.

    But honestly, it's not as homogeneous as in OpenOffice.
    Mozilla also has dictionaries.
    Wiktionary can also be helpful.

    paul


    Le 7 janv. 2011 à 22:26, Robert Muir a écrit :
    On Thu, Jan 6, 2011 at 11:53 AM, Pulkit Singhal wrote:
    Hello,

    What's a good source to get dictionaries (for spellcorrections) and/or
    thesaurus (for synonyms) that can be used with Lucene for non-English
    languages such as Fresh, Chinese, Korean etc?
    if you can't find a wordlist of correctly-spelled words somewhere
    else, you can always try
    http://wiki.services.openoffice.org/wiki/Dictionaries, grab the
    openoffice spellchecker dictionary for that language, and use the
    hunspell "unmunch" command (sort of like morphological generation) to
    generate a list of words you could then use with PlainTextDictionary.
    For example, the wordnet contrib module is based on the data set
    provided by the Princeton based wordnet system but I'm wondering where
    the Lucene users go for similar reliable source for other languages?
    in this case i would also investigate the openoffice thesaurus data,
    if you cant find anything else.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 6, '11 at 4:54p
activeJan 7, '11 at 9:36p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase