FAQ
Would someone tell me how Lucene supports indexing and searching documents
that contain international languages? What do I need to do in additions to
using the StandardAnalyzer?



Thanks.

Search Discussions

  • Grant Ingersoll at Jun 5, 2008 at 3:53 pm
    Hi Michael,

    That's a pretty open ended question and, I'm assuming, by
    "international languages" you mean non-English :-). You might get
    some mileage out of http://wiki.apache.org/lucene-java/IndexingOtherLanguages
    but it is a bit out of date (namely the sandbox references).
    Lucene indexes non-English languages just like it does English. You
    need to figure out what Analyzer you need (have a look in the contrib/
    Analyzers code/javadocs for many existing languages) and then pretty
    much everything else is the same. Namely, the same principals apply
    (what to store, index, etc.), as they do in English.

    Did you have something specific in mind? i.e. how to handle Chinese
    or some specific language? Lastly, if you do have a language in mind,
    try searching the mail archives for the name of that language.

    HTH,
    Grant
    On Jun 5, 2008, at 11:32 AM, Michael Siu wrote:

    Would someone tell me how Lucene supports indexing and searching
    documents
    that contain international languages? What do I need to do in
    additions to
    using the StandardAnalyzer?



    Thanks.





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Siu at Jun 5, 2008 at 4:04 pm
    Grant,

    Thanks for the timely reply. :-)

    No, we do not have a specific language in mind. Basically, our document
    source could potentially contain any language in the world. Supporting
    English, Spanish, Italian, French, Chinese, Russian and Japanese would be
    the minimum set.

    Do you mean we will need different analyzer for each language? Then is that
    means we will need to know the language type of a document before we can
    index it?

    Thanks again.



    -----Original Message-----
    From: Grant Ingersoll
    Sent: Thursday, June 05, 2008 8:53 AM
    To: java-user@lucene.apache.org
    Subject: Re: How international languages are supported in Lucene

    Hi Michael,

    That's a pretty open ended question and, I'm assuming, by
    "international languages" you mean non-English :-). You might get
    some mileage out of
    http://wiki.apache.org/lucene-java/IndexingOtherLanguages
    but it is a bit out of date (namely the sandbox references).
    Lucene indexes non-English languages just like it does English. You
    need to figure out what Analyzer you need (have a look in the contrib/
    Analyzers code/javadocs for many existing languages) and then pretty
    much everything else is the same. Namely, the same principals apply
    (what to store, index, etc.), as they do in English.

    Did you have something specific in mind? i.e. how to handle Chinese
    or some specific language? Lastly, if you do have a language in mind,
    try searching the mail archives for the name of that language.

    HTH,
    Grant
    On Jun 5, 2008, at 11:32 AM, Michael Siu wrote:

    Would someone tell me how Lucene supports indexing and searching
    documents
    that contain international languages? What do I need to do in
    additions to
    using the StandardAnalyzer?



    Thanks.





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jun 5, 2008 at 4:51 pm
    See below
    On Thu, Jun 5, 2008 at 12:04 PM, Michael Siu wrote:

    Grant,

    Thanks for the timely reply. :-)

    No, we do not have a specific language in mind. Basically, our document
    source could potentially contain any language in the world. Supporting
    English, Spanish, Italian, French, Chinese, Russian and Japanese would be
    the minimum set.

    Do you mean we will need different analyzer for each language? Then is that
    means we will need to know the language type of a document before we can
    index it?
    yes and yes. Try searching the mail archives for things like
    multi-language and you'll find this topic discussed ad-nauseum

    But basically consider why this must be so, especially when
    stemming. Languages are so variable that you'd get wildly
    different (and inappropriate) results if you tried to analyze them
    with the same analyzer. Especially when you get different
    language encodings in the document.


    Best
    Erick

    Thanks again.



    -----Original Message-----
    From: Grant Ingersoll
    Sent: Thursday, June 05, 2008 8:53 AM
    To: java-user@lucene.apache.org
    Subject: Re: How international languages are supported in Lucene

    Hi Michael,

    That's a pretty open ended question and, I'm assuming, by
    "international languages" you mean non-English :-). You might get
    some mileage out of
    http://wiki.apache.org/lucene-java/IndexingOtherLanguages
    but it is a bit out of date (namely the sandbox references).
    Lucene indexes non-English languages just like it does English. You
    need to figure out what Analyzer you need (have a look in the contrib/
    Analyzers code/javadocs for many existing languages) and then pretty
    much everything else is the same. Namely, the same principals apply
    (what to store, index, etc.), as they do in English.

    Did you have something specific in mind? i.e. how to handle Chinese
    or some specific language? Lastly, if you do have a language in mind,
    try searching the mail archives for the name of that language.

    HTH,
    Grant
    On Jun 5, 2008, at 11:32 AM, Michael Siu wrote:

    Would someone tell me how Lucene supports indexing and searching
    documents
    that contain international languages? What do I need to do in
    additions to
    using the StandardAnalyzer?



    Thanks.





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael Siu at Jun 5, 2008 at 5:20 pm
    Thanks Erick.

    -----Original Message-----
    From: Erick Erickson
    Sent: Thursday, June 05, 2008 9:51 AM
    To: java-user@lucene.apache.org
    Subject: Re: How international languages are supported in Lucene

    See below
    On Thu, Jun 5, 2008 at 12:04 PM, Michael Siu wrote:

    Grant,

    Thanks for the timely reply. :-)

    No, we do not have a specific language in mind. Basically, our document
    source could potentially contain any language in the world. Supporting
    English, Spanish, Italian, French, Chinese, Russian and Japanese would be
    the minimum set.

    Do you mean we will need different analyzer for each language? Then is that
    means we will need to know the language type of a document before we can
    index it?
    yes and yes. Try searching the mail archives for things like
    multi-language and you'll find this topic discussed ad-nauseum

    But basically consider why this must be so, especially when
    stemming. Languages are so variable that you'd get wildly
    different (and inappropriate) results if you tried to analyze them
    with the same analyzer. Especially when you get different
    language encodings in the document.


    Best
    Erick

    Thanks again.



    -----Original Message-----
    From: Grant Ingersoll
    Sent: Thursday, June 05, 2008 8:53 AM
    To: java-user@lucene.apache.org
    Subject: Re: How international languages are supported in Lucene

    Hi Michael,

    That's a pretty open ended question and, I'm assuming, by
    "international languages" you mean non-English :-). You might get
    some mileage out of
    http://wiki.apache.org/lucene-java/IndexingOtherLanguages
    but it is a bit out of date (namely the sandbox references).
    Lucene indexes non-English languages just like it does English. You
    need to figure out what Analyzer you need (have a look in the contrib/
    Analyzers code/javadocs for many existing languages) and then pretty
    much everything else is the same. Namely, the same principals apply
    (what to store, index, etc.), as they do in English.

    Did you have something specific in mind? i.e. how to handle Chinese
    or some specific language? Lastly, if you do have a language in mind,
    try searching the mail archives for the name of that language.

    HTH,
    Grant
    On Jun 5, 2008, at 11:32 AM, Michael Siu wrote:

    Would someone tell me how Lucene supports indexing and searching
    documents
    that contain international languages? What do I need to do in
    additions to
    using the StandardAnalyzer?



    Thanks.





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Noll at Jun 5, 2008 at 11:36 pm

    But basically consider why this must be so, especially when
    stemming. Languages are so variable that you'd get wildly
    different (and inappropriate) results if you tried to analyze them
    with the same analyzer. Especially when you get different
    language encodings in the document.
    Well... technically encoding is out of the scope of Lucene since we're passing
    in a Reader.

    I have to say though, analysing with the most naive analyser possible (the
    default one with no stop words and no stemming) works well enough.

    Language detection isn't at a point where it's reliable enough to use to
    determine which analyser to use automatically.

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 9, 2008 at 9:50 pm
    Hi Daniel,

    What makes you say that about language detection? Wouldn't that depend on the language detection approach or tool one uses and on the type and amount of content one trains language detector on? And what is the threshold for "reliable enough" that you have in mind?


    Thanks,
    Otis --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Daniel Noll <daniel@nuix.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, June 5, 2008 7:36:11 PM
    Subject: Re: How international languages are supported in Lucene
    But basically consider why this must be so, especially when
    stemming. Languages are so variable that you'd get wildly
    different (and inappropriate) results if you tried to analyze them
    with the same analyzer. Especially when you get different
    language encodings in the document.
    Well... technically encoding is out of the scope of Lucene since we're passing
    in a Reader.

    I have to say though, analysing with the most naive analyser possible (the
    default one with no stop words and no stemming) works well enough.

    Language detection isn't at a point where it's reliable enough to use to
    determine which analyser to use automatically.

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Noll at Jun 10, 2008 at 12:09 am

    On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote:
    Hi Daniel,

    What makes you say that about language detection? Wouldn't that depend on
    the language detection approach or tool one uses and on the type and amount
    of content one trains language detector on? And what is the threshold for
    "reliable enough" that you have in mind?
    I can't come up with a number of course, but I can say for certain that ICU's
    detector is unusable for detecting languages. It's barely good enough to
    correctly identify the charset; if you create a simple test in one charset it
    often detects it as another. If you then re-encode the text in that charset,
    it detects it as being yet another, and so forth.

    If you know of any better [open source] libraries for the same purpose, I'd
    love to hear of it.

    Additionally, anything the developer or user has to train I consider
    unreliable also. If a detector has to be trained, it should be trained by
    the ones who are distributing it. Not everyone has a corpus of every
    language in the world in order to train such a thing. :-/

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 9, 2008 at 9:52 pm
    Hi Daniel,

    What makes you say that about language detection? Wouldn't that depend on the language detection approach or tool one uses and on the type and amount of content one trains language detector on? And what is the threshold for "reliable enough" that you have in mind?


    Thanks,
    Otis --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Daniel Noll <daniel@nuix.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, June 5, 2008 7:36:11 PM
    Subject: Re: How international languages are supported in Lucene
    But basically consider why this must be so, especially when
    stemming. Languages are so variable that you'd get wildly
    different (and inappropriate) results if you tried to analyze them
    with the same analyzer. Especially when you get different
    language encodings in the document.
    Well... technically encoding is out of the scope of Lucene since we're passing
    in a Reader.

    I have to say though, analysing with the most naive analyser possible (the
    default one with no stop words and no stemming) works well enough.

    Language detection isn't at a point where it's reliable enough to use to
    determine which analyser to use automatically.

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 10, 2008 at 12:48 am
    Aha, I see. I wasn't referring to character-encoding-based lang ID. That is probably good enough if you need to know if the text is English or if it's Chinese or Japanese or Korean or Russian Cyrillic or Arabic or...

    I think there is a bit missing in your statement about training. You can't just train on, say "English". There are all kinds of "English" out there. The distribution of terms in WSJ is probably a lot different from what you might find in some corpus of medical texts or botany tests or contemporary art.

    As for not everyone having access to a corpus, I've come to love Wikipedia for exactly this thing. :) For an NLP class I took recently we used Croatian Wikipedia while developing an unsupervised morphological analyzer for highly inflected languages (Croatian being one of them - 7 cases, a pile of case/number/gender-dependent suffixes...). The analyzer ended up matching the state-of-the-art results. :)


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Daniel Noll <daniel@nuix.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, June 10, 2008 2:09:10 AM
    Subject: Re: How international languages are supported in Lucene
    On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote:
    Hi Daniel,

    What makes you say that about language detection? Wouldn't that depend on
    the language detection approach or tool one uses and on the type and amount
    of content one trains language detector on? And what is the threshold for
    "reliable enough" that you have in mind?
    I can't come up with a number of course, but I can say for certain that ICU's
    detector is unusable for detecting languages. It's barely good enough to
    correctly identify the charset; if you create a simple test in one charset it
    often detects it as another. If you then re-encode the text in that charset,
    it detects it as being yet another, and so forth.

    If you know of any better [open source] libraries for the same purpose, I'd
    love to hear of it.

    Additionally, anything the developer or user has to train I consider
    unreliable also. If a detector has to be trained, it should be trained by
    the ones who are distributing it. Not everyone has a corpus of every
    language in the world in order to train such a thing. :-/

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 5, '08 at 3:33p
activeJun 10, '08 at 12:48a
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase