FAQ
Thanks Erick.

-----Original Message-----
From: Erick Erickson
Sent: Thursday, June 05, 2008 9:51 AM
To: java-user@lucene.apache.org
Subject: Re: How international languages are supported in Lucene

See below
On Thu, Jun 5, 2008 at 12:04 PM, Michael Siu wrote:

Grant,

Thanks for the timely reply. :-)

No, we do not have a specific language in mind. Basically, our document
source could potentially contain any language in the world. Supporting
English, Spanish, Italian, French, Chinese, Russian and Japanese would be
the minimum set.

Do you mean we will need different analyzer for each language? Then is that
means we will need to know the language type of a document before we can
index it?
yes and yes. Try searching the mail archives for things like
multi-language and you'll find this topic discussed ad-nauseum

But basically consider why this must be so, especially when
stemming. Languages are so variable that you'd get wildly
different (and inappropriate) results if you tried to analyze them
with the same analyzer. Especially when you get different
language encodings in the document.


Best
Erick

Thanks again.



-----Original Message-----
From: Grant Ingersoll
Sent: Thursday, June 05, 2008 8:53 AM
To: java-user@lucene.apache.org
Subject: Re: How international languages are supported in Lucene

Hi Michael,

That's a pretty open ended question and, I'm assuming, by
"international languages" you mean non-English :-). You might get
some mileage out of
http://wiki.apache.org/lucene-java/IndexingOtherLanguages
but it is a bit out of date (namely the sandbox references).
Lucene indexes non-English languages just like it does English. You
need to figure out what Analyzer you need (have a look in the contrib/
Analyzers code/javadocs for many existing languages) and then pretty
much everything else is the same. Namely, the same principals apply
(what to store, index, etc.), as they do in English.

Did you have something specific in mind? i.e. how to handle Chinese
or some specific language? Lastly, if you do have a language in mind,
try searching the mail archives for the name of that language.

HTH,
Grant
On Jun 5, 2008, at 11:32 AM, Michael Siu wrote:

Would someone tell me how Lucene supports indexing and searching
documents
that contain international languages? What do I need to do in
additions to
using the StandardAnalyzer?



Thanks.





--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 10 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedJun 5, '08 at 3:33p
activeJun 10, '08 at 12:48a
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase