FAQ
What is the "best practice" to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields?
Should
a) each language be indexed in a seperate index/directory or should
b) the Documents (in a single directory) hold the diverse localized fields?

We most often will be searching "language dependent" which (at least performance wise) mandates one-directory-per-language...

Any (lucene specific) white papers on this topic?

Thx in advance
Clemens

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Otis Gospodnetic at Jan 18, 2011 at 7:15 pm
    Hi Clemens,

    If you will be searching individual languages, go with language-specific
    indices. Wunder likes to give an example of "die" in German vs. English. :)

    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: Clemens Wyss <clemensdev@mysign.ch>
    To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
    Sent: Tue, January 18, 2011 12:53:57 PM
    Subject: Best practices for multiple languages?

    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?

    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized fields?

    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...

    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shai Erera at Jan 18, 2011 at 7:28 pm
    Hi

    There are two types of multi-language docs:
    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages

    I've dealt with both, and there are different solutions to each. Which of
    them is yours?

    Shai
    On Tue, Jan 18, 2011 at 7:53 PM, Clemens Wyss wrote:

    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized fields?

    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...

    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Clemens Wyss at Jan 19, 2011 at 7:09 am

    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages
    We mainly have 1)-models

    Clemens
    -----Ursprüngliche Nachricht-----
    Von: Shai Erera
    Gesendet: Dienstag, 18. Januar 2011 20:28
    An: java-user@lucene.apache.org
    Betreff: Re: Best practices for multiple languages?

    Hi

    There are two types of multi-language docs:
    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages

    I've dealt with both, and there are different solutions to each. Which of them
    is yours?

    Shai
    On Tue, Jan 18, 2011 at 7:53 PM, Clemens Wyss wrote:

    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized fields?

    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...

    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bill Janssen at Jan 19, 2011 at 6:22 pm

    Clemens Wyss wrote:

    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages
    We mainly have 1)-models
    I've recently done this for UpLib. I run a language-guesser over the
    document to identify the primary language when the document comes into
    my repository, and save that language as part of the metadata for my
    document. When UpLib indexes the document into Lucene, it uses that
    language as a key into a table of available Analyzers, and uses the
    selected Analyzer for the document's text. (I'm actually doing this on
    a per-paragraph level now, but the principle is the same.)

    The tricky part is the query parser. My extended query parser allows
    a pseudo-field "_query_language" to specify that the query itself is in
    a particular language, in which case the appropriate Analyzer is used
    for the query.

    You can read the code for all this at
    <http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 19, 2011 at 6:36 pm
    So you are only indexing "analyzed" and querying "analyzed". Is that correct?
    Wouldn't it be better to prefer precise matches (a field that is analyzed with StandardAnalyzer for example) but also allow matches are stemmed.

    paul


    Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :
    Clemens Wyss wrote:
    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages
    We mainly have 1)-models
    I've recently done this for UpLib. I run a language-guesser over the
    document to identify the primary language when the document comes into
    my repository, and save that language as part of the metadata for my
    document. When UpLib indexes the document into Lucene, it uses that
    language as a key into a table of available Analyzers, and uses the
    selected Analyzer for the document's text. (I'm actually doing this on
    a per-paragraph level now, but the principle is the same.)

    The tricky part is the query parser. My extended query parser allows
    a pseudo-field "_query_language" to specify that the query itself is in
    a particular language, in which case the appropriate Analyzer is used
    for the query.

    You can read the code for all this at
    <http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bill Janssen at Jan 19, 2011 at 7:57 pm

    Paul Libbrecht wrote:

    So you are only indexing "analyzed" and querying "analyzed". Is that correct?
    Yes, that's correct. I fall back to StandardAnalyzer if no
    language-specific analyzer is available.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    StandardAnalyzer isn't quite precise, is it? StandardFilter does some
    kind of English-centric alterations to things.

    Perhaps the approach you suggest would be slightly better, but I'd have
    to see numbers on that from some reasonable corpus to be convinced it
    would be worth it.

    Bill
    paul


    Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :
    Clemens Wyss wrote:
    1) Docs in different languages -- every document is one language
    2) Each document has fields in different languages
    We mainly have 1)-models
    I've recently done this for UpLib. I run a language-guesser over the
    document to identify the primary language when the document comes into
    my repository, and save that language as part of the metadata for my
    document. When UpLib indexes the document into Lucene, it uses that
    language as a key into a table of available Analyzers, and uses the
    selected Analyzer for the document's text. (I'm actually doing this on
    a per-paragraph level now, but the principle is the same.)

    The tricky part is the query parser. My extended query parser allows
    a pseudo-field "_query_language" to specify that the query itself is in
    a particular language, in which case the appropriate Analyzer is used
    for the query.

    You can read the code for all this at
    <http://uplib.parc.com/hg/uplib/file/e29e36f751f7/python/uplib/indexing.py>.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 19, 2011 at 10:21 pm

    Le 19 janv. 2011 à 20:56, Bill Janssen a écrit :

    Paul Libbrecht wrote:
    So you are only indexing "analyzed" and querying "analyzed". Is that correct?
    Yes, that's correct. I fall back to StandardAnalyzer if no
    language-specific analyzer is available.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    StandardAnalyzer isn't quite precise, is it? StandardFilter does some
    kind of English-centric alterations to things.
    from here:
    http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html

    I can only conclude that it handles correctly the characters variety but does not stemming.
    The default constructor of StandardAnalyzer comes with a bunch of stop-words but they are easily deactivatable.


    I think it's quite precise, and certainly a lot more precise than removing the aux of chevaux!
    Perhaps the approach you suggest would be slightly better, but I'd have
    to see numbers on that from some reasonable corpus to be convinced it
    would be worth it.
    I am not sure I have these.
    I did several changes of this sort and the precision and recall measures went better in particular in presence of language-indication failure which happened to be very common in our authoring environment.

    paul
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Trejkaz at Jan 19, 2011 at 11:29 pm

    On Thu, Jan 20, 2011 at 9:08 AM, Paul Libbrecht wrote:
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    StandardAnalyzer isn't quite precise, is it?  StandardFilter does some
    kind of English-centric alterations to things.
    from here:
    http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html

    I can only conclude that it handles correctly the characters variety but does not stemming.
    Doesn't StandardAnalyzer also run this?
    http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardFilter.html

    This thing definitely performs English-specific filtering for "'s" and "'S".

    Daniel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bill Janssen at Jan 19, 2011 at 11:30 pm

    Paul Libbrecht wrote:

    I did several changes of this sort and the precision and recall
    measures went better in particular in presence of language-indication
    failure which happened to be very common in our authoring environment.
    There are two kinds of failures: no language, or wrong language.

    For no language, I fall back to StandardAnalyzer, so I should have
    results similar to yours. For wrong language, well, I'm using OTS
    trigram-based language guessers, and they're pretty good these days.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    Yes, I think it might improve things, but again, by how much? Stemming is
    better than no stemming, in terms of recall. But this approach would also
    improve precision.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dominique Bejean at Jan 20, 2011 at 9:46 am
    Hi,

    During a recent Solr project we needed to index document in a lot of
    languages. The natural solution with Lucene and Solr is to define one
    field per languages. Each field is configured in the schema.xml file to
    use a language specific processing (tokenizing, stop words, stemmer,
    ...). This is really not easy to manage if you have a lot of languages
    and this means that 1) the search interface need to know in which
    language your are searching 2) the search interface can't search in all
    languages at the same time.

    So, I decided that the only solution was to index all languages in only
    one field.

    Obviously, each language needs to be processed specifically. For this, I
    developped a analyzer that is in charge to redirect content to the
    correct tockenizer, filters and stemmer accordingly to its language.
    This analyzer is also used at query time. If the user specify the
    language of its query, the query is processed by appropriate tockenizer,
    filters and stemmer otherwise the query is processed by a defaut
    tockenizer, filters and stemmer.

    With this solution :

    1. I only need one field (or two if I want both stemmed and unstemmed
    processing)
    2. The user can search in all document regarless to there language

    I hope this help.

    Dominique
    www.zoonix.fr
    www.crawl-anywhere.com



    Le 20/01/11 00:29, Bill Janssen a écrit :
    Paul Libbrechtwrote:
    I did several changes of this sort and the precision and recall
    measures went better in particular in presence of language-indication
    failure which happened to be very common in our authoring environment.
    There are two kinds of failures: no language, or wrong language.

    For no language, I fall back to StandardAnalyzer, so I should have
    results similar to yours. For wrong language, well, I'm using OTS
    trigram-based language guessers, and they're pretty good these days.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    Yes, I think it might improve things, but again, by how much? Stemming is
    better than no stemming, in terms of recall. But this approach would also
    improve precision.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bill Janssen at Jan 20, 2011 at 5:31 pm

    Dominique Bejean wrote:

    Hi,

    During a recent Solr project we needed to index document in a lot of
    languages. The natural solution with Lucene and Solr is to define one
    field per languages. Each field is configured in the schema.xml file
    to use a language specific processing (tokenizing, stop words,
    stemmer, ...). This is really not easy to manage if you have a lot of
    languages and this means that 1) the search interface need to know in
    which language your are searching 2) the search interface can't search
    in all languages at the same time.

    So, I decided that the only solution was to index all languages in
    only one field.

    Obviously, each language needs to be processed specifically. For this,
    I developped a analyzer that is in charge to redirect content to the
    correct tockenizer, filters and stemmer accordingly to its
    language. This analyzer is also used at query time. If the user
    specify the language of its query, the query is processed by
    appropriate tockenizer, filters and stemmer otherwise the query is
    processed by a defaut tockenizer, filters and stemmer.
    I'm not sure how much this helps. My query processing is the same as
    yours, but I only index the document with a single analyzer, based on
    the language determination. With your approach, multiple analyses are
    all mixed together in a single field, so I'd expect a lower precision
    score, due to words that accidentally stem to the same root in multiple
    different languages.

    Bill
    With this solution :

    1. I only need one field (or two if I want both stemmed and unstemmed
    processing)
    2. The user can search in all document regarless to there language

    I hope this help.

    Dominique
    www.zoonix.fr
    www.crawl-anywhere.com



    Le 20/01/11 00:29, Bill Janssen a écrit :
    Paul Libbrechtwrote:
    I did several changes of this sort and the precision and recall
    measures went better in particular in presence of language-indication
    failure which happened to be very common in our authoring environment.
    There are two kinds of failures: no language, or wrong language.

    For no language, I fall back to StandardAnalyzer, so I should have
    results similar to yours. For wrong language, well, I'm using OTS
    trigram-based language guessers, and they're pretty good these days.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    Yes, I think it might improve things, but again, by how much? Stemming is
    better than no stemming, in terms of recall. But this approach would also
    improve precision.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 20, 2011 at 9:56 pm
    Isn't this approach somewhat bad for term-frequency?

    Words that would appear in several languages would be a lot more frequent (hence less significative).

    I'm still preferring the split-field method with a proper query expansion.
    This way, the term-frequency is evaluated on the corpus of one language.

    Dominique, in your case, at least if on the web, you have:
    - the user's preferred language (if defined in a profile)
    - the list of languages the browser says it accepts
    And that can easily be limited to around 8 so that you cover any language the user is expecting to search.

    paul


    Le 20 janv. 2011 à 10:46, Dominique Bejean a écrit :
    Hi,

    During a recent Solr project we needed to index document in a lot of languages. The natural solution with Lucene and Solr is to define one field per languages. Each field is configured in the schema.xml file to use a language specific processing (tokenizing, stop words, stemmer, ...). This is really not easy to manage if you have a lot of languages and this means that 1) the search interface need to know in which language your are searching 2) the search interface can't search in all languages at the same time.

    So, I decided that the only solution was to index all languages in only one field.

    Obviously, each language needs to be processed specifically. For this, I developped a analyzer that is in charge to redirect content to the correct tockenizer, filters and stemmer accordingly to its language. This analyzer is also used at query time. If the user specify the language of its query, the query is processed by appropriate tockenizer, filters and stemmer otherwise the query is processed by a defaut tockenizer, filters and stemmer.

    With this solution :

    1. I only need one field (or two if I want both stemmed and unstemmed processing)
    2. The user can search in all document regarless to there language

    I hope this help.

    Dominique
    www.zoonix.fr
    www.crawl-anywhere.com



    Le 20/01/11 00:29, Bill Janssen a écrit :
    Paul Libbrechtwrote:
    I did several changes of this sort and the precision and recall
    measures went better in particular in presence of language-indication
    failure which happened to be very common in our authoring environment.
    There are two kinds of failures: no language, or wrong language.

    For no language, I fall back to StandardAnalyzer, so I should have
    results similar to yours. For wrong language, well, I'm using OTS
    trigram-based language guessers, and they're pretty good these days.
    Wouldn't it be better to prefer precise matches (a field that is
    analyzed with StandardAnalyzer for example) but also allow matches are
    stemmed.
    Yes, I think it might improve things, but again, by how much? Stemming is
    better than no stemming, in terms of recall. But this approach would also
    improve precision.

    Bill

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Vinaya Kumar Thimmappa at Jan 19, 2011 at 7:00 am
    I think we should be using lucene with snowball jar's which means one
    index for all languages (ofcourse size of index is always a matter of
    concerns).

    Hope this helps.
    -vinaya
    On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
    What is the "best practice" to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized fields?

    We most often will be searching "language dependent" which (at least performance wise) mandates one-directory-per-language...

    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 19, 2011 at 7:44 am
    But for this, you need a skillfully designed:
    - set of fields
    - multiplexing analyzer
    - query expansion
    In one of my projects, we do not split language by fields and it's a pain... I'm having recurring issues in one sense or the other.
    - the "die" example that Oti s mentioned is a good one: stop-word in German, essential verb in English
    - I had recently issues with the contribution of the word Fourier (for the name of series): in English it stays fourier, in French in becomes fouri. So: if the resource is contributed in French, the indexed value is fouri, English seekers won't find it; if the resource is contributed in English, French seekers won't find it.
    So my last lesson: always have a whitespace-lowercase unstemmed field also at hand and prefer it over the others in your query expansion.

    A wiki page should probably be made.

    paul


    Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
    I think we should be using lucene with snowball jar's which means one index for all languages (ofcourse size of index is always a matter of concerns).

    Hope this helps.
    -vinaya
    On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
    What is the "best practice" to support multiple languages, i.e. Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized fields?

    We most often will be searching "language dependent" which (at least performance wise) mandates one-directory-per-language...

    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Shai Erera at Jan 19, 2011 at 8:24 am
    If you index documents, each in a different language, but all its fields are
    of the same language, then what you can do is the following:

    Create separate indexes per language
    -------------------------------------------------------
    This will work and is not too hard to set up. Requires some maintenance code
    (e.g. directing a search request against the relevant index) but nothing too
    complicated. The advantage of using this approach is that you don't risk
    running into issues like search for "die" when the language is German, yet
    you will find documents in English indexed w/ that word. So your searches
    are "language safe". A disadvantage is that if you ever require to do
    cross-language operation, like search two languages, you need to do search
    federation which is less good. Also, maintenance becomes a slight pain,
    because you e.g. need to optimize multiple indexes, make sure they don't try
    to optimize at once, resulting in a sudden burst of IO.

    Create one index
    -------------------------
    Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
    proper Analyzer per the doc's language. That way, all your documents are
    located in the same index so administration is really simple. They also
    don't step on each other toes - each document is analyzed exactly as it
    should. You might get into weird situations like the "die" example (fetching
    a document in incorrect language), but that's easily solvable by indexing
    for each document a "language" field and use it as a Filter during the
    search. You can cache that Filter so that its posting list isn't traversed
    for every query but instead only once.

    We use the second approach and we're required to support 32 languages. While
    in most deployments the number never exceeds 3-4 languages, I know of some
    that handle > 10. If you're careful enough, it just works.

    Hope this helps.

    Shai
    On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht wrote:


    But for this, you need a skillfully designed:
    - set of fields
    - multiplexing analyzer
    - query expansion
    In one of my projects, we do not split language by fields and it's a
    pain... I'm having recurring issues in one sense or the other.
    - the "die" example that Oti s mentioned is a good one: stop-word in
    German, essential verb in English
    - I had recently issues with the contribution of the word Fourier (for the
    name of series): in English it stays fourier, in French in becomes fouri.
    So: if the resource is contributed in French, the indexed value is fouri,
    English seekers won't find it; if the resource is contributed in English,
    French seekers won't find it.
    So my last lesson: always have a whitespace-lowercase unstemmed field also
    at hand and prefer it over the others in your query expansion.

    A wiki page should probably be made.

    paul


    Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
    I think we should be using lucene with snowball jar's which means one
    index for all languages (ofcourse size of index is always a matter of
    concerns).
    Hope this helps.
    -vinaya
    On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized
    fields?
    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...
    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Luca Rondanini at Jan 19, 2011 at 5:59 pm
    why not just using the StandardAnalyzer? it works pretty well even with
    Asian languages!


    On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera wrote:

    If you index documents, each in a different language, but all its fields
    are
    of the same language, then what you can do is the following:

    Create separate indexes per language
    -------------------------------------------------------
    This will work and is not too hard to set up. Requires some maintenance
    code
    (e.g. directing a search request against the relevant index) but nothing
    too
    complicated. The advantage of using this approach is that you don't risk
    running into issues like search for "die" when the language is German, yet
    you will find documents in English indexed w/ that word. So your searches
    are "language safe". A disadvantage is that if you ever require to do
    cross-language operation, like search two languages, you need to do search
    federation which is less good. Also, maintenance becomes a slight pain,
    because you e.g. need to optimize multiple indexes, make sure they don't
    try
    to optimize at once, resulting in a sudden burst of IO.

    Create one index
    -------------------------
    Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
    proper Analyzer per the doc's language. That way, all your documents are
    located in the same index so administration is really simple. They also
    don't step on each other toes - each document is analyzed exactly as it
    should. You might get into weird situations like the "die" example
    (fetching
    a document in incorrect language), but that's easily solvable by indexing
    for each document a "language" field and use it as a Filter during the
    search. You can cache that Filter so that its posting list isn't traversed
    for every query but instead only once.

    We use the second approach and we're required to support 32 languages.
    While
    in most deployments the number never exceeds 3-4 languages, I know of some
    that handle > 10. If you're careful enough, it just works.

    Hope this helps.

    Shai
    On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht wrote:


    But for this, you need a skillfully designed:
    - set of fields
    - multiplexing analyzer
    - query expansion
    In one of my projects, we do not split language by fields and it's a
    pain... I'm having recurring issues in one sense or the other.
    - the "die" example that Oti s mentioned is a good one: stop-word in
    German, essential verb in English
    - I had recently issues with the contribution of the word Fourier (for the
    name of series): in English it stays fourier, in French in becomes fouri.
    So: if the resource is contributed in French, the indexed value is fouri,
    English seekers won't find it; if the resource is contributed in English,
    French seekers won't find it.
    So my last lesson: always have a whitespace-lowercase unstemmed field also
    at hand and prefer it over the others in your query expansion.

    A wiki page should probably be made.

    paul


    Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
    I think we should be using lucene with snowball jar's which means one
    index for all languages (ofcourse size of index is always a matter of
    concerns).
    Hope this helps.
    -vinaya
    On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized
    fields?
    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...
    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Jan 19, 2011 at 6:17 pm
    Because it does not find "junks" when you search "junk".
    Or... chevaux when you search cheval.

    paul


    Le 19 janv. 2011 à 18:59, Luca Rondanini a écrit :
    why not just using the StandardAnalyzer? it works pretty well even with
    Asian languages!


    On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera wrote:

    If you index documents, each in a different language, but all its fields
    are
    of the same language, then what you can do is the following:

    Create separate indexes per language
    -------------------------------------------------------
    This will work and is not too hard to set up. Requires some maintenance
    code
    (e.g. directing a search request against the relevant index) but nothing
    too
    complicated. The advantage of using this approach is that you don't risk
    running into issues like search for "die" when the language is German, yet
    you will find documents in English indexed w/ that word. So your searches
    are "language safe". A disadvantage is that if you ever require to do
    cross-language operation, like search two languages, you need to do search
    federation which is less good. Also, maintenance becomes a slight pain,
    because you e.g. need to optimize multiple indexes, make sure they don't
    try
    to optimize at once, resulting in a sudden burst of IO.

    Create one index
    -------------------------
    Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
    proper Analyzer per the doc's language. That way, all your documents are
    located in the same index so administration is really simple. They also
    don't step on each other toes - each document is analyzed exactly as it
    should. You might get into weird situations like the "die" example
    (fetching
    a document in incorrect language), but that's easily solvable by indexing
    for each document a "language" field and use it as a Filter during the
    search. You can cache that Filter so that its posting list isn't traversed
    for every query but instead only once.

    We use the second approach and we're required to support 32 languages.
    While
    in most deployments the number never exceeds 3-4 languages, I know of some
    that handle > 10. If you're careful enough, it just works.

    Hope this helps.

    Shai
    On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht wrote:


    But for this, you need a skillfully designed:
    - set of fields
    - multiplexing analyzer
    - query expansion
    In one of my projects, we do not split language by fields and it's a
    pain... I'm having recurring issues in one sense or the other.
    - the "die" example that Oti s mentioned is a good one: stop-word in
    German, essential verb in English
    - I had recently issues with the contribution of the word Fourier (for the
    name of series): in English it stays fourier, in French in becomes fouri.
    So: if the resource is contributed in French, the indexed value is fouri,
    English seekers won't find it; if the resource is contributed in English,
    French seekers won't find it.
    So my last lesson: always have a whitespace-lowercase unstemmed field also
    at hand and prefer it over the others in your query expansion.

    A wiki page should probably be made.

    paul


    Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
    I think we should be using lucene with snowball jar's which means one
    index for all languages (ofcourse size of index is always a matter of
    concerns).
    Hope this helps.
    -vinaya
    On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
    What is the "best practice" to support multiple languages, i.e.
    Lucene-Documents that have multiple language content/fields?
    Should
    a) each language be indexed in a seperate index/directory or should
    b) the Documents (in a single directory) hold the diverse localized
    fields?
    We most often will be searching "language dependent" which (at least
    performance wise) mandates one-directory-per-language...
    Any (lucene specific) white papers on this topic?

    Thx in advance
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 18, '11 at 5:54p
activeJan 20, '11 at 9:56p
posts18
users9
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase