FAQ
Hi,



I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..). After reading the different posts, I think I will create a document per item, index them in the same index using each time a language specific analyzer, and store lang field for specific search. But I'm wondering how I should handle the common fields, it seems I have two options:

1 : store the common data in each item. What happen if duplicate information are entered, are they duplicate in the index ?



2 : create a separate document for the common data. In this case I will need to link these data to all underlying items storing some ids. The issue is that I would need to search the index twice if the search is done only per date, because I would need to retrieve the items contents.



Thank in advance for your help.



Mélanie

Search Discussions

  • Aslam bari at Mar 22, 2007 at 6:44 am
    Hi,
    Have a look to my resume attached with the mail. if it suits you, let me know.
    Thanks...


    ----- Original Message ----
    From: Melanie Langlois <Melanie.Langlois@tradingscreen.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, 22 March, 2007 11:33:03 AM
    Subject: indexing rss feeds in multiple languages


    Hi,



    I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..). After reading the different posts, I think I will create a document per item, index them in the same index using each time a language specific analyzer, and store lang field for specific search. But I'm wondering how I should handle the common fields, it seems I have two options:

    1 : store the common data in each item. What happen if duplicate information are entered, are they duplicate in the index ?



    2 : create a separate document for the common data. In this case I will need to link these data to all underlying items storing some ids. The issue is that I would need to search the index twice if the search is done only per date, because I would need to retrieve the items contents.



    Thank in advance for your help.



    Mélanie



    __________________________________________________________
    Yahoo! India Answers: Share what you know. Learn something new
    http://in.answers.yahoo.com/
  • Aslam bari at Mar 22, 2007 at 6:58 am
    OOPs!!!
    Sorry,
    My last message has come here by mistake. It was for someone else, It is just a silly mistake.

    sorry People.


    ----- Original Message ----
    From: aslam bari <iamaslamok@yahoo.co.in>
    To: java-user@lucene.apache.org
    Sent: Thursday, 22 March, 2007 12:12:57 PM
    Subject: Re: indexing rss feeds in multiple languages


    Hi,
    Have a look to my resume attached with the mail. if it suits you, let me know.
    Thanks...


    ----- Original Message ----
    From: Melanie Langlois <Melanie.Langlois@tradingscreen.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, 22 March, 2007 11:33:03 AM
    Subject: indexing rss feeds in multiple languages


    Hi,



    I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..). After reading the different posts, I think I will create a document per item, index them in the same index using each time a language specific analyzer, and store lang field for specific search. But I'm wondering how I should handle the common fields, it seems I have two options:

    1 : store the common data in each item. What happen if duplicate information are entered, are they duplicate in the index ?



    2 : create a separate document for the common data. In this case I will need to link these data to all underlying items storing some ids. The issue is that I would need to search the index twice if the search is done only per date, because I would need to retrieve the items contents.



    Thank in advance for your help.



    Mélanie





    Here’s a new way to find what you're looking for - Yahoo! Answers
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    __________________________________________________________
    Yahoo! India Answers: Share what you know. Learn something new
    http://in.answers.yahoo.com/
  • Doron Cohen at Mar 22, 2007 at 7:01 am
    If language is known also at search time, PerFieldAnalyzerWrapper seems a
    nice third option: single document per feed, with a separate field for each
    language, additional field(s) for the common data; using
    PerFieldAnalyzerWrapper at both indexing and search; using FieldSelector
    at search to retrieve only the relevant field(s) for matched documents.
    (never done this myself though.)
    - Doron

    "Melanie Langlois" <Melanie.Langlois@tradingscreen.com> wrote on 21/03/2007
    23:03:03:
    Hi,



    I saw that there are many post on the mailing list about indexing in
    multiple language, so I will try to not post duplicate question. In
    my case, I want to index rss feeds, so one feed contains several
    items in different languages, and some common data for all the items
    (date, source..). After reading the different posts, I think I will
    create a document per item, index them in the same index using each
    time a language specific analyzer, and store lang field for specific
    search. But I'm wondering how I should handle the common fields, it
    seems I have two options:

    1 : store the common data in each item. What happen if duplicate
    information are entered, are they duplicate in the index ?



    2 : create a separate document for the common data. In this case I
    will need to link these data to all underlying items storing some
    ids. The issue is that I would need to search the index twice if the
    search is done only per date, because I would need to retrieve the
    items contents.



    Thank in advance for your help.



    Mélanie


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Melanie Langlois at Mar 22, 2007 at 7:39 am
    Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers.

    Mélanie

    -----Original Message-----
    From: Doron Cohen
    Sent: Thursday, March 22, 2007 3:56 PM
    To: java-user@lucene.apache.org
    Subject: Re: indexing rss feeds in multiple languages

    If language is known also at search time, PerFieldAnalyzerWrapper seems a
    nice third option: single document per feed, with a separate field for each
    language, additional field(s) for the common data; using
    PerFieldAnalyzerWrapper at both indexing and search; using FieldSelector
    at search to retrieve only the relevant field(s) for matched documents.
    (never done this myself though.)
    - Doron

    "Melanie Langlois" <Melanie.Langlois@tradingscreen.com> wrote on 21/03/2007
    23:03:03:
    Hi,



    I saw that there are many post on the mailing list about indexing in
    multiple language, so I will try to not post duplicate question. In
    my case, I want to index rss feeds, so one feed contains several
    items in different languages, and some common data for all the items
    (date, source..). After reading the different posts, I think I will
    create a document per item, index them in the same index using each
    time a language specific analyzer, and store lang field for specific
    search. But I'm wondering how I should handle the common fields, it
    seems I have two options:

    1 : store the common data in each item. What happen if duplicate
    information are entered, are they duplicate in the index ?



    2 : create a separate document for the common data. In this case I
    will need to link these data to all underlying items storing some
    ids. The issue is that I would need to search the index twice if the
    search is done only per date, because I would need to retrieve the
    items contents.



    Thank in advance for your help.



    Mélanie


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Antony Bowesman at Mar 22, 2007 at 8:30 am

    Melanie Langlois wrote:
    Well, thanks, sounds like the best option to me. Does anybody use the
    PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on
    the performances when using different analyzers.
    I've not done any specifc comparisons between using a single Analyzer and
    multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of
    which can have a different analyzer depending on language or field type,
    although in practice about 8-10 fields may use the non-default analyzer.

    Performance is pretty good in any case and there's not been any noticeable
    degradtion when tweaking analyzers.
    Antony





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 22, '07 at 6:07a
activeMar 22, '07 at 8:30a
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase