FAQ
Hi,

I am not a French speaker, but here are some questions regarding
French analyzer:

Is there any analyzer that can do this? Analyze accentuated letters to
non accentuated corresponding letters (é,è,ê,ë -> e), so that

search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
and
search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"

Current analyzers, Snowball-French and FrenchAnalyzer don't have this feature.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Jul 30, 2007 at 6:18 pm
    Gosh, I sure hope not, because that would mean that we rolled our
    own for no good reason. We wound up just collapsing
    the input stream by substituting plain old 'e' for all the accented
    variants before indexing and before searching. Be *really* careful
    what character set you're using.

    Actually, we would have still had to roll our own because the
    character mapping was...er...wonky <G>....

    You have to store the data raw for display purposes if you want the
    accents to show though...

    Best
    Erick
    On 7/30/07, Chris Lu wrote:

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or
    "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Jul 30, 2007 at 8:37 pm
    Hi, Erick,

    I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
    and it works great! And I think it's the right way to go. Problems
    like "You have to store the data raw for display purposes if you want
    the accents to show though" will go away since Analyzer already have
    the original text and analyzed token mechanism built-in. And it's
    pretty easy to do!

    However, is there any special case that you have? Not really knowing
    French, I only tested one word, "fenêtre", and it's analyzed into
    "fenetre".

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 7/30/07, Erick Erickson wrote:
    Gosh, I sure hope not, because that would mean that we rolled our
    own for no good reason. We wound up just collapsing
    the input stream by substituting plain old 'e' for all the accented
    variants before indexing and before searching. Be *really* careful
    what character set you're using.

    Actually, we would have still had to roll our own because the
    character mapping was...er...wonky <G>....

    You have to store the data raw for display purposes if you want the
    accents to show though...

    Best
    Erick
    On 7/30/07, Chris Lu wrote:

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or
    "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Renaud Waldura at Jul 30, 2007 at 9:26 pm
    Being a French speaker, I will mention the following special cases:

    - "plus ça change" -> "plus ca change"
    - "œuf" -> "oeuf"
    - "lætitia" -> "laetitia"

    But I just looked, and it looks like ISOLatin1AccentFilter handles these.
    Better test to be sure...

    --Renaud


    -----Original Message-----
    From: Chris Lu
    Sent: Monday, July 30, 2007 1:36 PM
    To: java-user@lucene.apache.org
    Subject: Re: a question for french analyzer

    Hi, Erick,

    I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip, and
    it works great! And I think it's the right way to go. Problems like "You
    have to store the data raw for display purposes if you want the accents to
    show though" will go away since Analyzer already have the original text and
    analyzed token mechanism built-in. And it's pretty easy to do!

    However, is there any special case that you have? Not really knowing French,
    I only tested one word, "fenêtre", and it's analyzed into "fenetre".

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes

    On 7/30/07, Erick Erickson wrote:
    Gosh, I sure hope not, because that would mean that we rolled our
    own for no good reason. We wound up just collapsing
    the input stream by substituting plain old 'e' for all the accented
    variants before indexing and before searching. Be *really* careful
    what character set you're using.

    Actually, we would have still had to roll our own because the
    character mapping was...er...wonky <G>....

    You have to store the data raw for display purposes if you want the
    accents to show though...

    Best
    Erick
    On 7/30/07, Chris Lu wrote:

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or
    "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jul 31, 2007 at 12:14 am
    <<<However, is there any special case that you have?>>>

    Yes, the character set we use is, as I remember,
    MARC-8. Which I don't think is the ISOLatin....,
    but since I didn't know about that filter when we had our problem,
    I didn't even look. Oh well, smarter/braver/lazier next time <G>...

    Which is why I love this list, I find things like this and look
    smarter next time something similar comes up <G>.

    Thanks
    Erick
    On 7/30/07, Chris Lu wrote:

    Hi, Erick,

    I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
    and it works great! And I think it's the right way to go. Problems
    like "You have to store the data raw for display purposes if you want
    the accents to show though" will go away since Analyzer already have
    the original text and analyzed token mechanism built-in. And it's
    pretty easy to do!

    However, is there any special case that you have? Not really knowing
    French, I only tested one word, "fenêtre", and it's analyzed into
    "fenetre".

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:

    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 7/30/07, Erick Erickson wrote:
    Gosh, I sure hope not, because that would mean that we rolled our
    own for no good reason. We wound up just collapsing
    the input stream by substituting plain old 'e' for all the accented
    variants before indexing and before searching. Be *really* careful
    what character set you're using.

    Actually, we would have still had to roll our own because the
    character mapping was...er...wonky <G>....

    You have to store the data raw for display purposes if you want the
    accents to show though...

    Best
    Erick
    On 7/30/07, Chris Lu wrote:

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or
    "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Samir Abdou at Jul 30, 2007 at 6:19 pm
    Hi,

    Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
    and it should work !

    Hope this will help,
    Samir

    -----Message d'origine-----
    De : Chris Lu
    Envoyé : lundi, 30. juillet 2007 20:06
    À : java-user@lucene.apache.org
    Objet : a question for french analyzer

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Jul 30, 2007 at 8:32 pm
    Hi, Samir,

    Thanks a lot for this tip! It works great!

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    On 7/30/07, Samir Abdou wrote:
    Hi,

    Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
    and it should work !

    Hope this will help,
    Samir

    -----Message d'origine-----
    De: Chris Lu
    Envoyé: lundi, 30. juillet 2007 20:06
    À: java-user@lucene.apache.org
    Objet: a question for french analyzer

    Hi,

    I am not a French speaker, but here are some questions regarding
    French analyzer:

    Is there any analyzer that can do this? Analyze accentuated letters to
    non accentuated corresponding letters (é,è,ê,ë -> e), so that

    search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
    and
    search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"

    Current analyzers, Snowball-French and FrenchAnalyzer don't have this
    feature.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
    inutes

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 30, '07 at 6:06p
activeJul 31, '07 at 12:14a
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase