FAQ
I try to apply German*Filter and or Analyzer on my index. My index contains wine names such as "Petite Arvine" ( I know, that's french ;) ). Whenever one oft he German*Filter or German*Analyzer is in play the terms for "Petite Arvine" are reduced to
"Petit"
and
"Arvin"
Why so? Where have the e's gone?

Thanks for your help
Clemens

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Apr 12, 2011 at 2:41 pm
    I don't quite get why the German analyzer would do this, but
    all the Filters I see are stemmers and I expect they'd
    reduce the words as you indicate.

    What version of Lucene are you using?

    Best
    Erick
    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss wrote:

    I try to apply German*Filter and or Analyzer on my index. My index contains
    wine names such as "Petite Arvine" ( I know, that's french ;) ). Whenever
    one oft he German*Filter or German*Analyzer is in play the terms for
    "Petite Arvine" are reduced to
    "Petit"
    and
    "Arvin"
    Why so? Where have the e's gone?

    Thanks for your help
    Clemens

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Apr 12, 2011 at 3:03 pm

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german
    stemming algorithms remove final -e, as do all the french stemming
    algorithms.

    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Clemens Wyss at Apr 13, 2011 at 7:51 am
    What I really want to do is ignore german stop words such as "der", "die", "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...
    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german stemming
    algorithms remove final -e, as do all the french stemming algorithms.

    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Apr 13, 2011 at 8:51 am

    On Wed, Apr 13, 2011 at 9:51 AM, Clemens Wyss wrote:
    What I really want to do is ignore german stop words such as "der", "die", "das", "ein",...
    GermanAnalyzer takes a stemExclusionSet if you put those terms into
    this set the stemmer will not touch them. This should be in 3.1 I
    think

    public GermanAnalyzer(Version matchVersion, Set<?> stopwords, Set<?>
    stemExclusionSet)

    simon
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german stemming
    algorithms remove final -e, as do all the french stemming algorithms.

    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Apr 13, 2011 at 9:17 am

    On Wed, Apr 13, 2011 at 11:03 AM, Clemens Wyss wrote:
    I tried:
    Set<String> stemsToBeIgnored = new HashSet<String>(Arrays.asList( "e" ));
    GermanAnalyzer ga = new GermanAnalyzer( Version.LUCENE_31, GermanAnalyzer.getDefaultStopSet(), stemsToBeIgnored );
    try Arrays.asList("der", "die", "das", "ein");

    or do I get you wrong....

    simon
    But the e's are still "removed"...
    -----Ursprüngliche Nachricht-----
    Von: Simon Willnauer
    Gesendet: Mittwoch, 13. April 2011 10:51
    An: java-user@lucene.apache.org
    Cc: Clemens Wyss
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    On Wed, Apr 13, 2011 at 9:51 AM, Clemens Wyss <clemensdev@mysign.ch>
    wrote:
    What I really want to do is ignore german stop words such as "der", "die",
    "das", "ein",...

    GermanAnalyzer takes a stemExclusionSet if you put those terms into this
    set the stemmer will not touch them. This should be in 3.1 I think

    public GermanAnalyzer(Version matchVersion, Set<?> stopwords, Set<?>
    stemExclusionSet)

    simon
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german
    stemming algorithms remove final -e, as do all the french stemming
    algorithms.
    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Apr 13, 2011 at 10:19 am
    If you only want to ignore german stopwords, you don't need to use the
    german analyzer with german stemming. you can just use
    StandardAnalyzer with your own stopwords set!
    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss wrote:
    What I really want to do is ignore german stop words such as "der", "die", "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german stemming
    algorithms remove final -e, as do all the french stemming algorithms.

    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Clemens Wyss at Apr 13, 2011 at 11:34 am
    This is what I was looking for, thanks
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Mittwoch, 13. April 2011 12:11
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    If you only want to ignore german stopwords, you don't need to use the
    german analyzer with german stemming. you can just use StandardAnalyzer
    with your own stopwords set!
    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss wrote:
    What I really want to do is ignore german stop words such as "der", "die",
    "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german
    stemming algorithms remove final -e, as do all the french stemming
    algorithms.
    so i don't understand your problem.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Clemens Wyss at Apr 15, 2011 at 6:49 am
    Does the StandardAnalyzer lowercase its terms?
    -----Ursprüngliche Nachricht-----
    Von: Clemens Wyss
    Gesendet: Mittwoch, 13. April 2011 13:34
    An: java-user@lucene.apache.org
    Betreff: AW: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    This is what I was looking for, thanks
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Mittwoch, 13. April 2011 12:11
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    If you only want to ignore german stopwords, you don't need to use the
    german analyzer with german stemming. you can just use
    StandardAnalyzer with your own stopwords set!

    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    What I really want to do is ignore german stop words such as "der",
    "die",
    "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german
    stemming algorithms remove final -e, as do all the french stemming
    algorithms.
    so i don't understand your problem.

    -------------------------------------------------------------------
    -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Apr 15, 2011 at 6:56 am

    On Fri, Apr 15, 2011 at 8:48 AM, Clemens Wyss wrote:
    Does the StandardAnalyzer lowercase its terms?
    yes!

    simon
    -----Ursprüngliche Nachricht-----
    Von: Clemens Wyss
    Gesendet: Mittwoch, 13. April 2011 13:34
    An: java-user@lucene.apache.org
    Betreff: AW: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    This is what I was looking for, thanks
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Mittwoch, 13. April 2011 12:11
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    If you only want to ignore german stopwords, you don't need to use the
    german analyzer with german stemming. you can just use
    StandardAnalyzer with your own stopwords set!

    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    What I really want to do is ignore german stop words such as "der",
    "die",
    "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the german
    stemming algorithms remove final -e, as do all the french stemming
    algorithms.
    so i don't understand your problem.

    -------------------------------------------------------------------
    -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Clemens Wyss at Apr 18, 2011 at 7:09 am
    What is the best way to "avoid" the lowercasing (and still being able to exclude stop words)?
    -----Ursprüngliche Nachricht-----
    Von: Simon Willnauer
    Gesendet: Freitag, 15. April 2011 08:56
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...
    On Fri, Apr 15, 2011 at 8:48 AM, Clemens Wyss wrote:
    Does the StandardAnalyzer lowercase its terms?
    yes!

    simon
    -----Ursprüngliche Nachricht-----
    Von: Clemens Wyss
    Gesendet: Mittwoch, 13. April 2011 13:34
    An: java-user@lucene.apache.org
    Betreff: AW: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    This is what I was looking for, thanks
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Mittwoch, 13. April 2011 12:11
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    If you only want to ignore german stopwords, you don't need to use
    the german analyzer with german stemming. you can just use
    StandardAnalyzer with your own stopwords set!

    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    What I really want to do is ignore german stop words such as
    "der", "die",
    "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the
    german stemming algorithms remove final -e, as do all the french
    stemming
    algorithms.
    so i don't understand your problem.

    ----------------------------------------------------------------
    ---
    -- To unsubscribe, e-mail:
    java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail:
    java-user-help@lucene.apache.org
    -------------------------------------------------------------------
    -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 18, 2011 at 11:36 am
    You can easily string together your own tokenizer and any number of filters
    to create an analyzer that does exactly what you need. Lucene In Action
    shows an example for creating your own analyzer by assembling
    the standard parts....

    Best
    Erick
    On Mon, Apr 18, 2011 at 3:08 AM, Clemens Wyss wrote:
    What is the best way to "avoid" the lowercasing (and still being able to exclude stop words)?
    -----Ursprüngliche Nachricht-----
    Von: Simon Willnauer
    Gesendet: Freitag, 15. April 2011 08:56
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from (french)
    words...

    On Fri, Apr 15, 2011 at 8:48 AM, Clemens Wyss <clemensdev@mysign.ch>
    wrote:
    Does the StandardAnalyzer lowercase its terms?
    yes!

    simon
    -----Ursprüngliche Nachricht-----
    Von: Clemens Wyss
    Gesendet: Mittwoch, 13. April 2011 13:34
    An: java-user@lucene.apache.org
    Betreff: AW: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    This is what I was looking for, thanks
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Mittwoch, 13. April 2011 12:11
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    If you only want to ignore german stopwords, you don't need to use
    the german analyzer with german stemming. you can just use
    StandardAnalyzer with your own stopwords set!

    On Wed, Apr 13, 2011 at 3:51 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    What I really want to do is ignore german stop words such as
    "der", "die",
    "das", "ein",...
    -----Ursprüngliche Nachricht-----
    Von: Robert Muir
    Gesendet: Dienstag, 12. April 2011 17:03
    An: java-user@lucene.apache.org
    Betreff: Re: German*Filter, Analyzer "cutting" off letters from
    (french) words...

    On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss
    <clemensdev@mysign.ch>
    wrote:
    Why so? Where have the e's gone?
    the e is being stemmed as its a german suffix... all of the
    german stemming algorithms remove final -e, as do all the french
    stemming
    algorithms.
    so i don't understand your problem.

    ----------------------------------------------------------------
    ---
    -- To unsubscribe, e-mail:
    java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail:
    java-user-help@lucene.apache.org
    -------------------------------------------------------------------
    -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 12, '11 at 1:43p
activeApr 18, '11 at 11:36a
posts12
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase