FAQ
Hi,


Can anyone comment if my understanding of the stemming process in Lucene is correct. From my testing using the SnowballAnalyzer, if I passed this word "flashing" it will be trimmed to a root word "flash" and this root word ("flash") will be the one searched not the original word "flashing".

Is there an API in Lucene or third-party APIs that can do the following, I passed the word "flash" instead it will search for "flashing", "flashed", "flashes" etc.?


Regards,
Jay Malaluan

Search Discussions

  • Mathieu at Dec 16, 2008 at 12:33 pm
    you stem the search query and while indexing, so only "flash" is indexed
    when "flashing" is read.
    If you don't wont to hurt your index with half word, you can use a second
    index, just like for spelling :
    http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index

    M.

    On Tue, 16 Dec 2008 04:18:28 -0800 (PST), Jay Joel Malaluan
    wrote:
    Hi,


    Can anyone comment if my understanding of the stemming process in Lucene
    is correct. From my testing using the SnowballAnalyzer, if I passed this
    word "flashing" it will be trimmed to a root word "flash" and this root
    word ("flash") will be the one searched not the original word "flashing".

    Is there an API in Lucene or third-party APIs that can do the following, I
    passed the word "flash" instead it will search for "flashing", "flashed",
    "flashes" etc.?


    Regards,
    Jay Malaluan


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Dec 16, 2008 at 2:14 pm
    Why do you want to do this? The reason I ask is that you're
    making each clause very complex.....

    For a single term, it's not very complex, but for something like
    ((A AND B) OR (C AND D)) NOT X

    expanding A, B, C, D and X to, possibly many terms is...er...ugly.

    You could think about ngrams, although I confess I've only seen
    this on the lists, haven't worked with it myself.

    If your goal is to be able to search exact match words (i.e. you
    need to find "flash" when exactly "flash" was indexed, not "flashing")
    there are better strategies....

    So a bit more explanation of the problem could perhaps generate more
    helpful responses.

    Best
    Erick
    On Tue, Dec 16, 2008 at 7:18 AM, Jay Joel Malaluan wrote:

    Hi,


    Can anyone comment if my understanding of the stemming process in Lucene is
    correct. From my testing using the SnowballAnalyzer, if I passed this word
    "flashing" it will be trimmed to a root word "flash" and this root word
    ("flash") will be the one searched not the original word "flashing".

    Is there an API in Lucene or third-party APIs that can do the following, I
    passed the word "flash" instead it will search for "flashing", "flashed",
    "flashes" etc.?


    Regards,
    Jay Malaluan

  • Jay Joel Malaluan at Dec 16, 2008 at 11:42 pm
    Hi Erick,

    Well some client inquiries if it's possible to expand such simple words and does Lucene have an API for this logic? Because all I read was the stemming logic for Lucene was the other way around which is, example "flashing" it will be trimmed to the root word "flash" when searched.


    Regards,
    Jay Malaluan





    ________________________________
    From: Erick Erickson <erickerickson@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, December 16, 2008 10:14:13 PM
    Subject: Re: Inquiry on Lucene Stemming

    Why do you want to do this? The reason I ask is that you're
    making each clause very complex.....

    For a single term, it's not very complex, but for something like
    ((A AND B) OR (C AND D)) NOT X

    expanding A, B, C, D and X to, possibly many terms is...er...ugly.

    You could think about ngrams, although I confess I've only seen
    this on the lists, haven't worked with it myself.

    If your goal is to be able to search exact match words (i.e. you
    need to find "flash" when exactly "flash" was indexed, not "flashing")
    there are better strategies....

    So a bit more explanation of the problem could perhaps generate more
    helpful responses.

    Best
    Erick
    On Tue, Dec 16, 2008 at 7:18 AM, Jay Joel Malaluan wrote:

    Hi,


    Can anyone comment if my understanding of the stemming process in Lucene is
    correct. From my testing using the SnowballAnalyzer, if I passed this word
    "flashing" it will be trimmed to a root word "flash" and this root word
    ("flash") will be the one searched not the original word "flashing".

    Is there an API in Lucene or third-party APIs that can do the following, I
    passed the word "flash" instead it will search for "flashing", "flashed",
    "flashes" etc.?


    Regards,
    Jay Malaluan

  • Erick Erickson at Dec 17, 2008 at 12:35 am
    I'd ask the client why stemming wouldn't work <G>. I've spent faaaar too
    much
    time in my life doing useless things "because the client asked". Really, ask
    for
    the use cases where that is really required and that stemming wouldn't
    cover.

    But you're right that Lucene doesn't have such a facility or API. How could
    it?
    Stemming is easy (yeah, right). But at least it's algorithmic. Going from
    one word
    to all the possible variants (or even worse, all those variants the client
    thinks
    should be there) usually requires some kind of "expansion map". Particularly
    when you want to go from, say, Steve to Stephen.

    Best
    Erick
    On Tue, Dec 16, 2008 at 6:42 PM, Jay Joel Malaluan wrote:

    Hi Erick,

    Well some client inquiries if it's possible to expand such simple words and
    does Lucene have an API for this logic? Because all I read was the stemming
    logic for Lucene was the other way around which is, example "flashing" it
    will be trimmed to the root word "flash" when searched.


    Regards,
    Jay Malaluan





    ________________________________
    From: Erick Erickson <erickerickson@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, December 16, 2008 10:14:13 PM
    Subject: Re: Inquiry on Lucene Stemming

    Why do you want to do this? The reason I ask is that you're
    making each clause very complex.....

    For a single term, it's not very complex, but for something like
    ((A AND B) OR (C AND D)) NOT X

    expanding A, B, C, D and X to, possibly many terms is...er...ugly.

    You could think about ngrams, although I confess I've only seen
    this on the lists, haven't worked with it myself.

    If your goal is to be able to search exact match words (i.e. you
    need to find "flash" when exactly "flash" was indexed, not "flashing")
    there are better strategies....

    So a bit more explanation of the problem could perhaps generate more
    helpful responses.

    Best
    Erick

    On Tue, Dec 16, 2008 at 7:18 AM, Jay Joel Malaluan <
    exst_jmalaluan@yahoo.com
    wrote:
    Hi,


    Can anyone comment if my understanding of the stemming process in Lucene is
    correct. From my testing using the SnowballAnalyzer, if I passed this word
    "flashing" it will be trimmed to a root word "flash" and this root word
    ("flash") will be the one searched not the original word "flashing".

    Is there an API in Lucene or third-party APIs that can do the following, I
    passed the word "flash" instead it will search for "flashing", "flashed",
    "flashes" etc.?


    Regards,
    Jay Malaluan



  • Jokin Cuadrado at Dec 17, 2008 at 10:01 am
    Well, you could use the queryparser wildcard searches (flash*), but
    it doesn't use stemming logic, it just returns all the words that
    start with that string.

    You must be aware that the queryparser rewrite the query with every
    term that match the wildcard, so if your prefix is short it's easy to
    get the maxclauseexception.

    more info in:
    http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Wildcard%20Searches

    On Wed, Dec 17, 2008 at 12:42 AM, Jay Joel Malaluan
    wrote:
    Hi Erick,

    Well some client inquiries if it's possible to expand such simple words and does Lucene have an API for this logic? Because all I read was the stemming logic for Lucene was the other way around which is, example "flashing" it will be trimmed to the root word "flash" when searched.


    Regards,
    Jay Malaluan





    ________________________________
    From: Erick Erickson <erickerickson@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, December 16, 2008 10:14:13 PM
    Subject: Re: Inquiry on Lucene Stemming

    Why do you want to do this? The reason I ask is that you're
    making each clause very complex.....

    For a single term, it's not very complex, but for something like
    ((A AND B) OR (C AND D)) NOT X

    expanding A, B, C, D and X to, possibly many terms is...er...ugly.

    You could think about ngrams, although I confess I've only seen
    this on the lists, haven't worked with it myself.

    If your goal is to be able to search exact match words (i.e. you
    need to find "flash" when exactly "flash" was indexed, not "flashing")
    there are better strategies....

    So a bit more explanation of the problem could perhaps generate more
    helpful responses.

    Best
    Erick

    On Tue, Dec 16, 2008 at 7:18 AM, Jay Joel Malaluan <exst_jmalaluan@yahoo.com
    wrote:
    Hi,


    Can anyone comment if my understanding of the stemming process in Lucene is
    correct. From my testing using the SnowballAnalyzer, if I passed this word
    "flashing" it will be trimmed to a root word "flash" and this root word
    ("flash") will be the one searched not the original word "flashing".

    Is there an API in Lucene or third-party APIs that can do the following, I
    passed the word "flash" instead it will search for "flashing", "flashed",
    "flashes" etc.?


    Regards,
    Jay Malaluan


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Dec 20, 2008 at 7:12 pm
    : Well some client inquiries if it's possible to expand such simple words
    : and does Lucene have an API for this logic? Because all I read was the
    : stemming logic for Lucene was the other way around which is, example
    : "flashing" it will be trimmed to the root word "flash" when searched.

    there are two fundemental approaches (that i know of) to stemming:
    reduction and expansion.

    reduction can either be algorithmic, or morphologically/dictionary based.
    expansion esentially has to be morphologically based.

    The SnowballAnalyzer uses the Snowball stemming algorithm, which is a
    reduction approach to stemming. if you want an expansion based appraoch,
    you have to have a dictionary. Lucene doens't provide one of these, but
    that doens't mean you can't use one if you find one -- it's been a while
    since i looked but the only ones i've ever seen were only commercially
    avialable.

    http://en.wikipedia.org/wiki/Stemming




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Dec 22, 2008 at 3:18 am
    If Hoss is referring to synonym expansion, allow me to point out that freely downloadable code from Lucene in Action (first edition) has code for that, if you'd like to have a look, OP.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Chris Hostetter <hossman_lucene@fucit.org>
    To: java-user@lucene.apache.org
    Sent: Saturday, December 20, 2008 2:12:21 PM
    Subject: Re: Inquiry on Lucene Stemming


    : Well some client inquiries if it's possible to expand such simple words
    : and does Lucene have an API for this logic? Because all I read was the
    : stemming logic for Lucene was the other way around which is, example
    : "flashing" it will be trimmed to the root word "flash" when searched.

    there are two fundemental approaches (that i know of) to stemming:
    reduction and expansion.

    reduction can either be algorithmic, or morphologically/dictionary based.
    expansion esentially has to be morphologically based.

    The SnowballAnalyzer uses the Snowball stemming algorithm, which is a
    reduction approach to stemming. if you want an expansion based appraoch,
    you have to have a dictionary. Lucene doens't provide one of these, but
    that doens't mean you can't use one if you find one -- it's been a while
    since i looked but the only ones i've ever seen were only commercially
    avialable.

    http://en.wikipedia.org/wiki/Stemming




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 16, '08 at 12:19p
activeDec 22, '08 at 3:18a
posts8
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase