FAQ
Hello all,

I am using Lucene 2.4.1 and Snowball Analyzer for my indexing.
I am facing some issues with stemming.

Raining stemmed to Rain
cats stemmed to cat
but
Harder is not stemmed to hard
Stronger is not stemmed to Strong.

Even Keyword and Standard analyzer does the same. My opinion is Stemming
process is to get the base word. Here it is not doing so.

Any idea?

Regards
Ganesh



Send instant messages to your online friends http://in.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Matthew Hall at May 8, 2009 at 12:58 pm

    Ganesh wrote:
    My opinion is Stemming process is to get the base word. Here it is not
    doing so.
    Unfortunately this is where your problem lies, stemming doesn't do this,
    it breaks words that are almost lexically equivalent down into a similar
    root word. thus cat = cats.

    From the wiki: "*Stemming* is the process for reducing inflected (or
    sometimes derived) words to their stem
    <http://en.wikipedia.org/wiki/Word_stem>, base or root
    <http://en.wikipedia.org/wiki/Root_%28linguistics%29> form – generally a
    written word form. The stem need not be identical to the morphological
    root <http://en.wikipedia.org/wiki/Morphological_root> of the word; it
    is usually sufficient that related words map to the same stem, even if
    this stem is not in itself a valid root. The algorithm
    <http://en.wikipedia.org/wiki/Algorithm> has been a long-standing
    problem in computer science
    <http://en.wikipedia.org/wiki/Computer_science>; the first paper on the
    subject was published in 1968. The process of stemming, often called
    *conflation <http://en.wikipedia.org/wiki/Conflation>*, is useful in
    search engines <http://en.wikipedia.org/wiki/Search_engine> for query
    expansion <http://en.wikipedia.org/wiki/Query_expansion> or indexing
    <http://en.wikipedia.org/wiki/Index_%28search_engine%29> and other
    natural language processing
    <http://en.wikipedia.org/wiki/Natural_language_processing> problems."

    But the words hard, and harder mean different things (In the opinion of
    those who developed the Snowball algorithm), and as such shouldn't be
    stemming down to a single word.

    Now, I find it to be an arguable point about hard and harder not being
    close enough to stem to the same root, but in order to get this effect
    you will need to either change the snowball algorithm, or process your
    words into a more base form before they go into the stemmed, which is a
    hairy road indeed ^^

    Hope this helps.

    Matt

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    mhall@informatics.jax.org
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Hannu Väisänen at May 11, 2009 at 11:48 am

    On Fri, May 08, 2009 at 08:57:59AM -0400, Matthew Hall wrote:
    process your
    words into a more base form before they go into the stemmed
    Malaga (http://home.arcor.de/bjoern-beutel/malaga/) can be used to
    make a program that converts words to a base form.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 8, '09 at 12:25p
activeMay 11, '09 at 11:48a
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase