FAQ
Hi,

Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having problems with stemming. Does anyone have a recommendation for other text analyzers that handle stemming and also keep capitalization, stop words, and punctuation?

Thanks,
Larry


Larry A. Hendrix, Graduate Student
Computer Science Department
University of Wisconsin-Madison
1300 University Ave Rm 6749
Madison, WI 53711
Office: (608) 263-7624
lhendrix@cs.wisc.edu
Grambling State University Alum

Search Discussions

  • Christopher Condit at May 19, 2010 at 12:38 am
    Hi Larry-
    Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
    problems with stemming. Does anyone have a recommendation for other
    text analyzers that handle stemming and also keep capitalization, stop words,
    and punctuation?
    Have you tried the SnowballFilter? You could make your own analyzer combining a WhitespaceFilter and a SnowballFilter that should have the desired effect..
    See: http://lucene.apache.org/java/3_0_1/api/contrib-snowball/org/apache/lucene/analysis/snowball/SnowballFilter.html

    Good luck,
    -Chris

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at May 19, 2010 at 12:47 am
    You can construct your own analyzer by creating
    it from a pre-existing Tokenizer
    (e.g. WhiteSpaceTokenizer) and any number
    of TokenfFilters (e.g. TokenFilter). You can
    string any number of TokenFilters together
    to get many different effects.

    But I have to ask, why you want to keep capitalization?
    and punctuation? Do you really want to fail to match
    text indexed with "Erickson, Erick" with the query
    "erick erickson"? That's often a source of frustration
    instead of goodness.

    HTH
    Erick
    On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix wrote:

    Hi,

    Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
    problems with stemming. Does anyone have a recommendation for other text
    analyzers that handle stemming and also keep capitalization, stop words, and
    punctuation?

    Thanks,
    Larry


    Larry A. Hendrix, Graduate Student
    Computer Science Department
    University of Wisconsin-Madison
    1300 University Ave Rm 6749
    Madison, WI 53711
    Office: (608) 263-7624
    lhendrix@cs.wisc.edu
    Grambling State University Alum
  • Larry Hendrix at May 20, 2010 at 2:03 am
    Thanks for the advice. I want to keep the capitalization because in our application we are mining specific contact and company names from news articles. About 99% of the time if we match a contact or company and it's capitalized we avoid false matches.

    --Larry
    On May 18, 2010, at 7:46 PM, Erick Erickson wrote:

    You can construct your own analyzer by creating
    it from a pre-existing Tokenizer
    (e.g. WhiteSpaceTokenizer) and any number
    of TokenfFilters (e.g. TokenFilter). You can
    string any number of TokenFilters together
    to get many different effects.

    But I have to ask, why you want to keep capitalization?
    and punctuation? Do you really want to fail to match
    text indexed with "Erickson, Erick" with the query
    "erick erickson"? That's often a source of frustration
    instead of goodness.

    HTH
    Erick
    On Tue, May 18, 2010 at 2:05 PM, Larry Hendrix wrote:

    Hi,

    Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
    problems with stemming. Does anyone have a recommendation for other text
    analyzers that handle stemming and also keep capitalization, stop words, and
    punctuation?

    Thanks,
    Larry


    Larry A. Hendrix, Graduate Student
    Computer Science Department
    University of Wisconsin-Madison
    1300 University Ave Rm 6749
    Madison, WI 53711
    Office: (608) 263-7624
    lhendrix@cs.wisc.edu
    Grambling State University Alum
    Larry A. Hendrix, Graduate Student
    Computer Science Department
    University of Wisconsin-Madison
    1300 University Ave Rm 6749
    Madison, WI 53711
    Office: (608) 263-7624
    lhendrix@cs.wisc.edu
    Grambling State University Alum

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 18, '10 at 6:05p
activeMay 20, '10 at 2:03a
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase