FAQ
I just want to see if it's safe to use two different analyzers for the
following situation:

I have an index that I want to preserve case with so I can do case-sensitive
searches with my WhitespaceAnalyzer. However, I also want to do case
insensitive searches.

What I did was create a custom Analyzer that creates a LowerCaseFilter with
a WhitespaceTokenizer, and use that analyzer to search when I want case
insensitive searching, and use the default WhitespaceAnalyzer when I want
case sensitive.

Is this the right approach, or should I change something?

Thanks,
Max

Search Discussions

  • Shai Erera at Aug 12, 2009 at 3:31 am
    you should also make sure the data is indexed twice, once w/ the original
    case and once w/o. It's like putting a TokenFilter after WhitespaceTokenizer
    which returns two tokens - lowercased and the original, both in the same
    position (set posIncr to 0).
    On Wed, Aug 12, 2009 at 6:20 AM, Max Lynch wrote:

    I just want to see if it's safe to use two different analyzers for the
    following situation:

    I have an index that I want to preserve case with so I can do
    case-sensitive
    searches with my WhitespaceAnalyzer. However, I also want to do case
    insensitive searches.

    What I did was create a custom Analyzer that creates a LowerCaseFilter with
    a WhitespaceTokenizer, and use that analyzer to search when I want case
    insensitive searching, and use the default WhitespaceAnalyzer when I want
    case sensitive.

    Is this the right approach, or should I change something?

    Thanks,
    Max
  • Max Lynch at Dec 30, 2009 at 5:55 pm

    I just want to see if it's safe to use two different analyzers for the
    following situation:

    I have an index that I want to preserve case with so I can do
    case-sensitive
    searches with my WhitespaceAnalyzer. However, I also want to do case
    insensitive searches.
    you should also make sure the data is indexed twice, once w/ the original
    case and once w/o. It's like putting a TokenFilter after
    WhitespaceTokenizer
    which returns two tokens - lowercased and the original, both in the same
    position (set posIncr to 0).
    I finally got around to really needing this, and I'm just a little confused
    by the implementation. Should I physically use two different indexes (one
    with StandardAnalyzer, one with WhitespaceAnalyzer?), two separate fields (I
    don't think that's possible?), or could you explain your idea a little
    more? Should I implement my own WhitespaceTokenizer with the TokenFilter?

    Thanks.
  • Erick Erickson at Dec 30, 2009 at 10:21 pm
    See PerFieldAnalyzerWrapper for an easy way to implement two fields
    in the same document processed with different analyzers. So basically
    you're copying the input to two fields that handle things slightly
    differently.

    As far as re-implementing stuff, no real re-implementing is necessary,
    just create your Analyzers from pre-existing parts, it's much simpler
    than it sounds. Just derive a class from Analyzer and override
    tokenStream (or, possibly, reusableTokenStream). Then you have to
    send you input to both fields (see above).

    SynonymAnalyzer in Lucene In Action has an example, and I'm sure
    if you look in the mail archives you'll find other examples.....

    Alternatively, if one of the "regular" analyzers works for you *except*
    for lower-casing, just use that one for your mixed-case field and
    lower-case your input and send it to your lower-case field.

    Be careful to do the same steps when querying <G>.

    Also, TeeSinkTokenFilter might give you some joy, but I confess I haven't
    looked at it very thoroughly.

    HTH
    Erick
    On Wed, Dec 30, 2009 at 12:55 PM, Max Lynch wrote:

    I just want to see if it's safe to use two different analyzers for the
    following situation:

    I have an index that I want to preserve case with so I can do
    case-sensitive
    searches with my WhitespaceAnalyzer. However, I also want to do case
    insensitive searches.
    you should also make sure the data is indexed twice, once w/ the original
    case and once w/o. It's like putting a TokenFilter after
    WhitespaceTokenizer
    which returns two tokens - lowercased and the original, both in the same
    position (set posIncr to 0).
    I finally got around to really needing this, and I'm just a little confused
    by the implementation. Should I physically use two different indexes (one
    with StandardAnalyzer, one with WhitespaceAnalyzer?), two separate fields
    (I
    don't think that's possible?), or could you explain your idea a little
    more? Should I implement my own WhitespaceTokenizer with the TokenFilter?

    Thanks.
  • Max Lynch at Dec 30, 2009 at 11:36 pm

    Alternatively, if one of the "regular" analyzers works for you *except*
    for lower-casing, just use that one for your mixed-case field and
    lower-case your input and send it to your lower-case field.

    Be careful to do the same steps when querying <G>.
    Thanks Erick, I didn't think about this. It seems the most simple solution
    for now.

    -max

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 12, '09 at 3:20a
activeDec 30, '09 at 11:36p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase