FAQ
Hi Otis,

On both the indexing side and creation of the query parser, I'm using the
StandardAnalyzer class. Seems like it would be symmetrical w/r to case
sensitivity, but it's apparently not related to the problem or it's a
bug...I suspect the former. I'll start looking at the source next. Thanks,

Landon
-----Original Message-----
From: Otis Gospodnetic
Sent: Thursday, May 09, 2002 11:24 AM
To: Lucene Users List
Subject: Re: QueryParser question - case-sensitivity


Wouldn't that be the Analzyer that you are using?
I don't have the source handy to check it for you, but look for
toLowerCase or some such, and you'll find who's messing with your
queries.
Replace that piece, and you'll keep your upper cases.

Otis

--
To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

  • Dave Peixotto at May 9, 2002 at 7:37 pm
    Looks like the Standard Analyzer uses the LowerCaseFilter as one of its
    filters. This is the one that is converting everything to lower case. If
    you replace the StandardAnalyser with a different Analyzer you should be ok.

    Dave
    ----- Original Message -----
    From: "Landon Cox" <lcox@interactive-media.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Thursday, May 09, 2002 11:28 AM
    Subject: RE: QueryParser question - case-sensitivity

    Hi Otis,

    On both the indexing side and creation of the query parser, I'm using the
    StandardAnalyzer class. Seems like it would be symmetrical w/r to case
    sensitivity, but it's apparently not related to the problem or it's a
    bug...I suspect the former. I'll start looking at the source next. Thanks,
    Landon
    -----Original Message-----
    From: Otis Gospodnetic
    Sent: Thursday, May 09, 2002 11:24 AM
    To: Lucene Users List
    Subject: Re: QueryParser question - case-sensitivity


    Wouldn't that be the Analzyer that you are using?
    I don't have the source handy to check it for you, but look for
    toLowerCase or some such, and you'll find who's messing with your
    queries.
    Replace that piece, and you'll keep your upper cases.

    Otis

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Doug Cutting at May 9, 2002 at 7:58 pm
    [I'm resending this from a different account, since my first attempt is
    bogged down somewhere. A second copy will probably show up tomorrow, but in
    the interests of solving this problem sooner, I'm resending it. Sorry for
    the duplicaton.]

    Define an Analyzer that does not lowercase the id field, e.g., something
    like:

    public class MyAnalyzer extends Analyzer {
    private Analyzer standard = new StandardAnalyzer();
    public TokenStream tokenStream(String field, final Reader reader) {
    if ("id".equals(field)) {
    return new WhitespaceTokenizer(reader);
    } else {
    return standard.tokenStream(field, reader);
    }
    }
    }

    Then pass this into QueryParser.

    Doug
    -----Original Message-----
    From: Landon Cox

    Sent: Thursday, May 09, 2002 9:52 AM
    To: dcutting@grandcentral.com
    Subject: QueryParser question - case-sensitivity



    I have a QueryParser/Query question. These classes (not sure
    which) is
    apparently converting my term values into lowercase even though Term's
    values are by default case-sensitive. I've got non-word
    text, id's, that
    are case sensitive and stored/indexed that way, but query
    parser is not
    respecting my case sensitive search criterion.

    For example, I create a query string:

    id:"templatedata/f2container/data/Course1102043194747042"

    and pass this to the QueryParser.parse() method. When I dump
    the Query with
    toString() I get:

    +id:templatedata/f2container/data/course1102043194747042

    Naturally, this query fails as I'm expecting a hit on the id with the
    uppercase C. If I create and index an id all lower case,
    then the query
    succeeds. Case-sensitivity is important to maintain for querying this
    element, especially using it once the hit occurs.

    How do I coerce QueryParser/Query to not 'tolower' my query
    string? or is
    there an alternate method that's more direct which takes my
    query string
    with no modification?


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Landon Cox at May 9, 2002 at 9:22 pm
    Ok, this is the solution and it seems to have worked like a charm. I took
    Doug's fragment as a starting point, but enhanced it to be general purpose.
    Instead of the keyword field name being hardwired into the tokenStream
    method, the derived Analyzer class, in this case DCRAnalyzer, accepts a
    hashtable of keyword fieldnames. As long as the keyword's value in the hash
    is != null, this code below will work, so you can initialize the keyword's
    value with any object you care about.

    If you create another class derived from Hashtable that wires your app's
    keyword fieldnames into it, an instance of that class can be passed into
    DCRAnalyzer so all that all the application specific keyword knowledge
    remains contained in one app class, but this code can remain general.

    For my app, the XML 'id' attribute of all tags fall into this category of
    keyword fields to pass through unscathed. I'm sure I'll add others over
    time which is why hash seemed convenient and fast. Anyway, this all tested
    out as expected and now the analyzer has the 'smarts' needed for
    case-sensitivity on different fieldnames.

    /*
    * DCRAnalyzer.java
    *
    * Created on May 9, 2002, 1:14 PM
    */

    package <<<yourpackagenamegoeshere>>>;

    import java.io.*;
    import java.util.*;

    import org.apache.lucene.analysis.*;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;


    public class DCRAnalyzer extends Analyzer
    {

    /** Creates a new instance of DCRAnalyzer */
    public DCRAnalyzer( Hashtable keywordFieldNames )
    {
    m_keywordNames = keywordFieldNames;

    }

    public TokenStream tokenStream( String field, final Reader reader )
    {
    // see if field is a designated keyword name, if so, don't run it
    through standard
    if ( m_keywordNames.get(field) != null )
    {
    return new WhitespaceTokenizer(reader);

    } else {
    return m_standard.tokenStream(field, reader);
    }

    }

    private Hashtable m_keywordNames = new Hashtable();
    private Analyzer m_standard = new StandardAnalyzer();
    }

    Not much code, but it nicely did the trick and you could easily extend it to
    support numerous analyzers mapped to fieldnames, not just these two. Thanks
    for the various bits of advice from Doug, DaveP, and Otis.

    Landon Cox


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Landon Cox at May 9, 2002 at 8:06 pm
    Hi DaveP and Otis,

    After looking further, my take:

    It would be symmetrical except the ID field (term) I'm looking for was
    indexed as a Keyword since it's a file path that I don't want tokenized. I
    think what's happening is that since it's not being tokenized, even though
    I'm using StandardAnalyzer on both indexing and querying, when indexed it's
    not going through the lower case filter of StandardAnalyzer and therefore is
    stored fully respecting case-sensitivity.

    On the flipside, the query doesn't really know the same thing (term names
    mapped to field types - in this case a keyword) and is running all queries
    through StandardAnalyzer without regard to term name and type (as it was
    designed.)

    So, I think you're right - it comes down to the analyzer, but more directly,
    I think it comes down to the fact that the Keyword value is unmolested when
    indexed but the query term value after going through QueryParser.parse() is
    lower-case due to the LowerCaseFilter that StandardAnalyzer uses.

    For a keyword field, the docs say:
    Keyword
    public static final Field Keyword(String name,
    String value)
    Constructs a String-valued Field that is not tokenized, but is indexed and
    stored. Useful for non-text fields, e.g. date or url.

    If you look at StandardAnalyzer source, the tokenStream method runs it
    through LowerCaseFilter as spec'd. But since a Keyword is not tokenized,
    it's stored/indexed respecting case.

    Does that jive with your knowledge of the source and behavior of the
    classes?

    It does look like I need to make a query analyzer that's a little more
    "aware" of my field names (and types) for querying purposes...that analyzer
    would match the behavior on the indexing side such that it knows what fields
    are Keywords and therefore whether to pass them through unchanged or not.

    Thanks for the feedback.

    Landon

    PS. Late break: Just read the mail from Doug after writing this analysis.
    Think it confirmed what was going on. Thank you, Doug.

    -----Original Message-----
    From: Landon Cox
    Sent: Thursday, May 09, 2002 12:29 PM
    To: Lucene Users List
    Subject: RE: QueryParser question - case-sensitivity



    Hi Otis,

    On both the indexing side and creation of the query parser, I'm using the
    StandardAnalyzer class. Seems like it would be symmetrical w/r to case
    sensitivity, but it's apparently not related to the problem or it's a
    bug...I suspect the former. I'll start looking at the source
    next. Thanks,

    Landon
    -----Original Message-----
    From: Otis Gospodnetic
    Sent: Thursday, May 09, 2002 11:24 AM
    To: Lucene Users List
    Subject: Re: QueryParser question - case-sensitivity


    Wouldn't that be the Analzyer that you are using?
    I don't have the source handy to check it for you, but look for
    toLowerCase or some such, and you'll find who's messing with your
    queries.
    Replace that piece, and you'll keep your upper cases.

    Otis

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 9, '02 at 6:28p
activeMay 9, '02 at 9:22p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase