FAQ
Wildcard queries are case sensitive, while other queries depend on the
analyzer used for the field searched. The standard analyzer lowercases, so
lowercased terms are indexed. Thus your "SPINAL CORD" query is lowercased
and matches the indexed terms "spinal" and "cord". However, since prefixes
should not be stemmed they are not run through an analyzer and are hence
case sensitive. Your index contains no terms starting with "SPI" or "COR",
since all terms were lowercased when indexed.

This question is frequent enough that we should probably fix it. Perhaps a
method should be added Analyzer:
public boolean isLowercased(String fieldName);
When this is true, the query parser could lowercase prefix and range query
terms. Fellow Lucene developers, what do you think of that?

Doug
-----Original Message-----
From: Aruna Raghavan
Sent: Monday, January 21, 2002 2:05 PM
To: Lucene Users List
Subject: Case Sensitivity


Hi All,
I have noticed that I can not search using capital letters
for some reason.
If I try to do a search on "SPINAL CORD" and if I use a query
like SPI* AND
COR*, I get no results back. If I use lowercase (spi* AND
cor*) however, I
get the results back. I am using a standard analyzer. Does
anyone know why?
Thanks!

--
To unsubscribe, e-mail:
For additional commands, e-mail:
--
To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

  • Brian Goetz at Jan 21, 2002 at 11:21 pm

    Wildcard queries are case sensitive, while other queries depend on the
    analyzer used for the field searched. The standard analyzer lowercases, so
    lowercased terms are indexed. Thus your "SPINAL CORD" query is lowercased
    and matches the indexed terms "spinal" and "cord". However, since prefixes
    should not be stemmed they are not run through an analyzer and are hence
    case sensitive. Your index contains no terms starting with "SPI" or "COR",
    since all terms were lowercased when indexed.

    This question is frequent enough that we should probably fix it. Perhaps a
    method should be added Analyzer:
    public boolean isLowercased(String fieldName);
    When this is true, the query parser could lowercase prefix and range query
    terms. Fellow Lucene developers, what do you think of that?
    Something should be done, but I'm not sure this is the best way to do
    this. Perhaps extend Analyzer to work in two modes;
    "tokenization-only" and "tokenization + term normalization".



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Michal Plechawski at Jan 22, 2002 at 8:13 am
    Hi,

    I have never written anything to the list but in fact, I am doing some
    development using Lucene.
    I think that Brian's idea is more flexible and extendable. In my
    application, I need three or more kinds of analyzers: for counting tfidf
    statistics, for indexing (compute more, e.g. summaries) and for document
    classification (compute document-to-class assignment and store outside the
    index) and for some minor things.
    My experience shows that in complex Lucene applications there is a
    substantial need for many different Analyzers or - better solution - many
    faces of the same Analyzer in the same time. Something should be done
    here.

    Another story is - why did you put document deletion to IndexReader? I guess
    the main reason was the implementation, but from the API point of view it is
    horrible. I've got an abstraction 'Index' in my code with both add/remove
    operations, and switching between IndexReader and IndexWriter is not a thing
    I like the best, and I am forced now to add some cache for performance. I
    think one of the reasons is an unconsequent document id support - in delete
    there is an assumption, that documents may be uniquely identified, and in
    IndexWriter there is nothing like that. I think it should be very helpful
    for us developers to add id to documents, but may be very hard to implement.

    Last thing - did you ever think about adding transactions to Lucene? May be
    very simple exclusive-write transactions - e.g. reads are not transacted nor
    isolated, and writes are done in such a way - the write is exclusive (I
    guess it is in 1.2, I use 1.0), and one may commit/rollback all changes made
    during last session. Would it be hard?

    With all these issues added, Lucene would be mature enough to be used as an
    indexing engine in mission-critical applications.

    Regards,
    Michal



    ----- Original Message -----
    From: "Brian Goetz" <brian@quiotix.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Tuesday, January 22, 2002 12:12 AM
    Subject: Re: Case Sensitivity

    Wildcard queries are case sensitive, while other queries depend on the
    analyzer used for the field searched. The standard analyzer lowercases,
    so
    lowercased terms are indexed. Thus your "SPINAL CORD" query is
    lowercased
    and matches the indexed terms "spinal" and "cord". However, since
    prefixes
    should not be stemmed they are not run through an analyzer and are hence
    case sensitive. Your index contains no terms starting with "SPI" or
    "COR",
    since all terms were lowercased when indexed.

    This question is frequent enough that we should probably fix it.
    Perhaps a
    method should be added Analyzer:
    public boolean isLowercased(String fieldName);
    When this is true, the query parser could lowercase prefix and range
    query
    terms. Fellow Lucene developers, what do you think of that?
    Something should be done, but I'm not sure this is the best way to do
    this. Perhaps extend Analyzer to work in two modes;
    "tokenization-only" and "tokenization + term normalization".



    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

    >


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Doug Cutting at Jan 24, 2002 at 7:36 pm

    From: Michal Plechawski

    I think that Brian's idea is more flexible and extendable. In my
    application, I need three or more kinds of analyzers: for
    counting tfidf
    statistics, for indexing (compute more, e.g. summaries) and
    for document
    classification (compute document-to-class assignment and
    store outside the
    index) and for some minor things.
    My experience shows that in complex Lucene applications there is a
    substantial need for many different Analyzers or - better
    solution - many
    faces of the same Analyzer in the same time. Something should be done
    here.
    Currently it is easy to use different analyzers for different purposes, no?
    I'm not sure how Brian's proposal (bi-modal analyzers: tokenize only &
    tokenize+normalize) addresses your needs.
    Another story is - why did you put document deletion to
    IndexReader? I guess
    the main reason was the implementation, but from the API
    point of view it is
    horrible.
    Yes, sorry. I wonder if it would have been better to instead call
    IndexWriter IndexAdder or something, to make clear that it can only add
    documents. Perhaps someday this can be fixed.
    Last thing - did you ever think about adding transactions to
    Lucene? May be
    very simple exclusive-write transactions - e.g. reads are not
    transacted nor
    isolated, and writes are done in such a way - the write is
    exclusive (I
    guess it is in 1.2, I use 1.0), and one may commit/rollback
    all changes made
    during last session. Would it be hard?
    That is in fact what is done in 1.2.

    Doug

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Michal Plechawski at Jan 25, 2002 at 9:52 am

    Currently it is easy to use different analyzers for different purposes, no?
    I'm not sure how Brian's proposal (bi-modal analyzers: tokenize only &
    tokenize+normalize) addresses your needs.
    Ok, maybe I misled a point a bit. But Brian's proposal as I see it was to
    _group_ two tokenizers that differ in a single thing. For the query parser,
    it would use TWO analyzers, one for things that need normalization and
    another for things that need no normalization. It is extremely important,
    that these two analyzers are compatible (ie. differ only in normalization
    field), especially for applications juggling with many types of analyzers
    (eg. multilingual). May not happen that normalized analyzer is English and
    unnormalized is German for example, and Lucene API should support dealing
    with these (giving something like Analyzers class with two parts
    normalized() and unnormalized() or something like this).
    Yes, sorry. I wonder if it would have been better to instead call
    IndexWriter IndexAdder or something, to make clear that it can only add
    documents. Perhaps someday this can be fixed.
    I agree it would be better to call it IndexAdder. I guess that this is a
    major architectural change to add a possibility to:
    1) identify the doc with a numeric unique id
    2) to check that this id is unique
    3) to make it possible to delete the document with a given id calling an
    IndexWriter method
    Ok, can live without this, but the document uniqueness and identification
    would be very helpful for any "mission-critical" applications of Lucene,
    where it is unacceptable to have document repetitions and where the index
    change quite often.
    That is in fact what is done in 1.2.
    Thanks, I didn't know.

    Regards,
    Michal


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Brian Goetz at Jan 25, 2002 at 6:11 pm

    Ok, maybe I misled a point a bit. But Brian's proposal as I see it was to
    _group_ two tokenizers that differ in a single thing.
    I don't think that's what I was proposing... I was recognizing that
    sometimes the analysis process is a composite one, and I was advocating
    that the composition be made explicit since there are some cases where
    only tokenization, but not normalization, is desired.


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Michal Plechawski at Jan 25, 2002 at 6:21 pm
    That's one of ways to make the analysis composition explicit. Another way is
    to make Analyzer interface to return two token streams: normalizedStream()
    and unnormalizedStream(). I won't argue which is better.
    BTW: great thanks for adding possibility of analyzing different fields with
    different token streams in 1.2, that was the real problem in 1.0.

    Michal

    ----- Original Message -----
    From: "Brian Goetz" <brian@quiotix.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Friday, January 25, 2002 11:24 AM
    Subject: Re: Case Sensitivity - and more

    Ok, maybe I misled a point a bit. But Brian's proposal as I see it was
    to
    _group_ two tokenizers that differ in a single thing.
    I don't think that's what I was proposing... I was recognizing that
    sometimes the analysis process is a composite one, and I was advocating
    that the composition be made explicit since there are some cases where
    only tokenization, but not normalization, is desired.


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

    >


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Michal Plechawski at Jan 25, 2002 at 10:41 am

    ...one may commit/rollback
    all changes made
    during last session. Would it be hard?
    That is in fact what is done in 1.2.
    Ok, I have looked over Lucene 1.2 API and see no rollback() method. Does it
    work like that - if I do not close IndexWriter, it does not saves changes?
    How it behaves where there are so many changes that it must merge segments
    during these? Is it rollbackable?

    Regards,
    Michal


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 21, '02 at 11:01p
activeJan 25, '02 at 6:21p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase