FAQ
Hi,

I know that case-insensitive searching is normally done by creating an
all-lower-case version of the documents, and turning the search terms
into lower case whenever this field is searched, but this approach has
it's disadvantages.

Let's say, for example, you want to find "Dell" (with a capital "D"),
near "computers" (with or without capitals, ie. in any case). The
problem is that you would need to use a SpanQuery to find terms near
each other; but if the case-sensitivity required is different for each
term, then they will be in different fields, making the use of
SpanQuerys inpossible.

There might be ways to work around this, but my question is: will
case-insensitvity ever be added to Lucene as per-Term option? If not,
can anyone tell me where I should start looking in order to make this
change myself?

Thanks!

-JB



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Jun 25, 2008 at 1:03 pm
    Well, it depends on what you mean by "per term". There's already
    PerFieldAnalyzerWrapper for each field, but I don't think that's what
    you want.

    How do you expect a per term analyzer to behave? I'm having a hard
    time thinking of a use case that's general. You could always
    roll your own analyzer that didn't change case for your particular
    list of words.

    But the problem is your users. In your example, suppose a user
    typed in "dell computers". Would that match "Dell computers"?
    Does your analyzer automatically upper-case some words? If it
    does, that's the same as lower casing them all. If it doesn't,
    how do you explain that to your users?

    All in all, I'm having a tough time imagining how this would work.
    It's easy enough to say "let's assume", but I suspect that
    whatever solution satisfied your example will have its own problems
    that are far worse than just lower-casing things.

    Best
    Erick

    On Wed, Jun 25, 2008 at 5:37 AM, John Byrne wrote:

    Hi,

    I know that case-insensitive searching is normally done by creating an
    all-lower-case version of the documents, and turning the search terms into
    lower case whenever this field is searched, but this approach has it's
    disadvantages.

    Let's say, for example, you want to find "Dell" (with a capital "D"), near
    "computers" (with or without capitals, ie. in any case). The problem is that
    you would need to use a SpanQuery to find terms near each other; but if the
    case-sensitivity required is different for each term, then they will be in
    different fields, making the use of SpanQuerys inpossible.

    There might be ways to work around this, but my question is: will
    case-insensitvity ever be added to Lucene as per-Term option? If not, can
    anyone tell me where I should start looking in order to make this change
    myself?

    Thanks!

    -JB



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • John Byrne at Jun 25, 2008 at 2:00 pm
    What I had in mind was actually very simple: when you create a Term
    (programatically) you normally set the text and the field. I would also
    like to be able to set the case sensitivity to true or false for that
    specific Term object.

    I imangined (and maybe I am over simplifying it!) that somewhere in the
    API there must be a string comparison using 'String.equals()' that
    determines if a document contains the term or not - and that use of
    'equals()' has permanently locked Lucene into case-sensitive searching.
    The values being compared could be first lower-cased (or
    equalsIgnoreCase could be used) depending on the value of a boolean flag
    in the Term object.

    If that option was there, there would be no need to ever change the case
    in the analyzer - you'd be able to control case-sensitivity regardless
    of the field used.

    Of course, I realize that there is currently no way to take advantage of
    such a feature in the QueryParser. It could only be done
    programatically. But I don't think that's a reason not to do it, since
    the API already has features that aren't implemented in the QueryParser
    (like SpanQuerys). In a perfect world, the parser would support all the
    features, but for the time being anyone who wants to take advantage of
    the newer features has to find an alternative anyway.

    The problem that it would solve for me is, as I mentioned, that I could
    mix case-sensitive Terms with case-insensitive Terms when using
    SpanQuerys. I currently have no way to do that.

    Regards,
    -John

    Erick Erickson wrote:
    Well, it depends on what you mean by "per term". There's already
    PerFieldAnalyzerWrapper for each field, but I don't think that's what
    you want.

    How do you expect a per term analyzer to behave? I'm having a hard
    time thinking of a use case that's general. You could always
    roll your own analyzer that didn't change case for your particular
    list of words.

    But the problem is your users. In your example, suppose a user
    typed in "dell computers". Would that match "Dell computers"?
    Does your analyzer automatically upper-case some words? If it
    does, that's the same as lower casing them all. If it doesn't,
    how do you explain that to your users?

    All in all, I'm having a tough time imagining how this would work.
    It's easy enough to say "let's assume", but I suspect that
    whatever solution satisfied your example will have its own problems
    that are far worse than just lower-casing things.

    Best
    Erick


    On Wed, Jun 25, 2008 at 5:37 AM, John Byrne wrote:

    Hi,

    I know that case-insensitive searching is normally done by creating an
    all-lower-case version of the documents, and turning the search terms into
    lower case whenever this field is searched, but this approach has it's
    disadvantages.

    Let's say, for example, you want to find "Dell" (with a capital "D"), near
    "computers" (with or without capitals, ie. in any case). The problem is that
    you would need to use a SpanQuery to find terms near each other; but if the
    case-sensitivity required is different for each term, then they will be in
    different fields, making the use of SpanQuerys inpossible.

    There might be ways to work around this, but my question is: will
    case-insensitvity ever be added to Lucene as per-Term option? If not, can
    anyone tell me where I should start looking in order to make this change
    myself?

    Thanks!

    -JB



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ------------------------------------------------------------------------

    No virus found in this incoming message.
    Checked by AVG.
    Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release Date: 24/06/2008 20:41

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jun 25, 2008 at 4:02 pm
    I suppose something like that might work, but I still think that presenting
    a user with matches that sometimes work case sensitive and sometimes
    doesn't would be...er..fraught.


    If you can programmatically restrict your query construction and you're
    *sure* this is what your users expect, you can make it work. Just index
    each term twice, once lowercased and once in the native case, with 0
    term increment between them. Then you can simply construct your
    terms however you want and fire the result at the search. In fact, you
    only need to double-index the terms you want to do case-sensitive
    searches on. This will increase the size of your index less than you
    think...

    Best
    Erick
    On Wed, Jun 25, 2008 at 9:59 AM, John Byrne wrote:

    What I had in mind was actually very simple: when you create a Term
    (programatically) you normally set the text and the field. I would also like
    to be able to set the case sensitivity to true or false for that specific
    Term object.

    I imangined (and maybe I am over simplifying it!) that somewhere in the API
    there must be a string comparison using 'String.equals()' that determines if
    a document contains the term or not - and that use of 'equals()' has
    permanently locked Lucene into case-sensitive searching. The values being
    compared could be first lower-cased (or equalsIgnoreCase could be used)
    depending on the value of a boolean flag in the Term object.

    If that option was there, there would be no need to ever change the case in
    the analyzer - you'd be able to control case-sensitivity regardless of the
    field used.

    Of course, I realize that there is currently no way to take advantage of
    such a feature in the QueryParser. It could only be done programatically.
    But I don't think that's a reason not to do it, since the API already has
    features that aren't implemented in the QueryParser (like SpanQuerys). In a
    perfect world, the parser would support all the features, but for the time
    being anyone who wants to take advantage of the newer features has to find
    an alternative anyway.

    The problem that it would solve for me is, as I mentioned, that I could mix
    case-sensitive Terms with case-insensitive Terms when using SpanQuerys. I
    currently have no way to do that.

    Regards,
    -John

    Erick Erickson wrote:
    Well, it depends on what you mean by "per term". There's already
    PerFieldAnalyzerWrapper for each field, but I don't think that's what
    you want.

    How do you expect a per term analyzer to behave? I'm having a hard
    time thinking of a use case that's general. You could always
    roll your own analyzer that didn't change case for your particular
    list of words.

    But the problem is your users. In your example, suppose a user
    typed in "dell computers". Would that match "Dell computers"?
    Does your analyzer automatically upper-case some words? If it
    does, that's the same as lower casing them all. If it doesn't,
    how do you explain that to your users?

    All in all, I'm having a tough time imagining how this would work.
    It's easy enough to say "let's assume", but I suspect that
    whatever solution satisfied your example will have its own problems
    that are far worse than just lower-casing things.

    Best
    Erick


    On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <john.byrne@propylon.com>
    wrote:


    Hi,

    I know that case-insensitive searching is normally done by creating an
    all-lower-case version of the documents, and turning the search terms
    into
    lower case whenever this field is searched, but this approach has it's
    disadvantages.

    Let's say, for example, you want to find "Dell" (with a capital "D"),
    near
    "computers" (with or without capitals, ie. in any case). The problem is
    that
    you would need to use a SpanQuery to find terms near each other; but if
    the
    case-sensitivity required is different for each term, then they will be
    in
    different fields, making the use of SpanQuerys inpossible.

    There might be ways to work around this, but my question is: will
    case-insensitvity ever be added to Lucene as per-Term option? If not, can
    anyone tell me where I should start looking in order to make this change
    myself?

    Thanks!

    -JB



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ------------------------------------------------------------------------

    No virus found in this incoming message.
    Checked by AVG. Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release
    Date: 24/06/2008 20:41

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jun 25, 2008 at 9:40 pm
    : I imangined (and maybe I am over simplifying it!) that somewhere in the API
    : there must be a string comparison using 'String.equals()' that determines if a
    : document contains the term or not - and that use of 'equals()' has permanently
    : locked Lucene into case-sensitive searching. The values being compared could
    : be first lower-cased (or equalsIgnoreCase could be used) depending on the
    : value of a boolean flag in the Term object.

    You are over simplifying it a bit ... string comparisons are done in the
    internals, but not to compare a query "terms" to a document "terms" ...
    the index is inverted so there is a single enumeration of all indexed
    terms (regardless of which documents they are in) which maintain pointers
    to the docs that contained. querying involves seeking along that
    enumeration to find the indexed term that corrisponds to the query term.

    the enumeration is in lexigraphical order, so "Dell" is no where near
    "dell" in the enumeration. even if we added a boolean property to Terms
    indicating that it's case insensitive Term the "seeking" along that
    enumeration would be ... lss optimal ... then it can be now.

    : > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
    : > > "computers" (with or without capitals, ie. in any case). The problem is
    : > > that
    : > > you would need to use a SpanQuery to find terms near each other; but if
    : > > the
    : > > case-sensitivity required is different for each term, then they will be in
    : > > different fields, making the use of SpanQuerys inpossible.

    i assume by this statement that you are suggesting that you want your
    users to be able to say "find me $foo near $bar where $foo must be in the
    case i specified but bar can be in any case" is that correct?

    in that case Erick's point about indexing both the orriginal case and
    some normalized casing at the same term position is the best way to go --
    the only downside this has compared to seperate fields is that it can
    introduce some bias in your tf/idf values ... but that can be eliminated
    by prefaxing all of your "normalized" terms with some unicode character
    that your tokenizer would normally strip off.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • John Byrne at Jun 26, 2008 at 8:20 am

    Chris Hostetter wrote:
    the enumeration is in lexigraphical order, so "Dell" is no where near
    "dell" in the enumeration. even if we added a boolean property to Terms
    indicating that it's case insensitive Term the "seeking" along that
    enumeration would be ... lss optimal ... then it can be now.
    Ah, now I understand!
    : > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
    : > > "computers" (with or without capitals, ie. in any case). The problem is
    : > > that
    : > > you would need to use a SpanQuery to find terms near each other; but if
    : > > the
    : > > case-sensitivity required is different for each term, then they will be in
    : > > different fields, making the use of SpanQuerys inpossible.

    i assume by this statement that you are suggesting that you want your
    users to be able to say "find me $foo near $bar where $foo must be in the
    case i specified but bar can be in any case" is that correct?
    Yes, that's exactly what I meant.
    in that case Erick's point about indexing both the orriginal case and
    some normalized casing at the same term position is the best way to go --
    the only downside this has compared to seperate fields is that it can
    introduce some bias in your tf/idf values ... but that can be eliminated
    by prefaxing all of your "normalized" terms with some unicode character
    that your tokenizer would normally strip off.
    From Erick's reply:

    "I suppose something like that might work, but I still think that presenting
    a user with matches that sometimes work case sensitive and sometimes
    doesn't would be...er..fraught."

    The user would, of course, choose which terms are case-sensitive when
    they query, using a modifier in the query language. (I would have to
    implement that). It's something my users have asked to be able to do -
    in their view, fields are something that should be used for different
    content, and case-sensitivity should be an option on *any* field. But
    what you have suggested should allow it to work that way, by adding both
    versions of the term at the same position.

    Thanks guys!

    -John

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 25, '08 at 9:41a
activeJun 26, '08 at 8:20a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase