FAQ
Time to give a little something back to the Lucene community, even if
it's just a little knowledge for the maintainers...

Back on 17-Apr-2007 (for those searching the archives), I expressed a
need to match on queries using an intermix of case-sensitive with
case-insensitive terms.

The example that I cited was the word LET, which is an acronym when it
appears in uppercase, and extremely common word otherwise, such that
it appears in a large bulk of documents as a false positive,
especially when one tries this query: "company LET"~10


Erick Erickson had a fantastic idea to prefix a token with a dollar
sign to signify that was an acronym and do a little translating back
and forth.

Chris "Hoss" Hostetter suggested a "customized bastard stepchild of
StopFilter and LowercaseFiltter", doing processing based on whether
something was an acronym or not.


However, in producing a concrete example for the sake of discussion, I
neglected to indicate that it was the mixing of case-sensitive and
case-insensitive matching in a single Lucene field that I was after,
not acronyms in general.

Turns out there was no way to know an acronym list up front, and worse
yet, I'd be searching for people's names... some foreign, which also
happened to be stop words in English.



Thanks to both Erick and Hoss's input, I was actually able to develop
a working hybrid solution! And, I'd like to share a smidge bit of the
technical part on the off chance such a thing would be valuable to
other people.

I can now search for things like "+company +$LET" and case-insensitive
match on 'company' while doing a case-sensitive match on 'LET',
ignoring other cases of 'let'.

Warning: what follows is a high-level technical walk-thru of how to
bastardize your Lucene .jar file to make the above possible. Coding
skills required.


STEP ZERO: Go read the Lucene Tutorial by Steven J. Owens at
http://darksleep.com/lucene/ -- this is the best walk through of the
Lucene classes that I've yet to encounter. It starts you off assuming
zero knowledge on your part, goes through no specific implementation
details, and addresses the responsibilities and relationships of the
Lucene classes in such a way that there's no forward references in the
discussion.

In this tutorial he stresses not once, not twice, but three times that
the same Analyzer that is used to build an index -must- also be used
when performing a Query. There is great detail explaining why this is
so.

However, in order to get our magic to work, we need to violate this
rule in a very clever way.


STEP ONE: Building an index that has both case-sensitive and
case-insensitive tokens in it.

As a document is ingested and turned into a stream of tokens, we want
to do something different than the StandardAnalyzer. For each token
encountered, we want to emit two tokens into the index, both at the
same physical position: one is the case-sensitive token, the other is
the case-insensitive token.

We accomplish this by building our own class derived Analyzer, though
we don't do the LowerCaseFilter or the StopFilter steps. Instead, we
call a new custom filter that we'll write.

It creates a token, unchanged, and prefixes a dollar sign on the front
as-is. It also creates a lower-cased version, which is used for the
case insensitive.

While it's common to think of filters as tossing out tokens, they can
also be used to inject extra ones. A reasonable way of doing this can
be found on page 130 of the rather dated book Lucene In Action (ISBN
1-932394-28-1) using the synonym example as a template.

Using Luke, it's possible to verify that indeed that two tokens make
it into the index.


STEP TWO: Being able to query with dollar signs.

This step is where things get complicated. It turns out that
StandardAnalyzer, which uses the StandardTokenizer, throws away dollar
signs. So, it doesn't matter how many you type in your query, they
all vanish, never giving you the opportunity to do anything with them
downstream.

By luck, this wasn't a problem in step one. Any dollar signs in the
query are thrown away, and it's only after we construct the custom
token is one prepended. Which, as luck would turn out, is exactly the
format we want anyhow.

It also turns out, though, that we can't use the special analyzer we
just built in step one, because it's spewing two tokens for every one
provided. If a document contained only "let" and we queries for
"+LET", then re-using the tokenizer would produce a query that looked
like "+let +$LET", and since the latter term doesn't appear in the
document, we won't get a hit when we should.

Consequently, we've got three problems to solve before this all ties
together nicely.

First, we need to build a new Analyzer for queries. It needs to take
a special tokenizer that can handle dollar signs ...more on that in a
minute...

Second, we need to run it's output through a new custom filter the
conditionally converts a term to lowercase or not.

This is easy, because if the token's termText() starts with a dollar
sign, we leave it alone. Otherwise, we lowercase the token.

Follow that... If we type "$LET" it searches for "$LET", but if we
type "LET" it searches for "let". Anything that isn't flagged with
the dollar sign is converted to lowercase. This nicely fits the
syntax Erick suggested, plus it works for everything without the need
to know the acronyms up front. "Let" becomes "let", and "$Let" stays
"$Let" which is different from "$LET".

The third problem is the ugly one... making the tokenizer handle
dollar signs. When Hoss was talking about customized, bastard
stepchildren, he wasn't kidding.

We're actually going to have to go inside Lucene and twiddle the
grammars for the Lucene parser and tokenizer. This requires a little
bit of knowledge of JavaCC, because you're not modifying the source,
but the .jj files, which are used to generate the .java files.

In the QueryParser.jj file, we need to let the tokenizer know a term
can start with a dollar sign, so "$" has to be added to the end of the
TERM_CHAR line (just like - and + are).

Next we copy the StandardTokenizer.jj file to our own dollar aware
version, making a new class (replace StandardTokenizer with the class
name of your choice throughout the file). Now we teach it that an
optional dollar sign can appear before certain tokens. Here's how.

In front of the expressions for ALPHANUM, APOSTROPHE, ACRONYM,
COMPANY, EMAIL, HOST, and NUM we add (["$"])? which is the JavaCC way
of doing regexs. The dollar sign now means that a new case-sensitive
token is starting.

Of course, you'll have to modify the build.xml file for Lucene to add
your special class right after the StandardTokenizer.jj rule, the
jjdoc rule, and the clean-javacc rule. Then rebuild with ant javacc
and then ant to make a new .jar file, which you'll then use with
your software.


Bringing it all together, it's now possible to user your new query
version token analyzer with the QueryParser. And calling .parse()
with dollar sign prefixed strings will search for exact-case matches,
where omitting it works like the regular old Lucene we all know and
love.

The down side...? The index has twice as many tokens.


Given all of the Lucene internal code is new to me, I can't promise
that I've done things in the most optimal fashion or that I haven't
introduced some subtle problem.

But from what I can tell using hand crafted sample documents,
ingesting, and inspecting them with Luke, while stepping through
source looking at generated queries, it all seems to be working
perfectly.

I'd love to see a formal syntax like this officially enter the Lucene
standard query language someday.

If someone can figure point me at how to do this without twiddling
Lucene's code directly, I'd be happy to contribute the modification.

-Walt Stoneburner
http://www.wwco.com/~wls/blog/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Yonik Seeley at May 11, 2007 at 9:19 pm

    On 5/11/07, Walt Stoneburner wrote:
    In this tutorial he stresses not once, not twice, but three times that
    the same Analyzer that is used to build an index -must- also be used
    when performing a Query. There is great detail explaining why this is
    so.

    However, in order to get our magic to work, we need to violate this
    rule in a very clever way.
    Yeah, "compatible" analyzer would be a better way to put it. Using
    the same analyzer for anything that produces multiple tokens at the
    same position is normally wrong.
    Solr allows specification of a "query" analyzer and an "index"
    analyzer for these cases.
    STEP ONE: Building an index that has both case-sensitive and
    case-insensitive tokens in it.
    Yep, your approach sounds fine, and will work in phrase queries (which
    the two-field solution currently can't handle). The greater
    difficulty lies in making it generic (working for many analyzers,
    etc).
    This step is where things get complicated. It turns out that
    StandardAnalyzer, which uses the StandardTokenizer, throws away dollar
    signs. So, it doesn't matter how many you type in your query, they
    all vanish, never giving you the opportunity to do anything with them
    downstream.
    This points out the difficulty of doing this in a *generic* way.
    Better than a "$" would be a flag on the Token IMO. Not currently
    really supported by lucene, but you could perhaps subclass Token.

    Bringing it all together, it's now possible to user your new query
    version token analyzer with the QueryParser. And calling .parse()
    with dollar sign prefixed strings will search for exact-case matches,
    where omitting it works like the regular old Lucene we all know and
    love.

    The down side...? The index has twice as many tokens.
    I've also considered case-insensitive support at the Term-Enum level.
    It would make lookups slower, but the index wouldn't be much bigger (it would
    be slightly bigger because one would index everything w/o lowercasing).
    I'd love to see a formal syntax like this officially enter the Lucene
    standard query language someday.

    If someone can figure point me at how to do this without twiddling
    Lucene's code directly, I'd be happy to contribute the modification.
    If you picked a token prefix/postfix that would pass through
    the QueryParser w/o a syntax error, the necessary manipulation could
    all be done in the Analyzer/TokenFilter. Much easier, but perhaps not
    as nice a syntax.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Walt Stoneburner at May 12, 2007 at 5:18 pm

    Yonik Seeley adds some wonderful observations:
    Yeah, "compatible" analyzer would be a better way to put it. Using
    the same analyzer for anything that produces multiple tokens at the
    same position is normally wrong.
    I came to the same conclusion the moment I realized that my query
    string was being transformed into tokens. What I wasn't sure of 100%,
    and what you just gave comfort to, was the idea that if I used two
    different analyzers, if that broke anything or not.

    Seems not only does it not, but it's also necessary.

    Solr allows specification of a "query" analyzer and an "index"
    analyzer for these cases.
    I've had a very passing glance at Solr. It looked like an awesome
    replacement for anyone considering a Google appliance.

    The scalability and additional features of Solr really appealed to me,
    but it felt too much like a complete solution sealed in a pretty bow
    than an API library I could utilize with my code.

    Is it worth a second look at Solr? And, if so, is it possible to
    easily capitalize on the Lucene modifications there, without bringing
    in the other extras?

    Yep, your [two token] approach sounds fine, and will work in phrase
    queries (which the two-field solution currently can't handle).
    The greater difficulty lies in making it generic (working for many
    analyzers, etc).
    Since the index writer is still writing the same tokens the other
    analyzers do, the standard query mechanisms still work. The extra
    dollar-sign-prefixed-case-preserved-tokens simply get ignored.

    It would be nice, as you point out, to be able to utilize them with
    other analyzers, though even with 20/20 hindsight, I'm not sure how
    that'd be accomplished without some basic architectural trade-offs
    that would inadvertently impact performance for those not interested
    in the feature.
    This points out the difficulty of doing this in a *generic* way.
    Better than a "$" would be a flag on the Token IMO. Not currently
    really supported by lucene, but you could perhaps subclass Token.
    In fact, this is how I wanted to started go about approaching the
    problem. I was surprised there wasn't a flag field available to
    tinker with.

    I'll have to look into the subclassing solution, though I ponder if
    it'd be my subclass that would always be called to create a token. I
    haven't going through the Lucene engine as a whole, yet, only
    considering the parts immediately applicable to my short-term problem
    and solution.

    Still this is a good idea. I'll look into it.

    My biggest concern is the fact that I can't just have my own class /
    subclass for the parsing. At the moment, I'm stuck replicating the
    Lucene .jj grammar, modifying it, and making my own personal copy.
    Should the base copy of Lucene get update, say a new version is
    released, now I've got to hope the implementation didn't change, go
    manually copy the code, and patch it again, opposed to some method I
    can subclass or some data structure I can augment.

    The way I make myself feel better about this is to keep telling myself
    that Lucene is an API, and these classes are mere examples, and that I
    should be writing my own parser. It may not be true, but it helps me
    sleep at night.

    I've also considered case-insensitive support at the Term-Enum level.
    It would make lookups slower, but the index wouldn't be much bigger (it would
    be slightly bigger because one would index everything w/o lowercasing).
    I'm fairly certain I'm not the only person who's tried to mix
    case-sensitive searching into a query. What I'd love to have seen is
    a LuceneConfig.java file that said:

    // While not for everyone, due to size and performance overhead if not
    being used,
    // set the following line to true to enable mixed cased searching in a
    single field.
    boolean ENABLE_MIXEDCASE_SEARCHING = false;


    Better yet, now that I'm thinking of it, this should actually be an
    attribute of the Field itself.

    That makes far more sense.

    If you picked a token prefix/postfix that would pass through
    the QueryParser w/o a syntax error, the necessary manipulation could
    all be done in the Analyzer/TokenFilter. Much easier, but perhaps not
    as nice a syntax.
    The problem I ran into was that the lexical analyzer seemed to strip
    any symbolic characters that were not part of a legitimate Lucene
    query syntax. Thus my dollar sign solution wouldn't have worked
    without the modifications to the parser anyhow. I also wasn't too hot
    on further overloading the meaning of a symbol.


    Thanks for the feedback. I hope that at least knowing it IS possible
    to do helps some poor soul.

    -Walt Stoneburner,
    http://www.wwco.com/~wls/blog/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at May 12, 2007 at 7:59 pm

    I'd love to see a formal syntax like this officially enter the Lucene
    standard query language someday.
    I doubt that this is something that is ever going to really happen.
    There are a couple of approaches to the problem and there are other
    similar problems (like allowing stemmed and unstemmed searches) and I
    think you are going to be stuck with roll your own for this type of
    thing. The problem with it being in the syntax is that it requires you
    to use certain analyzer's and indexing schemes...something not
    guaranteed...and if you where to add a case sensitive approach to the
    core of Lucene it would slow down the common case for a somewhat rare
    case. I think the brilliance of Lucenes approach is that it provides
    building blocks for your own search solution. I wouldn't consider it out
    of the normal to write your own parser or modify Lucenes parser...I
    think a lot of people do. Your solution is a good one, and I think its
    the right approach with Lucene...there are a lot of cool things you can
    do with different analyzers and token positions...but adding more
    support than what is there will tie down the sweet open endedness. Maybe
    a sandbox package with all of the necessary analyzers and the modified
    query parser would be more likely...even still it seems a lot of stuff
    does not even make the sandbox. These guys keep quite intent on keeping
    Lucene lean and mean and I am quite thankful for it.

    As far as solr, a lot of the code can just be stolen or you might check
    out embedded solr and I think there is something else similar to
    embedded solr that might not require a web server...

    - Mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 11, '07 at 8:50p
activeMay 12, '07 at 7:59p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase