FAQ
Hi there. I've got an index of company names, and it's split up into
separate indexes by state.

I have a simple command line interface for testing. I'm getting some odd
results, though, with certain logic of wildcard searches. It seems like
depending on what order I put the fields of the query in alters the results
drastically when I AND them together.

Here are some examples:

***************************
This one makes sense

Query> name:amb*
State> california
name:amb*
[email protected]
amb*
2819 total matching documents

***************************
This is the REALLY confusing one. We know there's a company named AMB
Property Corporation. Why do I get NO hits?

Query> name:"amb prop*"
State> california
name:"amb prop*"
[email protected]
"amb prop"
0 total matching documents

***************************
Ok, so I get some results with this (I know the * isn't neccessary at the
end of property, but bear with me for the next example where it goes all
screwy)

Query> name:amb property*
State> california
name:amb property*
[email protected]
amb name:amb property*:property*
56 total matching documents

***************************
south san francisco is an exact match to the city. Why does this find 0
results??!

Query> name:amb property* AND city:south san francisco
State> california
name:amb property* AND city:south san francisco
[email protected]
amb +name:amb property* AND city:south san francisco:property* +city:south
name:
amb property* AND city:south san francisco:san name:amb property* AND
city:south
san francisco:francisco
0 total matching documents

****************************
Do this and suddenly I get matches

Query> name:amb propert* and city:"south san fran*"
State> california
name:amb propert* and city:"south san fran*"
[email protected]
amb name:amb propert* and city:"south san fran*":propert* city:"south san
fran"56 total matching documents

*****************************
And look, this gets matches too:

Query> name:"amb propert*" and city:"south san*"
State> california
name:"amb propert*" and city:"south san*"
[email protected]
"amb propert" city:"south san"
10732 total matching documents

*****************************
Yet do this and we're back to 0 results:

Query> name:"amb propert*" and city:"south san fran*"
State> california
name:"amb propert*" and city:"south san fran*"
[email protected]
"amb propert" city:"south san fran"
0 total matching documents

******************************
Now flip the query around and it works:

Query> city:"south san fran*" and name:amb propert*
State> california
city:"south san fran*" and name:amb propert*
[email protected]
city:"south san fran" amb city:"south san fran*" and name:amb
propert*:propert*
56 total matching documents

*******************************
Finally, using the prefix of the metaphone name with quotes around it
produces no results:

Query> metaph_name:"ambprp*"
State> california
metaph_name:"ambprp*"
[email protected]
metaph_name:ambprp
0 total matching documents

*******************************
But take away the quotes and it works:

Query> metaph_name:ambprp*
State> california
metaph_name:ambprp*
[email protected]
metaph_name:ambprp*
6 total matching documents


********************************
But quotes don't seem to matter in this complex wildcard:

Query> metaph_name:ambprp* and city:"sou* or san or fra*"
State> california
metaph_name:ambprp* and city:"sou* or san or fra*"
[email protected]
metaph_name:ambprp* city:"sou san fra"
6 total matching documents


So... Can someone help me nail down the logic for these things so we can
construct some good queries?

Thanks!

Search Discussions

  • Dan Quaroni at Sep 23, 2003 at 11:04 am
    BTW, this is with lucene 1.2

    Thanks!
  • Erik Hatcher at Sep 23, 2003 at 12:37 pm
    Ah, this is a fun one.... lots of fiddly issues with how queries work
    and how QueryParser works. I'll take a stab at some of these inline
    below....
    On Monday, September 22, 2003, at 08:26 PM, Dan Quaroni wrote:
    I have a simple command line interface for testing.
    Interesting interface. Looks like something that if made generic
    enough would be handy to have at least in the sandbox.
    I'm getting some odd
    results, though, with certain logic of wildcard searches.
    not all your queries are truly "WildcardQuery"'s though. look at the
    class it constructed to get a better idea of what is happening.
    It seems like
    depending on what order I put the fields of the query in alters the
    results
    drastically when I AND them together.
    Not quite the right explanation of what is happening. More below....
    ***************************
    This one makes sense

    Query> name:amb*
    State> california
    name:amb*
    [email protected]
    amb*
    2819 total matching documents
    Right.... QueryParser does a little optimization here and anything with
    a simple trailing * turns into a PrefixQuery, meaning all name fields
    that begin with "amb".
    ***************************
    This is the REALLY confusing one. We know there's a company named AMB
    Property Corporation. Why do I get NO hits?

    Query> name:"amb prop*"
    State> california
    name:"amb prop*"
    [email protected]
    "amb prop"
    0 total matching documents
    Notice you're now in PhraseQuery land. Wildcards don't work like you
    seem to expect here. What is really happening here is a query for
    documents that have "amb" and "prop" terms side by side in that order.
    The asterisk got axed by the analyzer. If you said "name:amb
    name:prop*" you'd get some hits I believe, as it would turn into a
    boolean query with a term and wildcard queries either OR'd or AND'd
    together. PhraseQuery does not support wildcards. A custom subclass
    of QueryParser could do some interesting things here and expand
    wildcard-like terms like this in a phrase into PhrasePrefixQuery, but
    that is probably overkill here (although maybe not). Look at the test
    case for PhrasePrefixQuery for some hints.
    Ok, so I get some results with this (I know the * isn't neccessary at
    the
    end of property, but bear with me for the next example where it goes
    all
    screwy)

    Query> name:amb property*
    State> california
    name:amb property*
    [email protected]
    amb name:amb property*:property*
    56 total matching documents
    your default field for QueryParser is "property*"? Odd field name, or
    is the output fishy? I'm a bit confused by the "property*:" there.
    I'm assuming you're outputting the Query.toString here.

    See above for a different way to phrase the query.
    ***************************
    south san francisco is an exact match to the city. Why does this find

    results??!

    Query> name:amb property* AND city:south san francisco
    State> california
    name:amb property* AND city:south san francisco
    [email protected]
    amb +name:amb property* AND city:south san francisco:property*
    +city:south
    name:
    amb property* AND city:south san francisco:san name:amb property* AND
    city:south
    san francisco:francisco
    0 total matching documents
    with all the AND's going on, this makes sense because "san" and
    "francisco" end up as separate term queries. you'd have to say
    city:"south san francisco" to turn it into a PhraseQuery.
    ****************************
    Do this and suddenly I get matches

    Query> name:amb propert* and city:"south san fran*"
    State> california
    name:amb propert* and city:"south san fran*"
    [email protected]
    amb name:amb propert* and city:"south san fran*":propert* city:"south
    san
    fran"56 total matching documents
    you're getting hits on the wildcard match at least, and probably on
    name field "amb" as well. again, phrase queries don't support
    wildcards like you've done here with "south san fran*" so you're not
    matching anything with that.
    *****************************
    And look, this gets matches too:

    Query> name:"amb propert*" and city:"south san*"
    State> california
    name:"amb propert*" and city:"south san*"
    [email protected]
    "amb propert" city:"south san"
    10732 total matching documents
    my guess here is you're getting hits on "south san" as a phrase query.
    are there that many in that area?
    *****************************
    Yet do this and we're back to 0 results:

    Query> name:"amb propert*" and city:"south san fran*"
    State> california
    name:"amb propert*" and city:"south san fran*"
    [email protected]
    "amb propert" city:"south san fran"
    0 total matching documents
    you're getting zero hits from "amb propert*" since * is getting
    stripped by the analyzer and there is no "amb propert" phrase match,
    and with the AND (which should be all uppercase, right?) definitely not
    getting hits.
    ******************************
    Now flip the query around and it works:

    Query> city:"south san fran*" and name:amb propert*
    State> california
    city:"south san fran*" and name:amb propert*
    [email protected]
    city:"south san fran" amb city:"south san fran*" and name:amb
    propert*:propert*
    56 total matching documents
    You didn't quite flip it around, you took off some quotes too, which
    removed a PhraseQuery and you're getting your hits from name:amb here
    as well as probably the wildcard of propert*. I'm still confused by
    the output of propert*: here - are you using the CVS version of Lucene?
    the toString looks ok there, maybe there was a bug in that method in
    earlier code?
    *******************************
    Finally, using the prefix of the metaphone name with quotes around it
    produces no results:

    Query> metaph_name:"ambprp*"
    State> california
    metaph_name:"ambprp*"
    [email protected]
    metaph_name:ambprp
    0 total matching documents
    Notice this is a TermQuery - thats the clue... the asterisk is taken
    literally there, so no matches.
    *******************************
    But take away the quotes and it works:

    Query> metaph_name:ambprp*
    State> california
    metaph_name:ambprp*
    [email protected]
    metaph_name:ambprp*
    6 total matching documents
    Now you kicked it into an optimized wildcard query, which turns into a
    prefix query, hence the matches.
    ********************************
    But quotes don't seem to matter in this complex wildcard:

    Query> metaph_name:ambprp* and city:"sou* or san or fra*"
    State> california
    metaph_name:ambprp* and city:"sou* or san or fra*"
    [email protected]
    metaph_name:ambprp* city:"sou san fra"
    6 total matching documents
    your clue here is that the toString output has the asterisks removed,
    so your analyzer stripped them. again quotes mean phrase query.
    phrase queries don't support wildcards.
    So... Can someone help me nail down the logic for these things so we
    can
    construct some good queries?
    I hope my above analysis helps. I may not be perfectly right on
    everything, but should be relatively close at identifying the issues.
    Fixing it is more up to how you want to deal with it. Perhaps a custom
    QueryParser is more what you're after.

    Erik
  • Terry Steichen at Sep 23, 2003 at 2:08 pm
    Erik's analysis is comprehensive and useful. I think this example reflects
    a common (and understandable) oversight - that wildcards do *not* work with
    a phrase. Got caught on that many times myself. Also there may be
    confusion about the format -> field:(term1 term2), in that the examples
    provided don't seem to make use a parentheses. Finally, as I recall, there
    was some bug(s) with some wildcard patterns with 1.2.

    Regards,

    Terry

    ----- Original Message -----
    From: "Erik Hatcher" <[email protected]>
    To: "Lucene Users List" <[email protected]>
    Sent: Monday, September 22, 2003 10:33 PM
    Subject: Re: Confusion over wildcard search logic

    Ah, this is a fun one.... lots of fiddly issues with how queries work
    and how QueryParser works. I'll take a stab at some of these inline
    below....
    On Monday, September 22, 2003, at 08:26 PM, Dan Quaroni wrote:
    I have a simple command line interface for testing.
    Interesting interface. Looks like something that if made generic
    enough would be handy to have at least in the sandbox.
    I'm getting some odd
    results, though, with certain logic of wildcard searches.
    not all your queries are truly "WildcardQuery"'s though. look at the
    class it constructed to get a better idea of what is happening.
    It seems like
    depending on what order I put the fields of the query in alters the
    results
    drastically when I AND them together.
    Not quite the right explanation of what is happening. More below....
    ***************************
    This one makes sense

    Query> name:amb*
    State> california
    name:amb*
    [email protected]
    amb*
    2819 total matching documents
    Right.... QueryParser does a little optimization here and anything with
    a simple trailing * turns into a PrefixQuery, meaning all name fields
    that begin with "amb".
    ***************************
    This is the REALLY confusing one. We know there's a company named AMB
    Property Corporation. Why do I get NO hits?

    Query> name:"amb prop*"
    State> california
    name:"amb prop*"
    [email protected]
    "amb prop"
    0 total matching documents
    Notice you're now in PhraseQuery land. Wildcards don't work like you
    seem to expect here. What is really happening here is a query for
    documents that have "amb" and "prop" terms side by side in that order.
    The asterisk got axed by the analyzer. If you said "name:amb
    name:prop*" you'd get some hits I believe, as it would turn into a
    boolean query with a term and wildcard queries either OR'd or AND'd
    together. PhraseQuery does not support wildcards. A custom subclass
    of QueryParser could do some interesting things here and expand
    wildcard-like terms like this in a phrase into PhrasePrefixQuery, but
    that is probably overkill here (although maybe not). Look at the test
    case for PhrasePrefixQuery for some hints.
    Ok, so I get some results with this (I know the * isn't neccessary at
    the
    end of property, but bear with me for the next example where it goes
    all
    screwy)

    Query> name:amb property*
    State> california
    name:amb property*
    [email protected]
    amb name:amb property*:property*
    56 total matching documents
    your default field for QueryParser is "property*"? Odd field name, or
    is the output fishy? I'm a bit confused by the "property*:" there.
    I'm assuming you're outputting the Query.toString here.

    See above for a different way to phrase the query.
    ***************************
    south san francisco is an exact match to the city. Why does this find

    results??!

    Query> name:amb property* AND city:south san francisco
    State> california
    name:amb property* AND city:south san francisco
    [email protected]
    amb +name:amb property* AND city:south san francisco:property*
    +city:south
    name:
    amb property* AND city:south san francisco:san name:amb property* AND
    city:south
    san francisco:francisco
    0 total matching documents
    with all the AND's going on, this makes sense because "san" and
    "francisco" end up as separate term queries. you'd have to say
    city:"south san francisco" to turn it into a PhraseQuery.
    ****************************
    Do this and suddenly I get matches

    Query> name:amb propert* and city:"south san fran*"
    State> california
    name:amb propert* and city:"south san fran*"
    [email protected]
    amb name:amb propert* and city:"south san fran*":propert* city:"south
    san
    fran"56 total matching documents
    you're getting hits on the wildcard match at least, and probably on
    name field "amb" as well. again, phrase queries don't support
    wildcards like you've done here with "south san fran*" so you're not
    matching anything with that.
    *****************************
    And look, this gets matches too:

    Query> name:"amb propert*" and city:"south san*"
    State> california
    name:"amb propert*" and city:"south san*"
    [email protected]
    "amb propert" city:"south san"
    10732 total matching documents
    my guess here is you're getting hits on "south san" as a phrase query.
    are there that many in that area?
    *****************************
    Yet do this and we're back to 0 results:

    Query> name:"amb propert*" and city:"south san fran*"
    State> california
    name:"amb propert*" and city:"south san fran*"
    [email protected]
    "amb propert" city:"south san fran"
    0 total matching documents
    you're getting zero hits from "amb propert*" since * is getting
    stripped by the analyzer and there is no "amb propert" phrase match,
    and with the AND (which should be all uppercase, right?) definitely not
    getting hits.
    ******************************
    Now flip the query around and it works:

    Query> city:"south san fran*" and name:amb propert*
    State> california
    city:"south san fran*" and name:amb propert*
    [email protected]
    city:"south san fran" amb city:"south san fran*" and name:amb
    propert*:propert*
    56 total matching documents
    You didn't quite flip it around, you took off some quotes too, which
    removed a PhraseQuery and you're getting your hits from name:amb here
    as well as probably the wildcard of propert*. I'm still confused by
    the output of propert*: here - are you using the CVS version of Lucene?
    the toString looks ok there, maybe there was a bug in that method in
    earlier code?
    *******************************
    Finally, using the prefix of the metaphone name with quotes around it
    produces no results:

    Query> metaph_name:"ambprp*"
    State> california
    metaph_name:"ambprp*"
    [email protected]
    metaph_name:ambprp
    0 total matching documents
    Notice this is a TermQuery - thats the clue... the asterisk is taken
    literally there, so no matches.
    *******************************
    But take away the quotes and it works:

    Query> metaph_name:ambprp*
    State> california
    metaph_name:ambprp*
    [email protected]
    metaph_name:ambprp*
    6 total matching documents
    Now you kicked it into an optimized wildcard query, which turns into a
    prefix query, hence the matches.
    ********************************
    But quotes don't seem to matter in this complex wildcard:

    Query> metaph_name:ambprp* and city:"sou* or san or fra*"
    State> california
    metaph_name:ambprp* and city:"sou* or san or fra*"
    [email protected]
    metaph_name:ambprp* city:"sou san fra"
    6 total matching documents
    your clue here is that the toString output has the asterisks removed,
    so your analyzer stripped them. again quotes mean phrase query.
    phrase queries don't support wildcards.
    So... Can someone help me nail down the logic for these things so we
    can
    construct some good queries?
    I hope my above analysis helps. I may not be perfectly right on
    everything, but should be relatively close at identifying the issues.
    Fixing it is more up to how you want to deal with it. Perhaps a custom
    QueryParser is more what you're after.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Dan Quaroni at Sep 23, 2003 at 2:09 pm
    Yeah, thanks a lot for your help! I'm using the release version of Lucene
    version 1.2.
    not all your queries are truly "WildcardQuery"'s though. look at the
    class it constructed to get a better idea of what is happening.
    Yeah, I printed the queries out to see what was going on and noticed that.
    I used wildcard in the subject as a sort of general description of what I
    was after rather than its technical meaning to lucene, but I was curious
    about why I was getting the query types I was getting.
    your default field for QueryParser is "property*"? Odd field name, or
    is the output fishy? I'm a bit confused by the "property*:" there.
    I'm assuming you're outputting the Query.toString here.
    That's just the Query.toString, yeah. I didn't set any default fields to
    property* so whatever it's spitting out comes from the QueryParser.
    Query> name:amb propert* and city:"south san fran*"
    you're getting hits on the wildcard match at least, and probably on
    name field "amb" as well. again, phrase queries don't support
    wildcards like you've done here with "south san fran*" so you're not
    matching anything with that.
    Ok... What's the correct procedure for doing a multi-word wildcard where I
    want it to begin with "south san fran" but not get anything else that
    contains "south" or "san"? Just and together the south, san, and fran?

    Although this might produce good results, my understanding was that booleans
    retrieve all matches and store them in memory then resolve the booleans. If
    I use the term "san" to search California, I'm going to need a lot of memory
    to store all of the temporary results...!

    Or is that only true when doing booleans on different fields? If so, I
    think we have our solution! :)
    I'm still confused by the output of propert*: here - are you using
    the CVS version of Lucene?
    The honest to goodness 1.2 release. :)
    I hope my above analysis helps. I may not be perfectly right on
    everything, but should be relatively close at identifying the issues.
    Fixing it is more up to how you want to deal with it. Perhaps a custom
    QueryParser is more what you're after.
    Well, i think the real key piece of information was to drop the quotes to
    avoid the phrase queries. Thanks!
  • Erik Hatcher at Sep 23, 2003 at 2:45 pm

    On Tuesday, September 23, 2003, at 10:09 AM, Dan Quaroni wrote:
    Yeah, thanks a lot for your help! I'm using the release version of
    Lucene
    version 1.2.
    Perhaps give the latest codebase a try too, just to see if any fixes
    (particularly in that WildcardQuery.toString) are there.
    you're getting hits on the wildcard match at least, and probably on
    name field "amb" as well. again, phrase queries don't support
    wildcards like you've done here with "south san fran*" so you're not
    matching anything with that.
    Ok... What's the correct procedure for doing a multi-word wildcard
    where I
    want it to begin with "south san fran" but not get anything else that
    contains "south" or "san"? Just and together the south, san, and fran?
    As far as I know there isn't a way to do this with QueryParser
    currently. The real way to do this with the existing API is to use
    PhrasePrefixQuery and do some manual setup before using it (like you'll
    see in the current test case and Javadocs for it) by enumerating all
    the terms that start with "fran" and passing that to a
    PhrasePrefixQuery (isn't this class misnamed? What does this have to
    do with "prefix"?) along with "south" and "san".
    Although this might produce good results, my understanding was that
    booleans
    retrieve all matches and store them in memory then resolve the
    booleans. If
    I use the term "san" to search California, I'm going to need a lot of
    memory
    to store all of the temporary results...!
    +south +san +fran* ought to do the trick. i wouldn't worry about
    memory too much until you've seen it to be a problem. i think you'll
    be fine (but don't currently have the understanding or data to back
    that up).

    Erik
  • Dan Quaroni at Sep 23, 2003 at 2:24 pm
    Your email prompted me to re-read the query parser documentation. There are
    only two examples using parentheses, which seem to be the answer to my
    questions. They are:

    (jakarta OR apache) AND website

    And

    title:(+return +"pink panther")


    These leave a lot unanswered, though. I mean, for example, what would
    happen if the query were:

    title:(+return +pink panther)
    or
    title:(return*pink panther)

    I.e. are the + or booleans required between each word inside the
    parentheses?

    I guess the answer is that I need to just play with it and find out, but as
    others have mentioned, the documentation is lacking in some respects and I'd
    say this is one of them... Maybe I'll submit some answers when I figure them
    out. :)
  • Otis Gospodnetic at Sep 23, 2003 at 2:25 pm
    Hello,
    I guess the answer is that I need to just play with it and find out,
    but as
    others have mentioned, the documentation is lacking in some respects
    and I'd
    say this is one of them... Maybe I'll submit some answers when I
    figure them out. :)
    Thank you, always appreciated.

    Otis


    __________________________________
    Do you Yahoo!?
    Yahoo! SiteBuilder - Free, easy-to-use web site design software
    http://sitebuilder.yahoo.com
  • Erik Hatcher at Sep 23, 2003 at 2:38 pm
    Better yet, submit some JUnit test cases that show how this stuff
    works, if the ones in Lucene's codebase aren't comprehensive enough.
    This is an excellent way to "play" with an API and get a good
    understanding of it and documenting it at the same time.

    Erik

    On Tuesday, September 23, 2003, at 10:25 AM, Otis Gospodnetic wrote:

    Hello,
    I guess the answer is that I need to just play with it and find out,
    but as
    others have mentioned, the documentation is lacking in some respects
    and I'd
    say this is one of them... Maybe I'll submit some answers when I
    figure them out. :)
    Thank you, always appreciated.

    Otis


    __________________________________
    Do you Yahoo!?
    Yahoo! SiteBuilder - Free, easy-to-use web site design software
    http://sitebuilder.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Dan Quaroni at Sep 23, 2003 at 2:55 pm

    Perhaps give the latest codebase a try too, just to see if any fixes
    (particularly in that WildcardQuery.toString) are there.
    It's our intention to put this into a production environment soon, so we
    were waiting on 1.3 to go final before attempting to use it.
    i wouldn't worry about
    memory too much until you've seen it to be a problem. i think you'll
    be fine (but don't currently have the understanding or data to back
    that up).
    The reason I split up the indexes by state was that I was running out of
    memory (and searches were very slow) with the whole world of companies in
    one index with all kinds of boolean joining. With it split out, it seems to
    do pretty well.

    Well, after some extremely brief experimentation (Maybe I shoulda done it
    before writing the email, huh?) I discovered this:

    **********************
    This worked pretty well and got me some good results - the company that I
    was looking for came back second (Which is pretty good given how general I
    made the query)

    Query> name:(amb proper*)
    State> california
    name:(amb proper*)
    [email protected]
    amb proper*
    31988 total matching documents

    *************
    This one matched a ton of documents, however the company I was looking for
    came up first in the list, though with a pretty abysmal score of 0.23769014

    Query> name:(amb prop*) and city:(south san fran*)
    State> california
    name:(amb prop*) and city:(south san fran*)
    [email protected]
    (amb prop*) (city:south city:san city:fran*)
    721977 total matching documents

    ****************
    The previous query took 1552 millis. I was able to reduce that to 285
    millis just by adding the +'s you suggested:

    Query> name:(amb prop*) and city:(+south +san +fran*)
    State> california
    name:(amb prop*) and city:(+south +san +fran*)
    [email protected]
    (amb prop*) (+city:south +city:san +city:fran*)
    45011 total matching documents


    Incidently, I say everything that I do with great awe at the power of Lucene
    and respect for those who have made it possible. Please don't take anything
    I say as a gripe - I'm just learning how things work and that's a neccessary
    step to take for any new software package of this type. You just have to
    learn the ins and outs and little quirks to be able to take full advantage
    of it.

    Thanks!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 23, '03 at 12:26a
activeSep 23, '03 at 2:55p
posts10
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase