FAQ
Got an interesting question about Lucene's behavior, as recently I was
handed something that look like this:
( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^-2 )

The intention of the query is to say "if medical is found, then rank cat
[scans] high, but if animal is found then rank cat [a feline] low."

Problem is my understanding of Lucene tells me that there really is no AND /
OR set operations, and that instead Lucene has a REQUIRED, NOT_REQUIRED,
SHOULD setting for each term. As such, one might be able to "simulate"
certain kinds of AND/OR expressions, but that subexpression statements are
independently isn't happening.

Is my understanding correct, or will Lucene do this kind of magic? The
actual query I was given to construct programatically was significantly more
complex.

-wls

Search Discussions

  • Erick Erickson at Jan 5, 2009 at 7:47 pm
    As you say, your "real" queries are more complex, but your
    example seems like a simple boost to me joined by an OR clause.

    MEDICAL:CAT^10 OR ANIMAL:CAT

    which you can construct in a BooleanQuery as two clauses
    and "SHOULD".

    The sense of this is that a hit must contain "CAT" in either
    the MEDICAL or the ANIMAL fields, but occurrences in the
    MEDICAL field will tend to sort to the top....

    Remember too that Lucene query logic isn't strictly Boolean...

    Best
    Erick
    On Mon, Jan 5, 2009 at 2:33 PM, Walt Stoneburner wrote:

    Got an interesting question about Lucene's behavior, as recently I was
    handed something that look like this:
    ( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^-2 )

    The intention of the query is to say "if medical is found, then rank cat
    [scans] high, but if animal is found then rank cat [a feline] low."

    Problem is my understanding of Lucene tells me that there really is no AND
    /
    OR set operations, and that instead Lucene has a REQUIRED, NOT_REQUIRED,
    SHOULD setting for each term. As such, one might be able to "simulate"
    certain kinds of AND/OR expressions, but that subexpression statements are
    independently isn't happening.

    Is my understanding correct, or will Lucene do this kind of magic? The
    actual query I was given to construct programatically was significantly
    more
    complex.

    -wls
  • Walt Stoneburner at Jan 7, 2009 at 12:06 am
    Erick,

    Thanks for taking a moment to address my question. I suspect the
    confusion expressed in the answer was from a slight transcription error that
    added additional punctuation.

    In your reply, the query was expressed using fields (note the use of extra
    use of colons that changes the query meaning entirely):
    MEDICAL:CAT^10 OR ANIMAL:CAT

    I'm actually using the defaults and no custom fields, which without the
    colons makes MEDICAL and ANIMAL terms instead. Here's the original query:
    ( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^-2 )

    Luke, which I was using to analyze this has a problem with the numerical
    value of negative two. So, let's rewrite the query using a different,
    parse-able, number, like zero:
    ( +MEDICAL CAT^2 ) OR ( +ANIMAL CAT^0 )

    The question I'm trying to phrase is: Is there a way to make the rank of
    SHOULD term conditional?

    In the example, I'm trying to express "If the term MEDICAL is found, the
    term CAT ranks high; if the term ANIMAL is found, the term CAT ranks low."

    In your reply, you also stated, "Remember too that Lucene query logic
    isn't strictly Boolean..." This is my understanding as well, so I don't see
    how this could work at all.

    The users I'm dealing with are looking at the query as one might an
    expression in the C or Java language, where it either does the left half or
    the right half. My understanding is that the expression as a whole gets
    reduced to something else entirely.

    And that's where things get weird.

    According to Luke, I get two SHOULD clauses, each with a MUST and a
    SHOULD. As I understood things, a SHOULD *term* merely affects the ranking
    of the results, it doesn't affect what gets brought back. So I'm trying to
    understand what a SHOULD *clause* does in this case. More importantly, what
    does it logically mean to: "should have a must?" That's like saying I have
    an optional mandatory term.

    Or, is Lucene _really_ doing two separate sub-expressions? Looking at the
    data structures generated, it's flying counter to my understanding of what
    has to be happening under the hood.

    Perhaps Lucene really can do this afterall?

    And, if not, is there a programatic way to do directly with the API?

    Is it even possible to express this construct as a single expression or
    data structure for the API:
    1. +( MEDICAL ANIMAL ) You must have either MEDICAL and/or ANIMAL.
    2. If MEDICAL present, then CAT ranks high, else, if ANIMAL present,
    then CAT ranks low, otherwise the presence of the term CAT has no influence
    on rank.

    Many thanks,
    -wls
  • Chris Hostetter at Jan 15, 2009 at 10:55 pm
    : The question I'm trying to phrase is: Is there a way to make the rank of
    : SHOULD term conditional?
    :
    : In the example, I'm trying to express "If the term MEDICAL is found, the
    : term CAT ranks high; if the term ANIMAL is found, the term CAT ranks low."

    except that there is an ambiguous situation here: what if a document
    contains both MEDICAL and ANIMAL ?

    you'll probably want a query something like this...

    (+MEDICAL -ANIMAL CAT^10) (+ANIMAL -MEDICAL CAT^0.1) (-ANIMAL -MEDICAL CAT)

    : According to Luke, I get two SHOULD clauses, each with a MUST and a
    : SHOULD. As I understood things, a SHOULD *term* merely affects the ranking
    : of the results, it doesn't affect what gets brought back. So I'm trying to
    : understand what a SHOULD *clause* does in this case. More importantly, what
    : does it logically mean to: "should have a must?" That's like saying I have
    : an optional mandatory term.

    not exactly ... Lucene queries "build up" result sets (hence you can't
    have a purely negative query) when a booleam query doesn't contain any
    MUST clauses, then at least one SHOULD clause must match a document for
    that document to make it into the result set.

    So when your outermost BooleanQuery contains two SHOULD clauses that means
    you need one or the other to match -- if both match, your score gets even
    higher.

    : Is it even possible to express this construct as a single expression or
    : data structure for the API:
    : 1. +( MEDICAL ANIMAL ) You must have either MEDICAL and/or ANIMAL.
    : 2. If MEDICAL present, then CAT ranks high, else, if ANIMAL present,
    : then CAT ranks low, otherwise the presence of the term CAT has no influence
    : on rank.

    ...ah, see when you elaborate on the details, it becoamse easier to spell
    out hte query structure...

    (+MEDICAL CAT^10) (+ANIMAL -MEDICAL CAT^0.1)

    in order for one of the main clauses to match, either MEDICAL or ANIMAL
    must match. if MEDICAL matches CAT scores high; we only care about ANIMAL
    matching if MEDICAL doesn't match -- in which case CAT ranks low.




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 5, '09 at 7:33p
activeJan 15, '09 at 10:55p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase