FAQ
Hi,

Let's say we have an index having few documents indexed using
StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
1) foo:bar
2) baz:"there is"

Let's assume that the first query yields some results because there
are documents matching that query.

The second query contains two stopwords ("there" and "is") and yields
0 results. The reason for this is because when baz:"there is" is
parsed, it ends up as a void query as both "there" and "is" are
stopwords (technically speaking, this is converted to an empty
BooleanQuery having no clauses). So far so good.

However, any of the following combined queries

+foo:bar +baz:"there is"
foo:bar AND baz:"there is"

behave exactly the same way as query +foo:bar, that is, brings back
some results. The second AND part which is supposed to yield no
results is completely ignored.

One might argue that when ANDing both conditions have to be met, that
is, documents having foo=bar and baz being empty have to be retrieved,
as when issued seperately, baz:"there is" yields 0 results.

It seem contradictory as an atomic query component has different
impact on the overall query depending on the context. Is there any
logical explanation for this? Can this be addressed in any way,
preferably without writing own QueryAnalyzer?

If this makes any difference, observed behaviour happens under Lucene v3.0.2.

Regards,
Mindaugas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Mindaugas Žakšauskas at May 23, 2011 at 11:08 am
    Not much luck so far :(

    Just in case if anyone wants to earn some virtual dosh, I have added
    some 50 bonus points to this question on StackOverflow:

    http://stackoverflow.com/questions/6044061/lucene-query-parsing-behaviour-joining-query-parts-with-and

    I also promise to post a solution here if anything satisfactory turns up.

    m.

    2011/5/17 Mindaugas Žakšauskas <mindas@gmail.com>:
    Hi,

    Let's say we have an index having few documents indexed using
    StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
    1) foo:bar
    2) baz:"there is"

    Let's assume that the first query yields some results because there
    are documents matching that query.

    The second query contains two stopwords ("there" and "is") and yields
    0 results. The reason for this is because when baz:"there is" is
    parsed, it ends up as a void query as both "there" and "is" are
    stopwords (technically speaking, this is converted to an empty
    BooleanQuery having no clauses). So far so good.

    However, any of the following combined queries

    +foo:bar +baz:"there is"
    foo:bar AND baz:"there is"

    behave exactly the same way as query +foo:bar, that is, brings back
    some results. The second AND part which is supposed to yield no
    results is completely ignored.

    One might argue that when ANDing both conditions have to be met, that
    is, documents having foo=bar and baz being empty have to be retrieved,
    as when issued seperately, baz:"there is" yields 0 results.

    It seem contradictory as an atomic query component has different
    impact on the overall query depending on the context. Is there any
    logical explanation for this? Can this be addressed in any way,
    preferably without writing own QueryAnalyzer?

    If this makes any difference, observed behaviour happens under Lucene v3.0.2.

    Regards,
    Mindaugas
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at May 23, 2011 at 2:13 pm
    Hmmm, somehow I missed this days ago....

    Anyway, the Lucene query parsing process isn't quite Boolean logic.
    I encourage you to think in terms of "required", "optional", and
    "prohibited".

    Both queries are equivalent, to see this try attaching &debugQuery=on
    to your URL and look at the "parsed query" in the debug info....

    Anyway, to your qestion.
    +foo:bar +baz:"there is"

    reads that "bar" must appear in the field "foo". So far so good.
    But it's also required that baz contain the empty clause, which
    is different than saying baz must be empty. One can argue that
    any field contains, by definition, nothing.

    But imagine the impact of what you're requesting. If all stop words
    get removed, then no query would ever match yours. Which
    would be very counter-intuitive IMO. Your users have no clue
    that you've removed stopwords, so they'll sit there saying "Look, I
    KNOW that "bar" was in foo and I KNOW that "there is" was in
    baz, why the heck didn't this cursed system find my doc?

    Anyway, I don't think you really want this behavior in the
    stopword removal case. If you can post some use-cases where this
    would be desirable, maybe we can noodle about a solution....

    Best
    Erick


    2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
    Not much luck so far :(

    Just in case if anyone wants to earn some virtual dosh, I have added
    some 50 bonus points to this question on StackOverflow:

    http://stackoverflow.com/questions/6H044061/lucene-query-parsing-behaviour-joining-query-parts-with-and

    I also promise to post a solution here if anything satisfactory turns up.

    m.

    2011/5/17 Mindaugas Žakšauskas <mindas@gmail.com>:
    Hi,

    Let's say we have an index having few documents indexed using
    StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
    1) foo:bar
    2) baz:"there is"

    Let's assume that the first query yields some results because there
    are documents matching that query.

    The second query contains two stopwords ("there" and "is") and yields
    0 results. The reason for this is because when baz:"there is" is
    parsed, it ends up as a void query as both "there" and "is" are
    stopwords (technically speaking, this is converted to an empty
    BooleanQuery having no clauses). So far so good.

    However, any of the following combined queries

    +foo:bar +baz:"there is"
    foo:bar AND baz:"there is"

    behave exactly the same way as query +foo:bar, that is, brings back
    some results. The second AND part which is supposed to yield no
    results is completely ignored.

    One might argue that when ANDing both conditions have to be met, that
    is, documents having foo=bar and baz being empty have to be retrieved,
    as when issued seperately, baz:"there is" yields 0 results.

    It seem contradictory as an atomic query component has different
    impact on the overall query depending on the context. Is there any
    logical explanation for this? Can this be addressed in any way,
    preferably without writing own QueryAnalyzer?

    If this makes any difference, observed behaviour happens under Lucene v3.0.2.

    Regards,
    Mindaugas
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mindaugas Žakšauskas at May 23, 2011 at 2:37 pm
    Hi Erick,

    I think answer to this question depends which hat you put on.

    If you put search engine hat (or do similar things in, i.e. Google),
    the results will be the same as what Lucene does at the moment. And
    that's fair enough - getting more results in search engine world is
    almost always better than getting less. Even if a bunch of slightly
    irrelevant results is returned, nobody cares.

    But if you put a database hat, the world view suddenly changes. I am
    sure there are plenty of people who use Lucene in situations where
    they need exact matches and any excess results are not desirable.

    The root of the evil here is coming from the fact that stopwords are
    not indexed and reasonable defaults have to be assumed in different
    situations. Thinking of this, to return all data for stopword-only
    query would probably be least expected and I don't disagree on your
    argument about the mixed case, too.

    This probably leaves me with a single option which is not to use
    stopwords at all, allowing me to get the best of the both worlds. Does
    anyone have any experience on how much of increased index size
    (roughly) can I expect?

    Regards,
    Mindaugas
    On Mon, May 23, 2011 at 3:13 PM, Erick Erickson wrote:
    Hmmm, somehow I missed this days ago....

    Anyway, the Lucene query parsing process isn't quite Boolean logic.
    I encourage you to think in terms of "required", "optional", and
    "prohibited".

    Both queries are equivalent, to see this try attaching &debugQuery=on
    to your URL and look at the "parsed query" in the debug info....

    Anyway, to your qestion.
    +foo:bar +baz:"there is"

    reads that "bar" must appear in the field "foo". So far so good.
    But it's also required that baz contain the empty clause, which
    is different than saying baz must be empty. One can argue that
    any field contains, by definition, nothing.

    But imagine the impact of what you're requesting. If all stop words
    get removed, then no query would ever match yours. Which
    would be very counter-intuitive IMO. Your users have no clue
    that you've removed stopwords, so they'll sit there saying "Look, I
    KNOW that "bar" was in foo and I KNOW that "there is" was in
    baz, why the heck didn't this cursed system find my doc?

    Anyway, I don't think you really want this behavior in the
    stopword removal case. If you can post some use-cases where this
    would be desirable, maybe we can noodle about a solution....

    Best
    Erick


    2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
    Not much luck so far :(

    Just in case if anyone wants to earn some virtual dosh, I have added
    some 50 bonus points to this question on StackOverflow:

    http://stackoverflow.com/questions/6H044061/lucene-query-parsing-behaviour-joining-query-parts-with-and

    I also promise to post a solution here if anything satisfactory turns up.

    m.

    2011/5/17 Mindaugas Žakšauskas <mindas@gmail.com>:
    Hi,

    Let's say we have an index having few documents indexed using
    StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
    1) foo:bar
    2) baz:"there is"

    Let's assume that the first query yields some results because there
    are documents matching that query.

    The second query contains two stopwords ("there" and "is") and yields
    0 results. The reason for this is because when baz:"there is" is
    parsed, it ends up as a void query as both "there" and "is" are
    stopwords (technically speaking, this is converted to an empty
    BooleanQuery having no clauses). So far so good.

    However, any of the following combined queries

    +foo:bar +baz:"there is"
    foo:bar AND baz:"there is"

    behave exactly the same way as query +foo:bar, that is, brings back
    some results. The second AND part which is supposed to yield no
    results is completely ignored.

    One might argue that when ANDing both conditions have to be met, that
    is, documents having foo=bar and baz being empty have to be retrieved,
    as when issued seperately, baz:"there is" yields 0 results.

    It seem contradictory as an atomic query component has different
    impact on the overall query depending on the context. Is there any
    logical explanation for this? Can this be addressed in any way,
    preferably without writing own QueryAnalyzer?

    If this makes any difference, observed behaviour happens under Lucene v3.0.2.

    Regards,
    Mindaugas
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at May 29, 2011 at 2:25 am
    Sorry, I was at Lucene Revolution, and out of circulation for a while.

    I agree with your point that "it depends" (what correct behavior is).
    Seems like a lot of Solr/Lucene has that answer...

    But as to stop words, the trend lately is to NOT remove them
    anyway. The whole stop word issue comes from a long time
    ago when disk space was much more dear.

    As to how much it increases the index size, I don't have
    a clue off the top of my head...

    Best
    Erick

    2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
    Hi Erick,

    I think answer to this question depends which hat you put on.

    If you put search engine hat (or do similar things in, i.e. Google),
    the results will be the same as what Lucene does at the moment. And
    that's fair enough - getting more results in search engine world is
    almost always better than getting less. Even if a bunch of slightly
    irrelevant results is returned, nobody cares.

    But if you put a database hat, the world view suddenly changes. I am
    sure there are plenty of people who use Lucene in situations where
    they need exact matches and any excess results are not desirable.

    The root of the evil here is coming from the fact that stopwords are
    not indexed and reasonable defaults have to be assumed in different
    situations. Thinking of this, to return all data for stopword-only
    query would probably be least expected and I don't disagree on your
    argument about the mixed case, too.

    This probably leaves me with a single option which is not to use
    stopwords at all, allowing me to get the best of the both worlds. Does
    anyone have any experience on how much of increased index size
    (roughly) can I expect?

    Regards,
    Mindaugas
    On Mon, May 23, 2011 at 3:13 PM, Erick Erickson wrote:
    Hmmm, somehow I missed this days ago....

    Anyway, the Lucene query parsing process isn't quite Boolean logic.
    I encourage you to think in terms of "required", "optional", and
    "prohibited".

    Both queries are equivalent, to see this try attaching &debugQuery=on
    to your URL and look at the "parsed query" in the debug info....

    Anyway, to your qestion.
    +foo:bar +baz:"there is"

    reads that "bar" must appear in the field "foo". So far so good.
    But it's also required that baz contain the empty clause, which
    is different than saying baz must be empty. One can argue that
    any field contains, by definition, nothing.

    But imagine the impact of what you're requesting. If all stop words
    get removed, then no query would ever match yours. Which
    would be very counter-intuitive IMO. Your users have no clue
    that you've removed stopwords, so they'll sit there saying "Look, I
    KNOW that "bar" was in foo and I KNOW that "there is" was in
    baz, why the heck didn't this cursed system find my doc?

    Anyway, I don't think you really want this behavior in the
    stopword removal case. If you can post some use-cases where this
    would be desirable, maybe we can noodle about a solution....

    Best
    Erick


    2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
    Not much luck so far :(

    Just in case if anyone wants to earn some virtual dosh, I have added
    some 50 bonus points to this question on StackOverflow:

    http://stackoverflow.com/questions/6H044061/lucene-query-parsing-behaviour-joining-query-parts-with-and

    I also promise to post a solution here if anything satisfactory turns up.

    m.

    2011/5/17 Mindaugas Žakšauskas <mindas@gmail.com>:
    Hi,

    Let's say we have an index having few documents indexed using
    StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
    1) foo:bar
    2) baz:"there is"

    Let's assume that the first query yields some results because there
    are documents matching that query.

    The second query contains two stopwords ("there" and "is") and yields
    0 results. The reason for this is because when baz:"there is" is
    parsed, it ends up as a void query as both "there" and "is" are
    stopwords (technically speaking, this is converted to an empty
    BooleanQuery having no clauses). So far so good.

    However, any of the following combined queries

    +foo:bar +baz:"there is"
    foo:bar AND baz:"there is"

    behave exactly the same way as query +foo:bar, that is, brings back
    some results. The second AND part which is supposed to yield no
    results is completely ignored.

    One might argue that when ANDing both conditions have to be met, that
    is, documents having foo=bar and baz being empty have to be retrieved,
    as when issued seperately, baz:"there is" yields 0 results.

    It seem contradictory as an atomic query component has different
    impact on the overall query depending on the context. Is there any
    logical explanation for this? Can this be addressed in any way,
    preferably without writing own QueryAnalyzer?

    If this makes any difference, observed behaviour happens under Lucene v3.0.2.

    Regards,
    Mindaugas
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 17, '11 at 2:06p
activeMay 29, '11 at 2:25a
posts5
users2
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase