FAQ
Hi,

I am using PhraseQuery with explicitly set term positions and slop=0, in
order to skip stop words. The field in my index is indexed with TermVector
positions.

When I do a query with stop words skipped, for example "internet for
research" (translated into PhraseQuery: "internet ? research"), I am getting
results with non-stop words as well as stop words, where the stop word
should be (e.g. "internet related research").

Is this expected behavior? If so, is there any way to do what I want, which
is for the query to match only results like "internet [stop-word] research"?

Thanks,
-- Avi

Search Discussions

  • Erick Erickson at Jan 19, 2010 at 1:29 pm
    How big is your index? Because the simplest thing would be
    to just not remove stopwords at index or query time. Perhaps
    in a duplicate field depending upon your needs.

    Erick
    On Tue, Jan 19, 2010 at 6:50 AM, Avi Rosenschein wrote:

    Hi,

    I am using PhraseQuery with explicitly set term positions and slop=0, in
    order to skip stop words. The field in my index is indexed with TermVector
    positions.

    When I do a query with stop words skipped, for example "internet for
    research" (translated into PhraseQuery: "internet ? research"), I am
    getting
    results with non-stop words as well as stop words, where the stop word
    should be (e.g. "internet related research").

    Is this expected behavior? If so, is there any way to do what I want, which
    is for the query to match only results like "internet [stop-word]
    research"?

    Thanks,
    -- Avi
  • Avi Rosenschein at Jan 19, 2010 at 1:39 pm
    Index is pretty large (50GB, divided into 8 shards). I'm afraid I would
    start running into memory issues by adding the stop words (though it is
    definitely something I would like to test at some point).

    My question was more to try to understand if this was known behavior in
    lucene, since I can't really think of a situation where this would be
    desired (maybe if the user was knowingly searching for "a
    [one-word-wildcard] b"; but a better way to do that would be with slop, not
    with term positions). Wouldn't it be better to have the ExactPhraseScorer
    not allow unmatched holes (i.e. terms in the document that are not matched
    in the query)?

    -- Avi
    On Tue, Jan 19, 2010 at 3:28 PM, Erick Erickson wrote:

    How big is your index? Because the simplest thing would be
    to just not remove stopwords at index or query time. Perhaps
    in a duplicate field depending upon your needs.

    Erick

    On Tue, Jan 19, 2010 at 6:50 AM, Avi Rosenschein <arosenschein@gmail.com
    wrote:
    Hi,

    I am using PhraseQuery with explicitly set term positions and slop=0, in
    order to skip stop words. The field in my index is indexed with
    TermVector
    positions.

    When I do a query with stop words skipped, for example "internet for
    research" (translated into PhraseQuery: "internet ? research"), I am
    getting
    results with non-stop words as well as stop words, where the stop word
    should be (e.g. "internet related research").

    Is this expected behavior? If so, is there any way to do what I want, which
    is for the query to match only results like "internet [stop-word]
    research"?

    Thanks,
    -- Avi

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 19, '10 at 11:51a
activeJan 19, '10 at 1:39p
posts3
users2
websitelucene.apache.org

2 users in discussion

Avi Rosenschein: 2 posts Erick Erickson: 1 post

People

Translate

site design / logo © 2022 Grokbase