FAQ
Hi again,As the subject would suggest I'm trying to implement a layer of
proximity weighting over lucene. This has greatly increased search
relevance, but at the same time has knocked down performance by a
substantial amount (see footer).

I am using a hand rolled query of the following form (implemented with
SpanNearQuery, not a sloppy PhraseQuery):
a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5

The obvious solution, "a b c"~5, is not applicable for my issues, because I
would like to allow for the possibility that a and b are near each other in
one field, while c is in another field.

So, is there something I'm missing to make this performant? Would a
reordering, query rewriting solution help? If there's no solution in
existing Lucene, would anyone be interested in investigating options with
me?

-Kyle


Somewhat arbitrary benchmarks.
--------------
Before:
$ ./bench.rb "paris hilton"
0.022000 0.000000 0.022000 ( 0.021000)
$ ./bench.rb "paris hilton goes to jail"
0.024000 0.000000 0.024000 ( 0.024000)

After:
$> ./bench.rb "paris hilton"
0.103000 0.000000 0.103000 ( 0.103000)
$> ./bench.rb "paris hilton goes to jail"
1.514000 0.000000 1.514000 ( 1.513000)

Search Discussions

  • Chris Hostetter at Oct 5, 2007 at 5:55 pm
    : I am using a hand rolled query of the following form (implemented with
    : SpanNearQuery, not a sloppy PhraseQuery):
    : a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5
    :
    : The obvious solution, "a b c"~5, is not applicable for my issues, because I
    : would like to allow for the possibility that a and b are near each other in
    : one field, while c is in another field.

    Hmmm.. can you give some more concrete examples of what you mean by this?
    both in terms of the use case you are trying to satisfy, and in terms of
    how your current code works ... you don't have to post code or give away
    trade secrets, just describe it as a black box (ie: what is the input?,
    how do you know when to use fieldA vs fieldC,how do you decide when to
    make a span query vs an OR query?

    based one what youv'e described so far, it's hard to udnerstand what it is
    you are doing -- which is important to udnerstand how to help you make it
    better/faster.

    : Somewhat arbitrary benchmarks.

    they do seem fairly arbitrary, especially since there are no units on the
    numbers, and no indication as to what "before" and "after" refer to...


    : --------------
    : Before:
    : $ ./bench.rb "paris hilton"
    : 0.022000 0.000000 0.022000 ( 0.021000)
    : $ ./bench.rb "paris hilton goes to jail"
    : 0.024000 0.000000 0.024000 ( 0.024000)
    :
    : After:
    : $> ./bench.rb "paris hilton"
    : 0.103000 0.000000 0.103000 ( 0.103000)
    : $> ./bench.rb "paris hilton goes to jail"
    : 1.514000 0.000000 1.514000 ( 1.513000)




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mike Klaas at Oct 5, 2007 at 6:23 pm

    On 5-Oct-07, at 10:54 AM, Chris Hostetter wrote:

    : I am using a hand rolled query of the following form (implemented
    with
    : SpanNearQuery, not a sloppy PhraseQuery):
    : a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5
    :
    : The obvious solution, "a b c"~5, is not applicable for my issues,
    because I
    : would like to allow for the possibility that a and b are near
    each other in
    : one field, while c is in another field.

    Hmmm.. can you give some more concrete examples of what you mean by
    this?
    both in terms of the use case you are trying to satisfy, and in
    terms of
    how your current code works ... you don't have to post code or give
    away
    trade secrets, just describe it as a black box (ie: what is the
    input?,
    how do you know when to use fieldA vs fieldC,how do you decide when to
    make a span query vs an OR query?

    based one what youv'e described so far, it's hard to udnerstand
    what it is
    you are doing -- which is important to udnerstand how to help you
    make it
    better/faster.
    I understand the OP to want a PhraseQuery that has an intention
    (rather than side-effect) of doing proximity-based scoring.

    "phrase query here"~1000 is the current hack that performs fine for N
    < 3 query terms, but fails currently for N >= 3 since it requires
    that all the terms be present. For larger queries, this effectively
    nullifies the usefulness of the phrase query approach.

    It doesn't seem to me that writing a variant of PhraseQuery that has
    the desired functionality would be _too_ hard, but I haven't looked
    into it in depth.

    -Mike



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Oct 5, 2007 at 6:29 pm
    : > : would like to allow for the possibility that a and b are near each other
    : > in
    : > : one field, while c is in another field.

    : I understand the OP to want a PhraseQuery that has an intention (rather than
    : side-effect) of doing proximity-based scoring.
    :
    : "phrase query here"~1000 is the current hack that performs fine for N < 3
    : query terms, but fails currently for N >= 3 since it requires that all the
    : terms be present. For larger queries, this effectively nullifies the
    : usefulness of the phrase query approach.

    that's what i thought first too, and it is a problem i'd eventaully like
    to tackle ... it was the part about "c" being in a differnet field from
    "a" and "b" that confused me ... i don't know what that exactly is being
    suggested here.




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mike Klaas at Oct 5, 2007 at 6:33 pm

    On 5-Oct-07, at 11:27 AM, Chris Hostetter wrote:
    that's what i thought first too, and it is a problem i'd eventaully
    like
    to tackle ... it was the part about "c" being in a differnet field
    from
    "a" and "b" that confused me ... i don't know what that exactly is
    being
    suggested here.
    I'm thinking of the dismax model: you still want each keyword to
    match (though possibly in different fields). I don't really think
    that that is appropriate to through into a single query class.
    Having separate match/boost clauses is better.

    -Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Kyle Maxwell at Oct 5, 2007 at 9:12 pm

    Hmmm.. can you give some more concrete examples of what you mean by this?
    both in terms of the use case you are trying to satisfy, and in terms of
    how your current code works ... you don't have to post code or give away
    trade secrets, just describe it as a black box (ie: what is the input?,
    how do you know when to use fieldA vs fieldC,how do you decide when to
    make a span query vs an OR query?

    I have a title field, and a genre field. A user enters the query:
    harry potter books

    If I could intelligently rewrite queries, this would be better formulated
    as:
    title:"harry potter"~5 genre:books


    Instead, since I don't have that knowledge, I should perhaps rewrite several
    guesses, and take the dismax. These guesses are equivalent to passing the
    following query through the MultiFieldQueryParser:

    ("harry potter"~5 AND books) OR (harry AND "potter books"~5)

    This is rather slow. The in the before/after, the numbers are in seconds,
    for one query, before and after this transformation has been made.


    Hope that clears things up

    -Kyle
  • Chris Hostetter at Oct 7, 2007 at 10:20 pm
    : If I could intelligently rewrite queries, this would be better formulated
    : as:
    : title:"harry potter"~5 genre:books
    :
    : Instead, since I don't have that knowledge, I should perhaps rewrite several
    : guesses, and take the dismax. These guesses are equivalent to passing the

    right. okay. the brute force approach of trying all possible
    permutations is really the only thing you can do unless you can think
    of ways to translate the "intelligence" that you would use to rewrite
    hte query into code. One start: test each "word" against each field and
    see if the idf is unusually high, if it is then maybe it's a good idea to
    pull that word out of the phrase and use it to query that specific field
    ... maybe you only do this on words at the beginign and end of the input?

    the problem becomes a lot simpler when you write code specific to your
    domain .. if you know you are dealing with "products" and you hvae a
    "type" field that only ever contains 1 of 50 values which frequently
    appera in search input (ie: books, couch, dvd) then testing that field
    first makes a lot of sense ... the problem becausem much ahrder when you
    want it to work on any generic index under the sun without knowing
    anything about the user behavior.

    : This is rather slow. The in the before/after, the numbers are in seconds,
    : for one query, before and after this transformation has been made.

    oh, well yeah ... no suprise there. you can't compare benchmarks between
    two queries that do completley differnet things -- a "simple" query is
    probably always going to be faster the a more complex query that matches a
    different set of documents, or does a "better" job of scoring the same
    set as the simple query. it's an apple and oranges thing.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 4, '07 at 12:17a
activeOct 7, '07 at 10:20p
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase