FAQ
Hi,

I am trying to implement an alternative scoring mechanism in Lucene.
A query of multiple terms is represented as a BooleanQuery with one or more Occur.SHOULD clauses.

The scoring for a document is very simple:
- Assign a score for each queryterm.
! If a document does not contain a queryterm this score can be larger or smaller than 0 !
- Sum these scores.

I run into problems in case a document does not contain a query term. See the following sample results with query "a" or "b".

Document1: x b p j x s n f e p t h q y t b q d x
Calculated score: -0.9631268
Score should be:
-2.7171288 = Sum of:
-1.754002 = score for "a", term frequency 0
-0.9631268 = score for "b", term frequency 2

Document2: r b g a p i h i u x w p f m p m j c m d
Calculated score: -2.6189766
Score should be:
-2.6189766 = Sum of:
-1.3109605 = score for "a", term frequency 1;
-1.3080161 = score for "b", term frequency 1;

The calculated score for document2 is as I have in mind. However, the score for documment1 is incorrect. The BooleanScorer seems to ignore the score for term "a"; as it doesn't exist in the document.

Any ideas how to solve this?

Regards,
Dolf






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Chris Hostetter at Jun 9, 2006 at 7:28 am
    : ! If a document does not contain a queryterm this score can be larger
    : or smaller than 0 !

    if a document doesn't contain a term, then the scorer for that query will
    never even try to score that document -- regardless of what your
    Similarity class looks like.

    if you really want this kind of behavior, you'll need to roll your own
    TermQuery/TermScorer classes and change next and skipTo to allways advance
    ot the next doc -- regardless of wether or not it matches (you can check
    for that in the score function and act accordingly)



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Trieschnigg, R.B. \(Dolf\) at Jun 9, 2006 at 11:48 am

    : ! If a document does not contain a queryterm this score
    can be larger
    : or smaller than 0 !

    if a document doesn't contain a term, then the scorer for
    that query will never even try to score that document --
    regardless of what your Similarity class looks like.

    if you really want this kind of behavior, you'll need to roll
    your own TermQuery/TermScorer classes and change next and
    skipTo to allways advance ot the next doc -- regardless of
    wether or not it matches (you can check for that in the score
    function and act accordingly)
    That sounds like a reasonable approach. However, I still require the searching process to be optimized for retrieving the first n hits. (I made my own implementation outside the Lucene search-architecture which was unbelievably slow).

    For example: a query containing two terms: "fast", "car", having document frequencies 300.000 and 20.000 in the index respectively. In a worst case scenario this would require 320.000 document scores to be calculated. I am not really sure how lucene optimizes its search, but I guess it does that by first processing the documents having the highest term frequencies (and thus highest combined score) with these query terms, and pruning the search if the n hits have been found and it's certain that no document can be found which will give a higher score.

    If I would change the next function in my own scorer to process all document ids, I am afraid I will wreck Lucene's optimization method (as I am then not serving the documents in descending term frequency order).

    Perhaps someone can tell if Lucene indeed requires <scorer>.next() to return the documents for a term in descending term frequency order?

    Regards,
    Dolf

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jun 9, 2006 at 7:09 pm
    : For example: a query containing two terms: "fast", "car", having
    : document frequencies 300.000 and 20.000 in the index respectively. In a
    : worst case scenario this would require 320.000 document scores to be
    : calculated. I am not really sure how lucene optimizes its search, but I
    : guess it does that by first processing the documents having the highest
    : term frequencies (and thus highest combined score) with these query
    : terms, and pruning the search if the n hits have been found and it's
    : certain that no document can be found which will give a higher score.

    Nope. Lucene scores all "matching" documents in the index in increasing
    order of docId -- it can optimize the process using "skipTo" in Scorers
    when it knows that it's not possible for for a document to "match" the
    overall query, so it "skips ahead" to the first doc that can match.

    ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
    knows that unless something matches title:cat and title:dog then there is
    not point in checking wether it matches body:snake -- let alone scoring
    hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
    for title:cat and title:dog to keep skipping ahead untill it finds a doc
    matching both, then it checks if it matches body:snake, and if it does
    *then* it scores things.

    : If I would change the next function in my own scorer to process all
    : document ids, I am afraid I will wreck Lucene's optimization method (as
    : I am then not serving the documents in descending term frequency order).

    it would certianly eliminate lucenes ability to skip ahead (allthough
    not in the way you imagined) ... but based on the way you've described how
    you want scoring to work, it has to score every doc no matter what --
    you've said that even if it doesn't contain the term at all it may get a
    score value which needs to be factored in to the overall score.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 10, 2006 at 1:41 pm
    Chris,

    Somebody recently asked me about how Lucene processes queries. Other than working on required clauses in a BooleanQuery first, and skipping if there are no matching Docs for them, there are no other query optimization strategies/tricks, are there?

    Otis

    ----- Original Message ----
    From: Chris Hostetter <hossman_lucene@fucit.org>
    To: java-user@lucene.apache.org
    Sent: Friday, June 9, 2006 3:08:35 PM
    Subject: RE: Different scoring mechanism



    : For example: a query containing two terms: "fast", "car", having
    : document frequencies 300.000 and 20.000 in the index respectively. In a
    : worst case scenario this would require 320.000 document scores to be
    : calculated. I am not really sure how lucene optimizes its search, but I
    : guess it does that by first processing the documents having the highest
    : term frequencies (and thus highest combined score) with these query
    : terms, and pruning the search if the n hits have been found and it's
    : certain that no document can be found which will give a higher score.

    Nope. Lucene scores all "matching" documents in the index in increasing
    order of docId -- it can optimize the process using "skipTo" in Scorers
    when it knows that it's not possible for for a document to "match" the
    overall query, so it "skips ahead" to the first doc that can match.

    ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
    knows that unless something matches title:cat and title:dog then there is
    not point in checking wether it matches body:snake -- let alone scoring
    hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
    for title:cat and title:dog to keep skipping ahead untill it finds a doc
    matching both, then it checks if it matches body:snake, and if it does
    *then* it scores things.

    : If I would change the next function in my own scorer to process all
    : document ids, I am afraid I will wreck Lucene's optimization method (as
    : I am then not serving the documents in descending term frequency order).

    it would certianly eliminate lucenes ability to skip ahead (allthough
    not in the way you imagined) ... but based on the way you've described how
    you want scoring to work, it has to score every doc no matter what --
    you've said that even if it doesn't contain the term at all it may get a
    score value which needs to be factored in to the overall score.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Jun 10, 2006 at 2:24 pm

    On 6/10/06, Otis Gospodnetic wrote:
    Other than working on required clauses in a BooleanQuery first, and skipping if there are no matching Docs for them, there are no other query optimization strategies/tricks, are there?

    I think that's pretty much it, depending on what you consider an
    optimization/trick.
    Efficient seeking to a specific term, efficient skipping to a
    particular document number for that term, keeping track of the term
    with the lowest docid with heaps or priority queues, and keeping track
    of the highest scoring docs with a priority queue.

    Higher level optimizations that do query transformations are left as
    an exercise to the application :-)

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    Otis

    ----- Original Message ----
    From: Chris Hostetter <hossman_lucene@fucit.org>
    To: java-user@lucene.apache.org
    Sent: Friday, June 9, 2006 3:08:35 PM
    Subject: RE: Different scoring mechanism



    : For example: a query containing two terms: "fast", "car", having
    : document frequencies 300.000 and 20.000 in the index respectively. In a
    : worst case scenario this would require 320.000 document scores to be
    : calculated. I am not really sure how lucene optimizes its search, but I
    : guess it does that by first processing the documents having the highest
    : term frequencies (and thus highest combined score) with these query
    : terms, and pruning the search if the n hits have been found and it's
    : certain that no document can be found which will give a higher score.

    Nope. Lucene scores all "matching" documents in the index in increasing
    order of docId -- it can optimize the process using "skipTo" in Scorers
    when it knows that it's not possible for for a document to "match" the
    overall query, so it "skips ahead" to the first doc that can match.

    ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
    knows that unless something matches title:cat and title:dog then there is
    not point in checking wether it matches body:snake -- let alone scoring
    hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
    for title:cat and title:dog to keep skipping ahead untill it finds a doc
    matching both, then it checks if it matches body:snake, and if it does
    *then* it scores things.

    : If I would change the next function in my own scorer to process all
    : document ids, I am afraid I will wreck Lucene's optimization method (as
    : I am then not serving the documents in descending term frequency order).

    it would certianly eliminate lucenes ability to skip ahead (allthough
    not in the way you imagined) ... but based on the way you've described how
    you want scoring to work, it has to score every doc no matter what --
    you've said that even if it doesn't contain the term at all it may get a
    score value which needs to be factored in to the overall score.



    -Hoss
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jun 10, 2006 at 8:50 pm
    : Higher level optimizations that do query transformations are left as
    : an exercise to the application :-)

    Word!

    For example, Yonik helped me speed up a fairly hairy query i had a while
    back by realizing that the way i was progromatically generating a query,
    one deeply nested clause was actually constant even though the wrapper
    around it changed depending on the input, and it was mainly a "match all
    docs, except things that amtch a, b, c" type query -- so we converted it
    to a Filter, wrapped it in in a CachingWrapperFilter, then wrapped that in
    a FilteredQuery -- majorly speed things up.

    in general, the best optimization I've seen is to identify the parts of
    your query that aren't really about scoring as much as they are about
    filtering -- and convert them to Filters which can be cached. usually
    they can be applied using the normal search(Query,Filter) methods -- but
    even if they are conditions that you have nested deep down like the one i
    described, don't be afraid to go crazy :)


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 7, '06 at 1:46p
activeJun 10, '06 at 8:50p
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase