FAQ
Hi,

I am trying to implement an alternative scoring mechanism in Lucene.
A query of multiple terms is represented as a BooleanQuery with one or more Occur.SHOULD clauses.

The scoring for a document is very simple:
- Assign a score for each queryterm.
! If a document does not contain a queryterm this score can be larger or smaller than 0 !
- Sum these scores.

I run into problems in case a document does not contain a query term. See the following sample results with query "a" or "b".

Document1: x b p j x s n f e p t h q y t b q d x
Calculated score: -0.9631268
Score should be:
-2.7171288 = Sum of:
-1.754002 = score for "a", term frequency 0
-0.9631268 = score for "b", term frequency 2

Document2: r b g a p i h i u x w p f m p m j c m d
Calculated score: -2.6189766
Score should be:
-2.6189766 = Sum of:
-1.3109605 = score for "a", term frequency 1;
-1.3080161 = score for "b", term frequency 1;

The calculated score for document2 is as I have in mind. However, the score for documment1 is incorrect. The BooleanScorer seems to ignore the score for term "a"; as it doesn't exist in the document.

Any ideas how to solve this?

Regards,
Dolf

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

Search Discussions

•  at Jun 9, 2006 at 7:28 am ⇧
: ! If a document does not contain a queryterm this score can be larger
: or smaller than 0 !

if a document doesn't contain a term, then the scorer for that query will
never even try to score that document -- regardless of what your
Similarity class looks like.

if you really want this kind of behavior, you'll need to roll your own
TermQuery/TermScorer classes and change next and skipTo to allways advance
ot the next doc -- regardless of wether or not it matches (you can check
for that in the score function and act accordingly)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Jun 9, 2006 at 11:48 am ⇧

: ! If a document does not contain a queryterm this score
can be larger
: or smaller than 0 !

if a document doesn't contain a term, then the scorer for
that query will never even try to score that document --
regardless of what your Similarity class looks like.

if you really want this kind of behavior, you'll need to roll
your own TermQuery/TermScorer classes and change next and
skipTo to allways advance ot the next doc -- regardless of
wether or not it matches (you can check for that in the score
function and act accordingly)
That sounds like a reasonable approach. However, I still require the searching process to be optimized for retrieving the first n hits. (I made my own implementation outside the Lucene search-architecture which was unbelievably slow).

For example: a query containing two terms: "fast", "car", having document frequencies 300.000 and 20.000 in the index respectively. In a worst case scenario this would require 320.000 document scores to be calculated. I am not really sure how lucene optimizes its search, but I guess it does that by first processing the documents having the highest term frequencies (and thus highest combined score) with these query terms, and pruning the search if the n hits have been found and it's certain that no document can be found which will give a higher score.

If I would change the next function in my own scorer to process all document ids, I am afraid I will wreck Lucene's optimization method (as I am then not serving the documents in descending term frequency order).

Perhaps someone can tell if Lucene indeed requires <scorer>.next() to return the documents for a term in descending term frequency order?

Regards,
Dolf

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Jun 9, 2006 at 7:09 pm ⇧
: For example: a query containing two terms: "fast", "car", having
: document frequencies 300.000 and 20.000 in the index respectively. In a
: worst case scenario this would require 320.000 document scores to be
: calculated. I am not really sure how lucene optimizes its search, but I
: guess it does that by first processing the documents having the highest
: term frequencies (and thus highest combined score) with these query
: terms, and pruning the search if the n hits have been found and it's
: certain that no document can be found which will give a higher score.

Nope. Lucene scores all "matching" documents in the index in increasing
order of docId -- it can optimize the process using "skipTo" in Scorers
when it knows that it's not possible for for a document to "match" the
overall query, so it "skips ahead" to the first doc that can match.

ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
knows that unless something matches title:cat and title:dog then there is
not point in checking wether it matches body:snake -- let alone scoring
hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
for title:cat and title:dog to keep skipping ahead untill it finds a doc
matching both, then it checks if it matches body:snake, and if it does
*then* it scores things.

: If I would change the next function in my own scorer to process all
: document ids, I am afraid I will wreck Lucene's optimization method (as
: I am then not serving the documents in descending term frequency order).

it would certianly eliminate lucenes ability to skip ahead (allthough
not in the way you imagined) ... but based on the way you've described how
you want scoring to work, it has to score every doc no matter what --
you've said that even if it doesn't contain the term at all it may get a
score value which needs to be factored in to the overall score.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Jun 10, 2006 at 1:41 pm ⇧
Chris,

Somebody recently asked me about how Lucene processes queries. Other than working on required clauses in a BooleanQuery first, and skipping if there are no matching Docs for them, there are no other query optimization strategies/tricks, are there?

Otis

----- Original Message ----
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Sent: Friday, June 9, 2006 3:08:35 PM
Subject: RE: Different scoring mechanism

: For example: a query containing two terms: "fast", "car", having
: document frequencies 300.000 and 20.000 in the index respectively. In a
: worst case scenario this would require 320.000 document scores to be
: calculated. I am not really sure how lucene optimizes its search, but I
: guess it does that by first processing the documents having the highest
: term frequencies (and thus highest combined score) with these query
: terms, and pruning the search if the n hits have been found and it's
: certain that no document can be found which will give a higher score.

Nope. Lucene scores all "matching" documents in the index in increasing
order of docId -- it can optimize the process using "skipTo" in Scorers
when it knows that it's not possible for for a document to "match" the
overall query, so it "skips ahead" to the first doc that can match.

ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
knows that unless something matches title:cat and title:dog then there is
not point in checking wether it matches body:snake -- let alone scoring
hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
for title:cat and title:dog to keep skipping ahead untill it finds a doc
matching both, then it checks if it matches body:snake, and if it does
*then* it scores things.

: If I would change the next function in my own scorer to process all
: document ids, I am afraid I will wreck Lucene's optimization method (as
: I am then not serving the documents in descending term frequency order).

it would certianly eliminate lucenes ability to skip ahead (allthough
not in the way you imagined) ... but based on the way you've described how
you want scoring to work, it has to score every doc no matter what --
you've said that even if it doesn't contain the term at all it may get a
score value which needs to be factored in to the overall score.

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Jun 10, 2006 at 2:24 pm ⇧

On 6/10/06, Otis Gospodnetic wrote:
Other than working on required clauses in a BooleanQuery first, and skipping if there are no matching Docs for them, there are no other query optimization strategies/tricks, are there?

I think that's pretty much it, depending on what you consider an
optimization/trick.
Efficient seeking to a specific term, efficient skipping to a
particular document number for that term, keeping track of the term
with the lowest docid with heaps or priority queues, and keeping track
of the highest scoring docs with a priority queue.

Higher level optimizations that do query transformations are left as
an exercise to the application :-)

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

Otis

----- Original Message ----
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Sent: Friday, June 9, 2006 3:08:35 PM
Subject: RE: Different scoring mechanism

: For example: a query containing two terms: "fast", "car", having
: document frequencies 300.000 and 20.000 in the index respectively. In a
: worst case scenario this would require 320.000 document scores to be
: calculated. I am not really sure how lucene optimizes its search, but I
: guess it does that by first processing the documents having the highest
: term frequencies (and thus highest combined score) with these query
: terms, and pruning the search if the n hits have been found and it's
: certain that no document can be found which will give a higher score.

Nope. Lucene scores all "matching" documents in the index in increasing
order of docId -- it can optimize the process using "skipTo" in Scorers
when it knows that it's not possible for for a document to "match" the
overall query, so it "skips ahead" to the first doc that can match.

ie: if you have a boolean query like "+title:cat +title:dog body:snake" it
knows that unless something matches title:cat and title:dog then there is
not point in checking wether it matches body:snake -- let alone scoring
hte doc at all. so BooleanScorer uses skipTo on the individual Scorers
for title:cat and title:dog to keep skipping ahead untill it finds a doc
matching both, then it checks if it matches body:snake, and if it does
*then* it scores things.

: If I would change the next function in my own scorer to process all
: document ids, I am afraid I will wreck Lucene's optimization method (as
: I am then not serving the documents in descending term frequency order).

it would certianly eliminate lucenes ability to skip ahead (allthough
not in the way you imagined) ... but based on the way you've described how
you want scoring to work, it has to score every doc no matter what --
you've said that even if it doesn't contain the term at all it may get a
score value which needs to be factored in to the overall score.

-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Jun 10, 2006 at 8:50 pm ⇧
: Higher level optimizations that do query transformations are left as
: an exercise to the application :-)

Word!

For example, Yonik helped me speed up a fairly hairy query i had a while
back by realizing that the way i was progromatically generating a query,
one deeply nested clause was actually constant even though the wrapper
around it changed depending on the input, and it was mainly a "match all
docs, except things that amtch a, b, c" type query -- so we converted it
to a Filter, wrapped it in in a CachingWrapperFilter, then wrapped that in
a FilteredQuery -- majorly speed things up.

in general, the best optimization I've seen is to identify the parts of
filtering -- and convert them to Filters which can be cached. usually
they can be applied using the normal search(Query,Filter) methods -- but
even if they are conditions that you have nested deep down like the one i
described, don't be afraid to go crazy :)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

Related Discussions

Discussion Overview
 group java-user categories lucene posted Jun 7, '06 at 1:46p active Jun 10, '06 at 8:50p posts 7 users 4 website lucene.apache.org

4 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase