Sorry, the image wasn't sent:
http://wwwhome.cs.utwente.nl/~trieschn/bm25.PNG-----Original Message-----
From: Trieschnigg, R.B. (Dolf)
Sent: vrijdag 17 februari 2006 10:54
To:
[email protected]Subject: RE: BM25 Similarity implementation
I would like to implement the Okapi BM25 weighting
function using my
own Similarity implementation. Unfortunately BM25 requires the
document length in the score calculation, which is not
provided by
the Scorer.
How do you want to measure document length? If the number of tokens
is an acceptable measure, then the norm contains
sqrt(numTokens) by default. You can modify your
Similarity.lengthNorm() implementation to not perform the sqrt, or
square the norm.
I assume the number of tokens will be a good estimate.
I've included an image with the algorithm (my ASCII art isn't
that good).
Legend of the figure:
- k1, k3 and b are constants
- tf is the within document term frequency
- df is the document frequency
- N is the collection size
- r is the number of relevant documents containing a
particular term (without relevance information assumed to be 0)
- R is the number of items known to be relevant to a specific
topic (without relevance information assumed to be 0)
As far is I understand Lucene multiplies the squared weight
with the result of Similarity.lengthNorm(), but BM25 requires
the document length for the calculation of the document term
weighting (as far as I know it's not possible to extract the
influence of the normalization as a constant multiplier).
Am I missing something here?
Dolf
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]