FAQ
Hi,

I would like to implement the Okapi BM25 weighting function using my own Similarity implementation. Unfortunately BM25 requires the document length in the score calculation, which is not provided by the Scorer.

Does anyone know a solution to this problem?

I've tried to find other Similarity implementations than the default one used by Lucene, but I could not find any... Any suggestions?

Thanks.
Dolf






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Doug Cutting at Feb 16, 2006 at 6:04 pm

    Trieschnigg, R.B. (Dolf) wrote:
    I would like to implement the Okapi BM25 weighting function using my own Similarity implementation. Unfortunately BM25 requires the document length in the score calculation, which is not provided by the Scorer.
    How do you want to measure document length? If the number of tokens is
    an acceptable measure, then the norm contains sqrt(numTokens) by
    default. You can modify your Similarity.lengthNorm() implementation to
    not perform the sqrt, or square the norm.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Trieschnigg, R.B. \(Dolf\) at Feb 17, 2006 at 9:54 am

    I would like to implement the Okapi BM25 weighting function
    using my own Similarity implementation. Unfortunately BM25
    requires the document length in the score calculation, which
    is not provided by the Scorer.
    How do you want to measure document length? If the number of
    tokens is an acceptable measure, then the norm contains
    sqrt(numTokens) by default. You can modify your
    Similarity.lengthNorm() implementation to not perform the
    sqrt, or square the norm.
    I assume the number of tokens will be a good estimate.

    I've included an image with the algorithm (my ASCII art isn't that good).
    Legend of the figure:
    - k1, k3 and b are constants
    - tf is the within document term frequency
    - df is the document frequency
    - N is the collection size
    - r is the number of relevant documents containing a particular term (without relevance information assumed to be 0)
    - R is the number of items known to be relevant to a specific topic (without relevance information assumed to be 0)

    As far is I understand Lucene multiplies the squared weight with the result of Similarity.lengthNorm(), but BM25 requires the document length for the calculation of the document term weighting (as far as I know it's not possible to extract the influence of the normalization as a constant multiplier).

    Am I missing something here?

    Dolf
  • Trieschnigg, R.B. \(Dolf\) at Feb 17, 2006 at 10:52 am
    Sorry, the image wasn't sent:
    http://wwwhome.cs.utwente.nl/~trieschn/bm25.PNG
    -----Original Message-----
    From: Trieschnigg, R.B. (Dolf)

    Sent: vrijdag 17 februari 2006 10:54
    To: [email protected]
    Subject: RE: BM25 Similarity implementation
    I would like to implement the Okapi BM25 weighting
    function using my
    own Similarity implementation. Unfortunately BM25 requires the
    document length in the score calculation, which is not
    provided by
    the Scorer.
    How do you want to measure document length? If the number of tokens
    is an acceptable measure, then the norm contains
    sqrt(numTokens) by default. You can modify your
    Similarity.lengthNorm() implementation to not perform the sqrt, or
    square the norm.
    I assume the number of tokens will be a good estimate.

    I've included an image with the algorithm (my ASCII art isn't
    that good).
    Legend of the figure:
    - k1, k3 and b are constants
    - tf is the within document term frequency
    - df is the document frequency
    - N is the collection size
    - r is the number of relevant documents containing a
    particular term (without relevance information assumed to be 0)
    - R is the number of items known to be relevant to a specific
    topic (without relevance information assumed to be 0)

    As far is I understand Lucene multiplies the squared weight
    with the result of Similarity.lengthNorm(), but BM25 requires
    the document length for the calculation of the document term
    weighting (as far as I know it's not possible to extract the
    influence of the normalization as a constant multiplier).

    Am I missing something here?

    Dolf
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 16, '06 at 10:41a
activeFeb 17, '06 at 10:52a
posts4
users2
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase