FAQ
Hi,

After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.

Search Discussions

  • Sebastian Marius Kirsch at Apr 28, 2006 at 6:58 pm

    On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote:
    After reading the code, I found the similarity measure in Lucene is not the
    same as the cosine coefficient measure commonly used. I dont know it is
    correct. And I wonder whether i can use the cosine coefficient measure in
    lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
    coefficient measure.
    Noone seems to have answered this yet, so I guess I'll have a go.

    I wrote down the following a while ago; I'm omitting boosts and coords
    here, since you don't have to use them. It assumes that you are using
    DefaultSimilarity and not a custom similarity implementation. You will
    have to pick through the LaTeX code; it's rather difficult to render
    formulas in ASCII.


    Lucene uses a
    modified vector-space model; the main scoring formula is
    \begin{equation}
    \label{eq:lucenescore}
    \score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot
    \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
    \sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}}
    \end{equation}
    where
    \[ \idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1 \]
    Scores are normalized to fall in a range of 0.0 to 1.0.

    This weighting scheme is easily related to the standard vector-space
    model by using \(\sqrt{\tf(\term, \doc)}\) instead of \(\tf(\term, \doc)\)
    and defining \(\tf(\term,\qu)\equiv 1\). Then
    \begin{align*}
    \score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =
    \frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\
    &= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}
    \idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}
    \idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}
    \left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}
    \sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}
    \idf(\term)\right)^2}}\\
    &= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}
    \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
    \sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}}
    \end{align*}
    By omitting the term \(\idf(\term)^2\) from the term
    \(\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}\) in the
    denominator, one arrives at the main scoring formula in
    equation~(\ref{eq:lucenescore}). Omitting the inverse document
    frequency from the document normalization factor allows one to
    precompute this factor and store it in the index; otherwise it would
    be necessary to recompute the normalization factors every time a
    document is added or deleted from the index.

    --
    Sebastian Kirsch [http://www.sebastian-kirsch.org/]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 28, '06 at 5:55a
activeApr 28, '06 at 6:58p
posts2
users2
websitelucene.apache.org

2 users in discussion

Jason: 1 post Sebastian Marius Kirsch: 1 post

People

Translate

site design / logo © 2022 Grokbase