FAQ
Hi,

After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.

## Search Discussions

•  at Apr 28, 2006 at 6:58 pm ⇧

On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote:
After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.
Noone seems to have answered this yet, so I guess I'll have a go.

I wrote down the following a while ago; I'm omitting boosts and coords
here, since you don't have to use them. It assumes that you are using
DefaultSimilarity and not a custom similarity implementation. You will
have to pick through the LaTeX code; it's rather difficult to render
formulas in ASCII.

Lucene uses a
modified vector-space model; the main scoring formula is

\label{eq:lucenescore}
\score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot
\idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
\sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}}

where
$\idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1$
Scores are normalized to fall in a range of 0.0 to 1.0.

This weighting scheme is easily related to the standard vector-space
model by using $$\sqrt{\tf(\term, \doc)}$$ instead of $$\tf(\term, \doc)$$
and defining $$\tf(\term,\qu)\equiv 1$$. Then
\begin{align*}
\score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =
\frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\
&= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}
\idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}
\idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}
\left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}
\sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}
\idf(\term)\right)^2}}\\
&= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}
\idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}}
\end{align*}
By omitting the term $$\idf(\term)^2$$ from the term
$$\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}$$ in the
denominator, one arrives at the main scoring formula in
equation~(\ref{eq:lucenescore}). Omitting the inverse document
frequency from the document normalization factor allows one to
precompute this factor and store it in the index; otherwise it would
be necessary to recompute the normalization factors every time a
document is added or deleted from the index.

--
Sebastian Kirsch [http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

## Related Discussions

Discussion Overview
 group java-user categories lucene posted Apr 28, '06 at 5:55a active Apr 28, '06 at 6:58p posts 2 users 2 website lucene.apache.org

### 2 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase