I would like to know if there is a simple way to force Lucene to adopt the
simple cosine similarity of the term frequency vectors of the documents and
the query for ranking the result. In practice the score sc_i of the document
i should be given by:

sc_i = (D_i*Q)/(|D_i|*|Q|)

where D_i = vector of the term frequencies of document i;
Q = vector of the term frequencies of the Query;
* = scalar product;
= norm of the vector (the square root of the sum of the squares
of the entries of the vector).

I wasn't able to find a way to evaluate |D_i|.

Thank you


Ing. Claudio Gennaro, PhD
ISTI (Information Science and Technology Institute)
Consiglio Nazionale delle Ricerche Area della Ricerca di Pisa (room I14)
Via G. Moruzzi 1
56124 Pisa - ITALY
phone: +39 050 315 3077
mobile: +39 328 92 16 734
fax: +39 050 315 3464 or +39 050 315 2810
e-mail: claudio.gennaro@isti.cnr.it

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 1 | next ›
Discussion Overview
groupjava-user @
postedAug 13, '09 at 4:30p
activeAug 13, '09 at 4:30p

1 user in discussion

Claudio Gennaro: 1 post



site design / logo © 2022 Grokbase