FAQ
Hello *,

I have been trying to find an *efficient *(in terms of performance) way
to get the Cosine Similarity between two Lucene Documents.

I have seen that this can be done with:

1. Converting the document into a query and submitting the query, getting

the results and their score. --TOO SLOW if you want this for all
documents
in a corpus.
2. MoreLikeThis class, but this is not what I really want.

What I want is the following:
I have 3 different fields(zones) in my index(corpus) for each document.
Each zone has its own boost(weight).

What I need is: get the distance of all pairs of documents in my index
using the different term weights(from each field's boost).

I other words I need to calculate the Similarity formula
for all pairs of documents in the index.

Does anyone have in mind any project or code that does this?
It would take some time to develop this myself.

thanks a lot in advance,
--
Asterios Katsifodimos
High Performance Computing systems Lab
Department of Computer Science, University of Cyprus
http://grid.ucy.ac.cy

Search Discussions

  • Karl Wettin at Jul 15, 2008 at 8:00 am
    I'm not sure what it is you say you want to do. If what you want to do
    is to measure distance between two documents then the easiet way is to
    extract the feature vectors (document TermFreqVector) from those two
    documents and measure the distance using something like the Tanimoto
    coefficient.

    You want to do this for all the pairs in the index? Consider Hadoop.

    You might get more and better help by asking in Mahout: http://lucene.apache.org/mahout/

    karl


    14 jul 2008 kl. 14.25 skrev Asterios Katsifodimos:
    Hello *,

    I have been trying to find an *efficient *(in terms of performance)
    way
    to get the Cosine Similarity between two Lucene Documents.

    I have seen that this can be done with:

    1. Converting the document into a query and submitting the query,
    getting

    the results and their score. --TOO SLOW if you want this for all
    documents
    in a corpus.
    2. MoreLikeThis class, but this is not what I really want.

    What I want is the following:
    I have 3 different fields(zones) in my index(corpus) for each
    document.
    Each zone has its own boost(weight).

    What I need is: get the distance of all pairs of documents in my index
    using the different term weights(from each field's boost).

    I other words I need to calculate the Similarity formula
    for all pairs of documents in the index.

    Does anyone have in mind any project or code that does this?
    It would take some time to develop this myself.

    thanks a lot in advance,
    --
    Asterios Katsifodimos
    High Performance Computing systems Lab
    Department of Computer Science, University of Cyprus
    http://grid.ucy.ac.cy

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 14, '08 at 12:26p
activeJul 15, '08 at 8:00a
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase