FAQ
Hi Herb,

thank you for your insights.

>>
but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model.
>>

Can you send some pointers to help me understand that? Are all TF/IDF-variants
probabilistic models? If so, what makes any model a non-probabilistic one?
If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme
form of TF/IDF, with IDF for all terms being considered constant) of VSM would
also be a probabilistic model.

>>
it's got strange normalizations though that doesn't allow comparisons of rank values across queries.
>>

Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to 1.0,
adjusting other rankings accordingly. While I have nothing to say against this - it's a hack,
but useful - it makes comparing the rank values across queries really difficult. It's like
using different scales whenever you measure something different, and then you do not tell
anyone about it.

>>
it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it.
>>

Stop bragging, send us your Similarity implementation :)

Regards,

Karsten


-----Ursprüngliche Nachricht-----
Von: Chong, Herb
Gesendet: Mittwoch, 3. Dezember 2003 23:01
An: Lucene Users List
Betreff: RE: Probabilistic Model in Lucene - possible?


i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries.

it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated.

Herb....

-----Original Message-----
From: Karsten Konrad
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?

>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>

Sorry, I have no idea about how to use a probabilistic approach with
Lucene, but if anyone does so, I would like to know, too.

I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance
rather than a ranking. I.e., it would be nice to have a ranking
weight whose value has some kind of semantics such that we could
compare results from different queries. Can probabilistic approches
do anything like this?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 3 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '03 at 9:51p
activeDec 4, '03 at 11:45a
posts3
users2
websitelucene.apache.org

2 users in discussion

Karsten Konrad: 2 posts Ambiesense: 1 post

People

Translate

site design / logo © 2022 Grokbase