FAQ
Hello Karsten and Herb,

I would like to join Karsten in the club of sightly confused people
regarding what probabilistic and what vector space model is... :-) I consistered
TF/IDF and cosine similarity so far as something which belongs to the VSM. The
standard similarity formula for probabilistic model also looks different from
TF/IDF...

Clarification would be appreciated.

Cheers,
Ralf

Hi Herb,

thank you for your insights.
but by most accepted definitions, the tf/idf model in Lucene is a
probabilistic model.
Can you send some pointers to help me understand that? Are all
TF/IDF-variants
probabilistic models? If so, what makes any model a non-probabilistic
one?
If you claim that TF/IDF is probabilistic, then the plain cosine (an
extreme
form of TF/IDF, with IDF for all terms being considered constant) of VSM
would
also be a probabilistic model.
it's got strange normalizations though that doesn't allow comparisons of
rank values across queries.
Lucene's internal ranking sometimes returns values > 1.0, these are then
normalized to 1.0,
adjusting other rankings accordingly. While I have nothing to say
against this - it's a hack,
but useful - it makes comparing the rank values across queries really
difficult. It's like
using different scales whenever you measure something different, and
then you do not tell
anyone about it.
it isn't terribly hard to make a normalized probabilistic model that
allows comparing of document scores across queries and assign a meaning
to the score. i've done it.
Stop bragging, send us your Similarity implementation :)

Regards,

Karsten


-----Ursprüngliche Nachricht-----
Von: Chong, Herb
Gesendet: Mittwoch, 3. Dezember 2003 23:01
An: Lucene Users List
Betreff: RE: Probabilistic Model in Lucene - possible?


i think i am missing the original question, but by most accepted
definitions, the tf/idf model in Lucene is a probabilistic model. it's
got strange normalizations though that doesn't allow comparisons of rank
values across queries.

it isn't terribly hard to make a normalized probabilistic model that
allows comparing of document scores across queries and assign a meaning
to the score. i've done it. however, that means abandoning idf and
keeping actual term frequencies for each document and document size.
once you normalize this way, you can intermingle document scores from
different queries and different corpora and make statements about the
absolute value of the score. it also leads directly into the discussion
we had earlier about interterm correlations and how to handle them
properly since the full interterm probabilistic model has as a special
case the traditional tf/idf model. interjecting Boolean conditions and
boost makes the model much more complicated.

Herb....

-----Original Message-----
From: Karsten Konrad
Sent: Wednesday, December 03, 2003 4:51 PM
To: Lucene Users List
Subject: AW: Probabilistic Model in Lucene - possible?
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
Sorry, I have no idea about how to use a probabilistic approach with
Lucene, but if anyone does so, I would like to know, too.

I am currently puzzled by a related question: I would like to know if
there are any approaches to get a confidence value for relevance
rather than a ranking. I.e., it would be nice to have a ranking
weight whose value has some kind of semantics such that we could
compare results from different queries. Can probabilistic approches
do anything like this?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
--
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '03 at 9:51p
activeDec 4, '03 at 11:45a
posts3
users2
websitelucene.apache.org

2 users in discussion

Karsten Konrad: 2 posts Ambiesense: 1 post

People

Translate

site design / logo © 2022 Grokbase