FAQ
Hi,

>>
I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible.
>>

Sorry, I have no idea about how to use a probabilistic approach with
Lucene, but if anyone does so, I would like to know, too.

I am currently puzzled by a related question: I would like to know
if there are any approaches to get a confidence value for relevance
rather than a ranking. I.e., it would be nice to have a ranking
weight whose value has some kind of semantics such that we could
compare results from different queries. Can probabilistic approches
do anything like this?

Any help appreciated,

Karsten



-----Ursprüngliche Nachricht-----
Von: ambiesense@gmx.de
Gesendet: Mittwoch, 3. Dezember 2003 15:13
An: lucene-user@jakarta.apache.org
Betreff: Probabilistic Model in Lucene - possible?


Hello group,

from the very inspiring conversations with Karsten I know that Lucene is based on a Vector Space Model. I am just wondering if it would be possible to turn this into a probabilistic Model approach. Of course I do know that I cannot change the underlying indexing and searching principles. However it would be possible to change the index term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I would need to implement another similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much effort would need to go into that? I am sure there are many other issues which I have not considered...

Kind Regards,
Ralf


--
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Karsten Konrad at Dec 4, 2003 at 10:10 am
    Hi Herb,

    thank you for your insights.

    >>
    but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model.
    >>

    Can you send some pointers to help me understand that? Are all TF/IDF-variants
    probabilistic models? If so, what makes any model a non-probabilistic one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM would
    also be a probabilistic model.

    >>
    it's got strange normalizations though that doesn't allow comparisons of rank values across queries.
    >>

    Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say against this - it's a hack,
    but useful - it makes comparing the rank values across queries really difficult. It's like
    using different scales whenever you measure something different, and then you do not tell
    anyone about it.

    >>
    it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it.
    >>

    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries.

    it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?

    >>
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    >>

    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Ambiesense at Dec 4, 2003 at 11:45 am
    Hello Karsten and Herb,

    I would like to join Karsten in the club of sightly confused people
    regarding what probabilistic and what vector space model is... :-) I consistered
    TF/IDF and cosine similarity so far as something which belongs to the VSM. The
    standard similarity formula for probabilistic model also looks different from
    TF/IDF...

    Clarification would be appreciated.

    Cheers,
    Ralf

    Hi Herb,

    thank you for your insights.
    but by most accepted definitions, the tf/idf model in Lucene is a
    probabilistic model.
    Can you send some pointers to help me understand that? Are all
    TF/IDF-variants
    probabilistic models? If so, what makes any model a non-probabilistic
    one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an
    extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM
    would
    also be a probabilistic model.
    it's got strange normalizations though that doesn't allow comparisons of
    rank values across queries.
    Lucene's internal ranking sometimes returns values > 1.0, these are then
    normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say
    against this - it's a hack,
    but useful - it makes comparing the rank values across queries really
    difficult. It's like
    using different scales whenever you measure something different, and
    then you do not tell
    anyone about it.
    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a meaning
    to the score. i've done it.
    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted
    definitions, the tf/idf model in Lucene is a probabilistic model. it's
    got strange normalizations though that doesn't allow comparisons of rank
    values across queries.

    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a meaning
    to the score. i've done it. however, that means abandoning idf and
    keeping actual term frequencies for each document and document size.
    once you normalize this way, you can intermingle document scores from
    different queries and different corpora and make statements about the
    absolute value of the score. it also leads directly into the discussion
    we had earlier about interterm correlations and how to handle them
    properly since the full interterm probabilistic model has as a special
    case the traditional tf/idf model. interjecting Boolean conditions and
    boost makes the model much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if
    there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    +++ GMX - die erste Adresse für Mail, Message, More +++
    Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '03 at 9:51p
activeDec 4, '03 at 11:45a
posts3
users2
websitelucene.apache.org

2 users in discussion

Karsten Konrad: 2 posts Ambiesense: 1 post

People

Translate

site design / logo © 2022 Grokbase