FAQ
Hello group,

from the very inspiring conversations with Karsten I know that Lucene is
based on a Vector Space Model. I am just wondering if it would be possible to
turn this into a probabilistic Model approach. Of course I do know that I
cannot change the underlying indexing and searching principles. However it would
be possible to change the index term weight to eigther 1.0 (relevant) or 0.0
(non-relevant). For the similarity I would need to implement another
similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much
effort would need to go into that? I am sure there are many other issues which
I have not considered...

Kind Regards,
Ralf


--
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Chong, Herb at Dec 3, 2003 at 10:01 pm
    i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries.

    it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?

    >>
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    >>

    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know
    if there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Chong, Herb at Dec 4, 2003 at 1:53 pm
    not all tf/idf variants are probabilistic models, but a great many are if the term weights are probabilities. if we just take straight, unmodified Term Frequency in a document, Inverse Document Frequency in the corpus, and the Term Frequency in the query as 1, you are in fact comparing the statistical properties of the query against the statistical properties of the query. they are probabilities you are comparing. i can't think of many papers that come right out and say it, but if you look at an individual term weight and can interpret it as a genuine probability, the vector space model based on the weights is a probabilistic model. the derivation is relatively straight forward to show it, if you have the right general model to start with. once you start throwing in ad hoc normalizations, then things get out of whack and it's not longer a probabilistic model.

    the implementations that i have done are with a former company and that means secret and protected by various intellectual property rights. however, i can sketch here the general approach one has to take and an outline of the derivation that unifies probabilistic models with vector space models and at the same time incorporate pairwise interterm correlation. in fact, the pairwise interterm correlations are a fundamental assumption. once you do all this, you can show that the traditional vector space model is a special case of a pairwise interterm correlation model. for those that are interested in advanced matrix algebra and some basic statistics, it should be very interesting. if only i had a published paper, i would post it. unfortunately, what i have is very obtuse because it's protected. the only paper that started out was submitted to SIGIR but rejected by all but one referee. that one thought this was a tremendous unification of the two methods, but academic journals being what they are, when 4 out of 5 referees can't understand the paper, it doesn't get published. i may brush it off and enlarge into a much longer paper for the Journal of IR, but once again, unless you are comfortable with probability theory and matrix theory, you are not going to follow it.

    so, who is game for a tutorial on the derivation?

    Herb...

    -----Original Message-----
    From: Karsten Konrad
    Sent: Thursday, December 04, 2003 5:09 AM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?



    Hi Herb,

    thank you for your insights.

    >>
    but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model.
    >>

    Can you send some pointers to help me understand that? Are all TF/IDF-variants
    probabilistic models? If so, what makes any model a non-probabilistic one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM would
    also be a probabilistic model.

    >>
    it's got strange normalizations though that doesn't allow comparisons of rank values across queries.
    >>

    Lucene's internal ranking sometimes returns values > 1.0, these are then normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say against this - it's a hack,
    but useful - it makes comparing the rank values across queries really difficult. It's like
    using different scales whenever you measure something different, and then you do not tell
    anyone about it.

    >>
    it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it.
    >>

    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries.

    it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?

    >>
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    >>

    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Adam Saltiel at Dec 5, 2003 at 8:45 am
    Herb,
    Any one game ... ?
    No takers? I would be very interested, but maybe beyond what can be
    posted in a mail list. I'd be equally interested in any references you
    may have.
    As we are on this subject how does LSI and the similar CNG (context
    network graph) fit into the model used by lucene. Could lucene be
    massaged to implement different mathematical models of search and
    retrieval, if so how modular are the core functions?

    Adam Saltiel

    -----Original Message-----
    From: Chong, Herb
    Sent: Thursday, December 04, 2003 1:53 PM
    To: Lucene Users List
    Subject: RE: Probabilistic Model in Lucene - possible?

    not all tf/idf variants are probabilistic models, but a great many are if
    the term weights are probabilities. if we just take straight,
    unmodified
    Term Frequency in a document, Inverse Document Frequency in the corpus,
    and the Term Frequency in the query as 1, you are in fact comparing the
    statistical properties of the query against the statistical properties of
    the query. they are probabilities you are comparing. i can't think of many
    papers that come right out and say it, but if you look at an
    individual
    term weight and can interpret it as a genuine probability, the vector
    space model based on the weights is a probabilistic model. the
    derivation
    is relatively straight forward to show it, if you have the right general
    model to start with. once you start throwing in ad hoc normalizations,
    then things get out of whack and it's not longer a probabilistic model.
    the implementations that i have done are with a former company and that
    means secret and protected by various intellectual property rights.
    however, i can sketch here the general approach one has to take and an
    outline of the derivation that unifies probabilistic models with vector
    space models and at the same time incorporate pairwise interterm
    correlation. in fact, the pairwise interterm correlations are a
    fundamental assumption. once you do all this, you can show that the
    traditional vector space model is a special case of a pairwise interterm
    correlation model. for those that are interested in advanced matrix
    algebra and some basic statistics, it should be very interesting. if only
    i had a published paper, i would post it. unfortunately, what i have is
    very obtuse because it's protected. the only paper that started out was
    submitted to SIGIR but rejected by all but one referee. that one thought
    this was a tremendous unification of the two methods, but academic
    journals being what they are, when 4 out of 5 referees can't
    understand
    the paper, it doesn't get published. i may brush it off and enlarge into a
    much longer paper for the Journal of IR, but once again, unless you are
    comfortable with probability theory and matrix theory, you are not going
    to follow it.

    so, who is game for a tutorial on the derivation?

    Herb...

    -----Original Message-----
    From: Karsten Konrad
    Sent: Thursday, December 04, 2003 5:09 AM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?



    Hi Herb,

    thank you for your insights.
    but by most accepted definitions, the tf/idf model in Lucene is a
    probabilistic model.
    Can you send some pointers to help me understand that? Are all TF/IDF-
    variants
    probabilistic models? If so, what makes any model a non-probabilistic one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an
    extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM
    would
    also be a probabilistic model.
    it's got strange normalizations though that doesn't allow comparisons of
    rank values across queries.
    Lucene's internal ranking sometimes returns values > 1.0, these are then
    normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say against
    this - it's a hack,
    but useful - it makes comparing the rank values across queries really
    difficult. It's like
    using different scales whenever you measure something different, and then
    you do not tell
    anyone about it.
    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it.
    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted
    definitions, the tf/idf model in Lucene is a probabilistic model. it's got
    strange normalizations though that doesn't allow comparisons of rank
    values across queries.

    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it. however, that means abandoning idf and keeping
    actual term frequencies for each document and document size. once you
    normalize this way, you can intermingle document scores from different
    queries and different corpora and make statements about the absolute value
    of the score. it also leads directly into the discussion we had earlier
    about interterm correlations and how to handle them properly since the
    full interterm probabilistic model has as a special case the
    traditional
    tf/idf model. interjecting Boolean conditions and boost makes the model
    much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if
    there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Shengli Wu at Dec 5, 2003 at 1:24 pm
    Deal all,

    I am interested in implement a probabilistic model in Lucene as well.
    I checked the book titled "model information retrieval" authored by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the
    implementation is not very complicated when we use Lucene's IndexReader
    class, almost all the parameters needed are there: the total number of
    document in the index (collection), the number of documents having a
    particular term, that is it. Probably we need to find out a satisfied method
    of defining the weights of terms in the documents as well as in the query.

    Cheers,

    Shengli



    Adam Saltiel <adam.saltiel@btinternet.com> said:
    Herb,
    Any one game ... ?
    No takers? I would be very interested, but maybe beyond what can be
    posted in a mail list. I'd be equally interested in any references you
    may have.
    As we are on this subject how does LSI and the similar CNG (context
    network graph) fit into the model used by lucene. Could lucene be
    massaged to implement different mathematical models of search and
    retrieval, if so how modular are the core functions?

    Adam Saltiel

    -----Original Message-----
    From: Chong, Herb
    Sent: Thursday, December 04, 2003 1:53 PM
    To: Lucene Users List
    Subject: RE: Probabilistic Model in Lucene - possible?

    not all tf/idf variants are probabilistic models, but a great many are if
    the term weights are probabilities. if we just take straight,
    unmodified
    Term Frequency in a document, Inverse Document Frequency in the corpus,
    and the Term Frequency in the query as 1, you are in fact comparing the
    statistical properties of the query against the statistical properties of
    the query. they are probabilities you are comparing. i can't think of many
    papers that come right out and say it, but if you look at an
    individual
    term weight and can interpret it as a genuine probability, the vector
    space model based on the weights is a probabilistic model. the
    derivation
    is relatively straight forward to show it, if you have the right general
    model to start with. once you start throwing in ad hoc normalizations,
    then things get out of whack and it's not longer a probabilistic model.
    the implementations that i have done are with a former company and that
    means secret and protected by various intellectual property rights.
    however, i can sketch here the general approach one has to take and an
    outline of the derivation that unifies probabilistic models with vector
    space models and at the same time incorporate pairwise interterm
    correlation. in fact, the pairwise interterm correlations are a
    fundamental assumption. once you do all this, you can show that the
    traditional vector space model is a special case of a pairwise interterm
    correlation model. for those that are interested in advanced matrix
    algebra and some basic statistics, it should be very interesting. if only
    i had a published paper, i would post it. unfortunately, what i have is
    very obtuse because it's protected. the only paper that started out was
    submitted to SIGIR but rejected by all but one referee. that one thought
    this was a tremendous unification of the two methods, but academic
    journals being what they are, when 4 out of 5 referees can't
    understand
    the paper, it doesn't get published. i may brush it off and enlarge into a
    much longer paper for the Journal of IR, but once again, unless you are
    comfortable with probability theory and matrix theory, you are not going
    to follow it.

    so, who is game for a tutorial on the derivation?

    Herb...

    -----Original Message-----
    From: Karsten Konrad
    Sent: Thursday, December 04, 2003 5:09 AM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?



    Hi Herb,

    thank you for your insights.
    but by most accepted definitions, the tf/idf model in Lucene is a
    probabilistic model.
    Can you send some pointers to help me understand that? Are all TF/IDF-
    variants
    probabilistic models? If so, what makes any model a non-probabilistic one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an
    extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM
    would
    also be a probabilistic model.
    it's got strange normalizations though that doesn't allow comparisons of
    rank values across queries.
    Lucene's internal ranking sometimes returns values > 1.0, these are then
    normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say against
    this - it's a hack,
    but useful - it makes comparing the rank values across queries really
    difficult. It's like
    using different scales whenever you measure something different, and then
    you do not tell
    anyone about it.
    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it.
    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted
    definitions, the tf/idf model in Lucene is a probabilistic model. it's got
    strange normalizations though that doesn't allow comparisons of rank
    values across queries.

    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it. however, that means abandoning idf and keeping
    actual term frequencies for each document and document size. once you
    normalize this way, you can intermingle document scores from different
    queries and different corpora and make statements about the absolute value
    of the score. it also leads directly into the discussion we had earlier
    about interterm correlations and how to handle them properly since the
    full interterm probabilistic model has as a special case the
    traditional
    tf/idf model. interjecting Boolean conditions and boost makes the model
    much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if
    there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    --




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Ambiesense at Dec 5, 2003 at 8:38 pm
    I guess you mean "Modern Information Retrieval" ... I would be a little bit
    careful since this book has theoretical glasses on. It might look more
    difficult than expected. However I would like to discuss this further. How could it
    be archived to get the values your are writing about? Any first ideas?

    Cheers,
    Ralf
    Deal all,

    I am interested in implement a probabilistic model in Lucene as well.
    I checked the book titled "model information retrieval" authored by
    Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto, it seems to me that the
    implementation is not very complicated when we use Lucene's IndexReader
    class, almost all the parameters needed are there: the total number of
    document in the index (collection), the number of documents having a
    particular term, that is it. Probably we need to find out a satisfied
    method
    of defining the weights of terms in the documents as well as in the query.

    Cheers,

    Shengli



    Adam Saltiel <adam.saltiel@btinternet.com> said:
    Herb,
    Any one game ... ?
    No takers? I would be very interested, but maybe beyond what can be
    posted in a mail list. I'd be equally interested in any references you
    may have.
    As we are on this subject how does LSI and the similar CNG (context
    network graph) fit into the model used by lucene. Could lucene be
    massaged to implement different mathematical models of search and
    retrieval, if so how modular are the core functions?

    Adam Saltiel

    -----Original Message-----
    From: Chong, Herb
    Sent: Thursday, December 04, 2003 1:53 PM
    To: Lucene Users List
    Subject: RE: Probabilistic Model in Lucene - possible?

    not all tf/idf variants are probabilistic models, but a great many are if
    the term weights are probabilities. if we just take straight,
    unmodified
    Term Frequency in a document, Inverse Document Frequency in the corpus,
    and the Term Frequency in the query as 1, you are in fact comparing the
    statistical properties of the query against the statistical properties of
    the query. they are probabilities you are comparing. i can't think of many
    papers that come right out and say it, but if you look at an
    individual
    term weight and can interpret it as a genuine probability, the vector
    space model based on the weights is a probabilistic model. the
    derivation
    is relatively straight forward to show it, if you have the right general
    model to start with. once you start throwing in ad hoc normalizations,
    then things get out of whack and it's not longer a probabilistic model.
    the implementations that i have done are with a former company and that
    means secret and protected by various intellectual property rights.
    however, i can sketch here the general approach one has to take and an
    outline of the derivation that unifies probabilistic models with vector
    space models and at the same time incorporate pairwise interterm
    correlation. in fact, the pairwise interterm correlations are a
    fundamental assumption. once you do all this, you can show that the
    traditional vector space model is a special case of a pairwise interterm
    correlation model. for those that are interested in advanced matrix
    algebra and some basic statistics, it should be very interesting. if only
    i had a published paper, i would post it. unfortunately, what i have is
    very obtuse because it's protected. the only paper that started out was
    submitted to SIGIR but rejected by all but one referee. that one thought
    this was a tremendous unification of the two methods, but academic
    journals being what they are, when 4 out of 5 referees can't
    understand
    the paper, it doesn't get published. i may brush it off and enlarge into a
    much longer paper for the Journal of IR, but once again, unless you are
    comfortable with probability theory and matrix theory, you are not going
    to follow it.

    so, who is game for a tutorial on the derivation?

    Herb...

    -----Original Message-----
    From: Karsten Konrad
    Sent: Thursday, December 04, 2003 5:09 AM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?



    Hi Herb,

    thank you for your insights.
    but by most accepted definitions, the tf/idf model in Lucene is a
    probabilistic model.
    Can you send some pointers to help me understand that? Are all TF/IDF-
    variants
    probabilistic models? If so, what makes any model a non-probabilistic one?
    If you claim that TF/IDF is probabilistic, then the plain cosine (an
    extreme
    form of TF/IDF, with IDF for all terms being considered constant) of VSM
    would
    also be a probabilistic model.
    it's got strange normalizations though that doesn't allow comparisons of
    rank values across queries.
    Lucene's internal ranking sometimes returns values > 1.0, these are then
    normalized to 1.0,
    adjusting other rankings accordingly. While I have nothing to say against
    this - it's a hack,
    but useful - it makes comparing the rank values across queries really
    difficult. It's like
    using different scales whenever you measure something different, and then
    you do not tell
    anyone about it.
    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it.
    Stop bragging, send us your Similarity implementation :)

    Regards,

    Karsten


    -----Ursprüngliche Nachricht-----
    Von: Chong, Herb
    Gesendet: Mittwoch, 3. Dezember 2003 23:01
    An: Lucene Users List
    Betreff: RE: Probabilistic Model in Lucene - possible?


    i think i am missing the original question, but by most accepted
    definitions, the tf/idf model in Lucene is a probabilistic model. it's got
    strange normalizations though that doesn't allow comparisons of rank
    values across queries.

    it isn't terribly hard to make a normalized probabilistic model that
    allows comparing of document scores across queries and assign a
    meaning to
    the score. i've done it. however, that means abandoning idf and keeping
    actual term frequencies for each document and document size. once you
    normalize this way, you can intermingle document scores from different
    queries and different corpora and make statements about the absolute value
    of the score. it also leads directly into the discussion we had earlier
    about interterm correlations and how to handle them properly since the
    full interterm probabilistic model has as a special case the
    traditional
    tf/idf model. interjecting Boolean conditions and boost makes the model
    much more complicated.

    Herb....

    -----Original Message-----
    From: Karsten Konrad
    Sent: Wednesday, December 03, 2003 4:51 PM
    To: Lucene Users List
    Subject: AW: Probabilistic Model in Lucene - possible?
    I would highly appreciate it if the experts here (especially Karsten or
    Chong) look at my idea and tell me if this would be possible.
    Sorry, I have no idea about how to use a probabilistic approach with
    Lucene, but if anyone does so, I would like to know, too.

    I am currently puzzled by a related question: I would like to know if
    there are any approaches to get a confidence value for relevance
    rather than a ranking. I.e., it would be nice to have a ranking
    weight whose value has some kind of semantics such that we could
    compare results from different queries. Can probabilistic approches
    do anything like this?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    --




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    --
    +++ GMX - die erste Adresse für Mail, Message, More +++
    Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Chong, Herb at Dec 4, 2003 at 1:57 pm
    oh, i forgot to mention, the interterm correlation model has been tested on TREC and deliver up to 0.10 improved precision at 0 recall. however, i used a single pass search engine without automatic relevance feedback and no linguistic analysis of the documents or queries, so i suffered in overall performance. the submission i made did perform the best of the single pass methods though, and middle of the pack in the multipass ones.

    Herb...

    -----Original Message-----
    From: Chong, Herb
    Sent: Thursday, December 04, 2003 8:53 AM
    To: Lucene Users List
    Subject: RE: Probabilistic Model in Lucene - possible?

    the implementations that i have done are with a former company and that means secret and protected by various intellectual property rights. however, i can sketch here the general approach one has to take and an outline of the derivation that unifies probabilistic models with vector space models and at the same time incorporate pairwise interterm correlation. in fact, the pairwise interterm correlations are a fundamental assumption. once you do all this, you can show that the traditional vector space model is a special case of a pairwise interterm correlation model. for those that are interested in advanced matrix algebra and some basic statistics, it should be very interesting. if only i had a published paper, i would post it. unfortunately, what i have is very obtuse because it's protected. the only paper that started out was submitted to SIGIR but rejected by all but one referee. that one thought this was a tremendous unification of the two methods, but academic journals being what they are, when 4 out of 5 referees can't understand the paper, it doesn't get published. i may brush it off and enlarge into a much longer paper for the Journal of IR, but once again, unless you are comfortable with probability theory and matrix theory, you are not going to follow it.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Chong, Herb at Dec 5, 2003 at 1:33 pm
    anyone interested, contact me offline. whoever contacts me by the end of next week, i'll email an outline of the derivation and we can discuss it in private emails. i guarantee, you will learn something interesting about search engines.

    Herb....

    -----Original Message-----
    From: Adam Saltiel
    Sent: Friday, December 05, 2003 3:46 AM
    To: 'Lucene Users List'
    Subject: RE: Probabilistic Model in Lucene - possible?


    Herb,
    Any one game ... ?
    No takers? I would be very interested, but maybe beyond what can be
    posted in a mail list. I'd be equally interested in any references you
    may have.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 3, '03 at 2:13p
activeDec 5, '03 at 8:38p
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase