Chong, Herb

not all tf/idf variants are probabilistic models, but a great many are if

the term weights are probabilities. if we just take straight,

unmodified

Term Frequency in a document, Inverse Document Frequency in the corpus,

and the Term Frequency in the query as 1, you are in fact comparing the

statistical properties of the query against the statistical properties of

the query. they are probabilities you are comparing. i can't think of many

papers that come right out and say it, but if you look at an

individual

term weight and can interpret it as a genuine probability, the vector

space model based on the weights is a probabilistic model. the

derivation

is relatively straight forward to show it, if you have the right general

model to start with. once you start throwing in ad hoc normalizations,

then things get out of whack and it's not longer a probabilistic model.

the implementations that i have done are with a former company and that

means secret and protected by various intellectual property rights.

however, i can sketch here the general approach one has to take and an

outline of the derivation that unifies probabilistic models with vector

space models and at the same time incorporate pairwise interterm

correlation. in fact, the pairwise interterm correlations are a

fundamental assumption. once you do all this, you can show that the

traditional vector space model is a special case of a pairwise interterm

correlation model. for those that are interested in advanced matrix

algebra and some basic statistics, it should be very interesting. if only

i had a published paper, i would post it. unfortunately, what i have is

very obtuse because it's protected. the only paper that started out was

submitted to SIGIR but rejected by all but one referee. that one thought

this was a tremendous unification of the two methods, but academic

journals being what they are, when 4 out of 5 referees can't

understand

the paper, it doesn't get published. i may brush it off and enlarge into a

much longer paper for the Journal of IR, but once again, unless you are

comfortable with probability theory and matrix theory, you are not going

to follow it.

so, who is game for a tutorial on the derivation?

Herb...

Karsten Konrad

Hi Herb,

thank you for your insights.

but by most accepted definitions, the tf/idf model in Lucene is a

probabilistic model.

Can you send some pointers to help me understand that? Are all TF/IDF-

variants

probabilistic models? If so, what makes any model a non-probabilistic one?

If you claim that TF/IDF is probabilistic, then the plain cosine (an

extreme

form of TF/IDF, with IDF for all terms being considered constant) of VSM

would

also be a probabilistic model.

it's got strange normalizations though that doesn't allow comparisons of

rank values across queries.

Lucene's internal ranking sometimes returns values > 1.0, these are then

normalized to 1.0,

adjusting other rankings accordingly. While I have nothing to say against

this - it's a hack,

but useful - it makes comparing the rank values across queries really

difficult. It's like

using different scales whenever you measure something different, and then

you do not tell

anyone about it.

it isn't terribly hard to make a normalized probabilistic model that

allows comparing of document scores across queries and assign a

meaning to

the score. i've done it.

Stop bragging, send us your Similarity implementation :)

Regards,

Karsten

Chong, Herb

i think i am missing the original question, but by most accepted

definitions, the tf/idf model in Lucene is a probabilistic model. it's got

strange normalizations though that doesn't allow comparisons of rank

values across queries.

it isn't terribly hard to make a normalized probabilistic model that

allows comparing of document scores across queries and assign a

meaning to

the score. i've done it. however, that means abandoning idf and keeping

actual term frequencies for each document and document size. once you

normalize this way, you can intermingle document scores from different

queries and different corpora and make statements about the absolute value

of the score. it also leads directly into the discussion we had earlier

about interterm correlations and how to handle them properly since the

full interterm probabilistic model has as a special case the

traditional

tf/idf model. interjecting Boolean conditions and boost makes the model

much more complicated.

Herb....

Karsten Konrad

I would highly appreciate it if the experts here (especially Karsten or

Chong) look at my idea and tell me if this would be possible.

Sorry, I have no idea about how to use a probabilistic approach with

Lucene, but if anyone does so, I would like to know, too.

I am currently puzzled by a related question: I would like to know if

there are any approaches to get a confidence value for relevance

rather than a ranking. I.e., it would be nice to have a ranking

weight whose value has some kind of semantics such that we could

compare results from different queries. Can probabilistic approches

do anything like this?

