The score normalization is actually more important for purposes of
review. It actually is possible that both D1 and D2 properly match to
F1. Some customers have repeats of the same film (e.g. Spiderman 2 and
Spiderman 2 in HD). When the system goes through and records the
potential matches, our review team needs to be able to determine
whether these matches are correct or not. In order to do so quickly
and accurately, they want the ability to judge things like "This is
why it chose this as a match, the title was a poor score, but the
actors and year were very highly scored". They might mark that as a
good match or as a bad match. Things that are recorded as bad matches
will be avoided on subsequent runs (e.g. the engine will match the
highest scoring item that is not in the blacklist for this source

Having all the customer feeds and the classified feed in one big index
hasn't really caused me any problems so far. The filter queries seem
to work fine for restricting a particular query to one set or the
On 5/31/07, Doron Cohen wrote:
I have no particular experience with matching
problems so the following might be off target...

Anyhow, if I understand correctly, problem is that,
currently, given a set of customer film descriptions
{D1, D2, ... , Dn}, a set of n queries are created
and each query can match at most one film in the
classification index, and it is impossible for both D1
and D2 to match the same classified film F. But if
both q1 and q2 matched a certain film F, some score
normalization is required in order to select: F matches
D1, or F matches D2.

If so, I am not sure that deep tuning of Lucene
scores is the best way to go.

How about the following alternative:
(1) Create an auxiliary index containing all (and
only) the customer documents. So each document in
the aux index represents a customer film description.
(2) For each customer document D, create the query Q,
and run it on the combined index (using a MultiReader
over the auxiliary index (reader) and the classification
index (reader). The customer document D (from the aux
index) is supposed to be the top result. Ignore all
other scores from the aux index. Use this (max) score
of D to normalize all the scores of film docs from the
classification index.


"Daniel Einspanjer" <deinspanjer@gmail.com> wrote on 30/05/2007 17:18:10:
This may be a five year old explaining to a four year old why the sky
is blue, but I'll share some of the stuff I've picked up. :)

My application isn't so much a search engine as a matching engine. I
take a large list of movie documents from a customer like a movie
channel or a cable provider and match that list against the movies our
company has classified. I wrote a query parser on top of the native
query parser that understands interpolated terms such as
+title_fuzzy_multivalued:"${LongTitle}" and it will pull the LongTitle
field from the customer movie and plug it into that term.

The huge problem I ran into was one of scoring. Since this is
matching not searching, and since the interpolation causes the query
from item A to be different from item B and likely wildly different
from the queries used in a different customer's matching, I really
needed a good score that could be compared across the board.

The solution I opted for was what I call perfect score normalization.
Basically, I index both the customer feed and the classified feed.
When the user of my system is adding a new feed to the system, they
define field alignments, e.g. they map the customer's LongTitle field
to the title field of the classified feed. Then, they define the
appropriate indices to use for each field alignment, e.g. they might
index the title fields using the title_strict_string_multivalue and
the title_fuzzy_terms_multivalue indices.

Now that I have these common indices, when I perform the matching run,
I interpolate the query using the values for the source item and get
both the best match from the classified feed (using a Solr filter
query to restrict the result set to only items with the classified
feed id) and the match for the customer item (using a filter query on
that item's ID). Now that I have these two scores, they are
comparable in the sense that the score of the customer item is "as
good as it gets". I divide the match score by the reference item
score and if the value is greater than one for some reason, I subtract
the amount above one from one to penalize it for being "too good".

This strategy required a few tweaks in the Similarity class. I have
actor name phrase queries with a word slop of two so that I can match
First Last to Last, First. I made my tf(float) function return 0 or 1
so that the scores for those two items look the same. tf also matters
in the case of multiple hits of a term within a field such as title.
If I am matching a movie with the title "Caesar Came Saw and
Conquered", I don't want the title "Caesar Came, Caesar Saw, Caesar
Conquered" to have a higher score just because the word Caesar is

I customize the idf() function to always return a 1 for year fields
because it could do funny things to a score if the source item had a
year 1984 and my query term was year_year:[${year -1} TO ${year +1}]
and there was only one item with a year of 1983. The 1983 would
actually score higher than the 1984.

I'm currently looking at whether overriding queryNorm() to always
return 1 is a good thing or not. I saw reference in a recent thread
that doing that might cause ^ boosts in terms or clauses to not work
right so I need to go back and study that again.

The other big thing that I'm doing is that the user doesn't define the
query in one big lump. They break it down into scoring sections. all
the title related terms are in one section and all the year related
terms in a different one. The user defines weights that each of these
sections should contribute to my "weighted score". I run individual
queries for each of these scoring sections against the source and
target items and record those normalized scores then multiply them by
their weights and add them up to get my weighted score.
This strategy is working pretty well, but it is slow because of all
the extra queries. I know that I can eliminate them by getting access
to the Explanation object and parsing out the scores I want there, but
that is what I am in the middle of researching how to do now. :)

Anyway.. some of this might be useful to you or maybe it is all
babble. You are either welcome or asked for forgiveness respectively.

On 5/30/07, Grant Ingersoll wrote:
Hi Walt,

One question that comes to mind, is what are you looking to do? Are
you not happy with the current scoring or you just trying to better
understand scoring? The calls to Similarity.tf(), etc. are call
backs from within the scoring algorithm (have a look at TermScorer in
the code) and provide a means for an application to change the score,
but in many cases there really isn't too much incentive to do so.


On May 30, 2007, at 4:45 PM, Walt Stoneburner wrote:

Hopefully I'm not opening myself up to public ridicule with what may
be a very stupid question, but...

At the moment, I'm trying to wrap my head around some of
the math that
happens when Lucene does scoring. Let's put aside the big equation
for a moment and focus on a simple method, such as tf(). [term

I understand that tf(freq) is supposed to return larger values when
freq is large, and smaller values when freq is small. Though here's
what making me scratch my head today:

a) Where does freq come from? (Not what is it, but who computes it
and how?)

Reason I ask is:

b) How do I know what "large" and "small" is, as I don't
really have a
relative scale of what the max and min values are? Should I just
assume a linear scale of 1.0 to 0.0 will be passed to the method?

But then that begs the question...

c) What values should I be passing out of a function like this?
Should I normalize my outgoing scores to some scale, or do I simply
just need to provide numbers that "have the right shaped curve".

I wish the documentation shed a smidgen bit more light in
those areas.

I look at things like idf() which returns 1+log(ratio) and then has
that value squared. Clearly that isn't on a scale of 1.0 to 0.0.

I feel like there may be some mathematical trickery going on and that
maybe the actual score values themselves don't matter inside the
ranking code, so long as their relative values to one another.

This then makes me ponder how the normalization process is done
between queries, allowing for a mix'n'match of results as these
numbers spill to the outside world. Obviously normalization has to
happen at that point for the mixing query results magic to work.

Is there a math wizard in the group who can talk to me like I'm
four years old?


To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Grant Ingersoll
Center for Natural Language Processing

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 8 | next ›
Discussion Overview
groupjava-user @
postedMay 30, '07 at 8:45p
activeJun 1, '07 at 12:59a



site design / logo © 2022 Grokbase