Marvin, thanks for your ideas... i gave some quick comments below.
On Fri, Nov 20, 2009 at 12:34 AM, Marvin Humphrey wrote:On Thu, Nov 19, 2009 at 10:47:14PM -0500, Robert Muir wrote:
1. Doing some kind of "basic relevancy test" for lucene release
candidates?
Marvin mentioned at apachecon how maybe relevancy tests could have caught
bugs similar to the 2.9.0 scorer bug, (ok maybe not that specific one), but
I agree.
For Lucy, ideally I want us to be benchmarking relevancy just like we
benchmark performance. Two major features of Lucy are pluggable index
components and the ability to do rapid prototyping in the host language.
We
should have some criteria by which we judge the crazy ideas people come up
with and we should be able to single out and reward the good ones.
I think one difference here though is that with performance, you can
microbenchmark, although its tricky as you know.
With relevance its not really safe to do that in my opinion...
But for "big" ideas, we definitely can.
I think in general the more test collections we have, the more accurate the
picture becomes...
We don't need a perfect collection immediately, maybe lots of imperfect ones
will do the trick for the time being?
One thing we don't do by default in Lucene is take proximity of term
matches
into account. I'd like to measure how much relevance improves when we
augment
basic boolean queries with span queries. That would make it easier to make
a
sensible engineering tradeoff between responsiveness and relevance. And
maybe
if the improvements are significant enough, somebody will build an
optimized
scoring class tree which exploits that information.
yes, this is a really good idea to evaluate, although my gut instinct tells
me the gains will be language-dependent to some extent,
the same challenge faced in MT when related words can be very far apart in a
sentence in some langs, and it comes out goobly-gook because
some phrase-based MT doesn't use "enough" context for that language.
if span queries help, i think it would be even better to produce an idea of
what kind of 'span context' you need for some different languages/document
types for good results.
2. Evaluating scoring algorithms? For example, there is a ASL BM25 impl I
have been using now for some time, with good results.
Code-wise, changing up Similarity impelentations and settings is cake.
I still don't understand exactly how we judge ranking with these
collections,
though. My understanding is that the collections mark each document as
either
1 or 0, meaning "relevant or not relevant". That's fine for precision and
recall...
http://en.wikipedia.org/wiki/Precision_and_recall... but how do we use these materials to judge relative ranking of
successful
matches when the scores in the answer book are all exactly 1.0?
oh you mean more fine-grained measurement where the top 100 docs are all
relevant, but some more relevant than others?
I guess you are right, but I think in general using precision/recall is a
good indicator that things are being ranked correctly across the board?
I think when i say 'scoring' i am usually referring to algorithms that are
very different from what lucene provides via Similarity.
For example, lucene does not provide 'average document length' in Similarity
to support a lot of the more modern formulas like BM25
(side note, I have measured nice gains on Hindi and Persian by using BM25 w/
avg doc length calculated from norms, which you read in anyway)
1. Move away from this current format / lucene benchmark package?
ORP should definitely do that, if for no other reason than that there seems
to
be consensus that other search engines need to be supported.
If it makes sense for Lucene or Lucy to extend the ORP benchmark code, we
can
choose to do so. If we can share at least some of the benchmarking
framework
code across projects, that would be great. I'm not sure whether that will
work
out well in practice, but it's worth a try for Lucy at least -- in a worst
case scenario, we fork and port.
yeah i hate the current format. I only wanted to get things off the ground
quickly since lucene's benchmark pkg can already deal with it.
And even lucene's "quality" pkg in the benchmark pkg is so limited:
for example it only runs topic queries but most of the time you want to also
be able to run topic+description or topic+description+narrative.
You also have to change code (at query time) to swap in a new analyzer, or
whatever... you can change analyzer in the .alg file at index time though.
I think it is more geared at benchmarking performance.
For example I do not even know if we can change Similarity easily using a
.alg file for indexing...
But it is something for now.
2. Setup easy framework inside openrelevance itself to do the relevance
tests?
I would kind of like to see Open Relevance become the project that sets the
standard for scientifically rigorous benchmarking, including but of course
not
limited to competitive performance benchmarking.
There's some overlap between the tasks of benchmarking performance and
benchmarking relevance.
I hadn't really thought about performance as well, but you are right, we
should keep this in mind. that's a good idea, its part of the whole picture.
For example, such a thing could possibly integrate the stuff Nicola spoke of
to produce output in a more digestable format.
Otherwise, each project will have to implement a framework itself (like
lucene-java's)
Each project will still need to implement its own client to interface with
the Open Relevance benchmarking API, but I think a lot can be shared.
ok, lets start thinking of some ways to improve this. currently you have to
apply ORP-2 to fix the README instructions,
but then anyone should be able to run the current stuff if they want, and
maybe have ideas for improvements.
Finally, I still think the whole crowdsourcing/building our own relevance
stuff is a fantastic idea and should still be pursued.
Its just that personally I am more capable to help with the above, and I
think it could complement that separate effort.
Agreed. But we should move forward with the corpuses we've got today.
Move forward doing more tests/producing worthwhile relevance experiments for
projects like Lucene?
Do you think this might create more developer interest for ORP,
or is the problem that everyone is already familiar with relevance testing,
but just has no time?
Marvin Humphrey
--
Robert Muir
rcmuir@gmail.com