So, what comparisons can we set up using these collections?
I think we can be creative. for example I used one of these tonight to test
LUCENE-1812, Andrzej's index pruning tool. Results showed that it works as
he advertised at apachecon...
also, we should be careful about the english ones i linked to (or
preferably, find bigger ones), because they are smallish collections.
I seem to recall you suggesting at ApacheCon that they would be handy when
judging Analyzer mods.
Yeah, definitely don't think any results should be gospel for analyzers or
scoring or anything else, but then again I think we could detect if some
change is completely broken or silly (bugs, etc).
These collections are all binary assertions -- relevant/not-relevant for a
given query -- right? Am I correct in presuming that such corpora can't
us to judge scoring and ranking algorithms, or Similarity implementations?
I think most of them are binary... but I think I disagree with your second
statement, these kinds of collections are used to compare scoring/ranking
algorithms all the time!
also, if you have some ideas on how to perhaps create some ant tasks to make
downloading/running these thru the lucene benchmark package easier, that
would be great too.
Hmm, that approach is specific to Lucene Java. It's not handy for either
the projects I work on (Lucy, KinoSearch).
You raise a good point here. Really at the end of the day, you just want to
produce a .txt file that you throw at the trec_eval commandline program or
something similar. Doing it in a lucene-java specific way doesn't allow us
to easily evaluate things even in solr, for example it has analysis
components that affect relevance!
I guess one approach could be to create scripts and stuff here that download
and munge these collections into a consistent format, and then lucy,
lucene-java, solr, whatever would have an an easier time running the
At some point, I'd planned to write a loose port of the Lucene benchmarking
suite so that Lucy (at least) could exploit it... The benchmarking code
gotten so elaborate and complex now, though -- I wonder how easy it will be
this is a bit frustrating because many collections claim to be "trec" format
but they are all formatted slightly differently...
Sounds like we need one module per corpus to explode it into a common
Is ant the best approach here? Maybe we start off with a scripting
in reference to both your comments above, I don't modify the lucene
benchmarking code really too much to run my tests, sometimes i change the
analyzer or scoring but thats it.
instead, i use sed and perl and what not to reformat things into the format
the benchmark package wants... so I guess this is already what I am doing