[ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777926#action_12777926 ]

Simon Willnauer commented on ORP-1:

bq. Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/
Yes I can!
bg. Your user name is simonw, right?

bq. Are you on the EU mirror? Can you try the US one?
I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)

Use existing collections for relevance testing

Key: ORP-1
URL: https://issues.apache.org/jira/browse/ORP-1
Project: Open Relevance Project
Issue Type: New Feature
Components: Collections, Judgments, Queries
Reporter: Robert Muir
Assignee: Simon Willnauer
Attachments: ORP-1.patch

I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
These can be downloaded from the internet.
(please add more if you know)
I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 14 of 24 | next ›
Discussion Overview
groupopenrelevance-dev @
postedNov 13, '09 at 1:45a
activeNov 19, '09 at 5:54a



site design / logo © 2019 Grokbase