FAQ

[OpenRelevance-dev] [jira] Created: (ORP-1) Use existing collections for relevance testing

Robert Muir (JIRA)
Nov 13, 2009 at 1:45 am
Use existing collections for relevance testing
----------------------------------------------

Key: ORP-1
URL: https://issues.apache.org/jira/browse/ORP-1
Project: Open Relevance Project
Issue Type: New Feature
Components: Collections, Judgments, Queries
Reporter: Robert Muir


I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
These can be downloaded from the internet.
(please add more if you know)

I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.

The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.

For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).

Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
reply

Search Discussions

23 responses

  • Robert Muir (JIRA) at Nov 13, 2009 at 3:32 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated ORP-1:
    --------------------------

    Attachment: ORP-1.patch

    here's the patch that supports the tempo collection (initially)

    theres a README.txt in the root that describes how to use this with lucene benchmark package.

    (you will need to use the lucene trunk for this, i just committed 2 minor patches so this will work)

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 13, 2009 at 11:11 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777453#action_12777453 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Good stuff robert! I guess we should split it up and have one issue for the basic stuff like base ant scripts and LICENCE files etc. and another one for the first collection code.

    thoughts?
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 13, 2009 at 11:11 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Simon Willnauer reassigned ORP-1:
    ---------------------------------

    Assignee: Simon Willnauer
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 13, 2009 at 11:19 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777456#action_12777456 ]

    Robert Muir commented on ORP-1:
    -------------------------------

    Simon, hard to kinda split the issues when there is nothing in openrelevance svn!

    Though this might look large/overkill for one collection, maybe I should have done two to illustrate better? the patch is mostly "basic stuff" you speak of, License files, build.xml's, ...

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Marvin Humphrey (JIRA) at Nov 13, 2009 at 3:24 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777526#action_12777526 ]

    Marvin Humphrey commented on ORP-1:
    -----------------------------------
    maybe I should have done two
    MHO: Commit this one, do a second one, plan to refactor out common code.
    Three JIRA issues.
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 13, 2009 at 3:34 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777528#action_12777528 ]

    Robert Muir commented on ORP-1:
    -------------------------------
    IMHO: Commit this one, do a second one, plan to refactor out common code.
    Marvin, I would like this too. I would like to do a persian one next, just to make sure everything is groovy with unicode.

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 13, 2009 at 3:46 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777531#action_12777531 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    bq. IMHO: Commit this one, do a second one, plan to refactor out common code.
    You are right, lets get it going! I will commit this soon.

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 13, 2009 at 4:04 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777536#action_12777536 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    BAAAH! I have SVN problems:
    svn: Server sent unexpected return value (403 Forbidden) in response to CHECKOUT request for '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk'

    That is what I get when I want to commit it... Ideas?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 13, 2009 at 4:08 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777538#action_12777538 ]

    Robert Muir commented on ORP-1:
    -------------------------------
    That is what I get when I want to commit it... Ideas?
    are you using https?
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 13, 2009 at 7:21 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777615#action_12777615 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    bq. are you using https?

    Yep!
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 13, 2009 at 8:43 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777646#action_12777646 ]

    Robert Muir commented on ORP-1:
    -------------------------------
    Yep!
    then next i would guess it is a legitimate permissions problem in svn
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Grant Ingersoll (JIRA) at Nov 14, 2009 at 1:45 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777922#action_12777922 ]

    Grant Ingersoll commented on ORP-1:
    -----------------------------------

    Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/

    Your user name is simonw, right? Are you on the EU mirror? Can you try the US one?
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 14, 2009 at 1:51 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777926#action_12777926 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    bq. Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/
    Yes I can!
    bg. Your user name is simonw, right?
    right

    bq. Are you on the EU mirror? Can you try the US one?
    I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 14, 2009 at 11:55 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777926#action_12777926 ]

    Simon Willnauer edited comment on ORP-1 at 11/14/09 11:54 PM:
    --------------------------------------------------------------

    {quote} Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/ {quote}
    Yes I can!
    {quote} Your user name is simonw, right?{quote}
    right

    {quote}Are you on the EU mirror? Can you try the US one?{quote}
    I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)


    was (Author: simonw):
    bq. Can you get to the SVN via a browser: https://svn.apache.org/repos/asf/lucene/openrelevance/
    Yes I can!
    bg. Your user name is simonw, right?
    right

    bq. Are you on the EU mirror? Can you try the US one?
    I'm on EU - I will try to do the commit on the US once I have a stable INet connection (sitting in a train right now.)

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer (JIRA) at Nov 15, 2009 at 12:07 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Grant, I tried it on US and EU. I always get the same stupid error.
    I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Grant Ingersoll at Nov 18, 2009 at 9:41 pm
    Simon,

    Any luck on this?

    Do you want me to try the patch?

    -Grant
    On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Grant, I tried it on US and EU. I always get the same stupid error.
    I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer at Nov 18, 2009 at 9:52 pm
    No luck here! I guess it is not on my side, I tried on 3 machines
    (Linux and Windows) I always get the same error:
    svn: Commit failed (details follow):
    svn: Server sent unexpected return value (403 Forbidden) in response
    to CHECKOUT request for
    '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk'

    No idea why it tells me something about CHECKOUT when I try to commit though.
    Can you look at the authz files for SVN, would be good if we can solve
    this issue somehow :)

    simon
    On Wed, Nov 18, 2009 at 10:40 PM, Grant Ingersoll wrote:
    Simon,

    Any luck on this?

    Do you want me to try the patch?

    -Grant
    On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Grant, I tried it on US and EU. I always get the same stupid error.
    I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case  issues). Are you able to check this?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Grant Ingersoll at Nov 18, 2009 at 10:00 pm
    Try it now, there was an oddity in the auth file for the path (I forgot the leading slash), although I was able to commit before changing it, but then again, I have full /lucene permissions.

    -Grant
    On Nov 18, 2009, at 4:51 PM, Simon Willnauer wrote:

    No luck here! I guess it is not on my side, I tried on 3 machines
    (Linux and Windows) I always get the same error:
    svn: Commit failed (details follow):
    svn: Server sent unexpected return value (403 Forbidden) in response
    to CHECKOUT request for
    '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk'

    No idea why it tells me something about CHECKOUT when I try to commit though.
    Can you look at the authz files for SVN, would be good if we can solve
    this issue somehow :)

    simon
    On Wed, Nov 18, 2009 at 10:40 PM, Grant Ingersoll wrote:
    Simon,

    Any luck on this?

    Do you want me to try the patch?

    -Grant
    On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Grant, I tried it on US and EU. I always get the same stupid error.
    I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case issues). Are you able to check this?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
    http://www.lucidimagination.com/search
  • Simon Willnauer at Nov 18, 2009 at 10:03 pm
    Works! Thanks!
    On Wed, Nov 18, 2009 at 10:59 PM, Grant Ingersoll wrote:
    Try it now, there was an oddity in the auth file for the path (I forgot the leading slash), although I was able to commit before changing it, but then again, I have full /lucene permissions.

    -Grant
    On Nov 18, 2009, at 4:51 PM, Simon Willnauer wrote:

    No luck here! I guess it is not on my side, I tried on 3 machines
    (Linux and Windows) I always get the same error:
    svn: Commit failed (details follow):
    svn: Server sent unexpected return value (403 Forbidden) in response
    to CHECKOUT request for
    '/repos/asf/!svn/ver/783110/lucene/openrelevance/trunk'

    No idea why it tells me something about CHECKOUT when I try to commit though.
    Can you look at the authz files for SVN, would be good if we can solve
    this issue somehow :)

    simon
    On Wed, Nov 18, 2009 at 10:40 PM, Grant Ingersoll wrote:
    Simon,

    Any luck on this?

    Do you want me to try the patch?

    -Grant
    On Nov 14, 2009, at 7:06 PM, Simon Willnauer (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778023#action_12778023 ]

    Simon Willnauer commented on ORP-1:
    -----------------------------------

    Grant, I tried it on US and EU. I always get the same stupid error.
    I googled a bit and found some possible issues that maybe the URL in the authz file is slightly wrong (Upper / Lower case  issues). Are you able to check this?

    simon
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
    http://www.lucidimagination.com/search
  • Simon Willnauer (JIRA) at Nov 18, 2009 at 10:05 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Simon Willnauer resolved ORP-1.
    -------------------------------

    Resolution: Fixed

    Commited in revision 881953

    Thanks you robert! We eventually fixed the SVN issue :)
    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 18, 2009 at 11:56 pm
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779741#action_12779741 ]

    Robert Muir commented on ORP-1:
    -------------------------------

    hey this is good to see!

    I will try to do a persian one tonight, then we can rewrite/refactor/redesign everything

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Nov 19, 2009 at 5:21 am
    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779832#action_12779832 ]

    Robert Muir commented on ORP-1:
    -------------------------------

    Simon, when you get a chance, can you set eol style to native in svn? Here is the list that need it:
    M LICENSE.txt
    M common-build.xml
    M src\java\org\apache\or\util\TrecQrel.java
    M src\java\org\apache\or\util\TrecDocumentWriter.java
    M src\java\org\apache\or\util\TrecTopicWriter.java
    M src\java\org\apache\or\util\TrecDocument.java
    M src\java\org\apache\or\util\TrecTopic.java
    M src\java\org\apache\or\util\TrecQrelWriter.java
    M FILEFORMATS.txt
    M build.xml
    M collections\tempo\src\java\org\apache\or\collections\tempo\TempoQrelConverter.java
    M collections\tempo\src\java\org\apache\or\collections\tempo\TempoCorpusConverter.java
    M collections\tempo\src\java\org\apache\or\collections\tempo\TempoTopicConverter.java
    M collections\tempo\build.xml
    M collections\collections-build.xml

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Simon Willnauer at Nov 19, 2009 at 5:54 am
    BAH! thats what I missed :). I was so concentrated on the SVN commit
    issue that I forgot to set them again when I applied the patch again
    and again :)

    thanks for the reminder
    On Thu, Nov 19, 2009 at 6:20 AM, Robert Muir (JIRA) wrote:

    [ https://issues.apache.org/jira/browse/ORP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779832#action_12779832 ]

    Robert Muir commented on ORP-1:
    -------------------------------

    Simon, when you get a chance, can you set eol style to native in svn? Here is the list that need it:
    M      LICENSE.txt
    M      common-build.xml
    M      src\java\org\apache\or\util\TrecQrel.java
    M      src\java\org\apache\or\util\TrecDocumentWriter.java
    M      src\java\org\apache\or\util\TrecTopicWriter.java
    M      src\java\org\apache\or\util\TrecDocument.java
    M      src\java\org\apache\or\util\TrecTopic.java
    M      src\java\org\apache\or\util\TrecQrelWriter.java
    M      FILEFORMATS.txt
    M      build.xml
    M      collections\tempo\src\java\org\apache\or\collections\tempo\TempoQrelConverter.java
    M      collections\tempo\src\java\org\apache\or\collections\tempo\TempoCorpusConverter.java
    M      collections\tempo\src\java\org\apache\or\collections\tempo\TempoTopicConverter.java
    M      collections\tempo\build.xml
    M      collections\collections-build.xml

    Use existing collections for relevance testing
    ----------------------------------------------

    Key: ORP-1
    URL: https://issues.apache.org/jira/browse/ORP-1
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections, Judgments, Queries
    Reporter: Robert Muir
    Assignee: Simon Willnauer
    Attachments: ORP-1.patch


    I created a list of existing collections with queries and judgements on the wiki here: http://cwiki.apache.org/ORP/existingcollections.html
    These can be downloaded from the internet.
    (please add more if you know)
    I've created source code (ant and java) to download these collections, and reformat them to the trec format that the lucene benchmark expects.
    each collection has its own ant script to download the collection, and java code to reformat, although I have some shared code at the top level.
    The resulting output for each collection is a "corpus.gz" file, a queries file, and a judgements file,
    The corpus.gz is a gzipped file that can be indexed with ant via the lucene benchmark package (using TrecContentSource)
    Once the index is created, the command-line tool QueryDriver under the lucene benchmark quality/trec package can be used to run the evaluation.
    It will print some summary output to stdout, but will also create a submission file that can be fed to trec_eval.
    For starters, I will only have support for one collection in the patch, the Indonesian "Tempo" collection (around 23,000 docs)
    We can simply add subdirectories for additional collections (it does a contrib-crawl like thing).
    Once I finish wrapping up some documentation (such as description of the formats, some javadocs, and an example), I'll upload the patch.
    These formats are actually documented in the lucene-java benchmark package already, but I think it would be nice to add this for non-java users.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post