Add TREC9 filtering (OHSUMED) collection
----------------------------------------

Key: ORP-6
URL: https://issues.apache.org/jira/browse/ORP-6
Project: Open Relevance Project
Issue Type: New Feature
Components: Collections
Reporter: Andrzej Bialecki




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Andrzej Bialecki (JIRA) at Feb 8, 2010 at 2:17 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Andrzej Bialecki updated ORP-6:
    --------------------------------

    Attachment: ohsumed.patch

    This patch adds support for creating collections from TREC9 / OHSUMED corpus, queries and qrels.
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Feb 8, 2010 at 3:01 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830958#action_12830958 ]

    Robert Muir commented on ORP-6:
    -------------------------------

    +1 (built and ran evaluation with training corpus/qrels)

    Andrzej, wanna commit this?

    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Feb 8, 2010 at 3:19 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830967#action_12830967 ]

    Andrzej Bialecki commented on ORP-6:
    -------------------------------------

    Sure, why not. But there are some points that I'm not sure about yet:

    * I created separate corpora and qrels for the test and train parts of the original collection.

    * the Mesh and OHSU topics are very different - e.g. from my experience Mesh topics converted to Lucene queries must include the description, because quite often the most relevant docs don't contain the Mesh term itself. This however makes for very long queries ...

    * AFAIU the definition of the filtering track is that qrels are NOT ranked, they just list relevant docs in random order. For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels.

    I could add these remarks to the README.
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Feb 8, 2010 at 3:31 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830977#action_12830977 ]

    Robert Muir commented on ORP-6:
    -------------------------------
    I created separate corpora and qrels for the test and train parts of the original collection.
    I am not familiar with this collection, except that from your README and the original file naming it appears like this is the right thing to do?
    the Mesh and OHSU topics are very different - e.g. from my experience Mesh topics converted to Lucene queries must include the description, because quite often the most relevant docs don't contain the Mesh term itself. This however makes for very long queries ...
    OK, I think the best way to handle this is to instead make it easier to run T, T+D, T+D+N, etc queries from the benchmark package. I'll open an issue with an initial patch for you to look over (but I dont think this is an ORP problem, just a problem that the benchmark pkg is really only setup to run Title queries right now).
    AFAIU the definition of the filtering track is that qrels are NOT ranked, they just list relevant docs in random order. For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels.
    I thought DCG etc were only based on the '2' versus '1' value in the qrels? I am only vaguely familiar with these so I could be wrong?
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Feb 8, 2010 at 3:43 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830984#action_12830984 ]

    Robert Muir commented on ORP-6:
    -------------------------------
    For calculation of metrics that depend on position (such as NDCG) this needs to be taken into account, e.g. by first sorting the qrels by relevance and calculating an Ideal DCG@N, where N is the number of available qrels.
    Andrzej, i looked at a patch to trec_eval to support NDCG and it appears to do this sort itself: http://cio.nist.gov/esd/emaildir/lists/ireval/msg00037.html
    I guess the latest version does not support this metric, are people using this patch or is there some other NDCG calculator that does not do this sort???
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrzej Bialecki (JIRA) at Feb 8, 2010 at 3:57 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830986#action_12830986 ]

    Andrzej Bialecki commented on ORP-6:
    -------------------------------------

    bq. OK, I think the best way to handle this is to instead make it easier to run T, T+D, T+D+N, etc queries from the benchmark package.

    That would be cool - yes, it's a Lucene benchmark issue.

    bq. I thought DCG etc were only based on the '2' versus '1' value in the qrels? I am only vaguely familiar with these so I could be wrong?

    http://en.wikipedia.org/wiki/Discounted_Cumulative_Gain unlike the plain Cumulative Gain, discounts the importance of a result by its position on the list of results (rank).

    bq. I guess the latest version does not support this metric, are people using this patch or is there some other NDCG calculator that does not do this sort???

    No, I stumbled upon this issue when implementing NDCG myself for another project.

    Ok, I'll add these remarks and commit. Thanks!
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Muir (JIRA) at Feb 8, 2010 at 4:26 pm
    [ https://issues.apache.org/jira/browse/ORP-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830994#action_12830994 ]

    Robert Muir commented on ORP-6:
    -------------------------------

    Thanks Andrzej for your work here. I opened LUCENE-2254 for the lucene benchmark issue.
    Add TREC9 filtering (OHSUMED) collection
    ----------------------------------------

    Key: ORP-6
    URL: https://issues.apache.org/jira/browse/ORP-6
    Project: Open Relevance Project
    Issue Type: New Feature
    Components: Collections
    Reporter: Andrzej Bialecki
    Attachments: ohsumed.patch

    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedFeb 8, '10 at 2:15p
activeFeb 8, '10 at 4:26p
posts8
users1
websitelucene.apache.org...

1 user in discussion

Robert Muir (JIRA): 8 posts

People

Translate

site design / logo © 2018 Grokbase