Author: ab
Date: Mon Feb 8 16:12:14 2010
New Revision: 907712

URL: http://svn.apache.org/viewvc?rev=907712&view=rev
Log:
Add TREC-9 / OHSUMED collection.

Added:
lucene/openrelevance/trunk/collections/ohsumed/
lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt (with props)
lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt (with props)
lucene/openrelevance/trunk/collections/ohsumed/README.txt (with props)
lucene/openrelevance/trunk/collections/ohsumed/build.xml (with props)
lucene/openrelevance/trunk/collections/ohsumed/src/
lucene/openrelevance/trunk/collections/ohsumed/src/java/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java (with props)
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java (with props)
lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java (with props)

Added: lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt Mon Feb 8 16:12:14 2010
@@ -0,0 +1,9 @@
+There is no explicit licensing information at the TREC site, the statement below was
+taken from the original OHSUMED corpus available here:
+
+ http://ir.ohsu.edu/ohsumed/ohsumed.html
+
+The National Library of Medicine has agreed to make the MEDLINE references in the
+test database available for experimentation, restricted to the following conditions:
+1. The data will not be used in any non-experimental clinical, library, or other setting.
+2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

Propchange: lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt Mon Feb 8 16:12:14 2010
@@ -0,0 +1,212 @@
+This README file describes all the data files associated with the
+OHSUMED document collection as it was used for the TREC-9
+Filtering Track. Please see "The TREC-9 Filtering Track Final
+Report" by Stephen Robertson and David A. Hull in the TREC-9
+proceedings for a description of the tasks performed in the track.
+
+(A) Description of the OHSUMED document collection (files: ohsumed.*)
+
+The OHSUMED test collection is a set of 348,566 references from
+MEDLINE, the on-line medical information database, consisting of
+titles and/or abstracts from 270 medical journals over a five-year
+period (1987-1991). The available fields are title, abstract, MeSH
+indexing terms, author, source, and publication type. The National
+Library of Medicine has agreed to make the MEDLINE references in the
+test database available for experimentation, restricted to the
+following conditions:
+
+1. The data will not be used in any non-experimental clinical,
+library, or other setting.
+2. Any human users of the data will explicitly be told that the data
+is incomplete and out-of-date.
+
+The OHSUMED document collection was obtained by William Hersh
+(hersh@OHSU.EDU) and colleagues for the experiments described in the
+papers below:
+
+Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive
+retrieval evaluation and new large test collection for research,
+Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
+
+Hersh WR, Hickam DH, Use of a multi-application computer workstation
+in a clinical setting, Bulletin of the Medical Library Association,
+1994, 82: 382-389.
+
+Here are the field definitions:
+
+ .I sequential identifier
+ (important note: documents should be processed in this order)
+ .U MEDLINE identifier (UI)
+ (<DOCNO> used for relevance judgements)
+ .M Human-assigned MeSH terms (MH)
+ .T Title (TI)
+ .P Publication type (PT)
+ .W Abstract (AB)
+ .A Author (AU)
+ .S Source (SO)
+
+Note: some abstracts are truncated at 250 words and some references
+have no abstracts at all (titles only). We do not have access to the
+full text of the documents.
+
+(B) Description of the topic statements (files: query.*)
+
+There were three different sets of filtering topics for the
+TREC-9 Filtering track:
+(1) a subset of 63 of the original query set developed by Hersh et al.xi
+ for their IR experiments (OHSUMED),
+(2) a set of 4904 MeSH terms and their definitions (MSH), and
+(3) a subset of 500 of the MeSH terms (MSH-SMP).
+
+The existing OHSUMED topics describe actual information needs, but the
+relevance judgements probably do not have the same coverage provided
+by the TREC pooling process. The MeSH terms do not directly represent
+information needs, rather they are controlled indexing terms. However,
+the assessment should be more or less complete and there are a lot of
+them, so this provides an unusual opportunity to work with a very
+large topic sample.
+
+The topic statements are provided in the standard TREC format and
+consist of <title> and <desc> (= description) fields only. The meaning
+of these fields is slightly different for each query type.
+
+(1) OHSUMED topics (files: query.ohsu.*)
+
+<title> = patient description
+<desc> = information request
+
+The test collection was built as part of a study assessing the use of
+MEDLINE by physicians in a clinical setting (Hersh and Hickam, above).
+Novice physicians using MEDLINE generated 106 queries. Only a subset
+of these queries were used in the TREC-9 Filtering Track. Before
+they searched, they were asked to provide a statement of information
+about their patient as well as their information need.
+
+(2) MeSH topics (files: query.mesh.*)
+
+<title> = MeSH concept name
+<desc> = MeSH scope note, a definition of the concept (source: MeSH 2000)
+
+The National Library of Medicine has authorized us to use a subset of
+the MeSH 2000 scope notes for Filtering Track experiments with the
+OHSUMED collection. If you wish to use the MeSH scope notes for any
+other purpose, please visit the NLM Web Site,
+ http://www.nlm.nih.gov/mesh/
+sign the attached Memorandum of Understanding, and download the full
+MeSH 2000 database directly from the source.
+
+The subset of the MeSH topics used for the MSH-SMP runs is defined
+by the file "sample.map". The perl script mesh-sample.prl will
+produce a file containing only the 500 topics in the subset
+from the file containing the full set of 4904 topics.
+
+(3) Use of MeSH term field (.M) during filtering
+
+TREC-9 filtering track participants were allowed to use
+the MeSH term field (.M) during the filtering of the
+OHSU topic set provided the use of the field was noted in
+the run description. The entire MeSH term field was *not*
+allowed to be accessed during the filtering of the MeSH topic set.
+Information on the presence or absence of the specific MeSH term
+represented in the filtering topic is contained in the relevant
+document files described below (simulating human judgement).
+
+(C) Description of the relevance judgements (files: qrels.*)
+
+The format of the relevance judgements is slightly different for the
+two topic sets.
+
+(1) OHSUMED relevance judgements (files: qrels.ohsu.*)
+
+Format: <topic-ID> \t <DOCNO> \t <Relevant> \n
+
+<DOCNO> - MEDLINE identifier (.U/UI)
+<Relevant> - 1 = possibly relevant, 2 = definitely relevant
+
+Each query was replicated by four searchers, two physicians
+experienced in searching and two medical librarians. The results were
+assessed for relevance by a different group of physicians, using a
+three point scale: definitely, possibly, or not relevant. The list of
+documents explicitly judged to be not relevant is not provided here.
+Over 10% of the query-document pairs were judged in duplicate to
+assess inter-observer reliability. For evaluation, all documents
+judged here as either possibly or definitely relevant were
+considered relevant. TREC-9 systems were allowed to distinguish
+between these two categories during the learning process if desired.
+
+(2) MeSH relevance judgments (files: qrels.mesh.*)
+
+Format: <topic-ID> \t <DOCNO> \n
+
+<DOCNO> - MEDLINE identifier (.U/UI)
+
+A document is considered "relevant" to a MeSH "topic" if the MeSH
+concept name is listed in the MeSH term field (.M) of the document.
+Please note that the MeSH concepts form a hierarchy. It is common
+practice to index a document *only* by the most specific MeSH concept
+that is relevant.
+
+(D) Description of the ohsu-trec directories
+
+Here we describe the contents of the three sub-directories of
+ohsu-trec.
+
+(1) pre-test - directory of material for preliminary system testing
+
+ ohsumed.87@ - MEDLINE references from 1987 (note: this is a symbolic
+ link to the actual document file located in trec9-train)
+
+ query.ohsu.test.1-43 - set of 43 OHSU test topics
+ query.mesh.test.1-119 - set of 119 MeSH test topics
+
+ qrels.ohsu.test.87 - relevance judgements for OHSU test topics (1987)
+ qrels.mesh.test.87 - relevance judgements for MeSH test topics (1987)
+
+This directory is intended for people interested in doing some
+preliminary testing of their filtering system on this domain.
+Important note: the test topics available here are *not* an unbiased
+sample of the TREC-9 topics. In particular, they are the ones that
+were specifically rejected from the official runs for a variety of
+reasons (usually because they had too many or too few relevance
+judgments). Therefore, they should not be used for optimizing system
+parameters, just for general tests to make sure that the system is
+functioning properly.
+
+(2) trec9-train - directory of TREC-9 training material
+
+ ohsumed.87 - MEDLINE references from 1987
+
+ query.ohsu.1-63 - set of 63 TREC-9 OHSU topics
+ query.mesh.1-4904 - set of 4904 TREC-9 MeSH topics
+
+ qrels.ohsu.adapt.87 - training qrels / OHSU / adaptive filtering (1987)
+ qrels.ohsu.batch.87 - training qrels / OHSU / batch filtering (1987)
+ qrels.mesh.adapt.87 - training qrels / MeSH / adaptive filtering (1987)
+ qrels.mesh.batch.87 - training qrels / MeSH / batch filtering (1987)
+
+This directory contains all the training material for the TREC-9
+filtering task. Routing systems should use the same data as the batch
+filtering systems. The 1987 OHSUMED documents are intended for
+training purposes only. The batch filtering qrels files contain all
+the evaluated documents for the 1987 collection. The OHSU qrels for
+adaptive filtering contain two documents judged definitely relevant
+for each topic. The MeSH qrels for adaptive filtering contain four
+documents assigned to each topic. In both case, the training samples
+extracted for adaptive filtering were selected by random sampling.
+TREC-9 participants were allowed to use the 1987 OHSUMED collection
+for generating collection summary statistics (such as IDF) or other
+purposes (for adaptive filtering runs such use had to be declared).
+
+(3) trec9-test - directory of TREC-9 test material
+
+ ohsumed.88-91 - MEDLINE references from 1988-1991
+
+ qrels.ohsu.88-91 - relevance judgements for OHSU topics
+ qrels.mesh.88-91 - relevance judgements for MeSH topics
+
+This directory contains the documents and relevance judgements used
+to run the official TREC-9 Filtering Track experiments. TREC-9
+participants were allowed to use the relevance judgement for a
+document only after that document was retrieved. Relevance
+judgements from documents not retrieved were never accessed
+(except for the final evaluation).

Propchange: lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/README.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/README.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/README.txt Mon Feb 8 16:12:14 2010
@@ -0,0 +1,22 @@
+This is the version of OHSUMED corpus as used during TREC-9 filtering track. The corpus can
+be obtained from this page:
+
+ http://trec.nist.gov/data/t9_filtering.html
+
+Please see the original OHSU_TREC9-README.txt for detailed information about the corpus.
+
+The build process builds two corpora from this collection: one that uses the trec9-train/
+data, and the other that uses trec9-test data.
+
+There are two types of topics (queries) in this collection, and they are significantly
+different. The MeSH topics contain just the MeSH concept in the title, which quite often
+doesn't occur in the relevant documents - instead these documents match terms
+from the topic's "description" field. The OHSU topics often use colloquial and
+inconsistent abbreviations such as "60 yo" for "60 year old" (but often also
+"60 y o" or "60 yr old"). In this case as well, the matching terms appear only in
+the description field of the topic and not in the title.
+
+The description of the TREC filtering track underlines that qrels are NOT ranked by
+relevance, instead they simply list relevant documents in random order. Therefore any
+metrics that assume a ranked retrieval will require either some preprocessing step
+(such as sorting of qrels by relevance+docId) or may be inapplicable to this corpus.

Propchange: lucene/openrelevance/trunk/collections/ohsumed/README.txt
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/build.xml?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/build.xml (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/build.xml Mon Feb 8 16:12:14 2010
@@ -0,0 +1,97 @@
+<?xml version="1.0"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+<project name="ohsumed" default="dist">
+
+ <import file="../collections-build.xml"/>
+
+ <property name="t9" location="${build.dir}/download/t9.filtering.tar.gz"/>
+ <available file="${t9}" property="t9.exists"/>
+
+ <target name="fetch" unless="t9.exists">
+ <mkdir dir="${build.dir}/download"/>
+ <get src="http://trec.nist.gov/data/filtering/t9.filtering.tar.gz"
+ dest="${t9}"/>
+ </target>
+
+ <target name="extract" depends="fetch">
+ <untar src="${t9}" dest="${build.dir}/extracted" compression="gzip">
+ <patternset>
+ <include name="ohsu-trec/trec9-*/*"/>
+ </patternset>
+ </untar>
+ </target>
+
+ <target name="dist" depends="compile,extract">
+ <mkdir dir="${dist.dir}"/>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedCorpusConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/ohsumed.87"/>
+ <arg value="${dist.dir}/train-corpus.gz"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedCorpusConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/ohsumed.88-91"/>
+ <arg value="${dist.dir}/test-corpus.gz"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedTopicConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/query.ohsu.1-63"/>
+ <arg value="${dist.dir}/queries-ohsu.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedTopicConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/query.mesh.1-4904"/>
+ <arg value="${dist.dir}/queries-mesh.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <concat destfile="${dist.dir}/queries.txt" fixlastline="yes"
+ encoding="UTF-8" outputencoding="UTF-8">
+ <filelist dir="${dist.dir}" files="queries-ohsu.txt,queries-mesh.txt"/>
+ </concat>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/qrels.mesh.batch.87"/>
+ <arg value="${dist.dir}/train-judgements-mesh.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/qrels.ohsu.batch.87"/>
+ <arg value="${dist.dir}/train-judgements-ohsu.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <concat destfile="${dist.dir}/train-judgements.txt" fixlastline="yes"
+ encoding="UTF-8" outputencoding="UTF-8">
+ <filelist dir="${dist.dir}" files="train-judgements-ohsu.txt,train-judgements-mesh.txt"/>
+ </concat>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/qrels.ohsu.88-91"/>
+ <arg value="${dist.dir}/test-judgements-ohsu.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+ <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/qrels.mesh.88-91"/>
+ <arg value="${dist.dir}/test-judgements-mesh.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <concat destfile="${dist.dir}/test-judgements.txt" fixlastline="yes"
+ encoding="UTF-8" outputencoding="UTF-8">
+ <filelist dir="${dist.dir}" files="test-judgements-ohsu.txt,test-judgements-mesh.txt"/>
+ </concat>
+ </target>
+
+</project>

Propchange: lucene/openrelevance/trunk/collections/ohsumed/build.xml
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java Mon Feb 8 16:12:14 2010
@@ -0,0 +1,138 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+
+import org.apache.or.util.TrecDocument;
+import org.apache.or.util.TrecDocumentWriter;
+
+public class OhsumedCorpusConverter {
+
+ private static final String OHSU_SEQID = ".I "; // the only single-line field
+ private static final String OHSU_DOCID = ".U";
+ private static final String OHSU_SUBJECT = ".S";
+ private static final String OHSU_MESH = ".M";
+ private static final String OHSU_TITLE = ".T";
+ private static final String OHSU_TYPE = ".P";
+ private static final String OHSU_BODY = ".W";
+ private static final String OHSU_AUTHORS = ".A";
+
+ private static final HashSet<String> multiLine = new HashSet<String>();
+ static {
+ multiLine.add(OHSU_DOCID);
+ multiLine.add(OHSU_SUBJECT);
+ multiLine.add(OHSU_MESH);
+ multiLine.add(OHSU_TITLE);
+ multiLine.add(OHSU_TYPE);
+ multiLine.add(OHSU_BODY);
+ multiLine.add(OHSU_AUTHORS);
+ }
+
+ private static TrecDocument doc = new TrecDocument();
+ private static Date date = new Date(); // this corpus does not have a date, use a fake one.
+
+ public static void main(String[] args) throws Exception {
+ if (args.length == 0) {
+ System.err.println("Usage: OhsumedCorpusConverter <inputFile> <outputFile>");
+ System.err.println("\tinputFile\tone of the ohsumed.87 or ohsumed.88-91 files");
+ System.err.println("\toutputFile\toutput to store the converted corpus. NOTE: will be silently overwritten if exists!");
+ System.exit(-1);
+ }
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecDocumentWriter writer = new TrecDocumentWriter(new File(args[1]));
+
+ String line = null;
+ String did = null;
+ StringBuilder body = new StringBuilder();
+ HashMap<String, StringBuilder> fields = new HashMap<String, StringBuilder>();
+ String curField = null;
+ while ((line = in.readLine()) != null) {
+ if (line.startsWith(OHSU_SEQID)) { // new document
+ if (!fields.isEmpty()) {
+ writeDocument(fields, writer);
+ fields.clear();
+ }
+ fields.put(OHSU_SEQID, new StringBuilder(line.substring(OHSU_SEQID.length())));
+ } else {
+ if (line.charAt(0) == '.' && Character.isUpperCase(line.charAt(1))) { // field id, for multi-line fields
+ line = line.trim();
+ if (multiLine.contains(line)) {
+ curField = line;
+ } else {
+ System.err.println("Invalid field name: " + line + ", skipping ...");
+ curField = null;
+ }
+ continue;
+ } else {
+ // value of the current field
+ StringBuilder sb = fields.get(curField);
+ if (sb == null) {
+ sb = new StringBuilder();
+ fields.put(curField, sb);
+ } else {
+ sb.append('\n');
+ }
+ sb.append(line);
+ }
+ }
+ }
+ if (!fields.isEmpty()) {
+ writeDocument(fields, writer);
+ }
+ in.close();
+ writer.close();
+ }
+
+ // for now glue title + body + authors - this is primitive, but probably
+ // better than ignoring everything except the body ...
+ private static void writeDocument(Map<String, StringBuilder> fields, TrecDocumentWriter writer) throws Exception {
+ // Note: some document have an empty body
+ StringBuilder body = fields.get(OHSU_BODY);
+ StringBuilder title = fields.get(OHSU_TITLE);
+ if (title != null) {
+ if (body != null) title.append('\n').append(body);
+ body = title;
+ }
+ StringBuilder authors = fields.get(OHSU_AUTHORS);
+ if (authors != null) {
+ body.append('\n').append(authors);
+ }
+ StringBuilder mesh = fields.get(OHSU_MESH);
+ if (mesh != null) {
+ body.append('\n').append(mesh);
+ }
+ doc.setBody(body);
+ doc.setDate(date);
+ StringBuilder docName = fields.get(OHSU_DOCID);
+ if (docName == null) {
+ System.err.println("-Empty docid - skipping ...");
+ return;
+ }
+ doc.setDocname(docName);
+ writer.write(doc);
+ }
+
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java Mon Feb 8 16:12:14 2010
@@ -0,0 +1,60 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecQrel;
+import org.apache.or.util.TrecQrelWriter;
+
+public class OhsumedQrelConverter {
+
+ public static void main(String[] args) throws Exception {
+ if (args.length == 0) {
+ System.err.println("Usage: OhsumedQrelConverter <inputQrels> <outputQrels>");
+ System.err.println("\tinputQrels\tone of the qrels.mesh.* or qrels.ohsu.* files from OHSUMED");
+ System.err.println("\toutputQrels\toutput file (will be silently overwritten if exists!)");
+ System.exit(-1);
+ }
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecQrelWriter writer = new TrecQrelWriter(new File(args[1]));
+ TrecQrel qrel = new TrecQrel();
+
+ String line = null;
+ while ((line = in.readLine()) != null) {
+ String[] fields = line.split("\\s+");
+ if (fields.length < 2) {
+ System.err.println("-invalid line, skiping: " + line);
+ continue;
+ }
+ qrel.setDocno(fields[1]);
+ qrel.setIter("0");
+ qrel.setQid(fields[0]);
+ if (fields.length > 2) {
+ qrel.setRel(Integer.parseInt(fields[2]));
+ } else {
+ qrel.setRel(1);
+ }
+ writer.write(qrel);
+ }
+ in.close();
+ writer.close();
+ }
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java
------------------------------------------------------------------------------
svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java Mon Feb 8 16:12:14 2010
@@ -0,0 +1,80 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecTopic;
+import org.apache.or.util.TrecTopicWriter;
+
+public class OhsumedTopicConverter {
+
+ public static void main(String[] args) throws Exception {
+ if (args.length == 0) {
+ System.err.println("Usage: OhsumedTopicConverter <inputTopics> <outputTopics>");
+ System.err.println("\tinputTopics\tone of the query.mesh.* or query.ohsu.* files from OHSUMED");
+ System.err.println("\toutputTopics\toutput file (will be silently overwritten if exists!)");
+ System.exit(-1);
+ }
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecTopicWriter writer = new TrecTopicWriter(new File(args[1]));
+ TrecTopic topic = new TrecTopic();
+ topic.setNarrative(""); // no narratives
+
+ String line = null;
+ boolean description = false;
+ while ((line = in.readLine()) != null) {
+ String lineT = line.trim();
+ if (lineT.equals("") || lineT.equals("</top>")) {
+ continue;
+ }
+ if (line.trim().equals("<top>")) { // output existing doc & reset
+ if (topic.getNumber() != null && !topic.getNumber().equals("")) {
+ writer.write(topic);
+ }
+ topic.setNumber("");
+ topic.setDescription("");
+ topic.setTitle("");
+ continue;
+ }
+ if (lineT.startsWith("<num> Number: ")) {
+ topic.setNumber(lineT.substring(14));
+ } else if (lineT.startsWith("<title> ")) {
+ topic.setTitle(line.substring(8));
+ } else if (lineT.equals("<desc> Description:")) {
+ description = true;
+ continue;
+ } else if (description) {
+ topic.setDescription(line);
+ description = false;
+ } else {
+ System.err.println("Unrecognized line, skipping: '" + line + "'");
+ continue;
+ }
+ }
+ // output last topic if present
+ if (!topic.getNumber().equals("")) {
+ writer.write(topic);
+ }
+ in.close();
+ writer.close();
+ }
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java
------------------------------------------------------------------------------
svn:eol-style = native

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedFeb 8, '10 at 4:12p
activeFeb 8, '10 at 4:12p
posts1
users1
websitelucene.apache.org...

1 user in discussion

Ab: 1 post

People

Translate

site design / logo © 2018 Grokbase