Author: simonw
Date: Wed Nov 18 22:02:45 2009
New Revision: 881953

URL: http://svn.apache.org/viewvc?rev=881953&view=rev
Log:
ORP-1: Use Existing collection for relevance testing. initial revision: Lucene benchmark converter code for tempo collection, license and basic ant scripts

Added:
lucene/openrelevance/trunk/FILEFORMATS.txt
lucene/openrelevance/trunk/LICENSE.txt
lucene/openrelevance/trunk/build.xml
lucene/openrelevance/trunk/collections/
lucene/openrelevance/trunk/collections/collections-build.xml
lucene/openrelevance/trunk/collections/tempo/
lucene/openrelevance/trunk/collections/tempo/build.xml
lucene/openrelevance/trunk/collections/tempo/src/
lucene/openrelevance/trunk/collections/tempo/src/java/
lucene/openrelevance/trunk/collections/tempo/src/java/org/
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoCorpusConverter.java
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoQrelConverter.java
lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoTopicConverter.java
lucene/openrelevance/trunk/common-build.xml
lucene/openrelevance/trunk/src/
lucene/openrelevance/trunk/src/java/
lucene/openrelevance/trunk/src/java/org/
lucene/openrelevance/trunk/src/java/org/apache/
lucene/openrelevance/trunk/src/java/org/apache/or/
lucene/openrelevance/trunk/src/java/org/apache/or/util/
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocument.java
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocumentWriter.java
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrel.java
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrelWriter.java
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopic.java
lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopicWriter.java
Modified:
lucene/openrelevance/trunk/README.txt

Added: lucene/openrelevance/trunk/FILEFORMATS.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/FILEFORMATS.txt?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/FILEFORMATS.txt (added)
+++ lucene/openrelevance/trunk/FILEFORMATS.txt Wed Nov 18 22:02:45 2009
@@ -0,0 +1,38 @@
+Corpus: gzipped file, formatted like this:
+<DOC>
+<DOCNO>XXX-1</DOCNO>
+<DOCHDR>
+Date: Tue, 09 Dec 2003 22:39:08 GMT
+</DOCHDR>
+blah blah blah
+yackedy smackedy
+</DOC>
+<DOC>
+...
+
+
+ Note: The date is "EEE, dd MMM yyyy kk:mm:ss z" in SimpleDateFormat
+ (or completely blank)
+
+
+
+Queries: text file, formatted like this:
+<top>
+<num> Number: nnn
+<title>yackedy smackedy
+<desc> foo bar foo bar
+<narr> blah blah blah
+
+</top>
+<top>
+...
+
+
+
+Judgements: text file, tab-separated, formatted like this:
+Query# Iteration# DOC# Judgement
+
+Query# corresponds to the Number: nnn in queries.txt
+Iteration# is not useful.
+Doc# corresponds to the DOCNO from the corpus.
+Judgement is some numeric value (such as 0 or 1) indicating relevance.

Added: lucene/openrelevance/trunk/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/LICENSE.txt?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/LICENSE.txt Wed Nov 18 22:02:45 2009
@@ -0,0 +1,202 @@
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.

Modified: lucene/openrelevance/trunk/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/README.txt?rev=881953&r1=881952&r2=881953&view=diff
==============================================================================
--- lucene/openrelevance/trunk/README.txt (original)
+++ lucene/openrelevance/trunk/README.txt Wed Nov 18 22:02:45 2009
@@ -1 +1,45 @@
-Placeholder for OpenRelevanceProject README.
+Quick Instructions:
+
+One step: run 'ant'
+ This should download collections and place them in the dist/collections directory
+
+Under a given collection 'example' there would be the following structure:
+ queries.txt: Trec topics file
+ judgements.txt: Relevance judgements file
+ corpus.gz: Gzipped corpus.
+
+See FILEFORMATS.txt for more information on the structure of these files.
+
+
+How to use with Lucene-java benchmark package?
+
+Step 1: create a contrib/benchmark/conf/openrelevance.alg
+
+### START OF FILE: just an example
+content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource
+content.source.log.step=2500
+doc.term.vector=false
+content.source.forever=false
+directory=FSDirectory
+doc.stored=true
+doc.tokenized=true
+content.source.excludeIteration=true
+ResetSystemErase
+CreateIndex
+{ AddDoc } : *
+CloseIndex
+### END OF FILE
+
+Step 2: place the corpus.gz into the contrib/benchmark/work/trec folder.
+ Or alternatively configure this to a different location in the .alg file
+
+Step 3: from contrib/benchmark, run ant run-task -Dtask.alg=conf/openrelevance.alg
+ This will create an index in contrib/benchmark/work/index
+
+Step 4:
+ java -cp lucene-core-3.0-dev.jar;lucene-benchmark-3.0-dev.jar queries.txt judgements.txt submission.txt contrib/benchmark/work/index
+
+ This will print a bunch of information, finally a summary output.
+
+ You can also take the resulting submission.txt, along with judgements.txt,
+ and run trec_eval to get "official" calculations.

Added: lucene/openrelevance/trunk/build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/build.xml?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/build.xml (added)
+++ lucene/openrelevance/trunk/build.xml Wed Nov 18 22:02:45 2009
@@ -0,0 +1,45 @@
+<?xml version="1.0"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+<project name="openrelevance" default="build-collections" basedir=".">
+
+ <import file="common-build.xml"/>
+
+ <path id="classpath">
+ <pathelement location="${build.dir}/classes/java"/>
+ </path>
+
+ <macrodef name="collections-crawl">
+ <attribute name="target" default=""/>
+ <attribute name="failonerror" default="true"/>
+ <sequential>
+ <subant target="@{target}" failonerror="@{failonerror}">
+ <fileset dir="."
+ includes="collections/*/build.xml"
+ />
+ </subant>
+ </sequential>
+ </macrodef>
+
+ <target name="build-collections" depends="compile"
+ description="Builds all collections">
+ <collections-crawl target="dist"/>
+ </target>
+
+</project>

Added: lucene/openrelevance/trunk/collections/collections-build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/collections-build.xml?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/collections-build.xml (added)
+++ lucene/openrelevance/trunk/collections/collections-build.xml Wed Nov 18 22:02:45 2009
@@ -0,0 +1,32 @@
+<?xml version="1.0"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+<project name="collections-build">
+ <echo>Building ${ant.project.name}...</echo>
+
+ <property name="build.dir" location="../../build/collections/${ant.project.name}"/>
+ <property name="dist.dir" location="../../dist/collections/${ant.project.name}"/>
+
+ <import file="../common-build.xml"/>
+
+ <path id="classpath">
+ <pathelement path="../../build/classes/java"/>
+ <pathelement path="${build.dir}/classes/java"/>
+ </path>
+</project>

Added: lucene/openrelevance/trunk/collections/tempo/build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/build.xml?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/build.xml (added)
+++ lucene/openrelevance/trunk/collections/tempo/build.xml Wed Nov 18 22:02:45 2009
@@ -0,0 +1,60 @@
+<?xml version="1.0"?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+<project name="tempo" default="default">
+
+ <import file="../collections-build.xml"/>
+
+ <property name="tempo.zip" location="${build.dir}/download/tempo.zip"/>
+ <available file="${tempo.zip}" property="tempo.exists"/>
+
+ <target name="fetch" unless="tempo.exists">
+ <mkdir dir="${build.dir}/download"/>
+ <get src="http://ilps.science.uva.nl/datafiles/bahasaindonesia/tempo.zip"
+ dest="${tempo.zip}"/>
+ </target>
+
+ <target name="extract" depends="fetch">
+ <unzip src="${tempo.zip}" dest="${build.dir}/extracted">
+ <patternset>
+ <include name="tempo/collection/*"/>
+ </patternset>
+ </unzip>
+ </target>
+
+ <target name="dist" depends="compile,extract">
+ <mkdir dir="${dist.dir}"/>
+ <java classname="org.apache.or.collections.tempo.TempoCorpusConverter">
+ <arg value="${build.dir}/extracted/tempo/collection/tempo"/>
+ <arg value="${dist.dir}/corpus.gz"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.tempo.TempoTopicConverter">
+ <arg value="${build.dir}/extracted/tempo/collection/tempo.qry"/>
+ <arg value="${dist.dir}/queries.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ <java classname="org.apache.or.collections.tempo.TempoQrelConverter">
+ <arg value="${build.dir}/extracted/tempo/collection/tempo.qrel"/>
+ <arg value="${dist.dir}/judgements.txt"/>
+ <classpath refid="classpath"/>
+ </java>
+ </target>
+
+</project>

Added: lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoCorpusConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoCorpusConverter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoCorpusConverter.java (added)
+++ lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoCorpusConverter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,76 @@
+package org.apache.or.collections.tempo;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.util.Date;
+import java.util.regex.Pattern;
+
+import org.apache.or.util.TrecDocument;
+import org.apache.or.util.TrecDocumentWriter;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Converts the tempo corpus into a standard format.
+ */
+public class TempoCorpusConverter {
+ static Pattern didPattern = Pattern.compile("^<DOC>\\s*.*$");
+
+ public static void main(String args[]) throws Exception {
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecDocumentWriter writer = new TrecDocumentWriter(new File(args[1]));
+ TrecDocument doc = new TrecDocument();
+
+ String line = null;
+ String did = null;
+ Date date = new Date(); // this corpus does not have a date, use a fake one.
+ StringBuilder body = new StringBuilder();
+
+ while ((line = in.readLine()) != null) {
+ if (didPattern.matcher(line).matches()) {
+ if (did != null) {
+ doc.setDocname(did);
+ doc.setBody(body);
+ doc.setDate(date);
+ writer.write(doc);
+ }
+ body.setLength(0);
+ did = in.readLine().replace("<DOCID>", "").replace("</DOCID>", "")
+ .trim();
+ body.append(in.readLine().replace("<TITLE>", "")
+ .replace("</TITLE>", "").trim());
+ body.append('\n');
+ in.readLine(); // "<TEXT>"
+ } else {
+ if (!line.equals("</TEXT>") && !line.equals("</DOC>")) {
+ body.append(line);
+ body.append('\n');
+ }
+ }
+ }
+ // the last document
+ doc.setDocname(did);
+ doc.setBody(body);
+ doc.setDate(date);
+ writer.write(doc);
+ writer.close();
+ }
+}

Added: lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoQrelConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoQrelConverter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoQrelConverter.java (added)
+++ lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoQrelConverter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,49 @@
+package org.apache.or.collections.tempo;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecQrel;
+import org.apache.or.util.TrecQrelWriter;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Converts the tempo relevance judgements into a standard format.
+ */
+public class TempoQrelConverter {
+ public static void main(String args[]) throws Exception {
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecQrelWriter writer = new TrecQrelWriter(new File(args[1]));
+ TrecQrel qrel = new TrecQrel();
+
+ String line = null;
+ while ((line = in.readLine()) != null) {
+ String values[] = line.split("\\s");
+ qrel.setDocno(values[1]);
+ qrel.setIter("0");
+ qrel.setQid(values[0]);
+ qrel.setRel(1);
+ writer.write(qrel);
+ }
+ writer.close();
+ }
+}

Added: lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoTopicConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoTopicConverter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoTopicConverter.java (added)
+++ lucene/openrelevance/trunk/collections/tempo/src/java/org/apache/or/collections/tempo/TempoTopicConverter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,53 @@
+package org.apache.or.collections.tempo;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecTopic;
+import org.apache.or.util.TrecTopicWriter;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Converts the tempo topics into a standard format.
+ * TODO: There is a separate file with more detail.
+ * This file could be used to fully populate the topics.
+ */
+public class TempoTopicConverter {
+ public static void main(String args[]) throws Exception {
+ BufferedReader in = new BufferedReader(new InputStreamReader(
+ new FileInputStream(args[0]), "UTF-8"));
+ TrecTopicWriter writer = new TrecTopicWriter(new File(args[1]));
+ TrecTopic topic = new TrecTopic();
+
+ String line = null;
+ while ((line = in.readLine()) != null) {
+ if (line.startsWith(".Q")) {
+ topic.setNumber(line.replace(".Q ", "").trim());
+ topic.setTitle(in.readLine());
+ topic.setDescription(" ");
+ topic.setNarrative(" ");
+ in.readLine(); // blank line
+ writer.write(topic);
+ }
+ }
+ writer.close();
+ }
+}

Added: lucene/openrelevance/trunk/common-build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/common-build.xml?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/common-build.xml (added)
+++ lucene/openrelevance/trunk/common-build.xml Wed Nov 18 22:02:45 2009
@@ -0,0 +1,68 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+<project name="common">
+
+ <property name="src.dir" location="src/java"/>
+ <property name="build.dir" location="build"/>
+ <property name="dist.dir" location="dist"/>
+
+ <property name="javac.deprecation" value="off"/>
+ <property name="javac.debug" value="on"/>
+ <property name="javac.source" value="1.5"/>
+ <property name="javac.target" value="1.5"/>
+
+ <property name="build.encoding" value="utf-8"/>
+
+ <target name="compile">
+ <compile
+ srcdir="src/java"
+ destdir="${build.dir}/classes/java">
+ <classpath refid="classpath"/>
+ </compile>
+ </target>
+
+ <target name="clean">
+ <delete dir="${build.dir}"/>
+ <delete dir="${dist.dir}"/>
+ </target>
+
+ <macrodef name="compile">
+ <attribute name="srcdir"/>
+ <attribute name="destdir"/>
+ <attribute name="javac.source" default="${javac.source}"/>
+ <attribute name="javac.target" default="${javac.target}"/>
+ <element name="nested" implicit="yes" optional="yes"/>
+
+ <sequential>
+ <mkdir dir="@{destdir}"/>
+ <javac
+ encoding="${build.encoding}"
+ srcdir="@{srcdir}"
+ destdir="@{destdir}"
+ deprecation="${javac.deprecation}"
+ debug="${javac.debug}"
+ source="@{javac.source}"
+ target="@{javac.target}">
+ <nested/>
+ <compilerarg line="-Xmaxwarns 10000000"/>
+ <compilerarg line="-Xmaxerrs 10000000"/>
+ </javac>
+ </sequential>
+ </macrodef>
+
+</project>

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocument.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocument.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocument.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocument.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,80 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.util.Date;
+
+/**
+ * Represents a single Trec document, with name, body, and a date
+ */
+public class TrecDocument {
+ private CharSequence docname;
+ private CharSequence body;
+ private Date date;
+
+ public TrecDocument(CharSequence docname, CharSequence body, Date date) {
+ this.docname = docname;
+ this.body = body;
+ this.date = date;
+ }
+
+ public TrecDocument() {
+ }
+
+ /**
+ * @return the docname
+ */
+ public CharSequence getDocname() {
+ return docname;
+ }
+
+ /**
+ * @param docname the docname to set
+ */
+ public void setDocname(CharSequence docname) {
+ this.docname = docname;
+ }
+
+ /**
+ * @return the body
+ */
+ public CharSequence getBody() {
+ return body;
+ }
+
+ /**
+ * @param body the body to set
+ */
+ public void setBody(CharSequence body) {
+ this.body = body;
+ }
+
+ /**
+ * @return the date
+ */
+ public Date getDate() {
+ return date;
+ }
+
+ /**
+ * @param date the date to set
+ */
+ public void setDate(Date date) {
+ this.date = date;
+ }
+}

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocumentWriter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocumentWriter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocumentWriter.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecDocumentWriter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,56 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+import java.util.zip.GZIPOutputStream;
+
+/**
+ * Writes {@link TrecDocument}s to a gzipped corpus file.
+ */
+public class TrecDocumentWriter {
+ private Writer writer;
+
+ public TrecDocumentWriter(File file) throws IOException {
+ this.writer = new BufferedWriter(new OutputStreamWriter(
+ new GZIPOutputStream(new FileOutputStream(file)), "UTF-8"));
+ }
+
+ public void close() throws IOException {
+ writer.close();
+ }
+
+ public void write(TrecDocument doc) throws IOException {
+ writer.write("<DOC>\n");
+ writer.write("<DOCNO>");
+ writer.write(doc.getDocname().toString());
+ writer.write("</DOCNO>\n");
+ writer.write("<DOCHDR>\n");
+ writer.write("Date: ");
+ // todo stash the date here in the stupid format
+ writer.write("\n");
+ writer.write("</DOCHDR>\n");
+ writer.write(doc.getBody().toString());
+ writer.write("\n</DOC>\n");
+ }
+}

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrel.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrel.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrel.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrel.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,96 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * A relevance judgment, containing query ID, iteration, doc number,
+ * and relevance score
+ */
+public class TrecQrel {
+ private CharSequence qid;
+ private CharSequence iter;
+ private CharSequence docno;
+ private int rel;
+
+ public TrecQrel(CharSequence qid, CharSequence iter, CharSequence docno,
+ int rel) {
+ this.qid = qid;
+ this.iter = iter;
+ this.docno = docno;
+ this.rel = rel;
+ }
+
+ public TrecQrel() {
+ }
+
+ /**
+ * @return the qid
+ */
+ public CharSequence getQid() {
+ return qid;
+ }
+
+ /**
+ * @param qid the qid to set
+ */
+ public void setQid(CharSequence qid) {
+ this.qid = qid;
+ }
+
+ /**
+ * @return the iter
+ */
+ public CharSequence getIter() {
+ return iter;
+ }
+
+ /**
+ * @param iter the iter to set
+ */
+ public void setIter(CharSequence iter) {
+ this.iter = iter;
+ }
+
+ /**
+ * @return the docno
+ */
+ public CharSequence getDocno() {
+ return docno;
+ }
+
+ /**
+ * @param docno the docno to set
+ */
+ public void setDocno(CharSequence docno) {
+ this.docno = docno;
+ }
+
+ /**
+ * @return the rel
+ */
+ public int getRel() {
+ return rel;
+ }
+
+ /**
+ * @param rel the rel to set
+ */
+ public void setRel(int rel) {
+ this.rel = rel;
+ }
+}

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrelWriter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrelWriter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrelWriter.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecQrelWriter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,52 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+
+/**
+ * Writes {@link TrecQrel}s to a judgements file.
+ */
+public class TrecQrelWriter {
+ private Writer writer;
+
+ public TrecQrelWriter(File file) throws IOException {
+ this.writer = new BufferedWriter(new OutputStreamWriter(
+ new FileOutputStream(file), "UTF-8"));
+ }
+
+ public void close() throws IOException {
+ writer.close();
+ }
+
+ public void write(TrecQrel qrel) throws IOException {
+ writer.write(qrel.getQid().toString());
+ writer.write("\t");
+ writer.write(qrel.getIter().toString());
+ writer.write("\t");
+ writer.write(qrel.getDocno().toString());
+ writer.write("\t");
+ writer.write(Integer.toString(qrel.getRel()));
+ writer.write("\n");
+ }
+}

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopic.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopic.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopic.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopic.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,96 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * A Topic (query) with number, title, description, and narrative
+ */
+public class TrecTopic {
+ private CharSequence number;
+ private CharSequence title;
+ private CharSequence description;
+ private CharSequence narrative;
+
+ public TrecTopic(CharSequence number, CharSequence title,
+ CharSequence description, CharSequence narrative) {
+ this.number = number;
+ this.title = title;
+ this.description = description;
+ this.narrative = narrative;
+ }
+
+ public TrecTopic() {
+ }
+
+ /**
+ * @return the number
+ */
+ public CharSequence getNumber() {
+ return number;
+ }
+
+ /**
+ * @param number the number to set
+ */
+ public void setNumber(CharSequence number) {
+ this.number = number;
+ }
+
+ /**
+ * @return the title
+ */
+ public CharSequence getTitle() {
+ return title;
+ }
+
+ /**
+ * @param title the title to set
+ */
+ public void setTitle(CharSequence title) {
+ this.title = title;
+ }
+
+ /**
+ * @return the description
+ */
+ public CharSequence getDescription() {
+ return description;
+ }
+
+ /**
+ * @param description the description to set
+ */
+ public void setDescription(CharSequence description) {
+ this.description = description;
+ }
+
+ /**
+ * @return the narrative
+ */
+ public CharSequence getNarrative() {
+ return narrative;
+ }
+
+ /**
+ * @param narrative the narrative to set
+ */
+ public void setNarrative(CharSequence narrative) {
+ this.narrative = narrative;
+ }
+
+}

Added: lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopicWriter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopicWriter.java?rev=881953&view=auto
==============================================================================
--- lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopicWriter.java (added)
+++ lucene/openrelevance/trunk/src/java/org/apache/or/util/TrecTopicWriter.java Wed Nov 18 22:02:45 2009
@@ -0,0 +1,58 @@
+package org.apache.or.util;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+
+/**
+ * Writes {@link TrecTopic}s to a topics file.
+ */
+public class TrecTopicWriter {
+ private Writer writer;
+
+ public TrecTopicWriter(File file) throws IOException {
+ this.writer = new BufferedWriter(new OutputStreamWriter(
+ new FileOutputStream(file), "UTF-8"));
+ }
+
+ public void close() throws IOException {
+ writer.close();
+ }
+
+ public void write(TrecTopic topic) throws IOException {
+ writer.write("<top>\n");
+ writer.write("<num> Number: ");
+ writer.write(topic.getNumber().toString());
+ writer.write("\n");
+ writer.write("<title>");
+ writer.write(topic.getTitle().toString());
+ writer.write("\n");
+ writer.write("<desc>");
+ writer.write(topic.getDescription().toString());
+ writer.write("\n");
+ writer.write("<narr>");
+ writer.write(topic.getNarrative().toString());
+ writer.write("\n\n");
+ writer.write("</top>\n");
+ }
+}

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedNov 18, '09 at 10:03p
activeNov 18, '09 at 10:03p
posts1
users1
websitelucene.apache.org...

1 user in discussion

Simonw: 1 post

People

Translate

site design / logo © 2018 Grokbase