FAQ
Greetings!

When I stumbled across this project I read the background material and
the notes about the difficulties with obtaining the materials from NIST
(licensing issues).

I tracked down the NIST (Tipster disks) materials which were described
as follows:
The documents in the test collection are varied in style, size and
subject domain. The first disk contains material from the Wall Street
Journal,
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_WSJsample>
(1986, 1987, 1988, 1989), the AP Newswire
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_APsample>
(1989), the Federal Register
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_FRsample>
(1989), information from Computer Select
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_CSsample>
disks (Ziff-Davis Publishing) and short abstracts from the Department
of Energy
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_DOEsample>.
The second disk contains information from the same sources, but from
different years. The third disk contains more information from the
Computer Select disks, plus material from the San Jose Mercury News
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_SJMercurysample>
(1991), more AP newswire (1990) and about 250 megabytes of formatted
U.S. Patents
<http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_USPatent>. The
format of all the documents is relatively clean and easy to use, with
SGML-like tags separating documents and document fields. There is no
part-of-speech tagging or breakdown into individual sentences or
paragraphs as the purpose of this collection is to test retrieval
against real-world data.
But are those really representative of all the documents that are
encountered in a modern searching context?

Considering the prevalence of email, for example, as compared to the
Wall Street Journal (1986-1989), I suspect email archives should be a
major part of any such corpus.

Thinking along those lines made me realize that the Apache Foundation
already has:

1) Email list archives
2) Source code
3) Program documentation
4) Wikis
5) Webpages

all of which fall within the expertise of Apache participants to judge
relevance. (Unlike some of the TREC collections, such as the tobacco
settlement documents which would require legal expertise.)

There are other text collections that could be used but it occurred to
me that starting close to home might avoid some of the licensing issues
that were troublesome in the past.

Apologies if this has been discussed before but I was unable to find
email archives for this project.

Hope everyone is having a great day!

Patrick

--
Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau

Search Discussions

  • Grant Ingersoll at Jun 9, 2011 at 12:10 pm

    On Jun 8, 2011, at 5:09 AM, Patrick Durusau wrote:

    Greetings!

    When I stumbled across this project I read the background material and the notes about the difficulties with obtaining the materials from NIST (licensing issues).

    I tracked down the NIST (Tipster disks) materials which were described as follows:
    The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_WSJsample> (1986, 1987, 1988, 1989), the AP Newswire <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_APsample> (1989), the Federal Register <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_FRsample> (1989), information from Computer Select <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_CSsample> disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3B_DOEsample>. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_SJMercurysample> (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents <http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC93T3D_USPatent>. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.
    But are those really representative of all the documents that are encountered in a modern searching context?
    No. Well, the newswire ones probably still emulate current newswire and the Fed Register probably still simulates that kind of stuff, albeit w/ updated language.
    Considering the prevalence of email, for example, as compared to the Wall Street Journal (1986-1989), I suspect email archives should be a major part of any such corpus.

    Thinking along those lines made me realize that the Apache Foundation already has:

    1) Email list archives
    Yep, I have these posted up on S3. http://asf-mail-archives.s3-website-us-east-1.amazonaws.com/
    2) Source code
    3) Program documentation
    4) Wikis
    5) Webpages

    all of which fall within the expertise of Apache participants to judge relevance. (Unlike some of the TREC collections, such as the tobacco settlement documents which would require legal expertise.)

    There are other text collections that could be used but it occurred to me that starting close to home might avoid some of the licensing issues that were troublesome in the past.
    Definitely. What we need is a way of gathering judgments as well as collecting queries, etc.

    I think we should also take the public NIST ones and host all of them here as well, along w/ judgments and queries so that it all just works seamlessly.
    Apologies if this has been discussed before but I was unable to find email archives for this project.
    http://www.lucidimagination.com/search/?q=#/p:openrelevance
    Hope everyone is having a great day!

    Patrick

    --
    Patrick Durusau
    patrick@durusau.net
    Chair, V1 - US TAG to JTC 1/SC 34
    Convener, JTC 1/SC 34/WG 3 (Topic Maps)
    Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
    Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

    Another Word For It (blog): http://tm.durusau.net
    Homepage: http://www.durusau.net
    Twitter: patrickDurusau
    --------------------------
    Grant Ingersoll
  • Patrick Durusau at Jun 9, 2011 at 12:29 pm
    Grant,

    Thanks for the quick reply and pointers!

    On mailing list archive question:
    On 6/9/2011 8:10 AM, Grant Ingersoll wrote:
    On Jun 8, 2011, at 5:09 AM, Patrick Durusau wrote:
    Apologies if this has been discussed before but I was unable to find email archives for this project.
    http://www.lucidimagination.com/search/?q=#/p:openrelevance
    Thinking I had overlooked a pointer to the mailing list archives for
    openrelevance-dev I looked at the Lucene mailing list page:

    http://lucene.apache.org/mail.html

    which lists other subproject mailing lists but not openrelevance lists.

    The mailing list page for openrelevance (ORP),
    http://lucene.apache.org/openrelevance/mail.html, has no pointers to
    archives.

    Is this the right place to ask about adding pointers to the ORP mail
    archives?

    I would not have thought to look at the Lucid site and I suspect others
    could make the same mistake.

    Hope you are having a great day!

    Patrick
    --

    Patrick Durusau
    patrick@durusau.net
    Chair, V1 - US TAG to JTC 1/SC 34
    Convener, JTC 1/SC 34/WG 3 (Topic Maps)
    Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
    Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

    Another Word For It (blog): http://tm.durusau.net
    Homepage: http://www.durusau.net
    Twitter: patrickDurusau
  • Patrick Durusau at Jun 9, 2011 at 11:23 pm
    Grant,

    On your comments about the Fed Register and hosting of a collection:

    <snip>
    But are those really representative of all the documents that are encountered in a modern searching context?
    No. Well, the newswire ones probably still emulate current newswire and the Fed Register probably still simulates that kind of stuff, albeit w/ updated language.

    1) The 1989 Federal Register used on the NIST disks isn't online. Is the
    idea to re-use other relevance judgments from prior work? If not, then
    the availability of the 1989 Federal Register is a moot point.

    2) Other than its presence in the NIST collection, is there some other
    reason for choosing the Federal Register as a resource?

    The Federal Register is composed of presidential documents, notices of
    new regs (all depts), announcements of various sorts. The regulation
    part feeds into the Code of Federal Regulations, both of which are
    available in annual bulk XML:

    Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/FR

    Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/CFR

    Judging "relevance" against specialized materials is doable but is going
    to depend on the number of people that can be attracted to contribute
    the "relevance" judgments that then underlie analysis of the corpus.

    The Federal Register caught my eye because I was familiar with it a very
    long time ago.

    <snip>
    There are other text collections that could be used but it occurred to me that starting close to home might avoid some of the licensing issues that were troublesome in the past.
    Definitely. What we need is a way of gathering judgments as well as collecting queries, etc.

    I think we should also take the public NIST ones and host all of them here as well, along w/ judgments and queries so that it all just works seamlessly.
    Well, but the NIST, Brown and other corpus efforts, in addition to be
    hobbled by last century licensing agreements and dated data, were
    products of a time when gathering that much electronic data was a
    non-trivial task. So it was important to deliver it as a set.

    I see no reason why we could not produce a checksum against say a
    particular year of the Federal Register (if you want to include it) and
    allow it to stay where it is.

    Unless by "...it just works seamlessly" you are envisioning some hosted
    dataset + queries + relevance measures sort of setup.

    I am sure that is possible but introduces a layer of complexity on top
    of identifying datasets, creating relevance measures for those datasets
    and queries that are a baseline for judgments of relevance.

    Hope you are having a great day!

    Patrick
    --

    Patrick Durusau
    patrick@durusau.net
    Chair, V1 - US TAG to JTC 1/SC 34
    Convener, JTC 1/SC 34/WG 3 (Topic Maps)
    Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
    Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

    Another Word For It (blog): http://tm.durusau.net
    Homepage: http://www.durusau.net
    Twitter: patrickDurusau

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
categorieslucene
postedJun 8, '11 at 9:07a
activeJun 9, '11 at 11:23p
posts4
users2
websitelucene.apache.org...

People

Translate

site design / logo © 2018 Grokbase