Still catching up on background and have some questions:

Grant mentioned the ASF email archives the other day in response to a
question I asked about using those as a corpora.

Reading the background document I see:
We have started a preliminary crawl of Creative Commons content using
Nutch. This is currently hosted on a private machine, but we would
like to bring this "in house" to the ASF and have the ASF host both
the crawling and the dissemination of the data. This, obviously, will
need to be supported by the ASF infrastructure, as it is potentially
quite burdensome in terms of disk space and bandwidth.
Is that still an operational assumption for at least one corpus?

I ask because design and focus on a single corpus, perhaps not the
largest one possible, such as a subpart of the email archives, could be
viewed as a shakedown run to create processes and test assumptions, not
to mention demonstrating viability of the project to others.

Hope everyone is having a great weekend!


PS: I know use of the TREC corpus was investigated. I know there are
other corpora research projects. Has there been an effort to survey
those for existing corpora with better licensing terms or likely
alliances? Thinking there may be projects that would offer better terms
in order to have the imprimatur of being part of an ASF umbrella corpora

Patrick Durusau
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupopenrelevance-dev @
postedJun 12, '11 at 3:30p
activeJun 12, '11 at 3:30p

1 user in discussion

Patrick Durusau: 1 post



site design / logo © 2019 Grokbase