FAQ
Hello,

I've written an extension to the Internet Archive's open source "Heritrix"
crawler that extends it to write into HDFS in SequenceFile format. The key
is the URL and the value is the HTTP response with some additional
metadata. It's actually quite simple to use, just drop a few jar files into
the Heritrix lib/ directory and you're good to go. Here's a link to the
download page: http://www.zvents.com/labs/hdfs_writer_processor . For
those of you who are interested, give it a whirl and feel free to send me
feedback.

- Doug Judd
doug@zvents.com

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 26, '07 at 1:24a
activeJan 26, '07 at 1:24a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Doug Judd: 1 post

People

Translate

site design / logo © 2022 Grokbase