I've developed an extension to Heritrix (The Internet Archive open source
crawler) that allows it to write directly into HDFS. It looks like the
developers over there are interested in including it into their project.
I've designed it to write SequenceFiles and use the URL as the key and the
HTTP response as the value. I've got a couple of questions that I could use
a little help on:

1. I can't seem to set the replication factor on a SequenceFile. There's no
way to pass it in and when I call the createWriter factory and then call
FileSystem.setReplication, it still seems to use the default value. Is
there anyway to do this, or should I file an enhancement request?

2. It appears that the Configuration class looks for the conf/ directory in
the CLASSPATH. This makes it difficult to integrate with Heritrix. For
now, I've modified the heritrix launch script by hardcoding the hadoop
configuration directory into the CLASSPATH. It seems like a better way to
go would be to provide a text box on the Heritrix settings page that allows
the user to enter the path to the Hadoop configuration directory.

- Doug Judd

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
postedJan 16, '07 at 4:24a
activeJan 16, '07 at 4:24a

1 user in discussion

Doug Judd: 1 post



site design / logo © 2022 Grokbase