FAQ
Hello,

I've developed an extension to Heritrix (The Internet Archive open source
crawler) that allows it to write directly into HDFS. It looks like the
developers over there are interested in including it into their project.
I've designed it to write SequenceFiles and use the URL as the key and the
HTTP response as the value. I've got a couple of questions that I could use
a little help on:

1. I can't seem to set the replication factor on a SequenceFile. There's no
way to pass it in and when I call the createWriter factory and then call
FileSystem.setReplication, it still seems to use the default value. Is
there anyway to do this, or should I file an enhancement request?

2. It appears that the Configuration class looks for the conf/ directory in
the CLASSPATH. This makes it difficult to integrate with Heritrix. For
now, I've modified the heritrix launch script by hardcoding the hadoop
configuration directory into the CLASSPATH. It seems like a better way to
go would be to provide a text box on the Heritrix settings page that allows
the user to enter the path to the Hadoop configuration directory.

- Doug Judd
doug@zvents.com
http://www.zvents.com/

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 16, '07 at 4:24a
activeJan 16, '07 at 4:24a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Doug Judd: 1 post

People

Translate

site design / logo © 2022 Grokbase