That's something sometimes referred to as "in-situ map reduce" and is
way not something Hadoop and associated tools generally do.
We'd have to solve problems like handling failure conditions of one of
the nodes crashing mid-run (it's the only one that had the data! Now
what?), etc. The usual way to solve this kind of issue in the wild is
to set up a process that moves your local log files into hadoop,
perhaps with metadata about where they came from (directories named
after hosts? metadata files? lots of options here), and runs jobs over
I am not sure how active it is right now, but you might want to look
into a subproject of Hadoop called Chuckwa for handling this type of
On Sat, Jun 18, 2011 at 12:02 PM, Dylan Scott wrote:
I was wondering what a good approach would be to the following: On each node
in a Hadoop cluster I have the same directory with different log files in
them (in the local filesystem, not hdfs). I'd like to load these files such
that each node in the cluster is mapping over the files in their version of
the directory. Are there existing LoadFuncs that would support this?