I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
so each task corresponds to a single file in HDFS
Most of the way into the job it hits a task that causes the input
format to OOM. After 4 attempts it fails the job.
Now this is obviously not great but for the purpose of my job I'd be
happy to just throw this input file away, it's only one of thousands
and I don't need exact results.
The trouble is I can't work out where what file this task corresponds to?
The closest I can find is that the job history file lists a STATE_STRING
( eg STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
but this is _only_ for the successfully completed ones, for the failed
one I'm actually interested in there is nothing
MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .
I grepped through all the hadoop logs and couldn't find anything that
relates this task to the files in it's split
Any ideas where this info might be recorded?