Hi folks,

I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
so each task corresponds to a single file in HDFS

Most of the way into the job it hits a task that causes the input
format to OOM. After 4 attempts it fails the job.
Now this is obviously not great but for the purpose of my job I'd be
happy to just throw this input file away, it's only one of thousands
and I don't need exact results.

The trouble is I can't work out where what file this task corresponds to?

The closest I can find is that the job history file lists a STATE_STRING
( eg STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
)

but this is _only_ for the successfully completed ones, for the failed
one I'm actually interested in there is nothing
MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

I grepped through all the hadoop logs and couldn't find anything that
relates this task to the files in it's split
Any ideas where this info might be recorded?

Cheers,
Mat

Search Discussions

  • Bejoy Ks at Dec 4, 2011 at 9:13 am
    Hi Mat
    I'm not sure of an implicit mechanism in hadoop that logs the input
    splits(file names) each mapper is processing. To analyze that you may have
    to do some custom logging. Just log the input file name on the start of map
    method. The full file path in hdfs can be obtained from the input Split as
    follows

    //get the file split being processed
    FileSplit filsp = (FileSplit)context.getInputSplit();
    //get the full path of the file being processed
    log.debug(filsp.getPath());

    This works with new map reduce API. In old map reduce API you can get the
    information from JobConf job as
    job.get("map.input.file");
    This line of code you can include in your configure method in case of old
    API.

    Hope it helps!...

    Regards
    Bejoy.K.S

    On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey wrote:

    Hi folks,

    I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
    I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
    so each task corresponds to a single file in HDFS

    Most of the way into the job it hits a task that causes the input
    format to OOM. After 4 attempts it fails the job.
    Now this is obviously not great but for the purpose of my job I'd be
    happy to just throw this input file away, it's only one of thousands
    and I don't need exact results.

    The trouble is I can't work out where what file this task corresponds to?

    The closest I can find is that the job history file lists a STATE_STRING
    ( eg
    STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
    )

    but this is _only_ for the successfully completed ones, for the failed
    one I'm actually interested in there is nothing
    MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
    TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
    TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
    HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

    I grepped through all the hadoop logs and couldn't find anything that
    relates this task to the files in it's split
    Any ideas where this info might be recorded?

    Cheers,
    Mat
  • Praveen Sripati at Dec 4, 2011 at 12:21 pm
    Mat,

    There is no need to know the input data which caused the task and finally
    the job to fail.

    Set the 'mapreduce.map.failures.maxpercent` and
    'mapreduce.reduce.failures.maxpercent' to the failure tolerance for the job
    to complete irrespective of some task failures.

    Again, this is one of the hidden features of Hadoop. Though it was
    intyroduced back in 2007 (HADOOP-1144).

    If you would like to really nail the problem, then you could use the
    IsolationRunner. Here is more information on it.

    http://hadoop.apache.org/common/docs/r0.20.205.0/mapred_tutorial.html#IsolationRunner

    Regards,
    Praveen
    On Sun, Dec 4, 2011 at 2:42 PM, Bejoy Ks wrote:

    Hi Mat
    I'm not sure of an implicit mechanism in hadoop that logs the
    input splits(file names) each mapper is processing. To analyze that you may
    have to do some custom logging. Just log the input file name on the start
    of map method. The full file path in hdfs can be obtained from the input
    Split as follows

    //get the file split being processed
    FileSplit filsp = (FileSplit)context.getInputSplit();
    //get the full path of the file being processed
    log.debug(filsp.getPath());

    This works with new map reduce API. In old map reduce API you can get the
    information from JobConf job as
    job.get("map.input.file");
    This line of code you can include in your configure method in case of old
    API.

    Hope it helps!...

    Regards
    Bejoy.K.S

    On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey wrote:

    Hi folks,

    I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
    I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
    so each task corresponds to a single file in HDFS

    Most of the way into the job it hits a task that causes the input
    format to OOM. After 4 attempts it fails the job.
    Now this is obviously not great but for the purpose of my job I'd be
    happy to just throw this input file away, it's only one of thousands
    and I don't need exact results.

    The trouble is I can't work out where what file this task corresponds to?

    The closest I can find is that the job history file lists a STATE_STRING
    ( eg
    STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
    )

    but this is _only_ for the successfully completed ones, for the failed
    one I'm actually interested in there is nothing
    MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
    TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
    TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
    HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

    I grepped through all the hadoop logs and couldn't find anything that
    relates this task to the files in it's split
    Any ideas where this info might be recorded?

    Cheers,
    Mat
  • Mat Kelcey at Dec 4, 2011 at 3:16 pm

    Set the 'mapreduce.map.failures.maxpercent` and
    'mapreduce.reduce.failures.maxpercent' to the failure tolerance for the job
    to complete irrespective of some task failures.
    Thanks!
    I crawled the javadoc looking for something like this, I wonder why it's hidden?
    Mat
  • Mat Kelcey at Dec 4, 2011 at 3:13 pm
    Good idea, I guess I'll just have to run it up again though.
    Thanks
    On 4 December 2011 01:12, Bejoy Ks wrote:
    Hi Mat
            I'm not sure of an implicit mechanism in hadoop that logs the input
    splits(file names) each mapper is processing. To analyze that you may have
    to do some custom logging. Just log the input file name on the start of map
    method. The full file path in hdfs can be obtained from the input Split as
    follows

    //get the file split being processed
    FileSplit filsp = (FileSplit)context.getInputSplit();
    //get the full path of the file being processed
    log.debug(filsp.getPath());

    This works with new map reduce API. In old map reduce API you can get the
    information from JobConf job as
    job.get("map.input.file");
    This line of code you can include in your configure method in case of old
    API.

    Hope it helps!...

    Regards
    Bejoy.K.S

    On Sun, Dec 4, 2011 at 4:05 AM, Mat Kelcey wrote:

    Hi folks,

    I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
    I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
    so each task corresponds to a single file in HDFS

    Most of the way into the job it hits a task that causes the input
    format to OOM. After 4 attempts it fails the job.
    Now this is obviously not great but for the purpose of my job I'd be
    happy to just throw this input file away, it's only one of thousands
    and I don't need exact results.

    The trouble is I can't work out where what file this task corresponds to?

    The closest I can find is that the job history file lists a STATE_STRING
    ( eg
    STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
    )

    but this is _only_ for the successfully completed ones, for the failed
    one I'm actually interested in there is nothing
    MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
    TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
    TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
    HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

    I grepped through all the hadoop logs and couldn't find anything that
    relates this task to the files in it's split
    Any ideas where this info might be recorded?

    Cheers,
    Mat
  • Harsh J at Dec 4, 2011 at 3:37 pm
    Mat,

    Perhaps you can simply set a percentage of failure toleration for your job.

    Doable via http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int) and http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)

    If you set it to 10%, your job still passes if 10% of total Map or Reduce tasks failed. I think this fits your use-case.
    On 04-Dec-2011, at 4:05 AM, Mat Kelcey wrote:

    Hi folks,

    I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
    I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
    so each task corresponds to a single file in HDFS

    Most of the way into the job it hits a task that causes the input
    format to OOM. After 4 attempts it fails the job.
    Now this is obviously not great but for the purpose of my job I'd be
    happy to just throw this input file away, it's only one of thousands
    and I don't need exact results.

    The trouble is I can't work out where what file this task corresponds to?

    The closest I can find is that the job history file lists a STATE_STRING
    ( eg STATE_STRING="hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468"
    )

    but this is _only_ for the successfully completed ones, for the failed
    one I'm actually interested in there is nothing
    MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
    TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
    TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
    HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

    I grepped through all the hadoop logs and couldn't find anything that
    relates this task to the files in it's split
    Any ideas where this info might be recorded?

    Cheers,
    Mat
  • Praveen Sripati at Dec 4, 2011 at 4:36 pm
    Matt,

    I could not find the properties in the documentation, so I mentioned this
    feature as hidden. As Harsh mentioned there is an API.

    There was a blog entry on '
    Automatically Documenting Apache Hadoop Configuration' from Cloudera. It
    would be great if it is contributed to Apache and made part of the build
    process. I suggested it before, but there was no response.

    http://www.cloudera.com/blog/2011/08/automatically-documenting-apache-hadoop-configuration/

    Regards,
    Praveen
    On Sun, Dec 4, 2011 at 9:07 PM, Harsh J wrote:

    Mat,

    Perhaps you can simply set a percentage of failure toleration for your job.

    Doable via
    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)
    and
    http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)

    If you set it to 10%, your job still passes if 10% of total Map or Reduce
    tasks failed. I think this fits your use-case.

    On 04-Dec-2011, at 4:05 AM, Mat Kelcey wrote:

    Hi folks,

    I have a Hadoop 0.20.2 map only job with thousands of inputs tasks;
    I'm using the org.apache.nutch.tools.arc.ArcInputFormat input format
    so each task corresponds to a single file in HDFS

    Most of the way into the job it hits a task that causes the input
    format to OOM. After 4 attempts it fails the job.
    Now this is obviously not great but for the purpose of my job I'd be
    happy to just throw this input file away, it's only one of thousands
    and I don't need exact results.

    The trouble is I can't work out where what file this task corresponds to?

    The closest I can find is that the job history file lists a STATE_STRING
    ( eg STATE_STRING="
    hdfs://ip-10-115-29-44\.ec2\.internal:9000/user/hadoop/arc_files\.aa/2009/09/17/0/1253240925734_0\.arc\.gz:0+100425468
    "
    )

    but this is _only_ for the successfully completed ones, for the failed
    one I'm actually interested in there is nothing
    MapAttempt TASK_TYPE="MAP" TASKID="task_201112030459_0011_m_004130"
    TASK_ATTEMPT_ID="attempt_201112030459_0011_m_004130_0"
    TASK_STATUS="FAILED" FINISH_TIME="1322901661261"
    HOSTNAME="ip-10-218-57-227\.ec2\.internal" ERROR="Error: null" .

    I grepped through all the hadoop logs and couldn't find anything that
    relates this task to the files in it's split
    Any ideas where this info might be recorded?

    Cheers,
    Mat

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 3, '11 at 10:36p
activeDec 4, '11 at 4:36p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase