FAQ
Hello,
I have been getting
Too many fetch failures (in the map operation)
and
shuffle error (in the reduce operation)

and am unable to complete any job on the cluster.

I have 5 slaves in the cluster. So I have the following values in the hadoop-site.xml file:
<name>mapred.map.tasks</name>
<value>53</value>
// 53 = nearest prime to 5*10

<name>mapred.reduce.tasks</name>
<value>7</value>
// 7 = nearest prime to 5

Please let me know what would be the suggest fix for this.

Hadoop version I am using is hadoop-0.16.3 and it is installed on Ubuntu.

Thanks!
--Sayali



---------------------------------
Sent from Yahoo! Mail.
A Smarter Email.

Search Discussions

  • Amar Kamat at Jun 20, 2008 at 4:57 am

    Sayali Kulkarni wrote:
    Hello,
    I have been getting
    Too many fetch failures (in the map operation)
    and
    shuffle error (in the reduce operation)
    Can you post the reducer logs. How many nodes are there in the cluster?
    Are you seeing this for all the maps and reducers? Are the reducers
    progressing at all? Are all the maps that the reducer is failing from a
    remote machine? Are all the failed maps/reducers from the same machine?
    Can you provide some more details.
    Amar
    and am unable to complete any job on the cluster.

    I have 5 slaves in the cluster. So I have the following values in the hadoop-site.xml file:
    <name>mapred.map.tasks</name>
    <value>53</value>
    // 53 = nearest prime to 5*10

    <name>mapred.reduce.tasks</name>
    <value>7</value>
    // 7 = nearest prime to 5

    Please let me know what would be the suggest fix for this.

    Hadoop version I am using is hadoop-0.16.3 and it is installed on Ubuntu.

    Thanks!
    --Sayali



    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Sayali Kulkarni at Jun 20, 2008 at 5:47 am
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is solved only if there is a single node in the cluster. So I can deduce that the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============
    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2 nodes in the cluster but still the problem exists.
    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they just do not proceed at all. They just stop at an average of 16%.
    Are all the maps that the reducer is failing from a remote machine? Yes.
    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Amar Kamat at Jun 20, 2008 at 6:44 am
    Yeah. With 2 nodes the reducers will go up to 16% because the reducer
    are able to fetch maps from the same machine (locally) but fails to copy
    it from the remote machine. A common reason in such cases is the
    *restricted machine access* (firewall etc). The web-server on a
    machine/node hosts map outputs which the reducers on the other machine
    are not able to access. There will be a URL associated with a map that
    the reducer try to fetch (check the reducer logs for this url). Just try
    accessing it manually from the reducer's machine/node. Most likely this
    experiment should also fail. Let us know if this is not the case.
    Amar
    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is solved only if there is a single node in the cluster. So I can deduce that the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============

    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2 nodes in the cluster but still the problem exists.

    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they just do not proceed at all. They just stop at an average of 16%.

    Are all the maps that the reducer is failing from a remote machine?
    Yes.

    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Tarandeep Singh at Jun 30, 2008 at 9:08 pm
    I am getting this error as well.
    As Sayali mentioned in his mail, I updated the /etc/hosts file with the
    slave machines IP addresses, but I am still getting this error.

    Amar, which is the url that you were talking about in your mail -
    "There will be a URL associated with a map that the reducer try to fetch
    (check the reducer logs for this url)"

    Please tell me where should I look for it... I will try to access it
    manually to see if this error is due to firewall.

    Thanks,
    Taran
    On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat wrote:

    Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
    able to fetch maps from the same machine (locally) but fails to copy it from
    the remote machine. A common reason in such cases is the *restricted machine
    access* (firewall etc). The web-server on a machine/node hosts map outputs
    which the reducers on the other machine are not able to access. There will
    be a URL associated with a map that the reducer try to fetch (check the
    reducer logs for this url). Just try accessing it manually from the
    reducer's machine/node. Most likely this experiment should also fail. Let us
    know if this is not the case.
    Amar

    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is
    solved only if there is a single node in the cluster. So I can deduce that
    the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is
    created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the
    cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
    process : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job:
    job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============


    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2
    nodes in the cluster but still the problem exists.


    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they
    just do not proceed at all. They just stop at an average of 16%.

    Are all the maps that the reducer is failing from a remote machine?
    Yes.


    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.
    Thanks for the help in advance,

    Regards,
    Sayali

    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Amar Kamat at Jul 1, 2008 at 5:17 am

    Tarandeep Singh wrote:
    I am getting this error as well.
    As Sayali mentioned in his mail, I updated the /etc/hosts file with the
    slave machines IP addresses, but I am still getting this error.

    Amar, which is the url that you were talking about in your mail -
    "There will be a URL associated with a map that the reducer try to fetch
    (check the reducer logs for this url)"

    Please tell me where should I look for it... I will try to access it
    manually to see if this error is due to firewall.
    One thing you can do is to see if all the maps that have failed while
    fetching are from remote host. Look at the web-ui to find out where the
    map task finished and look at the reduce task logs to find out which
    maps-fetches failed.

    I am not sure if the reduce task logs have it. Try this
    port=tasktracker.http.port (this is set through conf)
    tthost = tasktracker hostname (destination tasktracker from where the
    map out needs to be fetched)
    jobid = complete job id "job_...."
    mapid = the task attemptid "attempt_..." that has successfully completed
    the map
    reduce-partition-id = this is the partition number for reduce task.
    task_..._r_$i_$j will have reduce-partition-id as int-value($i).

    url =
    http://'$tthost':'$port'/mapOutput?job='$jobid'&map='$mapid'&reduce='$reduce-partition-id'
    '$var' is what you have to substitute.
    Amar
    Thanks,
    Taran

    On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat wrote:

    Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
    able to fetch maps from the same machine (locally) but fails to copy it from
    the remote machine. A common reason in such cases is the *restricted machine
    access* (firewall etc). The web-server on a machine/node hosts map outputs
    which the reducers on the other machine are not able to access. There will
    be a URL associated with a map that the reducer try to fetch (check the
    reducer logs for this url). Just try accessing it manually from the
    reducer's machine/node. Most likely this experiment should also fail. Let us
    know if this is not the case.
    Amar

    Sayali Kulkarni wrote:

    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is
    solved only if there is a single node in the cluster. So I can deduce that
    the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is
    created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the
    cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
    process : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job:
    job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============



    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2
    nodes in the cluster but still the problem exists.



    Are the reducers progressing at all?

    The reducers continue to execute upto a certain point, but after that they
    just do not proceed at all. They just stop at an average of 16%.


    Are all the maps that the reducer is failing from a remote machine?
    Yes.



    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.
    Thanks for the help in advance,

    Regards,
    Sayali

    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.

  • Sayali Kulkarni at Jun 21, 2008 at 5:54 am
    Hi!

    My problem of "Too many fetch failures" as well as "shuffle error" was resolved when I added the list of all the slave machines in the /etc/hosts file.

    Earlier on every slave I just had the entries of the master and own machine in the /etc/hosts file. But now I have updated all the /etc/hosts files to include the IP address and the names of all the machines in the cluster and my problem is resolved.

    One question still,
    I currently have just 5-6 nodes. But when Hadoop is deployed on a larger cluster, say of 1000+ nodes, is it expected that every time a new machine is added to the cluster, you add an entry in the /etc/hosts of all the (1000+) machines in the cluster?


    Regards,
    Sayali

    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is solved only if there is a single node in the cluster. So I can deduce that the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id : task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id : task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============
    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2 nodes in the cluster but still the problem exists.
    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they just do not proceed at all. They just stop at an average of 16%.
    Are all the maps that the reducer is failing from a remote machine? Yes.
    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Daniel Blaisdell at Jun 21, 2008 at 1:46 pm
    After a certain threshold your annoyance level will push you to configure a
    DNS server. :)

    -Daniel
    On Sat, Jun 21, 2008 at 1:53 AM, Sayali Kulkarni wrote:

    Hi!

    My problem of "Too many fetch failures" as well as "shuffle error" was
    resolved when I added the list of all the slave machines in the /etc/hosts
    file.

    Earlier on every slave I just had the entries of the master and own machine
    in the /etc/hosts file. But now I have updated all the /etc/hosts files to
    include the IP address and the names of all the machines in the cluster and
    my problem is resolved.

    One question still,
    I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
    cluster, say of 1000+ nodes, is it expected that every time a new machine is
    added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
    machines in the cluster?


    Regards,
    Sayali

    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is
    solved only if there is a single node in the cluster. So I can deduce that
    the problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is
    created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the
    cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process
    : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============
    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2
    nodes in the cluster but still the problem exists.
    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they
    just do not proceed at all. They just stop at an average of 16%.
    Are all the maps that the reducer is failing from a remote machine? Yes.
    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Allen Wittenauer at Jun 23, 2008 at 1:42 pm

    On 6/21/08 1:53 AM, "Sayali Kulkarni" wrote:
    One question still,
    I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
    cluster, say of 1000+ nodes, is it expected that every time a new machine is
    added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
    machines in the cluster?
    No.

    Any competent system administrator will say that the installation should
    be using DNS or perhaps some other distributed naming service by then.

    Heck, even at five nodes I would have deployed DNS. :)

    To get an idea of how Yahoo! runs its large installations, take a look
    at my presentation on the Hadoop wiki (
    http://wiki.apache.org/hadoop/HadoopPresentations ).
  • Shengkai Zhu at Jul 11, 2008 at 6:24 am
    This is also how I fixed this problem.
    On 6/21/08, Sayali Kulkarni wrote:

    Hi!

    My problem of "Too many fetch failures" as well as "shuffle error" was
    resolved when I added the list of all the slave machines in the /etc/hosts
    file.

    Earlier on every slave I just had the entries of the master and own machine
    in the /etc/hosts file. But now I have updated all the /etc/hosts files to
    include the IP address and the names of all the machines in the cluster and
    my problem is resolved.

    One question still,
    I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
    cluster, say of 1000+ nodes, is it expected that every time a new machine is
    added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
    machines in the cluster?


    Regards,
    Sayali

    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is solved
    only if there is a single node in the cluster. So I can deduce that the
    problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is
    created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the
    cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process
    : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============
    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2
    nodes in the cluster but still the problem exists.
    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they
    just do not proceed at all. They just stop at an average of 16%.
    Are all the maps that the reducer is failing from a remote machine? Yes.
    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.
  • Brainstorm at Jul 19, 2008 at 10:05 am
    Got this problem too, and fixed it just 5 minutes ago... there were
    wrong IP entries on the nodes referring to the frontend, it was
    slowing down the reduce process *a lot*... in numbers:

    Wrong hosts file using wordcount example: 3hrs, 45mins, 41sec (4
    minutes map, the rest, reduce)
    Right hosts file using wordcount example: 6mins, 26sec

    Moral of the history: AVOID static hosts file, always use DNS.

    PD: Static hosts files were replicated by rocksclusters to all compute
    nodes on install (kickstart) time, but not refreshed afterwards while
    doing "rocks sync dns" nor "rocks sync config".
    On Fri, Jul 11, 2008 at 8:24 AM, Shengkai Zhu wrote:
    This is also how I fixed this problem.
    On 6/21/08, Sayali Kulkarni wrote:

    Hi!

    My problem of "Too many fetch failures" as well as "shuffle error" was
    resolved when I added the list of all the slave machines in the /etc/hosts
    file.

    Earlier on every slave I just had the entries of the master and own machine
    in the /etc/hosts file. But now I have updated all the /etc/hosts files to
    include the IP address and the names of all the machines in the cluster and
    my problem is resolved.

    One question still,
    I currently have just 5-6 nodes. But when Hadoop is deployed on a larger
    cluster, say of 1000+ nodes, is it expected that every time a new machine is
    added to the cluster, you add an entry in the /etc/hosts of all the (1000+)
    machines in the cluster?


    Regards,
    Sayali

    Sayali Kulkarni wrote:
    Can you post the reducer logs. How many nodes are there in the cluster?
    There are 6 nodes in the cluster - 1 master and 5 slaves
    I tried to reduce the number of nodes, and found that the problem is solved
    only if there is a single node in the cluster. So I can deduce that the
    problem is there in some configuration.

    Configuration file:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>
    <name>hadoop.tmp.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    </property>

    <property>
    <name>fs.default.name</name>
    <value>hdfs://10.105.41.25:54310</value>
    <description>The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>10.105.41.25:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>2</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is
    created.
    The default is used if replication is not specified in create time.
    </description>
    </property>


    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1048M</value>
    </property>

    <property>
    <name>mapred.local.dir</name>
    <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
    </property>

    <property>
    <name>mapred.map.tasks</name>
    <value>53</value>
    <description>The default number of map tasks per job. Typically set
    to a prime several times greater than number of available hosts.
    Ignored when mapred.job.tracker is "local".
    </description>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>7</value>
    <description>The default number of reduce tasks per job. Typically set
    to a prime close to the number of available hosts. Ignored when
    mapred.job.tracker is "local".
    </description>
    </property>

    </configuration>


    ============
    This is the output that I get when running the tasks with 2 nodes in the
    cluster:

    08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to process
    : 1
    08/06/20 11:07:45 INFO mapred.JobClient: Running job: job_200806201106_0001
    08/06/20 11:07:46 INFO mapred.JobClient: map 0% reduce 0%
    08/06/20 11:07:53 INFO mapred.JobClient: map 8% reduce 0%
    08/06/20 11:07:55 INFO mapred.JobClient: map 17% reduce 0%
    08/06/20 11:07:57 INFO mapred.JobClient: map 26% reduce 0%
    08/06/20 11:08:00 INFO mapred.JobClient: map 34% reduce 0%
    08/06/20 11:08:01 INFO mapred.JobClient: map 43% reduce 0%
    08/06/20 11:08:04 INFO mapred.JobClient: map 47% reduce 0%
    08/06/20 11:08:05 INFO mapred.JobClient: map 52% reduce 0%
    08/06/20 11:08:08 INFO mapred.JobClient: map 60% reduce 0%
    08/06/20 11:08:09 INFO mapred.JobClient: map 69% reduce 0%
    08/06/20 11:08:10 INFO mapred.JobClient: map 73% reduce 0%
    08/06/20 11:08:12 INFO mapred.JobClient: map 78% reduce 0%
    08/06/20 11:08:13 INFO mapred.JobClient: map 82% reduce 0%
    08/06/20 11:08:15 INFO mapred.JobClient: map 91% reduce 1%
    08/06/20 11:08:16 INFO mapred.JobClient: map 95% reduce 1%
    08/06/20 11:08:18 INFO mapred.JobClient: map 99% reduce 3%
    08/06/20 11:08:23 INFO mapred.JobClient: map 100% reduce 3%
    08/06/20 11:08:25 INFO mapred.JobClient: map 100% reduce 7%
    08/06/20 11:08:28 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:08:30 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:08:33 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:08:35 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:08:38 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:09:54 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000002_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000011_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:09:57 INFO mapred.JobClient: map 95% reduce 9%
    08/06/20 11:09:59 INFO mapred.JobClient: map 100% reduce 9%
    08/06/20 11:10:04 INFO mapred.JobClient: map 100% reduce 10%
    08/06/20 11:10:07 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:09 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:12 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:14 INFO mapred.JobClient: map 100% reduce 15%
    08/06/20 11:10:17 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:24 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000000_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: map 100% reduce 11%
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000001_0, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000003_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:32 INFO mapred.JobClient: map 100% reduce 12%
    08/06/20 11:10:37 INFO mapred.JobClient: map 100% reduce 13%
    08/06/20 11:10:42 INFO mapred.JobClient: map 100% reduce 14%
    08/06/20 11:10:47 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: map 95% reduce 16%
    08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000020_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:10:54 INFO mapred.JobClient: map 100% reduce 16%
    08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000017_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:09 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000007_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:27 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000012_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:34 INFO mapred.JobClient: map 95% reduce 17%
    08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:39 INFO mapred.JobClient: map 91% reduce 18%
    08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000002_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:41 INFO mapred.JobClient: map 95% reduce 18%
    08/06/20 11:11:42 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000006_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:11:44 INFO mapred.JobClient: map 100% reduce 17%
    08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_r_000003_1, Status : FAILED
    Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    08/06/20 11:11:51 INFO mapred.JobClient: map 100% reduce 18%
    08/06/20 11:11:54 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: map 95% reduce 19%
    08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000010_0, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:02 INFO mapred.JobClient: map 100% reduce 19%
    08/06/20 11:12:07 INFO mapred.JobClient: map 100% reduce 20%
    08/06/20 11:12:08 INFO mapred.JobClient: map 100% reduce 33%
    08/06/20 11:12:09 INFO mapred.JobClient: map 100% reduce 47%
    08/06/20 11:12:11 INFO mapred.JobClient: map 100% reduce 60%
    08/06/20 11:12:16 INFO mapred.JobClient: map 100% reduce 62%
    08/06/20 11:12:24 INFO mapred.JobClient: map 100% reduce 63%
    08/06/20 11:12:26 INFO mapred.JobClient: map 100% reduce 64%
    08/06/20 11:12:31 INFO mapred.JobClient: map 100% reduce 65%
    08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
    task_200806201106_0001_m_000019_1, Status : FAILED
    Too many fetch-failures
    08/06/20 11:12:36 INFO mapred.JobClient: map 100% reduce 66%
    08/06/20 11:12:38 INFO mapred.JobClient: map 100% reduce 67%
    08/06/20 11:12:39 INFO mapred.JobClient: map 100% reduce 80%

    ===============
    Are you seeing this for all the maps and reducers?
    Yes, this happens on all the maps and reducers. I tried to keep just 2
    nodes in the cluster but still the problem exists.
    Are the reducers progressing at all?
    The reducers continue to execute upto a certain point, but after that they
    just do not proceed at all. They just stop at an average of 16%.
    Are all the maps that the reducer is failing from a remote machine? Yes.
    Are all the failed maps/reducers from the same machine?
    All the maps and reducers are failing anyways.

    Thanks for the help in advance,

    Regards,
    Sayali


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.


    ---------------------------------
    Sent from Yahoo! Mail.
    A Smarter Email.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 19, '08 at 2:07p
activeJul 19, '08 at 10:05a
posts11
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase