FAQ
Hi,

I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
to a unresponsive node. From the reduce log (sorry that I didn't
keep it around), it stuck in copying map output from a dead
node (I can not ssh to that one). At that point, all maps are already
finished. I'm wondering why this slowness does not trigger a reduce
task fail and the corresponding map failed (even if it is finished) then
redo the map task on another node so that the reduce can work.

Thanks,
Rong-En Fan

Search Discussions

  • Rong-en Fan at Sep 19, 2008 at 1:43 am
    Reply to myself. I'm using streaming and the task timeout was set to 0,
    so that's why.
    On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan wrote:
    Hi,

    I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
    to a unresponsive node. From the reduce log (sorry that I didn't
    keep it around), it stuck in copying map output from a dead
    node (I can not ssh to that one). At that point, all maps are already
    finished. I'm wondering why this slowness does not trigger a reduce
    task fail and the corresponding map failed (even if it is finished) then
    redo the map task on another node so that the reduce can work.

    Thanks,
    Rong-En Fan
  • Rong-en Fan at Sep 19, 2008 at 4:43 am
    this time, I set task timeout to 10m via

    -jobconf mapred.task.timeout=600000

    However, I still see this "hang" at shuffle stage, and lots
    of messages below appear in the log

    2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200809190308_0007_r_000001_1 Need 6 map output(s)
    2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200809190308_0007_r_000001_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200809190308_0007_r_000001_1 Got 6 known map output location(s);
    scheduling...
    2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200809190308_0007_r_000001_1 Scheduled 0 of 6 known outputs (6
    slow hosts and 0 dup hosts)

    When fetching map output from one weird node (actually, it has a disk died),
    the http daemon returns 500 internal server error.

    It seems to me that the reducer fails in an infinite loop... I'm wondering
    this behavior is fixed in 0.18.x or there is some configuration parameters
    that I should tune with?

    Thanks,
    Rong-En Fan
    On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan wrote:
    Reply to myself. I'm using streaming and the task timeout was set to 0,
    so that's why.
    On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan wrote:
    Hi,

    I'm using 0.17.2.1 and see a reduce hang in shuffle phase due
    to a unresponsive node. From the reduce log (sorry that I didn't
    keep it around), it stuck in copying map output from a dead
    node (I can not ssh to that one). At that point, all maps are already
    finished. I'm wondering why this slowness does not trigger a reduce
    task fail and the corresponding map failed (even if it is finished) then
    redo the map task on another node so that the reduce can work.

    Thanks,
    Rong-En Fan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 18, '08 at 7:35p
activeSep 19, '08 at 4:43a
posts3
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Rong-en Fan: 3 posts

People

Translate

site design / logo © 2022 Grokbase