FAQ
Hi!

I'm experiencing hung reducers, with the following symptoms:
Task Logs: 'task_200807230647_0008_r_000009_1'


stdout logs



stderr logs



syslog logs

red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
location(s); scheduling... 2008-07-24 07:56:11,064 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:16,074 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:21,084 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:26,094 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:31,104 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:36,114 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
scheduling... 2008-07-24 07:56:41,126 INFO
org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)

Notice how it needs 6 map outputs, all map tasks have finished, and it still
just hangs there.

The second speculative copy of that reducer task needs 14 map outputs with the
same messages :(

Other observations:

killing the reduce tasks via job -killtask ends up with restarting the job on
the same node, and curiously the new job gets jammed at the same position
(6/14 maps needed).

The only remedy to this problem seems to be a complete restart of the cluster
and reprocessing. That gets really boring with jobs that took a day to
process first :(

Andreas

Search Discussions

  • Devaraj Das at Jul 24, 2008 at 1:21 pm
    Could you try to kill the tasktracker hosting the task the next time when it
    happens? I just want to isolate the problem - whether it is a problem in the
    TT-JT communication or in the Task-TT communication. From your description
    it looks like the problem is between the JT-TT communication. But pls run
    the experiment when it happens again and let us know what happens.

    Thanks,
    Devaraj

    On 7/24/08 1:42 PM, "Andreas Kostyrka" wrote:

    Hi!

    I'm experiencing hung reducers, with the following symptoms:
    Task Logs: 'task_200807230647_0008_r_000009_1'


    stdout logs



    stderr logs



    syslog logs

    red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
    location(s); scheduling... 2008-07-24 07:56:11,064 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:16,074 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:21,084 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:26,094 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:31,104 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:36,114 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:41,126 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)

    Notice how it needs 6 map outputs, all map tasks have finished, and it still
    just hangs there.

    The second speculative copy of that reducer task needs 14 map outputs with the
    same messages :(

    Other observations:

    killing the reduce tasks via job -killtask ends up with restarting the job on
    the same node, and curiously the new job gets jammed at the same position
    (6/14 maps needed).

    The only remedy to this problem seems to be a complete restart of the cluster
    and reprocessing. That gets really boring with jobs that took a day to
    process first :(

    Andreas
  • Andreas Kostyrka at Jul 24, 2008 at 6:40 pm

    On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
    Could you try to kill the tasktracker hosting the task the next time when
    it happens? I just want to isolate the problem - whether it is a problem in
    the TT-JT communication or in the Task-TT communication. From your
    description it looks like the problem is between the JT-TT communication.
    But pls run the experiment when it happens again and let us know what
    happens.
    Well, I did restart the tasktracker where the reduce job was running, but that
    lead only to a situation where the jobtracker did not restart the job, showed
    it as still running, and was not able to kill the reduce task via hadoop
    job -kill-task nor -fail-task.

    I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
    at another startup confirmed the whole batch of problems I've been
    experiencing, and for him 0.15 works for production.

    <rant-mode>
    No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
    could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
    reducing to our workloads, and before 0.16 failed >80% of the jobs with
    reducers not being able to get their output. 0.17.0 improved that to a point
    where one can, with some pain, e.g. restarting the cluster daily, not storing
    anything important on HDFS, only temporary data, ..., use it somehow for
    production, at least for small jobs.) So one wonders how 0.16 got released?
    Or was it meant only as developer-only bug fixing series?
    </rant-mode>

    Sorry, this has been driving me up the walls into an asylum till I compared
    notes with a collegue, and decided that I'm not crazy ;)

    Andreas
    Thanks,
    Devaraj
    On 7/24/08 1:42 PM, "Andreas Kostyrka" wrote:
    Hi!

    I'm experiencing hung reducers, with the following symptoms:
    Task Logs: 'task_200807230647_0008_r_000009_1'


    stdout logs



    stderr logs



    syslog logs

    red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
    location(s); scheduling... 2008-07-24 07:56:11,064 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:16,074 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:21,084 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:26,094 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:31,104 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:36,114 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:41,126 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
    Notice how it needs 6 map outputs, all map tasks have finished, and it
    still just hangs there.

    The second speculative copy of that reducer task needs 14 map outputs
    with the same messages :(

    Other observations:

    killing the reduce tasks via job -killtask ends up with restarting the
    job on the same node, and curiously the new job gets jammed at the same
    position (6/14 maps needed).

    The only remedy to this problem seems to be a complete restart of the
    cluster and reprocessing. That gets really boring with jobs that took a
    day to process first :(

    Andreas
  • Devaraj Das at Jul 24, 2008 at 7:41 pm

    On 7/25/08 12:09 AM, "Andreas Kostyrka" wrote:
    On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
    Could you try to kill the tasktracker hosting the task the next time when
    it happens? I just want to isolate the problem - whether it is a problem in
    the TT-JT communication or in the Task-TT communication. From your
    description it looks like the problem is between the JT-TT communication.
    But pls run the experiment when it happens again and let us know what
    happens.
    Well, I did restart the tasktracker where the reduce job was running, but that
    lead only to a situation where the jobtracker did not restart the job, showed
    it as still running, and was not able to kill the reduce task via hadoop
    job -kill-task nor -fail-task.
    The reduce task would eventually be reexecuted (after some timeout,
    defaulting to 10 minutes, the tasktracker would be assumed as lost and all
    reducers that were running on that node would be reexecuted).
    I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
    at another startup confirmed the whole batch of problems I've been
    experiencing, and for him 0.15 works for production.

    <rant-mode>
    No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
    could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
    reducing to our workloads, and before 0.16 failed >80% of the jobs with
    reducers not being able to get their output. 0.17.0 improved that to a point
    where one can, with some pain, e.g. restarting the cluster daily, not storing
    anything important on HDFS, only temporary data, ..., use it somehow for
    production, at least for small jobs.) So one wonders how 0.16 got released?
    Or was it meant only as developer-only bug fixing series?
    </rant-mode>
    Pls raise jiras for the specific problems.
    Sorry, this has been driving me up the walls into an asylum till I compared
    notes with a collegue, and decided that I'm not crazy ;)

    Andreas
    Thanks,
    Devaraj
    On 7/24/08 1:42 PM, "Andreas Kostyrka" wrote:
    Hi!

    I'm experiencing hung reducers, with the following symptoms:
    Task Logs: 'task_200807230647_0008_r_000009_1'


    stdout logs



    stderr logs



    syslog logs

    red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
    location(s); scheduling... 2008-07-24 07:56:11,064 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:16,074 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:21,084 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:26,094 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:31,104 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:36,114 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
    07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
    07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
    map-outputs from tasktracker and 0 map-outputs from previous failures
    2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
    task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
    scheduling... 2008-07-24 07:56:41,126 INFO
    org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
    Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
    Notice how it needs 6 map outputs, all map tasks have finished, and it
    still just hangs there.

    The second speculative copy of that reducer task needs 14 map outputs
    with the same messages :(

    Other observations:

    killing the reduce tasks via job -killtask ends up with restarting the
    job on the same node, and curiously the new job gets jammed at the same
    position (6/14 maps needed).

    The only remedy to this problem seems to be a complete restart of the
    cluster and reprocessing. That gets really boring with jobs that took a
    day to process first :(

    Andreas
  • Andreas Kostyrka at Jul 24, 2008 at 8:03 pm

    On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
    On 7/25/08 12:09 AM, "Andreas Kostyrka" wrote:
    On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
    Could you try to kill the tasktracker hosting the task the next time
    when it happens? I just want to isolate the problem - whether it is a
    problem in the TT-JT communication or in the Task-TT communication. From
    your description it looks like the problem is between the JT-TT
    communication. But pls run the experiment when it happens again and let
    us know what happens.
    Well, I did restart the tasktracker where the reduce job was running, but
    that lead only to a situation where the jobtracker did not restart the
    job, showed it as still running, and was not able to kill the reduce task
    via hadoop job -kill-task nor -fail-task.
    The reduce task would eventually be reexecuted (after some timeout,
    defaulting to 10 minutes, the tasktracker would be assumed as lost and all
    reducers that were running on that node would be reexecuted).
    I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
    peer at another startup confirmed the whole batch of problems I've been
    experiencing, and for him 0.15 works for production.

    <rant-mode>
    No question, 0.17 is way better than 0.16, on the other hand I wonder how
    0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
    introduced reducing to our workloads, and before 0.16 failed >80% of the
    jobs with reducers not being able to get their output. 0.17.0 improved
    that to a point where one can, with some pain, e.g. restarting the
    cluster daily, not storing anything important on HDFS, only temporary
    data, ..., use it somehow for production, at least for small jobs.) So
    one wonders how 0.16 got released? Or was it meant only as developer-only
    bug fixing series?
    </rant-mode>
    Pls raise jiras for the specific problems.
    I know, that's why I bracketed it as rantmode. OTOH, many of these issues had
    either this creepy feeling where you wondered if you did something wrong or
    were issues where I had to react relatively quickly, which usually destroys
    the faulty state. (I know, as a developer having reproduced a bug is golden.
    As an admin asked about processing lag, it's rather to opposite)

    Plus fixing the issue in the next release or even via a patch means that I
    have a non-working cluster till then. Now I that means I would need to start
    debugging the cluster utility software instead of our apps. ;(

    Andreas

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 24, '08 at 8:13a
activeJul 24, '08 at 8:03p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Andreas Kostyrka: 3 posts Devaraj Das: 2 posts

People

Translate

site design / logo © 2023 Grokbase