Hi,

We sometimes see reducers fail just when all mappers are finishing. All
mappers finish roughly at the same time. The reducers only dump the following
exception:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

The reducers own log output also shows nothing that gives a clue, this is the
last part of the log:

2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor
2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging on-
disk files
2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging on-
disk files
2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging in
memory files
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where 0 is
already in progress
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling Map
Completion Events
2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
dup hosts)
2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where 0 is
already in progress
2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)

Is there any advice?
Thanks

Search Discussions

  • Harsh J at Dec 26, 2011 at 8:15 pm
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a failed task and not as a killed task? If not, it may just be a result of speculative reducer execution at work and is nothing to worry about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:

    Hi,

    We sometimes see reducers fail just when all mappers are finishing. All
    mappers finish roughly at the same time. The reducers only dump the following
    exception:

    java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
    Caused by: java.io.IOException: Task process exit with nonzero status of 137.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue, this is the
    last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging on-
    disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging on-
    disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging in
    memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where 0 is
    already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling Map
    Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
    dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
    dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
    dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where 0 is
    already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
    dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
    dup hosts)

    Is there any advice?
    Thanks
  • Markus Jelsma at Dec 26, 2011 at 8:21 pm
    Hi,

    They were indeed reported as failed and speculative execution is disabled
    because this particular job is not side-effect free. We've seen this before
    but it's rare.

    Thanks
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a
    failed task and not as a killed task? If not, it may just be a result of
    speculative reducer execution at work and is nothing to worry about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
    Hi,

    We sometimes see reducers fail just when all mappers are finishing. All
    mappers finish roughly at the same time. The reducers only dump the
    following exception:

    java.lang.Throwable: Child Error

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

    Caused by: java.io.IOException: Task process exit with nonzero status of
    137.

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue, this is
    the last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    on- disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
    on- disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    in memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where
    0 is already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
    Map Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where
    0 is already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)

    Is there any advice?
    Thanks
  • Harsh J at Dec 26, 2011 at 8:32 pm
    Markus,

    Thanks for clarifying! What heap size does this task launch with and what's the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM killer taking part here?
    On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:

    Hi,

    They were indeed reported as failed and speculative execution is disabled
    because this particular job is not side-effect free. We've seen this before
    but it's rare.

    Thanks
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a
    failed task and not as a killed task? If not, it may just be a result of
    speculative reducer execution at work and is nothing to worry about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
    Hi,

    We sometimes see reducers fail just when all mappers are finishing. All
    mappers finish roughly at the same time. The reducers only dump the
    following exception:

    java.lang.Throwable: Child Error

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

    Caused by: java.io.IOException: Task process exit with nonzero status of
    137.

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue, this is
    the last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool: Got
    brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    on- disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
    on- disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    in memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s) where
    0 is already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
    Map Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s) where
    0 is already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)

    Is there any advice?
    Thanks
  • Markus Jelsma at Dec 26, 2011 at 8:59 pm
    Spot on!

    The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two
    reducers and only 60MB for datanode and tasktracker daemons. I had already
    accounted for Java processes taking more than -Xmx of physical RAM, it seems
    an edge case caused the mappers to take a bit more RAM than usual. Then when
    the reducers start (all at the same time) it was not enough.

    Thanks!
    Markus,

    Thanks for clarifying! What heap size does this task launch with and what's
    the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
    killer taking part here?
    On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
    Hi,

    They were indeed reported as failed and speculative execution is disabled
    because this particular job is not side-effect free. We've seen this
    before but it's rare.

    Thanks
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a
    failed task and not as a killed task? If not, it may just be a result of
    speculative reducer execution at work and is nothing to worry about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
    Hi,

    We sometimes see reducers fail just when all mappers are finishing. All
    mappers finish roughly at the same time. The reducers only dump the
    following exception:

    java.lang.Throwable: Child Error

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

    Caused by: java.io.IOException: Task process exit with nonzero status
    of 137.

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue, this
    is the last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    on- disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
    on- disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    in memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
    where 0 is already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
    Map Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
    where 0 is already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)

    Is there any advice?
    Thanks
  • Harsh J at Dec 26, 2011 at 9:28 pm
    Markus,

    Good to know you fixed it now :)

    Also consider raising reduce slowstart to a more suitable value, like 80-85% -- this is pretty beneficial in both heap consumption and more performant Reduce slots. The property name is "mapred.reduce.slowstart.completed.maps" and carries a float value (percentage).

    ( P.s. You may also want to watch this general take on the topic: http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop-troubleshooting-101 )
    On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote:

    Spot on!

    The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two
    reducers and only 60MB for datanode and tasktracker daemons. I had already
    accounted for Java processes taking more than -Xmx of physical RAM, it seems
    an edge case caused the mappers to take a bit more RAM than usual. Then when
    the reducers start (all at the same time) it was not enough.

    Thanks!
    Markus,

    Thanks for clarifying! What heap size does this task launch with and what's
    the machine RAM and the TaskTracker slots? Can you rule out the Linux OOM
    killer taking part here?
    On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
    Hi,

    They were indeed reported as failed and speculative execution is disabled
    because this particular job is not side-effect free. We've seen this
    before but it's rare.

    Thanks
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a
    failed task and not as a killed task? If not, it may just be a result of
    speculative reducer execution at work and is nothing to worry about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
    Hi,

    We sometimes see reducers fail just when all mappers are finishing. All
    mappers finish roughly at the same time. The reducers only dump the
    following exception:

    java.lang.Throwable: Child Error

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

    Caused by: java.io.IOException: Task process exit with nonzero status
    of 137.

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue, this
    is the last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    on- disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for merging
    on- disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for merging
    in memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
    where 0 is already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for polling
    Map Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
    where 0 is already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow hosts
    and0 dup hosts)

    Is there any advice?
    Thanks
  • Markus Jelsma at Dec 26, 2011 at 10:51 pm
    Hi,
    Markus,

    Good to know you fixed it now :)

    Also consider raising reduce slowstart to a more suitable value, like
    80-85% -- this is pretty beneficial in both heap consumption and more
    performant Reduce slots. The property name is
    "mapred.reduce.slowstart.completed.maps" and carries a float value
    (percentage).
    Is that going to help if one has a non-zero value for reusing JVM instances?
    We have little concurrent jobs running because the process flow is linear and
    we prefer the benefit of copying map output while it's running.
    For this particular job, however, it is not the case since all mappers finish
    almost simultaneously so changing that value would not make a difference
    right? 80% finishes within a matter of seconds.

    For the record, it's an Apache Nutch fetcher job configured to terminate after
    1 hour. This downloads URL's, hence the disabling of speculative execution.
    ( P.s. You may also want to watch this general take on the topic:
    http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-hadoop
    -troubleshooting-101 )
    Something interesting for tomorrow :)

    Many thanks for sharing thoughts!
    Cheers
    On 27-Dec-2011, at 2:29 AM, Markus Jelsma wrote:
    Spot on!

    The machines have 16GM RAM with 1050MB for nine mappers and 340MB for two
    reducers and only 60MB for datanode and tasktracker daemons. I had
    already accounted for Java processes taking more than -Xmx of physical
    RAM, it seems an edge case caused the mappers to take a bit more RAM
    than usual. Then when the reducers start (all at the same time) it was
    not enough.

    Thanks!
    Markus,

    Thanks for clarifying! What heap size does this task launch with and
    what's the machine RAM and the TaskTracker slots? Can you rule out the
    Linux OOM killer taking part here?
    On 27-Dec-2011, at 1:50 AM, Markus Jelsma wrote:
    Hi,

    They were indeed reported as failed and speculative execution is
    disabled because this particular job is not side-effect free. We've
    seen this before but it's rare.

    Thanks
    Exit code 137 would mean a SIGKILL. Are you positive its reported as a
    failed task and not as a killed task? If not, it may just be a result
    of speculative reducer execution at work and is nothing to worry
    about.
    On 27-Dec-2011, at 1:14 AM, Markus Jelsma wrote:
    Hi,

    We sometimes see reducers fail just when all mappers are finishing.
    All mappers finish roughly at the same time. The reducers only dump
    the following exception:

    java.lang.Throwable: Child Error

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)

    Caused by: java.io.IOException: Task process exit with nonzero status
    of 137.

    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

    The reducers own log output also shows nothing that gives a clue,
    this is the last part of the log:

    2011-12-26 19:35:19,116 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,117 INFO org.apache.hadoop.io.compress.CodecPool:
    Got brand-new decompressor
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for
    merging on- disk files
    2011-12-26 19:35:19,120 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread waiting: Thread for
    merging on- disk files
    2011-12-26 19:35:19,121 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for
    merging in memory files
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 50 map output(s)
    where 0 is already in progress
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Thread started: Thread for
    polling Map Completion Events
    2011-12-26 19:35:19,122 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 0 outputs (0 slow
    hosts and0 dup hosts)
    2011-12-26 19:35:24,124 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 2 outputs (0 slow
    hosts and0 dup hosts)
    2011-12-26 19:35:25,805 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
    hosts and0 dup hosts)
    2011-12-26 19:36:21,578 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Need another 47 map output(s)
    where 0 is already in progress
    2011-12-26 19:36:21,593 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
    hosts and0 dup hosts)
    2011-12-26 19:36:42,412 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201112261420_0006_r_000009_0 Scheduled 1 outputs (0 slow
    hosts and0 dup hosts)

    Is there any advice?
    Thanks
  • Arun C Murthy at Dec 27, 2011 at 7:33 am
    I wouldn't use jvm reuse at this point. It's had a number of issues over time and I've consistently switched it off for a long while now.

    Arun
    On Dec 26, 2011, at 2:50 PM, Markus Jelsma wrote:

    Hi,
    Markus,

    Good to know you fixed it now :)

    Also consider raising reduce slowstart to a more suitable value, like
    80-85% -- this is pretty beneficial in both heap consumption and more
    performant Reduce slots. The property name is
    "mapred.reduce.slowstart.completed.maps" and carries a float value
    (percentage).
    Is that going to help if one has a non-zero value for reusing JVM instances?
    We have little concurrent jobs running because the process flow is linear and
    we prefer the benefit of copying map output while it's running.
    For this particular job, however, it is not the case since all mappers finish
    almost simultaneously so changing that value would not make a difference
    right? 80% finishes within a matter of seconds.

    For the record, it's an Apache Nutch fetcher job configured to terminate after
    1 hour. This downloads URL's, hence the disabling of speculative execution.
  • Markus Jelsma at Dec 27, 2011 at 11:14 am
    It seems indeed. We had good results with 20.203 and a large value for reuse
    configured. What changed?

    Thanks
    I wouldn't use jvm reuse at this point. It's had a number of issues over
    time and I've consistently switched it off for a long while now.

    Arun
    On Dec 26, 2011, at 2:50 PM, Markus Jelsma wrote:
    Hi,
    Markus,

    Good to know you fixed it now :)

    Also consider raising reduce slowstart to a more suitable value, like
    80-85% -- this is pretty beneficial in both heap consumption and more
    performant Reduce slots. The property name is
    "mapred.reduce.slowstart.completed.maps" and carries a float value
    (percentage).
    Is that going to help if one has a non-zero value for reusing JVM
    instances? We have little concurrent jobs running because the process
    flow is linear and we prefer the benefit of copying map output while
    it's running.
    For this particular job, however, it is not the case since all mappers
    finish almost simultaneously so changing that value would not make a
    difference right? 80% finishes within a matter of seconds.

    For the record, it's an Apache Nutch fetcher job configured to terminate
    after 1 hour. This downloads URL's, hence the disabling of speculative
    execution.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 26, '11 at 7:45p
activeDec 27, '11 at 11:14a
posts9
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase