FAQ
Hi all,

We are seeing the following error in our reducers of a particular job:

Error: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


After enough reducers fail the entire job fails. This error occurs
regardless of whether mapred.compress.map.output is true. We were able
to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
and.ShuffleRamManager.reserve correctly detect the the memory
available for allocation? I would think that with poor configuration
settings (and default settings in particular) the job may not be as
efficient, but wouldn't die.

Here is some more context in the logs, I have attached the full
reducer log here: http://gist.github.com/323746


2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling 4191933 bytes (435311 raw bytes) into RAM from
attempt_201003060739_0002_m_000061_0
2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
attempt_201003060739_0002_m_000202_0
2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0 adding host
hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
failures
2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201003060739_0002_r_000000_0 : Map output copy failure :
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
would address the issue in 0.20.2, but it did not. Does anyone have
any comments or suggestions? Is this a bug I should file a JIRA for?

Jacob Rideout
Return Path

Search Discussions

  • Ted Yu at Mar 7, 2010 at 6:37 am
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers
    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout wrote:

    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Jacob R Rideout at Mar 7, 2010 at 8:22 pm
    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers
    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout wrote:

    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Andy Sautins at Mar 7, 2010 at 9:04 pm
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm following. If you would mind helping me understand I would appreciate it.

    Looking at the code maxSingleShuffleLimit is only used in determining if the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following ( ReduceTask.java starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the ramManager unreserves the memory? If the memory is unreserved before the actual underlying data references are removed then it seems like another thread can try to allocate memory ( ReduceTask.java:2730 ) before the previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job that is failing does have a good number ( 200k+ ) of map outputs. The large number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers
    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout wrote:

    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 7, 2010 at 10:39 pm
    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often. Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.

    Thanks

    >
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins wrote:


    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.

    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following ( ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before the
    actual underlying data references are removed then it seems like another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Andy Sautins at Mar 7, 2010 at 11:58 pm
    Thanks Ted. Very helpful. You are correct that I misunderstood the code at ReduceTask.java:1535. I missed the fact that it's in a IOException catch block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with mapred.reduce.parallel.copies set with values from 5 all the way down to 1. All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    So from that it does seem like something else might be going on, yes? I need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often. Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.

    Thanks

    >
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins wrote:


    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.

    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following ( ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before the
    actual underlying data references are removed then it seems like another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 8, 2010 at 3:40 am
    Lowering mapred.job.shuffle.input.buffer.percent would be the option to
    choose.

    Maybe GC wasn't releasing memory fast enough for in memory shuffling.
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins wrote:


    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.

    Thanks

    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before the
    actual underlying data references are removed then it seems like another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 9, 2010 at 7:56 pm
    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins wrote:


    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.

    Thanks

    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before the
    actual underlying data references are removed then it seems like another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Andy Sautins at Mar 9, 2010 at 10:24 pm
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included in the 0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and re-ran the same job scenarios. Running with mapred.reduce.parallel.copies set to 1 and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins wrote:


    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)


    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.

    Thanks

    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before the
    actual underlying data references are removed then it seems like another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular job:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were able
    to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 9, 2010 at 10:34 pm
    Andy:
    You need to manually apply the patch.

    Cheers
    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins wrote:


    Thanks Ted. My understanding is that MAPREDUCE-1182 is included in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Andy Sautins at Mar 9, 2010 at 10:42 pm
    Ah. My mistake. We will apply the patch manually to 0.20.2 and re-run. Just out of curiosity, why do the release notes for 0.20.2 indicate that MAPREDUCE-1182 is included in the release, but the patch needs to be applied manually. Is there an additional part of the patch not included in the release?

    Thanks for your help.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 3:33 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Andy:
    You need to manually apply the patch.

    Cheers
    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins wrote:


    Thanks Ted. My understanding is that MAPREDUCE-1182 is included in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Baleksan at Mar 9, 2010 at 11:18 pm
    FrrkcyriivlkfkjlkcuhgoyjturopEihymbgfkieNjl$jk

    Simon

    Simon

    Matilda

    Boris

    lltjhhwgh
    Sent via BlackBerry from T-Mobile

    -----Original Message-----
    From: Ted Yu <yuzhihong@gmail.com>
    Date: Tue, 9 Mar 2010 14:33:28
    To: <common-user@hadoop.apache.org>
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Andy:
    You need to manually apply the patch.

    Cheers
    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins wrote:


    Thanks Ted. My understanding is that MAPREDUCE-1182 is included in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    So from that it does seem like something else might be going on, yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
    has 2731 lines. So I have to guess the location of the code snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded _before_ the
    ramManager unreserves the memory? If the memory is unreserved before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the particular job
    that is failing does have a good number ( 200k+ ) of map outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
    some time to write a patch later this week.

    Jacob Rideout
    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor configuration
    settings (and default settings in particular) the job may not be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone have
    any comments or suggestions? Is this a bug I should file a JIRA for?

    Jacob Rideout
    Return Path
  • Christopher Douglas at Mar 10, 2010 at 12:21 am
    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone
    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path
  • Andy Sautins at Mar 10, 2010 at 1:02 am
    Thanks Christopher.

    The heap size for reduce tasks is configured to be 640M ( mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone
    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 10, 2010 at 3:44 am
    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers
    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins wrote:


    Thanks Christopher.

    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone
    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path
  • Christopher Douglas at Mar 10, 2010 at 7:54 am
    That section of code is unmodified in MR-1182. See the patches/svn
    log. -C

    Sent from my iPhone
    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize,
    InputStream in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Christopher.

    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone
    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood
    the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system,
    it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks
    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.

    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt
    ("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,

    We are seeing the following error in our reducers of a
    particular
    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We
    were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy
    failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 10, 2010 at 1:27 pm
    I verified that size and maxSize are long. This means MR-1182 didn't resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:
    That section of code is unmodified in MR-1182. See the patches/svn log. -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Christopher.

    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it became
    more likely that the root cause from my reasoning made OOME happen sooner.
    Thanks

    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Ted,

    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a particular job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)
    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)
    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path
  • Chris Douglas at Mar 10, 2010 at 11:36 pm
    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning
    the 0.20 shuffle semantics is likely not worthwhile (though data
    informing improvements to trunk would be excellent). Most likely (and
    tautologically), ReduceTask simply requires more memory than is
    available and the job failure can be avoided by either 0) increasing
    the heap size or 1) lowering mapred.shuffle.input.buffer.percent. Most
    of the tasks we run have a heap of 1GB. For a reduce fetching >200k
    map outputs, that's a reasonable, even stingy amount of space. -C
    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to
    OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:
    That section of code is unmodified in MR-1182. See the patches/svn
    log. -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize,
    InputStream in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.sautins@returnpath.net
    wrote:
    Thanks Christopher.

    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend
    and
    re-ran
    the same job scenarios. Running with
    mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<
    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I
    misunderstood the code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException catch
    block. My mistake. That's what I get for being in a rush.

    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to 1.
    All failed with the same error:

    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes? I
    need to do some more research.

    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling
    getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't
    apply.

    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2 only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how
    we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your
    system, it became
    more likely that the root cause from my reasoning made OOME
    happen sooner.
    Thanks

    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Ted,

    I'm trying to follow the logic in your mail and I'm not sure
    I'm
    following. If you would mind helping me understand I would
    appreciate it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining if
    the copy _can_ fit into memory:

    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory
    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (
    ReduceTask.java
    starting at 2730 ):

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_ the
    ramManager unreserves the memory? If the memory is unreserved
    before
    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 )
    before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers =
    conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a
    particular job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)
    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true.
    We were
    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the
    full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy
    failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.run
    (ReduceTask.java:1195)
    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does
    anyone
    have
    any comments or suggestions? Is this a bug I should file a
    JIRA
    for?

    Jacob Rideout
    Return Path
  • Ted Yu at Mar 11, 2010 at 3:49 am
    Thanks to Andy for the log he provided.

    You can see from the log below that size increased steadily from 341535057
    to 408181692, approaching maxSize. Then OOME:

    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
    maxSize=417601952
    ...
    2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=635753 size=408181692 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
    attempt_201003101826_0001_m_000875_0

    2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.net to
    penalty box, next contact in 4 seconds
    2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
    failures
    2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003101826_0001_r_000004_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)

    Looking at the call to unreserve() in ReduceTask, two were for IOException
    and the other was for Sanity check (line 1557). Meaning they wouldn't be
    called in normal execution path.

    I see one call in IFile.InMemoryReader:
    // Inform the RamManager
    ramManager.unreserve(bufferSize);

    And InMemoryReader is used in
    Reader<K, V> reader =
    new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
    mo.data, 0, mo.data.length);

    On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas wrote:

    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
    0.20 shuffle semantics is likely not worthwhile (though data informing
    improvements to trunk would be excellent). Most likely (and tautologically),
    ReduceTask simply requires more memory than is available and the job failure
    can be avoided by either 0) increasing the heap size or 1) lowering
    mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
    1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
    stingy amount of space. -C


    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:

    That section of code is unmodified in MR-1182. See the patches/svn log.
    -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Christopher.
    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <

    andy.sautins@returnpath.net wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<

    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted. Very helpful. You are correct that I misunderstood the
    code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException

    catch
    block. My mistake. That's what I get for being in a rush.
    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to

    1.
    All failed with the same error:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes?

    I
    need to do some more research.
    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total

    heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies

    to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2

    only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it

    became
    more likely that the root cause from my reasoning made OOME happen
    sooner.
    Thanks


    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net wrote:
    Ted,
    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate

    it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining

    if
    the copy _can_ fit into memory:
    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory

    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (

    ReduceTask.java
    starting at 2730 ):
    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before

    the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_

    the
    ramManager unreserves the memory? If the memory is unreserved
    before

    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular

    job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The

    large
    number of small map outputs may be why we are triggering a problem.
    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:

    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",

    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 * maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <

    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a particular

    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We were

    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:

    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from

    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path

  • Ted Yu at Mar 11, 2010 at 3:55 am
    I pressed send key a bit early.

    I will have to dig a bit deeper.
    Hopefully someone can find reader.close() call after which I will look for
    another possible root cause :-)

    On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu wrote:

    Thanks to Andy for the log he provided.

    You can see from the log below that size increased steadily from 341535057
    to 408181692, approaching maxSize. Then OOME:


    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
    maxSize=417601952
    ...
    2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=635753 size=408181692 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
    attempt_201003101826_0001_m_000875_0

    2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
    2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
    failures
    2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003101826_0001_r_000004_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)

    Looking at the call to unreserve() in ReduceTask, two were for IOException
    and the other was for Sanity check (line 1557). Meaning they wouldn't be
    called in normal execution path.

    I see one call in IFile.InMemoryReader close() method:
    // Inform the RamManager
    ramManager.unreserve(bufferSize);

    And InMemoryReader is used in createInMemorySegments():
    Reader<K, V> reader =
    new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
    mo.data, 0, mo.data.length);

    But I don't see reader.close() in ReduceTask file.
    On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas wrote:

    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
    0.20 shuffle semantics is likely not worthwhile (though data informing
    improvements to trunk would be excellent). Most likely (and tautologically),
    ReduceTask simply requires more memory than is available and the job failure
    can be avoided by either 0) increasing the heap size or 1) lowering
    mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
    1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
    stingy amount of space. -C


    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:

    That section of code is unmodified in MR-1182. See the patches/svn log.
    -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream
    in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Christopher.
    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <

    andy.sautins@returnpath.net wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios. Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<

    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted. Very helpful. You are correct that I misunderstood
    the

    code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException

    catch
    block. My mistake. That's what I get for being in a rush.
    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to

    1.
    All failed with the same error:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes?

    I
    need to do some more research.
    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total

    heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies

    to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2

    only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it

    became
    more likely that the root cause from my reasoning made OOME happen
    sooner.
    Thanks


    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net wrote:
    Ted,
    I'm trying to follow the logic in your mail and I'm not sure I'm
    following. If you would mind helping me understand I would
    appreciate

    it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining

    if
    the copy _can_ fit into memory:
    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory

    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (

    ReduceTask.java
    starting at 2730 ):
    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before

    the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_

    the
    ramManager unreserves the memory? If the memory is unreserved
    before

    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular

    job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The

    large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:

    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",

    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <

    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a
    particular

    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We
    were

    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:

    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from

    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path

  • Bo Shi at May 8, 2010 at 3:08 am
    Hey Ted, any further insights on this? We're encountering a similar
    issue (on CD2). I'll be applying MAPREDUCE-1182 to see if that
    resolves our case but it sounds like that JIRA didn't completely
    eliminate the problem for some folks.
    On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu wrote:
    I pressed send key a bit early.

    I will have to dig a bit deeper.
    Hopefully someone can find reader.close() call after which I will look for
    another possible root cause :-)

    On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu wrote:

    Thanks to Andy for the log he provided.

    You can see from the log below that size increased steadily from 341535057
    to 408181692, approaching maxSize. Then OOME:


    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=3893000 size=345428057 numPendingRequests=
    maxSize=417601952
    ...
    2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve:
    pos=end requestedSize=635753 size=408181692 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
    attempt_201003101826_0001_m_000875_0

    2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
    2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
    failures
    2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003101826_0001_r_000004_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)

    Looking at the call to unreserve() in ReduceTask, two were for IOException
    and the other was for Sanity check (line 1557). Meaning they wouldn't be
    called in normal execution path.

    I see one call in IFile.InMemoryReader close() method:
    // Inform the RamManager
    ramManager.unreserve(bufferSize);

    And InMemoryReader is used in createInMemorySegments():
    Reader<K, V> reader =
    new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
    mo.data, 0, mo.data.length);

    But I don't see reader.close() in ReduceTask file.
    On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas wrote:

    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the
    0.20 shuffle semantics is likely not worthwhile (though data informing
    improvements to trunk would be excellent). Most likely (and tautologically),
    ReduceTask simply requires more memory than is available and the job failure
    can be avoided by either 0) increasing the heap size or 1) lowering
    mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of
    1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
    stingy amount of space. -C


    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:

    That section of code is unmodified in MR-1182. See the patches/svn log.
    -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream
    in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Christopher.
    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <

    andy.sautins@returnpath.net wrote:
    Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
    re-ran
    the same job scenarios.  Running with mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<

    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch

    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted.  Very helpful.  You are correct that I misunderstood
    the

    code
    at ReduceTask.java:1535.  I missed the fact that it's in a
    IOException

    catch
    block.  My mistake.  That's what I get for being in a rush.
    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to

    1.
    All failed with the same error:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes?

    I
    need to do some more research.
    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total

    heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies

    to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2

    only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system, it

    became
    more likely that the root cause from my reasoning made OOME happen
    sooner.
    Thanks


    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net wrote:
    Ted,
    I'm trying to follow the logic in your mail and I'm not sure I'm
    following.  If you would mind helping me understand I would
    appreciate

    it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining

    if
    the copy _can_ fit into memory:
    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory

    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (

    ReduceTask.java
    starting at 2730 ):
    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before

    the
    mapOutput is discarded.  Shouldn't the mapOutput be discarded
    _before_

    the
    ramManager unreserves the memory?  If the memory is unreserved
    before

    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 ) before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense.  One thing to note is that the
    particular

    job
    that is failing does have a good number ( 200k+ ) of map
    outputs.  The

    large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:

    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",

    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <

    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a
    particular

    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We
    were

    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:

    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from

    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does anyone
    have
    any comments or suggestions? Is this a bug I should file a JIRA
    for?

    Jacob Rideout
    Return Path

  • Ted Yu at May 8, 2010 at 3:25 am
    You need to lower mapred.job.shuffle.input.buffer.percent to 20% or 25%.
    I didn't have time recently to find the root cause in 0.20.2

    I was told that shuffle has been rewritten in 0.21
    You may give it a try.
    On Fri, May 7, 2010 at 8:08 PM, Bo Shi wrote:

    Hey Ted, any further insights on this? We're encountering a similar
    issue (on CD2). I'll be applying MAPREDUCE-1182 to see if that
    resolves our case but it sounds like that JIRA didn't completely
    eliminate the problem for some folks.
    On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu wrote:
    I pressed send key a bit early.

    I will have to dig a bit deeper.
    Hopefully someone can find reader.close() call after which I will look for
    another possible root cause :-)

    On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu wrote:

    Thanks to Andy for the log he provided.

    You can see from the log below that size increased steadily from
    341535057
    to 408181692, approaching maxSize. Then OOME:


    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
    maxSize=417601952
    ...
    2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=end requestedSize=635753 size=408181692 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
    attempt_201003101826_0001_m_000875_0

    2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0 adding host
    hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
    2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
    failures
    2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003101826_0001_r_000004_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
    Looking at the call to unreserve() in ReduceTask, two were for
    IOException
    and the other was for Sanity check (line 1557). Meaning they wouldn't be
    called in normal execution path.

    I see one call in IFile.InMemoryReader close() method:
    // Inform the RamManager
    ramManager.unreserve(bufferSize);

    And InMemoryReader is used in createInMemorySegments():
    Reader<K, V> reader =
    new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
    mo.data, 0, mo.data.length);

    But I don't see reader.close() in ReduceTask file.
    On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <chrisdo@yahoo-inc.com
    wrote:
    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning
    the
    0.20 shuffle semantics is likely not worthwhile (though data informing
    improvements to trunk would be excellent). Most likely (and
    tautologically),
    ReduceTask simply requires more memory than is available and the job
    failure
    can be avoided by either 0) increasing the heap size or 1) lowering
    mapred.shuffle.input.buffer.percent. Most of the tasks we run have a
    heap of
    1GB. For a reduce fetching >200k map outputs, that's a reasonable, even
    stingy amount of space. -C


    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and 32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to
    OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:

    That section of code is unmodified in MR-1182. See the patches/svn
    log.
    -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize, InputStream
    in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Christopher.
    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <

    andy.sautins@returnpath.net wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend
    and
    re-ran
    the same job scenarios. Running with
    mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<

    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Ted. Very helpful. You are correct that I misunderstood
    the

    code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException

    catch
    block. My mistake. That's what I get for being in a rush.
    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the way
    down to

    1.
    All failed with the same error:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going on,
    yes?

    I
    need to do some more research.
    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling
    getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
    total

    heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies

    to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't
    apply.
    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2

    only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how
    we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in and
    make more
    memory available.
    Of course, given the large number of map outputs in your system,
    it
    became
    more likely that the root cause from my reasoning made OOME
    happen
    sooner.
    Thanks


    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net wrote:
    Ted,
    I'm trying to follow the logic in your mail and I'm not sure
    I'm
    following. If you would mind helping me understand I would
    appreciate

    it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining

    if
    the copy _can_ fit into memory:
    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory

    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (

    ReduceTask.java
    starting at 2730 ):
    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before

    the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_

    the
    ramManager unreserves the memory? If the memory is unreserved
    before

    the
    actual underlying data references are removed then it seems like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 )
    before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular

    job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The

    large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:

    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers =
    conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <

    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a
    particular

    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true. We
    were

    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the
    full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:

    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from

    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy
    failure :
    java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does
    anyone
    have
    any comments or suggestions? Is this a bug I should file a
    JIRA
    for?

    Jacob Rideout
    Return Path

  • Alex Kozlov at May 9, 2010 at 4:42 am
    In our case there were three things that helped:


    - reduce *mapred.reduce.parallel.copies *(and incur some execution time
    penalty on large clusters)
    - increase the # of reducers (and just have the larger # of output files)
    - decrease the amount of map output (may or may not be possible in your
    case)


    Alex K

    On Fri, May 7, 2010 at 8:25 PM, Ted Yu wrote:

    You need to lower mapred.job.shuffle.input.buffer.percent to 20% or 25%.
    I didn't have time recently to find the root cause in 0.20.2

    I was told that shuffle has been rewritten in 0.21
    You may give it a try.
    On Fri, May 7, 2010 at 8:08 PM, Bo Shi wrote:

    Hey Ted, any further insights on this? We're encountering a similar
    issue (on CD2). I'll be applying MAPREDUCE-1182 to see if that
    resolves our case but it sounds like that JIRA didn't completely
    eliminate the problem for some folks.
    On Wed, Mar 10, 2010 at 11:54 PM, Ted Yu wrote:
    I pressed send key a bit early.

    I will have to dig a bit deeper.
    Hopefully someone can find reader.close() call after which I will look for
    another possible root cause :-)

    On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu wrote:

    Thanks to Andy for the log he provided.

    You can see from the log below that size increased steadily from
    341535057
    to 408181692, approaching maxSize. Then OOME:


    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=start requestedSize=3893000 size=341535057 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=end requestedSize=3893000 size=345428057 numPendingRequests=0
    maxSize=417601952
    ...
    2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask:
    reserve:
    pos=end requestedSize=635753 size=408181692 numPendingRequests=0
    maxSize=417601952
    2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task
    attempt_201003101826_0001_r_000004_0: Failed fetch #1 from
    attempt_201003101826_0001_m_000875_0

    2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0 adding host
    hd17.dfs.returnpath.netto penalty box, next contact in 4 seconds
    2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous
    failures
    2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner:
    attempt_201003101826_0001_r_000004_0 : Map output copy failure :
    java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200)
    Looking at the call to unreserve() in ReduceTask, two were for
    IOException
    and the other was for Sanity check (line 1557). Meaning they wouldn't
    be
    called in normal execution path.

    I see one call in IFile.InMemoryReader close() method:
    // Inform the RamManager
    ramManager.unreserve(bufferSize);

    And InMemoryReader is used in createInMemorySegments():
    Reader<K, V> reader =
    new InMemoryReader<K, V>(ramManager, mo.mapAttemptId,
    mo.data, 0, mo.data.length);

    But I don't see reader.close() in ReduceTask file.
    On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <chrisdo@yahoo-inc.com
    wrote:
    I don't think this OOM is a framework bug per se, and given the
    rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning
    the
    0.20 shuffle semantics is likely not worthwhile (though data
    informing
    improvements to trunk would be excellent). Most likely (and
    tautologically),
    ReduceTask simply requires more memory than is available and the job
    failure
    can be avoided by either 0) increasing the heap size or 1) lowering
    mapred.shuffle.input.buffer.percent. Most of the tasks we run have a
    heap of
    1GB. For a reduce fetching >200k map outputs, that's a reasonable,
    even
    stingy amount of space. -C


    On Mar 10, 2010, at 5:26 AM, Ted Yu wrote:

    I verified that size and maxSize are long. This means MR-1182 didn't
    resolve
    Andy's issue.

    According to Andy:
    At the beginning of the job there are 209,754 pending map tasks and
    32
    pending reduce tasks

    My guess is that GC wasn't reclaiming memory fast enough, leading to
    OOME
    because of large number of in-memory shuffle candidates.

    My suggestion for Andy would be to:
    1. add -*verbose*:*gc as JVM parameter
    2. modify reserve() slightly to calculate the maximum outstanding
    numPendingRequests and print the maximum.

    Based on the output from above two items, we can discuss solution.
    My intuition is to place upperbound on numPendingRequests beyond
    which
    canFitInMemory() returns false.
    *
    My two cents.

    On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
    wrote:

    That section of code is unmodified in MR-1182. See the patches/svn
    log.
    -C

    Sent from my iPhone


    On Mar 9, 2010, at 7:44 PM, "Ted Yu" wrote:

    I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
    This is what I see in ReduceTask (line 999):
    public synchronized boolean reserve(int requestedSize,
    InputStream
    in)

    throws InterruptedException {
    // Wait till the request can be fulfilled...
    while ((size + requestedSize) > maxSize) {

    I don't see the fix from MR-1182.

    That's why I suggested to Andy that he manually apply MR-1182.

    Cheers

    On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:

    Thanks Christopher.
    The heap size for reduce tasks is configured to be 640M (
    mapred.child.java.opts set to -Xmx640m ).

    Andy

    -----Original Message-----
    From: Christopher Douglas
    Sent: Tuesday, March 09, 2010 5:19 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    No, MR-1182 is included in 0.20.2

    What heap size have you set for your reduce tasks? -C

    Sent from my iPhone

    On Mar 9, 2010, at 2:34 PM, "Ted Yu" wrote:

    Andy:
    You need to manually apply the patch.

    Cheers

    On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <

    andy.sautins@returnpath.net wrote:
    Thanks Ted. My understanding is that MAPREDUCE-1182 is
    included
    in the
    0.20.2 release. We upgraded our cluster to 0.20.2 this weekend
    and
    re-ran
    the same job scenarios. Running with
    mapred.reduce.parallel.copies
    set to 1
    and continue to have the same Java heap space error.



    -----Original Message-----
    From: Ted Yu
    Sent: Tuesday, March 09, 2010 12:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    This issue has been resolved in
    http://issues.apache.org/jira/browse/MAPREDUCE-1182

    Please apply the patch
    M1182-1v20.patch<

    http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
    On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
    andy.sautins@returnpath.net
    wrote:
    Thanks Ted. Very helpful. You are correct that I
    misunderstood
    the

    code
    at ReduceTask.java:1535. I missed the fact that it's in a
    IOException

    catch
    block. My mistake. That's what I get for being in a rush.
    For what it's worth I did re-run the job with
    mapred.reduce.parallel.copies set with values from 5 all the
    way
    down to

    1.
    All failed with the same error:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    So from that it does seem like something else might be going
    on,
    yes?

    I
    need to do some more research.
    I appreciate your insights.

    Andy

    -----Original Message-----
    From: Ted Yu
    Sent: Sunday, March 07, 2010 3:38 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Shuffle In Memory OutOfMemoryError

    My observation is based on this call chain:
    MapOutputCopier.run() calling copyOutput() calling
    getMapOutput()
    calling
    ramManager.canFitInMemory(decompressedLength)

    Basically ramManager.canFitInMemory() makes decision without
    considering
    the
    number of MapOutputCopiers that are running. Thus 1.25 * 0.7
    of
    total

    heap
    may be used in shuffling if default parameters were used.
    Of course, you should check the value for
    mapred.reduce.parallel.copies

    to
    see if it is 5. If it is 4 or lower, my reasoning wouldn't
    apply.
    About ramManager.unreserve() call, ReduceTask.java from hadoop
    0.20.2

    only
    has 2731 lines. So I have to guess the location of the code
    snippet you
    provided.
    I found this around line 1535:
    } catch (IOException ioe) {
    LOG.info("Failed to shuffle from " +
    mapOutputLoc.getTaskAttemptId(),
    ioe);

    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    Please confirm the line number.

    If we're looking at the same code, I am afraid I don't see how
    we
    can
    improve it. First, I assume IOException shouldn't happen that
    often.
    Second,
    mapOutput.discard() just sets:
    data = null;
    for in memory case. Even if we call mapOutput.discard() before
    ramManager.unreserve(), we don't know when GC would kick in
    and
    make more
    memory available.
    Of course, given the large number of map outputs in your
    system,
    it
    became
    more likely that the root cause from my reasoning made OOME
    happen
    sooner.
    Thanks


    On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
    andy.sautins@returnpath.net wrote:
    Ted,
    I'm trying to follow the logic in your mail and I'm not sure
    I'm
    following. If you would mind helping me understand I would
    appreciate

    it.
    Looking at the code maxSingleShuffleLimit is only used in
    determining

    if
    the copy _can_ fit into memory:
    boolean canFitInMemory(long requestedSize) {
    return (requestedSize < Integer.MAX_VALUE &&
    requestedSize < maxSingleShuffleLimit);
    }

    It also looks like the RamManager.reserve should wait until
    memory

    is
    available so it should hit a memory limit for that reason.
    What does seem a little strange to me is the following (

    ReduceTask.java
    starting at 2730 ):
    // Inform the ram-manager
    ramManager.closeInMemoryFile(mapOutputLength);
    ramManager.unreserve(mapOutputLength);

    // Discard the map-output
    try {
    mapOutput.discard();
    } catch (IOException ignored) {
    LOG.info("Failed to discard map-output from " +
    mapOutputLoc.getTaskAttemptId(), ignored);
    }
    mapOutput = null;

    So to me that looks like the ramManager unreserves the memory
    before

    the
    mapOutput is discarded. Shouldn't the mapOutput be discarded
    _before_

    the
    ramManager unreserves the memory? If the memory is
    unreserved
    before

    the
    actual underlying data references are removed then it seems
    like
    another
    thread can try to allocate memory ( ReduceTask.java:2730 )
    before
    the
    previous memory is disposed ( mapOutput.discard() ).

    Not sure that makes sense. One thing to note is that the
    particular

    job
    that is failing does have a good number ( 200k+ ) of map
    outputs. The

    large
    number of small map outputs may be why we are triggering a
    problem.

    Thanks again for your thoughts.

    Andy


    -----Original Message-----
    From: Jacob R Rideout
    Sent: Sunday, March 07, 2010 1:21 PM
    To: common-user@hadoop.apache.org
    Cc: Andy Sautins; Ted Yu
    Subject: Re: Shuffle In Memory OutOfMemoryError

    Ted,

    Thank you. I filled MAPREDUCE-1571 to cover this issue. I
    might
    have
    some time to write a patch later this week.

    Jacob Rideout

    On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhihong@gmail.com
    wrote:

    I think there is mismatch (in ReduceTask.java) between:
    this.numCopiers =
    conf.getInt("mapred.reduce.parallel.copies",
    5);
    and:
    maxSingleShuffleLimit = (long)(maxSize *
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
    where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f

    because
    copiers = new ArrayList<MapOutputCopier>(numCopiers);
    so the total memory allocated for in-mem shuffle is 1.25 *
    maxSize

    A JIRA should be filed to correlate the constant 5 above and
    MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

    Cheers

    On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <

    apache@jacobrideout.net
    wrote:
    Hi all,
    We are seeing the following error in our reducers of a
    particular

    job:
    Error: java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    After enough reducers fail the entire job fails. This error
    occurs
    regardless of whether mapred.compress.map.output is true.
    We
    were

    able
    to avoid the issue by reducing
    mapred.job.shuffle.input.buffer.percent
    to 20%. Shouldn't the framework via
    ShuffleRamManager.canFitInMemory
    and.ShuffleRamManager.reserve correctly detect the the
    memory
    available for allocation? I would think that with poor
    configuration
    settings (and default settings in particular) the job may
    not
    be as
    efficient, but wouldn't die.

    Here is some more context in the logs, I have attached the
    full
    reducer log here: http://gist.github.com/323746


    2010-03-06 07:54:49,621 INFO
    org.apache.hadoop.mapred.ReduceTask:
    Shuffling 4191933 bytes (435311 raw bytes) into RAM from
    attempt_201003060739_0002_m_000061_0
    2010-03-06 07:54:50,222 INFO
    org.apache.hadoop.mapred.ReduceTask:

    Task
    attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
    attempt_201003060739_0002_m_000202_0
    2010-03-06 07:54:50,223 WARN
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0 adding host
    hd37.dfs.returnpath.net to penalty box, next contact in 4
    seconds
    2010-03-06 07:54:50,223 INFO
    org.apache.hadoop.mapred.ReduceTask:
    attempt_201003060739_0002_r_000000_0: Got 1 map-outputs
    from
    previous
    failures
    2010-03-06 07:54:50,223 FATAL
    org.apache.hadoop.mapred.TaskRunner:
    attempt_201003060739_0002_r_000000_0 : Map output copy
    failure :
    java.lang.OutOfMemoryError: Java heap space
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.getMapOutput(ReduceTask.java:1408)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier
    $MapOutputCopier.copyOutput(ReduceTask.java:1261)

    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
    (ReduceTask.java:1195)

    We tried this both in 0.20.1 and 0.20.2. We had hoped
    MAPREDUCE-1182
    would address the issue in 0.20.2, but it did not. Does
    anyone
    have
    any comments or suggestions? Is this a bug I should file a
    JIRA
    for?

    Jacob Rideout
    Return Path

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 6, '10 at 4:32p
activeMay 9, '10 at 4:42a
posts23
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase