FAQ
I am running a hadoop program processing Tera Byte size data. The code
was test successfully on a small sample (100G) and it worked. However,
when trying it on the full problem, the program freezes forever at Map
99% Reduce 33%. There is no error, and the size of userlog folder is
clean (<30M) cause otherwise it will generate Giga bytes of error logs.

I checked the log of mapper and reducer, it seems that the reducer is
waiting for an output from the mapper and it never reaches. What is the
possible reason of causing this? Most of the configurations are set by
default. I set "mapred.child.java.opts=-Xmx2000M
hadoop.job.history.user.location=none". The problem occurs both on
0.19.2 and 0.20.2. Thanks!

Example of Mapper logs:

2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
sorted segments
2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 3 segments left of total size: 461743 bytes
2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
(11015008, 10030254, 607594)
2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
of commiting
2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201103231501_0007_m_000286_0' done.

Example of Reducer logs:

2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
0 is already in progress
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)


Shi

Search Discussions

  • Kevin Leach at Mar 24, 2011 at 7:27 pm
    Shi,

    The key here is the 99% done mapper. Nothing can move on until all
    mappers complete.
    Is it possible your data in the larger set has an incomplete record or
    some such at the end?

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:02 PM
    To: hadoop user
    Subject: Program freezes at Map 99% Reduce 33%

    I am running a hadoop program processing Tera Byte size data. The code
    was test successfully on a small sample (100G) and it worked. However,
    when trying it on the full problem, the program freezes forever at Map
    99% Reduce 33%. There is no error, and the size of userlog folder is
    clean (<30M) cause otherwise it will generate Giga bytes of error logs.

    I checked the log of mapper and reducer, it seems that the reducer is
    waiting for an output from the mapper and it never reaches. What is the
    possible reason of causing this? Most of the configurations are set by
    default. I set "mapred.child.java.opts=-Xmx2000M
    hadoop.job.history.user.location=none". The problem occurs both on
    0.19.2 and 0.20.2. Thanks!

    Example of Mapper logs:

    2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
    sorted segments
    2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
    the last merge-pass, with 3 segments left of total size: 461743 bytes
    2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
    (11015008, 10030254, 607594)
    2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
    Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
    of commiting
    2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
    'attempt_201103231501_0007_m_000286_0' done.

    Example of Reducer logs:

    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
    0 is already in progress
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)


    Shi
  • Shi Yu at Mar 24, 2011 at 7:54 pm
    Hi Kevin,

    thanks for reply. I could hardly imagine an example of incomplete
    record. The mapper is very simple, just reading line by line as Strings,
    splitting the line by tab, and outputting a Text Pair for sort and
    secondary sort. If there were incomplete record, there should be an
    error and the only place could happen is the tab splitting stage. I
    use LZO codec compressing the Mapper output and Reducer output.

    There is one issue which I think might be the reason. The files in the
    log has the following structure . It seems I had 294 mappers. Notice
    that there is one file missing: "attempt_201103231501_0007_m_000292_0",
    where is the Nr. 292 mapper output ???? Was it a failed node or something?


    All the logs just hang there for 2 hours (~12:35, while the current time
    is 14:50). Some folders were visited around 14:45 by me. So no reducer
    nor mapper has been generating any logs in the past two hours.

    Shi

    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
    attempt_201103231501_0007_m_000281_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000282_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000283_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
    attempt_201103231501_0007_m_000284_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000285_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
    attempt_201103231501_0007_m_000286_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000287_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
    attempt_201103231501_0007_m_000288_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_0
    drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_1
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
    attempt_201103231501_0007_m_000290_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
    attempt_201103231501_0007_m_000291_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_m_000293_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000000_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000001_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000002_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000003_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000004_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000005_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000006_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000007_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
    attempt_201103231501_0007_r_000008_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000009_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000010_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000011_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000012_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000013_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000014_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000015_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000016_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000017_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
    attempt_201103231501_0007_r_000018_0

    Shi
    On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    The key here is the 99% done mapper. Nothing can move on until all
    mappers complete.
    Is it possible your data in the larger set has an incomplete record or
    some such at the end?

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:02 PM
    To: hadoop user
    Subject: Program freezes at Map 99% Reduce 33%

    I am running a hadoop program processing Tera Byte size data. The code
    was test successfully on a small sample (100G) and it worked. However,
    when trying it on the full problem, the program freezes forever at Map
    99% Reduce 33%. There is no error, and the size of userlog folder is
    clean (<30M) cause otherwise it will generate Giga bytes of error logs.

    I checked the log of mapper and reducer, it seems that the reducer is
    waiting for an output from the mapper and it never reaches. What is the
    possible reason of causing this? Most of the configurations are set by
    default. I set "mapred.child.java.opts=-Xmx2000M
    hadoop.job.history.user.location=none". The problem occurs both on
    0.19.2 and 0.20.2. Thanks!

    Example of Mapper logs:

    2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
    sorted segments
    2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
    the last merge-pass, with 3 segments left of total size: 461743 bytes
    2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
    (11015008, 10030254, 607594)
    2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
    Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
    of commiting
    2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
    'attempt_201103231501_0007_m_000286_0' done.

    Example of Reducer logs:

    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
    0 is already in progress
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)


    Shi
  • Kevin Leach at Mar 24, 2011 at 9:04 pm
    Shi,

    This states "Of course, the framework discards the sub-directory of
    unsuccessful task-attempts."
    http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
    ml

    So yes, the missing directory is likely a failure.

    If you can, narrow the problem down by looking at sections of your data
    on a smaller number of mappers. Look at just the tail of your input and
    a couple of mappers and then compare that to your program run without
    hadoop.

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:57 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Program freezes at Map 99% Reduce 33%

    Hi Kevin,

    thanks for reply. I could hardly imagine an example of incomplete
    record. The mapper is very simple, just reading line by line as Strings,
    splitting the line by tab, and outputting a Text Pair for sort and
    secondary sort. If there were incomplete record, there should be an
    error and the only place could happen is the tab splitting stage. I
    use LZO codec compressing the Mapper output and Reducer output.

    There is one issue which I think might be the reason. The files in the
    log has the following structure . It seems I had 294 mappers. Notice
    that there is one file missing: "attempt_201103231501_0007_m_000292_0",
    where is the Nr. 292 mapper output ???? Was it a failed node or
    something?


    All the logs just hang there for 2 hours (~12:35, while the current time
    is 14:50). Some folders were visited around 14:45 by me. So no reducer
    nor mapper has been generating any logs in the past two hours.

    Shi

    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
    attempt_201103231501_0007_m_000281_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000282_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000283_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
    attempt_201103231501_0007_m_000284_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000285_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
    attempt_201103231501_0007_m_000286_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000287_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
    attempt_201103231501_0007_m_000288_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_0
    drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_1
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
    attempt_201103231501_0007_m_000290_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
    attempt_201103231501_0007_m_000291_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_m_000293_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000000_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000001_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000002_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000003_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000004_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000005_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000006_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000007_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
    attempt_201103231501_0007_r_000008_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000009_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000010_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000011_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000012_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000013_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000014_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000015_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000016_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000017_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
    attempt_201103231501_0007_r_000018_0

    Shi
    On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    The key here is the 99% done mapper. Nothing can move on until all
    mappers complete.
    Is it possible your data in the larger set has an incomplete record or
    some such at the end?

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:02 PM
    To: hadoop user
    Subject: Program freezes at Map 99% Reduce 33%

    I am running a hadoop program processing Tera Byte size data. The code
    was test successfully on a small sample (100G) and it worked. However,
    when trying it on the full problem, the program freezes forever at Map
    99% Reduce 33%. There is no error, and the size of userlog folder is
    clean (<30M) cause otherwise it will generate Giga bytes of error logs.
    I checked the log of mapper and reducer, it seems that the reducer is
    waiting for an output from the mapper and it never reaches. What is
    the possible reason of causing this? Most of the configurations are
    set by default. I set "mapred.child.java.opts=-Xmx2000M
    hadoop.job.history.user.location=none". The problem occurs both on
    0.19.2 and 0.20.2. Thanks!

    Example of Mapper logs:

    2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
    3 sorted segments
    2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
    the last merge-pass, with 3 segments left of total size: 461743 bytes
    2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
    (11015008, 10030254, 607594)
    2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
    Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
    process of commiting
    2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
    'attempt_201103231501_0007_m_000286_0' done.

    Example of Reducer logs:

    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
    where 0 is already in progress
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)


    Shi
  • Shi Yu at Mar 24, 2011 at 9:53 pm
    Hi Kevin,

    thanks for the suggestion. I think I found the problem, because my code
    is a chained map / reduce. In the previous iteration there is a
    .lzo_deflate output which is 40 times larger than other files. That was
    because of a special "key" value, which has significant larger
    occurrences than other keys. I used a self-defined partitioner:

    public int getPartition(TextPair key, Text value, int numPartitions) {
    return (key.getFirst().hashCode() & Integer.MAX_VALUE) %
    numPartitions;
    }

    maybe all the occurrences of this key are send to a single reducer and
    yields such a huge compressed file. And moreover the .lzo_deflate file
    not splittable? I removed this huge file and reran the current task, it
    worked smoothly. I didn't notice this problem before on small test
    because the output was not compressed in that run.

    Now I am trying to add some random string to that special key, since the
    previous outputs are still input for the next task so there is no reason
    to force them to a single reducer.

    Shi

    On 3/24/2011 4:03 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    This states "Of course, the framework discards the sub-directory of
    unsuccessful task-attempts."
    http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
    ml

    So yes, the missing directory is likely a failure.

    If you can, narrow the problem down by looking at sections of your data
    on a smaller number of mappers. Look at just the tail of your input and
    a couple of mappers and then compare that to your program run without
    hadoop.

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:57 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Program freezes at Map 99% Reduce 33%

    Hi Kevin,

    thanks for reply. I could hardly imagine an example of incomplete
    record. The mapper is very simple, just reading line by line as Strings,
    splitting the line by tab, and outputting a Text Pair for sort and
    secondary sort. If there were incomplete record, there should be an
    error and the only place could happen is the tab splitting stage. I
    use LZO codec compressing the Mapper output and Reducer output.

    There is one issue which I think might be the reason. The files in the
    log has the following structure . It seems I had 294 mappers. Notice
    that there is one file missing: "attempt_201103231501_0007_m_000292_0",
    where is the Nr. 292 mapper output ???? Was it a failed node or
    something?


    All the logs just hang there for 2 hours (~12:35, while the current time
    is 14:50). Some folders were visited around 14:45 by me. So no reducer
    nor mapper has been generating any logs in the past two hours.

    Shi

    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
    attempt_201103231501_0007_m_000281_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000282_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000283_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
    attempt_201103231501_0007_m_000284_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000285_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
    attempt_201103231501_0007_m_000286_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000287_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
    attempt_201103231501_0007_m_000288_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_0
    drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_1
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
    attempt_201103231501_0007_m_000290_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
    attempt_201103231501_0007_m_000291_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_m_000293_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000000_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000001_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000002_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000003_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000004_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000005_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000006_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000007_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
    attempt_201103231501_0007_r_000008_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000009_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000010_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000011_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000012_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000013_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000014_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000015_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000016_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000017_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
    attempt_201103231501_0007_r_000018_0

    Shi
    On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    The key here is the 99% done mapper. Nothing can move on until all
    mappers complete.
    Is it possible your data in the larger set has an incomplete record or
    some such at the end?

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:02 PM
    To: hadoop user
    Subject: Program freezes at Map 99% Reduce 33%

    I am running a hadoop program processing Tera Byte size data. The code
    was test successfully on a small sample (100G) and it worked. However,
    when trying it on the full problem, the program freezes forever at Map
    99% Reduce 33%. There is no error, and the size of userlog folder is
    clean (<30M) cause otherwise it will generate Giga bytes of error logs.
    I checked the log of mapper and reducer, it seems that the reducer is
    waiting for an output from the mapper and it never reaches. What is
    the possible reason of causing this? Most of the configurations are
    set by default. I set "mapred.child.java.opts=-Xmx2000M
    hadoop.job.history.user.location=none". The problem occurs both on
    0.19.2 and 0.20.2. Thanks!

    Example of Mapper logs:

    2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
    3 sorted segments
    2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
    the last merge-pass, with 3 segments left of total size: 461743 bytes
    2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
    (11015008, 10030254, 607594)
    2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
    Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
    process of commiting
    2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
    'attempt_201103231501_0007_m_000286_0' done.

    Example of Reducer logs:

    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)
    2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
    where 0 is already in progress
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
    and0 dup hosts)


    Shi
  • Kevin Leach at Mar 25, 2011 at 1:41 am
    Good. Data skew should not look stuck. Try sending status updates so at
    least you can tell one mapper is still busy. Yes, adding data or
    including another field into the key can help reduce data skew.

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 5:56 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Program freezes at Map 99% Reduce 33%

    Hi Kevin,

    thanks for the suggestion. I think I found the problem, because my code
    is a chained map / reduce. In the previous iteration there is a
    .lzo_deflate output which is 40 times larger than other files. That was

    because of a special "key" value, which has significant larger
    occurrences than other keys. I used a self-defined partitioner:

    public int getPartition(TextPair key, Text value, int numPartitions) {
    return (key.getFirst().hashCode() & Integer.MAX_VALUE) %
    numPartitions;
    }

    maybe all the occurrences of this key are send to a single reducer and
    yields such a huge compressed file. And moreover the .lzo_deflate file
    not splittable? I removed this huge file and reran the current task, it
    worked smoothly. I didn't notice this problem before on small test
    because the output was not compressed in that run.

    Now I am trying to add some random string to that special key, since the

    previous outputs are still input for the next task so there is no reason

    to force them to a single reducer.

    Shi

    On 3/24/2011 4:03 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    This states "Of course, the framework discards the sub-directory of
    unsuccessful task-attempts."
    http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
    ml

    So yes, the missing directory is likely a failure.

    If you can, narrow the problem down by looking at sections of your data
    on a smaller number of mappers. Look at just the tail of your input and
    a couple of mappers and then compare that to your program run without
    hadoop.

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:57 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Program freezes at Map 99% Reduce 33%

    Hi Kevin,

    thanks for reply. I could hardly imagine an example of incomplete
    record. The mapper is very simple, just reading line by line as Strings,
    splitting the line by tab, and outputting a Text Pair for sort and
    secondary sort. If there were incomplete record, there should be an
    error and the only place could happen is the tab splitting stage. I
    use LZO codec compressing the Mapper output and Reducer output.

    There is one issue which I think might be the reason. The files in the
    log has the following structure . It seems I had 294 mappers. Notice
    that there is one file missing:
    "attempt_201103231501_0007_m_000292_0",
    where is the Nr. 292 mapper output ???? Was it a failed node or
    something?


    All the logs just hang there for 2 hours (~12:35, while the current time
    is 14:50). Some folders were visited around 14:45 by me. So no reducer
    nor mapper has been generating any logs in the past two hours.

    Shi

    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
    attempt_201103231501_0007_m_000281_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000282_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000283_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
    attempt_201103231501_0007_m_000284_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000285_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
    attempt_201103231501_0007_m_000286_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000287_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
    attempt_201103231501_0007_m_000288_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_0
    drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
    attempt_201103231501_0007_m_000289_1
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
    attempt_201103231501_0007_m_000290_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
    attempt_201103231501_0007_m_000291_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_m_000293_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000000_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000001_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000002_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000003_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000004_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000005_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000006_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000007_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
    attempt_201103231501_0007_r_000008_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000009_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000010_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000011_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000012_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000013_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000014_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000015_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000016_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
    attempt_201103231501_0007_r_000017_0
    drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
    attempt_201103231501_0007_r_000018_0

    Shi
    On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
    Shi,

    The key here is the 99% done mapper. Nothing can move on until all
    mappers complete.
    Is it possible your data in the larger set has an incomplete record
    or
    some such at the end?

    Kevin

    -----Original Message-----
    From: Shi Yu
    Sent: Thursday, March 24, 2011 3:02 PM
    To: hadoop user
    Subject: Program freezes at Map 99% Reduce 33%

    I am running a hadoop program processing Tera Byte size data. The
    code
    was test successfully on a small sample (100G) and it worked.
    However,
    when trying it on the full problem, the program freezes forever at
    Map
    99% Reduce 33%. There is no error, and the size of userlog folder is
    clean (<30M) cause otherwise it will generate Giga bytes of error logs.
    I checked the log of mapper and reducer, it seems that the reducer is
    waiting for an output from the mapper and it never reaches. What is
    the possible reason of causing this? Most of the configurations are
    set by default. I set "mapred.child.java.opts=-Xmx2000M
    hadoop.job.history.user.location=none". The problem occurs both on
    0.19.2 and 0.20.2. Thanks!

    Example of Mapper logs:

    2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
    3 sorted segments
    2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
    the last merge-pass, with 3 segments left of total size: 461743 bytes
    2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
    (11015008, 10030254, 607594)
    2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
    Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
    process of commiting
    2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner:
    Task
    'attempt_201103231501_0007_m_000286_0' done.

    Example of Reducer logs:

    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow
    hosts
    and0 dup hosts)
    2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
    where 0 is already in progress
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
    2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
    attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow
    hosts
    and0 dup hosts)


    Shi

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 24, '11 at 6:59p
activeMar 25, '11 at 1:41a
posts6
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Kevin Leach: 3 posts Shi Yu: 3 posts

People

Translate

site design / logo © 2022 Grokbase