FAQ
As this topic comes up reasonably often on the list, I thought others might
be interested in this:

http://arstechnica.com/business/news/2009/10/dram-study-turns-assumptions-about-errors-upside-down.ars?utm_source=rss&utm_medium=rss&utm_campaign=rss
Basically, the takeaway is that RAM errors are pretty common, especially on
machines under high load (like Hadoop clusters). ECC RAM is important to
catch them, and the error rate gets way worse at 20 months.

-Todd

Search Discussions

  • Arvind Sharma at Dec 2, 2009 at 4:03 pm
    I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 ) but not sure this one is exactly the same scenario.

    Hadoop - 0.19.2

    The client side DFSClient fails to write when few of the DN in a grid goes down. I see this error :

    ***************************

    2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
    block blk_30289322
    54678171367_1462691 from datanode 10.201.9.225:50010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
    blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
    10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
    ..201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
    block blk_-661912
    3912237837733_1462799 from datanode 10.201.9.225:50010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
    10.201.9.225:50010: bad datanode 10.201.9.225:50010


    ***************************

    The only way I could get my client program to write successfully to the DFS was to re-start it.

    Any suggestions how to get around this problem on the client side ? As I understood, the DFSClient APIs will take care of situations like this and the clients don't need to worry about if some of the DN goes down.

    Also, the replication factor is 3 in my setup and there are 10 DN (out of which TWO went down)


    Thanks!
    Arvind
  • Arvind Sharma at Dec 4, 2009 at 12:50 pm
    Any suggestions would be welcome :-)

    Arvind






    ________________________________
    From: Arvind Sharma <arvind321@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Wed, December 2, 2009 8:02:39 AM
    Subject: DFSClient write error when DN down



    I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 ) but not sure this one is exactly the same scenario.

    Hadoop - 0.19.2

    The client side DFSClient fails to write when few of the DN in a grid goes down. I see this error :

    ***************************

    2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
    block blk_30289322
    54678171367_1462691 from datanode 10.201.9.225:50010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
    blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
    10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
    ...201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
    block blk_-661912
    3912237837733_1462799 from datanode 10.201.9.225:50010
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
    10.201.9.225:50010: bad datanode 10.201.9.225:50010


    ***************************

    The only way I could get my client program to write successfully to the DFS was to re-start it.

    Any suggestions how to get around this problem on the client side ? As I understood, the DFSClient APIs will take care of situations like this and the clients don't need to worry about if some of the DN goes down.

    Also, the replication factor is 3 in my setup and there are 10 DN (out of which TWO went down)


    Thanks!
    Arvind
  • Todd Lipcon at Dec 4, 2009 at 4:36 pm
    Hi Arvind,

    Looks to me like you've identified the JIRAs that are causing this.
    Hopefully they will be fixed soon.

    -Todd
    On Fri, Dec 4, 2009 at 4:43 AM, Arvind Sharma wrote:

    Any suggestions would be welcome :-)

    Arvind






    ________________________________
    From: Arvind Sharma <arvind321@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Wed, December 2, 2009 8:02:39 AM
    Subject: DFSClient write error when DN down



    I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 )
    but not sure this one is exactly the same scenario.

    Hadoop - 0.19.2

    The client side DFSClient fails to write when few of the DN in a grid goes
    down. I see this error :

    ***************************

    2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
    block blk_30289322
    54678171367_1462691 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for
    block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
    blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
    10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
    ...201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
    block blk_-661912
    3912237837733_1462799 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13
    13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
    10.201.9.225:50010: bad datanode 10.201.9.225:50010


    ***************************

    The only way I could get my client program to write successfully to the DFS
    was to re-start it.

    Any suggestions how to get around this problem on the client side ? As I
    understood, the DFSClient APIs will take care of situations like this and
    the clients don't need to worry about if some of the DN goes down.

    Also, the replication factor is 3 in my setup and there are 10 DN (out of
    which TWO went down)


    Thanks!
    Arvind


  • Arvind Sharma at Dec 4, 2009 at 5:01 pm
    Thanks Todd !

    Just wanted another confirmation I guess :-)

    Arvind




    ________________________________
    From: Todd Lipcon <todd@cloudera.com>
    To: common-user@hadoop.apache.org
    Sent: Fri, December 4, 2009 8:35:56 AM
    Subject: Re: DFSClient write error when DN down

    Hi Arvind,

    Looks to me like you've identified the JIRAs that are causing this.
    Hopefully they will be fixed soon.

    -Todd
    On Fri, Dec 4, 2009 at 4:43 AM, Arvind Sharma wrote:

    Any suggestions would be welcome :-)

    Arvind






    ________________________________
    From: Arvind Sharma <arvind321@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Wed, December 2, 2009 8:02:39 AM
    Subject: DFSClient write error when DN down



    I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 )
    but not sure this one is exactly the same scenario.

    Hadoop - 0.19.2

    The client side DFSClient fails to write when few of the DN in a grid goes
    down. I see this error :

    ***************************

    2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
    block blk_30289322
    54678171367_1462691 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for
    block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
    blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
    10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
    ...201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
    block blk_-661912
    3912237837733_1462799 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13
    13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
    10.201.9.225:50010: bad datanode 10.201.9.225:50010


    ***************************

    The only way I could get my client program to write successfully to the DFS
    was to re-start it.

    Any suggestions how to get around this problem on the client side ? As I
    understood, the DFSClient APIs will take care of situations like this and
    the clients don't need to worry about if some of the DN goes down.

    Also, the replication factor is 3 in my setup and there are 10 DN (out of
    which TWO went down)


    Thanks!
    Arvind


  • Edward Capriolo at Dec 4, 2009 at 5:04 pm

    On Fri, Dec 4, 2009 at 12:01 PM, Arvind Sharma wrote:
    Thanks Todd !

    Just wanted another confirmation I guess :-)

    Arvind




    ________________________________
    From: Todd Lipcon <todd@cloudera.com>
    To: common-user@hadoop.apache.org
    Sent: Fri, December 4, 2009 8:35:56 AM
    Subject: Re: DFSClient write error when DN down

    Hi Arvind,

    Looks to me like you've identified the JIRAs that are causing this.
    Hopefully they will be fixed soon.

    -Todd
    On Fri, Dec 4, 2009 at 4:43 AM, Arvind Sharma wrote:

    Any suggestions would be welcome :-)

    Arvind






    ________________________________
    From: Arvind Sharma <arvind321@yahoo.com>
    To: common-user@hadoop.apache.org
    Sent: Wed, December 2, 2009 8:02:39 AM
    Subject: DFSClient write error when DN down



    I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 )
    but not sure this one is exactly the same scenario.

    Hadoop - 0.19.2

    The client side DFSClient fails to write when few of the DN in a grid goes
    down.  I see this error :

    ***************************

    2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
    block blk_30289322
    54678171367_1462691 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
    2009-11-13 13:45:27,815 WARN             DFSClient |  Error Recovery for
    block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
    2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
    blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
    10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
    ...201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
    ResponseProcessor exception for block
    blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
    block blk_-661912
    3912237837733_1462799 from datanode 10.201.9.225:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13
    13:45:37,433 WARN             DFSClient |  Error Recovery for block
    blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
    2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
    blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
    10.201.9.225:50010: bad datanode 10.201.9.225:50010


    ***************************

    The only way I could get my client program to write successfully to the DFS
    was to re-start it.

    Any suggestions how to get around this problem on the client side ?  As I
    understood, the DFSClient APIs will take care of situations like this and
    the clients don't need to worry about if some of the DN goes down.

    Also, the replication factor is 3 in my setup and there are 10 DN (out of
    which TWO went down)


    Thanks!
    Arvind



    I will give you another confirmation...

    This has happened on my dev cluster (5 nodes). I had an 18.3 cluster
    at the time. Replication was at 3, I had 2 nodes go down. I did not
    look into this very deeply. My hunch was that new files were created
    by a map/reduce program, and they were replicated only to the two
    nodes that went down. This caused the job to die, and the file system
    was not 'right' until I brought the two data nodes back online. FSCK
    did not think anything was wrong with the filesystem and everything
    that did not deal with the parent Paths the files were in was fine.
    However M/R apps that tried to use the parent Path failed. I tried
    restarts to no avail for NameNode and all DataNodes.

    In that case, I just did anything I could to bring those data nodes
    up. Even if you can bring them up without much storage, the act of
    bringing them up cleared the issue. Sorry for the off the cuff,
    unconfirmed description. This event only happened to me once so I
    never looked into it again. If it is any consolation, signs point to
    it not happening very often.

    Edward
  • Madhur Khandelwal at Dec 4, 2009 at 8:30 pm
    Hi all,

    I have a 3 node cluster running a hadoop (0.20.1) job. I am noticing the
    following exception during the SHUFFLE phase because of which tasktracker on
    one of the nodes is getting blacklisted (after 4 occurrences of the
    exception). I have the config set to run 8 maps and 8 reduces simultaneously
    and rest all the settings are left default. Any pointers would be helpful.

    2009-12-04 01:04:36,237 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
    shuffle from attempt_200912031748_0002_m_000035_0
    java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    at
    sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:221)
    at
    sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at
    sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
    ection.java:2391)
    at
    org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
    at
    org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
    mory(ReduceTask.java:1522)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
    t(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
    ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
    ask.java:1195)

    Here is the error message on the web jobtracker UI:
    java.io.IOException: Task process exit with nonzero status of 137.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Around the same time, the tasktracker log has the following WARN messages:
    2009-12-04 01:09:19,051 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call ping(attempt_200912031748_0002_r_000008_0) from
    127.0.0.1:42371: output error
    2009-12-04 01:09:21,984 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder,
    call getMapCompletionEvents(job_200912031748_0002, 38, 10000,
    attempt_200912031748_0002_r_000008_0) from 127.0.0.1:42371: output error
    2009-12-04 01:10:02,114 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_200912031748_0002_r_000008_0 Child Error
    2009-12-04 01:10:07,567 INFO org.apache.hadoop.mapred.TaskRunner:
    attempt_200912031748_0002_r_000008_0 done; removing files.

    There is one more exception I see in the task log, not sure if it's related:
    2009-12-04 01:01:37,120 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
    shuffle from attempt_200912031748_0002_m_000044_0
    java.io.IOException: Premature EOF
    at
    sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:234)
    at
    sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at
    sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
    ection.java:2391)
    at
    org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
    at
    org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
    mory(ReduceTask.java:1522)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
    t(ReduceTask.java:1408)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
    ReduceTask.java:1261)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
    ask.java:1195)
  • Raymond Jennings III at Dec 4, 2009 at 8:33 pm
    Does the combiner run once per data node or one per map task? (That it can run multiple times on the same data node after each map task.) Thanks.
  • Owen O'Malley at Dec 4, 2009 at 9:43 pm

    On Fri, Dec 4, 2009 at 12:32 PM, Raymond Jennings III wrote:

    Does the combiner run once per data node or one per map task? (That it can
    run multiple times on the same data node after each map task.) Thanks.
    The combiner can run 0, 1, or many times on each data value. It can run in
    both the map task and reduce task.

    -- Owen
  • Mike Kendall at Dec 4, 2009 at 10:00 pm
    are you sure it can be run in the reduce task? if it does it's still
    before the reducer is called though... so the flow of your data will
    still be: data -> mapper(s) -> optional reducer(s) -> reducer(s) ->
    output_data


    On Fri, Dec 4, 2009 at 1:42 PM, Owen O'Malley wrote:
    On Fri, Dec 4, 2009 at 12:32 PM, Raymond Jennings III <raymondjiii@yahoo.com
    wrote:
    Does the combiner run once per data node or one per map task?  (That it can
    run multiple times on the same data node after each map task.)  Thanks.
    The combiner can run 0, 1, or many times on each data value. It can run in
    both the map task and reduce task.

    -- Owen
  • Raymond Jennings III at Dec 4, 2009 at 10:03 pm
    I still would like to know how many times it will run given how many mappers run. I realize it may never run but what determines how many times if any?

    --- On Fri, 12/4/09, Mike Kendall wrote:
    From: Mike Kendall <mkendall@justin.tv>
    Subject: Re: Combiner phase question
    To: common-user@hadoop.apache.org
    Date: Friday, December 4, 2009, 4:59 PM
    are you sure it can be run in the
    reduce task?  if it does it's still
    before the reducer is called though...  so the flow of
    your data will
    still be: data -> mapper(s) -> optional reducer(s)
    -> reducer(s) ->
    output_data


    On Fri, Dec 4, 2009 at 1:42 PM, Owen O'Malley wrote:
    On Fri, Dec 4, 2009 at 12:32 PM, Raymond Jennings III
    <raymondjiii@yahoo.com
    wrote:
    Does the combiner run once per data node or one
    per map task?  (That it can
    run multiple times on the same data node after
    each map task.)  Thanks.
    The combiner can run 0, 1, or many times on each data
    value. It can run in
    both the map task and reduce task.

    -- Owen
  • Mike Kendall at Dec 5, 2009 at 12:29 am
    from what i understand, the combiner runs when nodes are idle and
    you're waiting on a few processes that are taking too long... so the
    cluster tries to optimize by putting these idle nodes to work by doing
    optional preprocessing...

    On Fri, Dec 4, 2009 at 2:02 PM, Raymond Jennings III
    wrote:
    I still would like to know how many times it will run given how many mappers run.  I realize it may never run but what determines how many times if any?

    --- On Fri, 12/4/09, Mike Kendall wrote:
    From: Mike Kendall <mkendall@justin.tv>
    Subject: Re: Combiner phase question
    To: common-user@hadoop.apache.org
    Date: Friday, December 4, 2009, 4:59 PM
    are you sure it can be run in the
    reduce task?  if it does it's still
    before the reducer is called though...  so the flow of
    your data will
    still be: data -> mapper(s) -> optional reducer(s)
    -> reducer(s) ->
    output_data



    On Fri, Dec 4, 2009 at 1:42 PM, Owen O'Malley <owen.omalley@gmail.com>
    wrote:
    On Fri, Dec 4, 2009 at 12:32 PM, Raymond Jennings III
    <raymondjiii@yahoo.com
    wrote:
    Does the combiner run once per data node or one
    per map task?  (That it can
    run multiple times on the same data node after
    each map task.)  Thanks.
    The combiner can run 0, 1, or many times on each data
    value. It can run in
    both the map task and reduce task.

    -- Owen

  • Owen O'Malley at Dec 5, 2009 at 1:09 am
    The combiner runs when it is spilling the intermediate output to disk. So
    the flow looks like:

    in map:
    map writes into buffer
    when buffer is "full" do a quick sort, combine and write to disk
    merge sort the partial outputs from disk, combine and write to disk

    in reduce:
    fetch output from maps into buffer
    when buffer is "full" do a merge sort, combine and write to disk
    merge sort the partial outputs and feed to the reduce

    So you'll have as many combines in general as the framework needs to spill
    to disk. It all depends on the data sizes. The 0 time case is rare, but it
    is if a partition has a single value in it (because it is very very large).

    -- Owen
  • Amandeep Khurana at Dec 4, 2009 at 9:18 pm
    Seems like the reducer isnt able to read from the mapper node. Do you see
    something in the datanode logs? Also, check the namenode logs.. Make sure
    you have DEBUG logging enabled.

    -Amandeep


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    On Fri, Dec 4, 2009 at 12:29 PM, Madhur Khandelwal wrote:

    Hi all,

    I have a 3 node cluster running a hadoop (0.20.1) job. I am noticing the
    following exception during the SHUFFLE phase because of which tasktracker
    on
    one of the nodes is getting blacklisted (after 4 occurrences of the
    exception). I have the config set to run 8 maps and 8 reduces
    simultaneously
    and rest all the settings are left default. Any pointers would be helpful.

    2009-12-04 01:04:36,237 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
    shuffle from attempt_200912031748_0002_m_000035_0
    java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    at
    sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:221)
    at
    sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at

    sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
    ection.java:2391)
    at
    org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
    at
    org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
    mory(ReduceTask.java:1522)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
    t(ReduceTask.java:1408)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
    ReduceTask.java:1261)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
    ask.java:1195)

    Here is the error message on the web jobtracker UI:
    java.io.IOException: Task process exit with nonzero status of 137.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Around the same time, the tasktracker log has the following WARN messages:
    2009-12-04 01:09:19,051 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder, call ping(attempt_200912031748_0002_r_000008_0) from
    127.0.0.1:42371: output error
    2009-12-04 01:09:21,984 WARN org.apache.hadoop.ipc.Server: IPC Server
    Responder,
    call getMapCompletionEvents(job_200912031748_0002, 38, 10000,
    attempt_200912031748_0002_r_000008_0) from 127.0.0.1:42371: output error
    2009-12-04 01:10:02,114 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_200912031748_0002_r_000008_0 Child Error
    2009-12-04 01:10:07,567 INFO org.apache.hadoop.mapred.TaskRunner:
    attempt_200912031748_0002_r_000008_0 done; removing files.

    There is one more exception I see in the task log, not sure if it's
    related:
    2009-12-04 01:01:37,120 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
    shuffle from attempt_200912031748_0002_m_000044_0
    java.io.IOException: Premature EOF
    at
    sun.net.www.http.ChunkedInputStream.fastRead(ChunkedInputStream.java:234)
    at
    sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:662)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at

    sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConn
    ection.java:2391)
    at
    org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
    at
    org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMe
    mory(ReduceTask.java:1522)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutpu
    t(ReduceTask.java:1408)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(
    ReduceTask.java:1261)
    at

    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceT
    ask.java:1195)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 9, '09 at 5:44p
activeDec 5, '09 at 1:09a
posts14
users8
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase