FAQ
I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop
0.20.1. The job I'm using probably writes to on the order of over 1000
part-files at once, across the whole grid. The grid has 33 nodes in it. I
get the following exception in the run logs:

10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000013_0, Status : FAILED
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

....lots of EOFExceptions....

10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000019_0, Status : FAILED
java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%
From searching around, it seems like the most common cause of BadLink and
EOFExceptions is when the nodes don't have enough file descriptors set. But
across all the grid machines, the file-max has been set to 1573039.
Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

Where else should I be looking for what's causing this?

Search Discussions

  • Meng Mao at Feb 3, 2010 at 9:05 pm
    also, which is the ulimit that's important, the one for the user who is
    running the job, or the hadoop user that owns the Hadoop processes?
    On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote:

    I've been trying to run a fairly small input file (300MB) on Cloudera
    Hadoop 0.20.1. The job I'm using probably writes to on the order of over
    1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
    I get the following exception in the run logs:

    10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000013_0, Status : FAILED
    java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    ....lots of EOFExceptions....

    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000019_0, Status : FAILED
    java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
    10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
    10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
    10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%

    From searching around, it seems like the most common cause of BadLink and
    EOFExceptions is when the nodes don't have enough file descriptors set. But
    across all the grid machines, the file-max has been set to 1573039.
    Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

    Where else should I be looking for what's causing this?
  • Meng Mao at Feb 4, 2010 at 7:53 pm
    I wrote a hadoop job that checks for ulimits across the nodes, and every
    node is reporting:
    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 139264
    max locked memory (kbytes, -l) 32
    max memory size (kbytes, -m) unlimited
    open files (-n) 65536
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) 139264
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited


    Is anything in there telling about file number limits? From what I
    understand, a high open files limit like 65536 should be enough. I estimate
    only a couple thousand part-files on HDFS being written to at once, and
    around 200 on the filesystem per node.
    On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao wrote:

    also, which is the ulimit that's important, the one for the user who is
    running the job, or the hadoop user that owns the Hadoop processes?

    On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote:

    I've been trying to run a fairly small input file (300MB) on Cloudera
    Hadoop 0.20.1. The job I'm using probably writes to on the order of over
    1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
    I get the following exception in the run logs:

    10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000013_0, Status : FAILED
    java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at
    org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    ....lots of EOFExceptions....

    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000019_0, Status : FAILED
    java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
    10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
    10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
    10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%

    From searching around, it seems like the most common cause of BadLink and
    EOFExceptions is when the nodes don't have enough file descriptors set. But
    across all the grid machines, the file-max has been set to 1573039.
    Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

    Where else should I be looking for what's causing this?
  • Meng Mao at Feb 5, 2010 at 7:43 am
    not sure what else I could be checking to see where the problem lies. Should
    I be looking in the datanode logs? I looked briefly in there and didn't see
    anything from around the time exceptions started getting reported.
    lsof during the job execution? Number of open threads?

    I'm at a loss here.
    On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao wrote:

    I wrote a hadoop job that checks for ulimits across the nodes, and every
    node is reporting:
    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 139264
    max locked memory (kbytes, -l) 32
    max memory size (kbytes, -m) unlimited
    open files (-n) 65536
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) 139264
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited


    Is anything in there telling about file number limits? From what I
    understand, a high open files limit like 65536 should be enough. I estimate
    only a couple thousand part-files on HDFS being written to at once, and
    around 200 on the filesystem per node.
    On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao wrote:

    also, which is the ulimit that's important, the one for the user who is
    running the job, or the hadoop user that owns the Hadoop processes?

    On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote:

    I've been trying to run a fairly small input file (300MB) on Cloudera
    Hadoop 0.20.1. The job I'm using probably writes to on the order of over
    1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
    I get the following exception in the run logs:

    10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000013_0, Status : FAILED
    java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at
    org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at
    org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    ....lots of EOFExceptions....

    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000019_0, Status : FAILED
    java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
    10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
    10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
    10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%

    From searching around, it seems like the most common cause of BadLink and
    EOFExceptions is when the nodes don't have enough file descriptors set. But
    across all the grid machines, the file-max has been set to 1573039.
    Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

    Where else should I be looking for what's causing this?
  • Todd Lipcon at Feb 5, 2010 at 4:06 pm
    Yes, you're likely to see an error in the DN log. Do you see anything
    about max number of xceivers?

    -Todd
    On Thu, Feb 4, 2010 at 11:42 PM, Meng Mao wrote:
    not sure what else I could be checking to see where the problem lies. Should
    I be looking in the datanode logs? I looked briefly in there and didn't see
    anything from around the time exceptions started getting reported.
    lsof during the job execution? Number of open threads?

    I'm at a loss here.
    On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao wrote:

    I wrote a hadoop job that checks for ulimits across the nodes, and every
    node is reporting:
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 139264
    max locked memory       (kbytes, -l) 32
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 65536
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 10240
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 139264
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited


    Is anything in there telling about file number limits? From what I
    understand, a high open files limit like 65536 should be enough. I estimate
    only a couple thousand part-files on HDFS being written to at once, and
    around 200 on the filesystem per node.
    On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao wrote:

    also, which is the ulimit that's important, the one for the user who is
    running the job, or the hadoop user that owns the Hadoop processes?

    On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote:

    I've been trying to run a fairly small input file (300MB) on Cloudera
    Hadoop 0.20.1. The job I'm using probably writes to on the order of over
    1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
    I get the following exception in the run logs:

    10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000013_0, Status : FAILED
    java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at
    org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at
    org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    ....lots of EOFExceptions....

    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000019_0, Status : FAILED
    java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

    10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
    10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
    10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
    10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
    10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%

    From searching around, it seems like the most common cause of BadLink and
    EOFExceptions is when the nodes don't have enough file descriptors set. But
    across all the grid machines, the file-max has been set to 1573039.
    Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

    Where else should I be looking for what's causing this?
  • Meng Mao at Feb 5, 2010 at 10:19 pm
    ack, after looking at the logs again, there are definitely xcievers errors.
    It's set to 256!
    I had thought I had cleared this a possible cause, but guess I was wrong.
    Gonna retest right away.
    Thanks!
    On Fri, Feb 5, 2010 at 11:05 AM, Todd Lipcon wrote:

    Yes, you're likely to see an error in the DN log. Do you see anything
    about max number of xceivers?

    -Todd
    On Thu, Feb 4, 2010 at 11:42 PM, Meng Mao wrote:
    not sure what else I could be checking to see where the problem lies. Should
    I be looking in the datanode logs? I looked briefly in there and didn't see
    anything from around the time exceptions started getting reported.
    lsof during the job execution? Number of open threads?

    I'm at a loss here.
    On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao wrote:

    I wrote a hadoop job that checks for ulimits across the nodes, and every
    node is reporting:
    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 139264
    max locked memory (kbytes, -l) 32
    max memory size (kbytes, -m) unlimited
    open files (-n) 65536
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 10240
    cpu time (seconds, -t) unlimited
    max user processes (-u) 139264
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited


    Is anything in there telling about file number limits? From what I
    understand, a high open files limit like 65536 should be enough. I
    estimate
    only a couple thousand part-files on HDFS being written to at once, and
    around 200 on the filesystem per node.
    On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao wrote:

    also, which is the ulimit that's important, the one for the user who is
    running the job, or the hadoop user that owns the Hadoop processes?

    On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao wrote:

    I've been trying to run a fairly small input file (300MB) on Cloudera
    Hadoop 0.20.1. The job I'm using probably writes to on the order of
    over
    1000 part-files at once, across the whole grid. The grid has 33 nodes
    in it.
    I get the following exception in the run logs:

    10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000013_0, Status : FAILED
    java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at
    org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at
    org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
    ....lots of EOFExceptions....

    10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
    attempt_201001261532_1137_r_000019_0, Status : FAILED
    java.io.IOException: Bad connect ack with firstBadLink
    10.2.19.1:50010
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
    10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
    10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
    10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
    10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
    10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%

    From searching around, it seems like the most common cause of BadLink
    and
    EOFExceptions is when the nodes don't have enough file descriptors
    set. But
    across all the grid machines, the file-max has been set to 1573039.
    Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

    Where else should I be looking for what's causing this?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 3, '10 at 12:30a
activeFeb 5, '10 at 10:19p
posts6
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Meng Mao: 5 posts Todd Lipcon: 1 post

People

Translate

site design / logo © 2022 Grokbase