0.20.1. The job I'm using probably writes to on the order of over 1000
part-files at once, across the whole grid. The grid has 33 nodes in it. I
get the following exception in the run logs:
10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12%
10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000013_0, Status : FAILED
java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
....lots of EOFExceptions....
10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
attempt_201001261532_1137_r_000019_0, Status : FAILED
java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11%
10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12%
10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13%
10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14%
10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15%
From searching around, it seems like the most common cause of BadLink and
EOFExceptions is when the nodes don't have enough file descriptors set. Butacross all the grid machines, the file-max has been set to 1573039.
Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.
Where else should I be looking for what's causing this?