FAQ
I had a job go awry over the weekend that wrote ~500GB of map output (it
never made it to the reduce step) to our HDFS cluster and locked up.
After restarting the cluster, I get a lot of errors like these in the
datanodes when I run a job:

----

2008-02-04 23:10:59,562 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-1714483428308373630 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

2008-02-04 23:10:59,916 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-1714483428308373630 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

2008-02-04 23:10:59,945 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-1714483428308373630 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

----

and these:

----

2008-02-04 23:12:15,656 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-707985392830118202 has already been started (though not completed), and thus cannot be created.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

2008-02-04 23:12:16,547 INFO org.apache.hadoop.dfs.DataNode: Served block blk_6046518089785041038 to /172.31.100.36
2008-02-04 23:12:17,480 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Block blk_-707985392830118202 has already been started (though not completed), and thus cannot be created.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:568)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

----

and these:

---

2008-02-04 23:12:14,480 INFO org.apache.hadoop.dfs.DataNode: Exception writing to mirror 172.31.100.47:50010
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveChunk(DataNode.java:1333)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:1386)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:938)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

2008-02-04 23:12:14,563 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:1394)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:938)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:619)

---

The namenode doesn't show any errors. The above errors generally end up
in jobs locking up.

I'm using 0.15.3 on a 13-node cluster. It looks like a bunch of blocks
got allocated on the datanodes that the namdenode doesn't know about,
and the datanodes are refusing to work with new blocks that have the
same id. Does this sound likely? What's a good fix for this?

Thanks!
Colin Evans

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 5, '08 at 12:16a
activeFeb 5, '08 at 12:16a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Colin Evans: 1 post

People

Translate

site design / logo © 2023 Grokbase