FAQ
Hello,

The following problem has occured three times in the past 2 weeks for us now, twice of which in the last 24 hours.

Our setup currently is a single server, which runs "everything": namenode, datanode and client. We're using JNI in the client as an interface to HDFS.

It seems as if, after a while of operation, the client is unable to write any more data to the datanode and throws an exception. The symptoms seem to be similar to http://issues.apache.org/jira/browse/HDFS-693 .

Here are the relevant logs from the client:


-----

09/11/11 10:58:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-5973015382562856275_32855java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:52549 remote=/127.0.0.1:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2367)

09/11/11 10:58:22 WARN hdfs.DFSClient: Error Recovery for block blk_-5973015382562856275_32855 bad datanode[0] 127.0.0.1:50010

Exception in thread "Thread-9" java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting...
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2493)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2046)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209)
Call to org.apache.hadoop.fs.FSDataOutputStream::write failed!

-----

I'm not entirely sure what to make of this: it seems like some sort of hickup, since the data node is available quickly afterwards (as a matter of fact, I am unable to reproduce this error in a controlled way -- I've only been able to notice the applications crash while not watching).

Here are some relevant logs in the DataNode:

-----

2009-11-11 10:56:30,429 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_6654737502435554218_32905 src: /127.0.0.1:33103 dest: /127.0.0.1:50010
2009-11-11 10:56:30,854 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:33103, dest: /127.0.0.1:50010, bytes: 488, op: HDFS_WRITE, cliID: DFSClient_attempt_200910192055_3176_r_000000_0, srvID: DS-1076946
65-62.129.149.242-50010-1254143880412, blockid: blk_6654737502435554218_32905
2009-11-11 10:56:30,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_6654737502435554218_32905 terminating
2009-11-11 10:56:36,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:33161, bytes: 222249, op: HDFS_READ, cliID: DFSClient_attempt_200910192055_3178_m_000001_0, srvID: DS-10769
4665-62.129.149.242-50010-1254143880412, blockid: blk_-5388231964351796801_32881
2009-11-11 10:56:41,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_698694794042576935_32906 src: /127.0.0.1:33208 dest: /127.0.0.1:50010
2009-11-11 10:56:41,463 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:33192, dest: /127.0.0.1:50010, bytes: 246338, op: HDFS_WRITE, cliID: DFSClient_-648412294, srvID: DS-107694665-62.129.149.242-50010
-1254143880412, blockid: blk_-5491664625164378664_32900
2009-11-11 10:56:41,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-5491664625164378664_32900 terminating
2009-11-11 10:56:41,737 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:33208, dest: /127.0.0.1:50010, bytes: 1757, op: HDFS_WRITE, cliID: DFSClient_attempt_200910192055_3177_r_000000_0, srvID: DS-107694
665-62.129.149.242-50010-1254143880412, blockid: blk_698694794042576935_32906
2009-11-11 10:56:41,746 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_698694794042576935_32906 terminating
2009-11-11 10:58:27,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-5973015382562856275_32855 java.io.EOFException: while trying to read 65557 bytes
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-5973015382562856275_32855 Interrupted.
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-5973015382562856275_32855 terminating
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-6041681763208648701_32907 src: /127.0.0.1:33655 dest: /127.0.0.1:50010
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_9134662523757703658_32907 src: /127.0.0.1:33688 dest: /127.0.0.1:50010
2009-11-11 10:58:27,519 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:33533, bytes: 0, op: HDFS_READ, cliID: DFSClient_-1667266750, srvID: DS-107694665-62.129.149.242-50010-1254143880412, blockid: blk_6654737502435554218_32905
2009-11-11 10:58:27,519 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-5973015382562856275_32855 received exception java.io.EOFException: while trying to read 65557 bytes
2009-11-11 10:58:27,531 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-107694665-62.129.149.242-50010-1254143880412, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:619)
2009-11-11 10:58:30,442 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:33655, dest: /127.0.0.1:50010, bytes: 9833, op: HDFS_WRITE, cliID: DFSClient_-648412294, srvID: DS-107694665-62.129.149.242-50010-1254143880412, blockid: blk_-6041681763208648701_32907

-----

Again, not much useful information: the exception shown in the logs has a later timestamp than the actual error occurring in the client.

Now, I'm wondering about a few things:

- What could be causing these problems? A network error seems out of the question, since all communication is done via localhost 127.0.0.1, but I'm having trouble figuring out which steps to take next to investigate the problem.

- Would this problem (or these class of problems) be related to the fact I'm only running a single datanode? In other words, would the client still throw an exception if it has multiple data nodes to choose from and only a single one fails to respond ?

- Given the fact that the DataNode seems to get back into normal operation somewhere after the error occurred, would increasing the timeout from 63 seconds to, say, 300 seconds be a wise idea ? Are these kind of "hiccups" to be expected in normal mode of operation?


Thanks in advance!

Regards,

- Leon Mergen

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 11, '09 at 11:11a
activeNov 11, '09 at 11:11a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Leon Mergen: 1 post

People

Translate

site design / logo © 2021 Grokbase