FAQ

On Nov 25, 2009, at 11:27 AM, David J. O'Dell wrote:

I've intermittently seen the following errors on both of my
clusters, it happens when writing files.
I was hoping this would go away with the new version but I see the
same behavior on both versions.
The namenode logs don't show any problems, its always on the client
and datanodes.
[leaving errors below for reference]

I've seen similar errors on my 0.19.2 cluster when the cluster is
decently busy. I've traced this more or less to the host in question
doing verification on its blocks, an operation which seems to take the
datanode out for upwards of 500 seconds in some cases.

In 0.19.2, if you look at
o.a.h.hdfs.server.datanode.FSDataset.FSVolumeSet, you will see that
all methods are synchronized. All operations for the dataset on the
node seem to drop through methods in this class which in turn causes a
backup when one thread spends a large amount of time locking the
monitor...

You can grab a few jstacks and use a dump analyzer (like https://tda.dev.java.net/)
to poke through them to see if you have the same behavior.

I have not spent enough time digging into this to understand whether
the whole dataset really needs to be locked during the operation or if
the locks could be moved closer to the FSDir operations.

dave bayer

original logs clips included here:
Client log:
09/11/25 10:54:15 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.1.75.11:37852
remote=/10.1.75.125:50010]
09/11/25 10:54:15 INFO hdfs.DFSClient: Abandoning block
blk_-105422935413230449_22608
09/11/25 10:54:15 INFO hdfs.DFSClient: Waiting to find target node:
10.1.75.125:50010

Datanode log:
2009-11-25 10:54:51,170 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.1.75.125:50010,
storageID=DS-1401408597-10.1.75.125-50010-1258737830230,
infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 120000 millis timeout while waiting
for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/
10.1.75.104:50010]
at
org
.apache
.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org
.apache
.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:
282)
at
org
.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:
103)
at java.lang.Thread.run(Thread.java:619)

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 8 of 10 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 25, '09 at 7:28p
activeJan 9, '10 at 4:02a
posts10
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase