FAQ
Hello,

We have a 35 node cluster running Hadoop 0.20.2 and are seeing periods
of namenode inactivity that cause block read and write errors. We have
enabled garbage collection logging and have determined that garbage
collection is not the cause of the inactivity. We have been able to
reproduce this behavior with as little as three clients doing reads and
writes within Hadoop.

During the namenode inactivity we see little if any namenode logging.
Namenode stats: 9992150 files and directories, 6629279 blocks = 16621429
total. Heap Size is 7.29 GB / 11.7 GB (62%)


During the inactive state, strace reports many calls to sched_yield:

[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0
[pid 26246] <... sched_yield resumed> ) = 0
[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0
[pid 26246] <... sched_yield resumed> ) = 0
[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0

An example of the errors seen:

Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:51257 remote=/10.4.30.14:50010]
Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream
INFO: Abandoning block blk_1939160217996593958_48349447
Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream
INFO: Waiting to find target node: 10.4.30.14:50010
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:50554 remote=/10.4.30.30:50010]
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream
INFO: Abandoning block blk_1314187820700137040_48349451
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream
INFO: Waiting to find target node: 10.4.30.30:50010
Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:51369 remote=/10.4.30.14:50010]
Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream
INFO: Abandoning block blk_8612699045898320923_48349455

The periods of inactivity last 40-70 seconds followed by 20-30 seconds
of activity and the process repeats. We have seen this behavior on both
Hadoop 0.20.1 and 0.20.2 with the Sun jre build 1.6.0_13-b03.

Any suggestions on this issue are greatly appreciated.

--

The datanodes occasionally report:

2010-04-27 09:54:21,247 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.30.24:50010, storageID=DS-1258702238-10.4.30.24-50010-1265310891468, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.DataInputStream.readByte(Unknown Source)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readString(Text.java:400)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Unknown Source)


<configuration>

<property>
<name>dfs.name.dir</name>
<value>/hadoop</value>
</property>

<property>
<name>dfs.hosts</name>
<value>/opt/hadoop/conf/dfs.hosts</value>
</property>

<property>
<name>dfs.http.address</name>
<value>hfs-030010:50070</value>
</property>

<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>800000000</value>
</property>

<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>

<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>0</value>
</property>

<property>
<name>dfs.datanode.handler.count</name>
<value>34</value>
</property>

<property>
<name>dfs.namenode.handler.count</name>
<value>64</value>
</property>

<property>
<name>dfs.namenode.logging.level</name>
<value>block</value>
</property>

</configuration>

Search Discussions

  • Todd Lipcon at Apr 27, 2010 at 6:34 pm
    Those errors would indicate problems on the DN or client level, not the NN
    level.

    I'd double check your networking, make sure you don't have any switching
    issues, etc. Also double check for swapping on your DNs (if you see more
    than a few MB swapped out, you need to oversubscribe your memory less).

    -Todd
    On Tue, Apr 27, 2010 at 10:22 AM, elsif wrote:

    Hello,

    We have a 35 node cluster running Hadoop 0.20.2 and are seeing periods
    of namenode inactivity that cause block read and write errors. We have
    enabled garbage collection logging and have determined that garbage
    collection is not the cause of the inactivity. We have been able to
    reproduce this behavior with as little as three clients doing reads and
    writes within Hadoop.

    During the namenode inactivity we see little if any namenode logging.
    Namenode stats: 9992150 files and directories, 6629279 blocks = 16621429
    total. Heap Size is 7.29 GB / 11.7 GB (62%)


    During the inactive state, strace reports many calls to sched_yield:

    [pid 26249] sched_yield( <unfinished ...>
    [pid 26246] sched_yield( <unfinished ...>
    [pid 26249] <... sched_yield resumed> ) = 0
    [pid 26246] <... sched_yield resumed> ) = 0
    [pid 26249] sched_yield( <unfinished ...>
    [pid 26246] sched_yield( <unfinished ...>
    [pid 26249] <... sched_yield resumed> ) = 0
    [pid 26246] <... sched_yield resumed> ) = 0
    [pid 26249] sched_yield( <unfinished ...>
    [pid 26246] sched_yield( <unfinished ...>
    [pid 26249] <... sched_yield resumed> ) = 0

    An example of the errors seen:

    Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    createBlockOutputStream
    INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException:
    69000 millis timeout while waiting for channel to be ready for read. ch :
    java.nio.channels.SocketChannel[connected local=/10.4.30.11:51257 remote=/
    10.4.30.14:50010]
    Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    nextBlockOutputStream
    INFO: Abandoning block blk_1939160217996593958_48349447
    Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    nextBlockOutputStream
    INFO: Waiting to find target node: 10.4.30.14:50010
    Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    createBlockOutputStream
    INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException:
    69000 millis timeout while waiting for channel to be ready for read. ch :
    java.nio.channels.SocketChannel[connected local=/10.4.30.11:50554 remote=/
    10.4.30.30:50010]
    Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    nextBlockOutputStream
    INFO: Abandoning block blk_1314187820700137040_48349451
    Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    nextBlockOutputStream
    INFO: Waiting to find target node: 10.4.30.30:50010
    Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    createBlockOutputStream
    INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException:
    69000 millis timeout while waiting for channel to be ready for read. ch :
    java.nio.channels.SocketChannel[connected local=/10.4.30.11:51369 remote=/
    10.4.30.14:50010]
    Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
    nextBlockOutputStream
    INFO: Abandoning block blk_8612699045898320923_48349455

    The periods of inactivity last 40-70 seconds followed by 20-30 seconds
    of activity and the process repeats. We have seen this behavior on both
    Hadoop 0.20.1 and 0.20.2 with the Sun jre build 1.6.0_13-b03.

    Any suggestions on this issue are greatly appreciated.

    --

    The datanodes occasionally report:

    2010-04-27 09:54:21,247 ERROR
    org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
    10.4.30.24:50010, storageID=DS-1258702238-10.4.30.24-50010-1265310891468,
    infoPort=50075, ipcPort=50020):DataXceiver
    java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.DataInputStream.readByte(Unknown Source)
    at
    org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at
    org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
    org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
    at
    org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
    at java.lang.Thread.run(Unknown Source)


    <configuration>

    <property>
    <name>dfs.name.dir</name>
    <value>/hadoop</value>
    </property>

    <property>
    <name>dfs.hosts</name>
    <value>/opt/hadoop/conf/dfs.hosts</value>
    </property>

    <property>
    <name>dfs.http.address</name>
    <value>hfs-030010:50070</value>
    </property>

    <property>
    <name>dfs.balance.bandwidthPerSec</name>
    <value>800000000</value>
    </property>

    <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
    </property>

    <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
    </property>

    <property>
    <name>dfs.datanode.handler.count</name>
    <value>34</value>
    </property>

    <property>
    <name>dfs.namenode.handler.count</name>
    <value>64</value>
    </property>

    <property>
    <name>dfs.namenode.logging.level</name>
    <value>block</value>
    </property>

    </configuration>




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Elsif at Apr 27, 2010 at 7:50 pm
    The network checks out fine. Each node has 4G of memory and there are
    no other java processes running besides the data node. The nodes all
    have a very low load and very little swap in use:

    top - 12:40:45 up 272 days, 1:41, 2 users, load average: 0.10, 0.13, 0.22
    Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.0%us, 6.0%sy, 0.0%ni, 92.7%id, 0.7%wa, 0.3%hi, 0.3%si,
    0.0%st
    Mem: 3635144k total, 3109948k used, 525196k free, 2872k buffers
    Swap: 15624944k total, 87440k used, 15537504k free, 2127708k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    20053 root 20 0 1291m 686m 6960 S 3.7 19.3 11:31.74 java


    One additional correlation point we are seeing is the block reports are
    taking a long time:

    2010-04-27 08:32:47,470 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 733396
    blocks got processed in 308347 msecs
    2010-04-27 09:42:47,701 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 734668
    blocks got processed in 217988 msecs
    2010-04-27 10:24:02,943 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 736449
    blocks got processed in 271354 msecs
    2010-04-27 11:23:08,826 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738089
    blocks got processed in 305396 msecs
    2010-04-27 12:23:10,333 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738367
    blocks got processed in 306389 msecs
    On 04/27/2010 11:33 AM, Todd Lipcon wrote:
    Those errors would indicate problems on the DN or client level, not
    the NN level.

    I'd double check your networking, make sure you don't have any
    switching issues, etc. Also double check for swapping on your DNs (if
    you see more than a few MB swapped out, you need to oversubscribe your
    memory less).

    -Todd
  • Todd Lipcon at Apr 27, 2010 at 9:01 pm

    On Tue, Apr 27, 2010 at 12:50 PM, elsif wrote:

    The network checks out fine. Each node has 4G of memory and there are
    no other java processes running besides the data node. The nodes all
    have a very low load and very little swap in use:

    top - 12:40:45 up 272 days, 1:41, 2 users, load average: 0.10, 0.13,
    0.22
    Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie
    Cpu(s): 0.0%us, 6.0%sy, 0.0%ni, 92.7%id, 0.7%wa, 0.3%hi, 0.3%si,
    0.0%st
    Mem: 3635144k total, 3109948k used, 525196k free, 2872k buffers
    Swap: 15624944k total, 87440k used, 15537504k free, 2127708k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    20053 root 20 0 1291m 686m 6960 S 3.7 19.3 11:31.74 java


    One additional correlation point we are seeing is the block reports are
    taking a long time:

    2010-04-27 08:32:47,470 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 733396
    blocks got processed in 308347 msecs
    You have 733 thousand blocks on each DN? This is most likely your issue.
    There are some patches in trunk to improve performance for heavy-storage
    DNs, but that number of blocks is still very high. My guess is that you are
    storing lots of tiny files, which is not one of HDFS's strong suits.

    If you periodically jstack the DataNode JVM you may be able to see something
    causing contention. My guess is it that it is related to your high number of
    blocks.

    -Todd



    2010-04-27 09:42:47,701 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 734668
    blocks got processed in 217988 msecs
    2010-04-27 10:24:02,943 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 736449
    blocks got processed in 271354 msecs
    2010-04-27 11:23:08,826 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738089
    blocks got processed in 305396 msecs
    2010-04-27 12:23:10,333 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738367
    blocks got processed in 306389 msecs
    On 04/27/2010 11:33 AM, Todd Lipcon wrote:
    Those errors would indicate problems on the DN or client level, not
    the NN level.

    I'd double check your networking, make sure you don't have any
    switching issues, etc. Also double check for swapping on your DNs (if
    you see more than a few MB swapped out, you need to oversubscribe your
    memory less).

    -Todd

    --
    Todd Lipcon
    Software Engineer, Cloudera

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedApr 27, '10 at 5:23p
activeApr 27, '10 at 9:01p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Elsif: 2 posts Todd Lipcon: 2 posts

People

Translate

site design / logo © 2022 Grokbase