FAQ
Hello everyone-

So I have a 5 node cluster that I've been running for a few weeks with no problems. Today I decided to add nodes and double its size to 10. After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup. Specifically, the data nodes didn't start. The task trackers seemed to start fine. Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log files:

2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner.
2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs
2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
at java.lang.Thread.run(Thread.java:619)

After this the data node shuts down. This same message is appearing on all the failed nodes. Help!

-kevin

Search Discussions

  • Kevin Eppinger at Apr 8, 2009 at 1:24 pm
    FYI: Problem fixed. It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added. The solution was to put the following entry in hadoop-site.xml:

    <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
    </property>

    Thanks to 'jdcryans' and 'digarok' from IRC for the help.

    -kevin

    -----Original Message-----
    From: Kevin Eppinger
    Sent: Tuesday, April 07, 2009 1:05 PM
    To: core-user@hadoop.apache.org
    Subject: Hadoop data nodes failing to start

    Hello everyone-

    So I have a 5 node cluster that I've been running for a few weeks with no problems. Today I decided to add nodes and double its size to 10. After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup. Specifically, the data nodes didn't start. The task trackers seemed to start fine. Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log files:

    2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner.
    2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs
    2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
    1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
    at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
    at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
    at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
    at java.lang.Thread.run(Thread.java:619)

    After this the data node shuts down. This same message is appearing on all the failed nodes. Help!

    -kevin
  • Jean-Daniel Cryans at Apr 8, 2009 at 1:30 pm
    Kevin,

    I'm glad it worked for you.

    We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
    on that same cluster without the socket timeout thing?

    Thx,

    J-D

    On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
    wrote:
    FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added.  The solution was to put the following entry in hadoop-site.xml:

    <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
    </property>

    Thanks to 'jdcryans' and 'digarok' from IRC for the help.

    -kevin

    -----Original Message-----
    From: Kevin Eppinger
    Sent: Tuesday, April 07, 2009 1:05 PM
    To: core-user@hadoop.apache.org
    Subject: Hadoop data nodes failing to start

    Hello everyone-

    So I have a 5 node cluster that I've been running for a few weeks with no problems.  Today I decided to add nodes and double its size to 10.  After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup.  Specifically, the data nodes didn't start.  The task trackers seemed to start fine.  Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log files:

    2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner.
    2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs
    2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
    1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
    at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
    at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
    at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
    at java.lang.Thread.run(Thread.java:619)

    After this the data node shuts down.  This same message is appearing on all the failed nodes.  Help!

    -kevin
  • Kevin Eppinger at Apr 8, 2009 at 9:30 pm
    Unfortunately not. I don't have much leeway to experiment with this cluster.

    -kevin

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of Jean-Daniel Cryans
    Sent: Wednesday, April 08, 2009 8:30 AM
    To: core-user@hadoop.apache.org
    Subject: Re: Hadoop data nodes failing to start

    Kevin,

    I'm glad it worked for you.

    We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
    on that same cluster without the socket timeout thing?

    Thx,

    J-D

    On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
    wrote:
    FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added.  The solution was to put the following entry in hadoop-site.xml:

    <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
    </property>

    Thanks to 'jdcryans' and 'digarok' from IRC for the help.

    -kevin

    -----Original Message-----
    From: Kevin Eppinger
    Sent: Tuesday, April 07, 2009 1:05 PM
    To: core-user@hadoop.apache.org
    Subject: Hadoop data nodes failing to start

    Hello everyone-

    So I have a 5 node cluster that I've been running for a few weeks with no problems.  Today I decided to add nodes and double its size to 10.  After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup.  Specifically, the data nodes didn't start.  The task trackers seemed to start fine.  Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log files:

    2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner.
    2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs
    2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
    1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
    at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
    at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
    at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
    at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
    at java.lang.Thread.run(Thread.java:619)

    After this the data node shuts down.  This same message is appearing on all the failed nodes.  Help!

    -kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 7, '09 at 6:05p
activeApr 8, '09 at 9:30p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase