FAQ
Hello all,

I'm running HBase on top of hadoop and I have some difficulties to tune
hadoop conf in order to work fine with HBase.
My configuration is 4 desktop class machines, 2 are running a
datanode/region server, 1 only a region server and 1 a namenode/hbase
master, 1Gb RAM each

When I start HBase, about 300 regions must be load on 3 region servers; a
lot of accesses are made concurrently on Hadoop. My first problem, using the
default configuration, was to see too many of:
DataXceiver: java.net.SocketTimeoutException: 480000 millis timeout while
waiting for channel to be ready for write.

I was wondering what the reason of such a time out is. Where is the
bottleneck ? First I believed that was a network problem (I have 100Mbits/s
interfaces). But after monitoring the network, it seems the load is low when
it happens.
Anyway, I found the parameter
dfs.datanode.socket.write.timeout and I set it 0 to disable the timeout.

Then I saw in datanodes
xceiverCount 256 exceeds the limit of concurrent xcievers 255
What is exactly the role of the receivers ? to receive the replicated blocks
and/or to receive the file from clients ?
When their threads end ? When their threads are created ?

Anyway, I found the parameter
dfs.datanode.max.xcievers
I upped it to 511, then to 1023 and today to 2047; but by cluster is not so
big (300 HBase regions, 200Gb including replication factor of 2); I'm not
sure I will be able to up this limit for a long time. Moreover, it
considerably increases the amount of virtual memory needed for the datanode
jvm (about 2Gb now, only 500Mb for heap). That yields to excessive swap, and
a new problem arises; some leases expired, and my entire cluster eventually
fails.

Can I tune other parameter to avoid these concurrent receivers to be created
?
Upping the dfs.replication.interval for example could help ?

Could the fact the I run the regionserver on the same machine that the
datanode up the amount of xciever ? in which case I'll try a different
layout, and use the network bottleneck to avoid stress datanodes.

Any clue on the inside-hadoop-xciever would be appreciated.
Thanks.

-- Jean-Adrien
--
View this message in context: http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21349807.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Jean-Adrien at Jan 8, 2009 at 2:29 pm
    Some more information about the case.

    I read the HADOOP-3633 / 3859 / 3831 in jira.
    I run the version 18.1 of hadoop therefore I have no fix for 3831.
    Nevertheless my problem seems different.
    The threads are created as soon the client (HBase) requests data. the data
    arrives to HBase without problem but the thread never ends. Looking at the #
    of threads graphs:

    http://www.nabble.com/file/p21352818/launch_tests.png
    (you might need to go to nabble to see the image:
    http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21349807.html

    In the graph one runs hadoop / HBase 3 times (A/B/C) :
    A:
    I configure hadoop with dfs.datanode.max.xcievers=2023 and
    dfs.datanode.socket.write.timeout=0
    as soon I start hbase, the region load their data from dfs and the number of
    threads climbs up to 1100 in about 2-3 min. Then it stays in this scope.
    All DataXceiver threads are in one of these two states:

    "org.apache.hadoop.dfs.DataNode$DataXceiver@6a2f81" daemon prio=10
    tid=0x08289c00 nid=0x6bb6 runnable [0x8f980000..0x8f981140]
    java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
    at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
    - locked <0x95838858> (a sun.nio.ch.Util$1)
    - locked <0x95838868> (a java.util.Collections$UnmodifiableSet)
    - locked <0x95838818> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
    at
    org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:260)
    at
    org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
    at
    org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
    at
    org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    - locked <0x95838b90> (a java.io.BufferedInputStream)
    at java.io.DataInputStream.readShort(DataInputStream.java:295)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
    at java.lang.Thread.run(Thread.java:619)

    "org.apache.hadoop.dfs.DataNode$DataXceiver@1abf87e" daemon prio=10
    tid=0x90bbd400 nid=0x61ae runnable [0x7b68a000..0x7b68afc0]
    java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    - locked <0x9671a8e0> (a java.io.BufferedInputStream)
    at java.io.DataInputStream.readShort(DataInputStream.java:295)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1115)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
    at java.lang.Thread.run(Thread.java:619)


    B:
    I changed hadoop configuration, introducing the default 8min timeout.
    Once again, as soon HBase gets data from dfs, the number of thread grows to
    1100. After 8 minutes the timeout fires, and they fail one after each other
    with the exception:

    2009-01-08 14:21:09,305 WARN org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(192.168.1.13:50010,
    storageID=DS-1681396969-127.0.1.1-50010-1227536709605, infoPort=50075,
    ipcPor
    t=50020):Got exception while serving blk_-1718199459793984230_722338 to
    /192.168.1.13:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting for
    channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/192.168.1.13:50010 re
    mote=/192.168.1.13:37462]
    at
    org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
    at
    org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
    at
    org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
    at
    org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
    at
    org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
    at
    org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
    at java.lang.Thread.run(Thread.java:619)

    C:
    During this third session, I made the same run, but before the timeout
    fires, I stop HBase. In this case, the thread ends correctly.

    Is it the responsibility of hadoop client too manage its connection pool
    with the server ? In which case the problem would be an HBase problem?
    Anyway I found my problem, it is not a matter of performances.

    Thanks for your answers
    Have a nice day.

    -- Jean-Adrien
    --
    View this message in context: http://www.nabble.com/xceiverCount-limit-reason-tp21349807p21352818.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Raghu Angadi at Jan 8, 2009 at 6:38 pm

    Jean-Adrien wrote:
    Is it the responsibility of hadoop client too manage its connection pool
    with the server ? In which case the problem would be an HBase problem?
    Anyway I found my problem, it is not a matter of performances.
    Essentially, yes. Client has to close the file to relinquish
    connections, if clients are using the common read/write interface.

    Currently if a client keeps many hdfs files open, it results in many
    threads held at the DataNodes. As you noticed, timeout at DNs helps.

    Various solutions are possible at different levels: application(hbase),
    Client API, HDFS, etc. https://issues.apache.org/jira/browse/HADOOP-3856
    is proposal at HDFS level.

    Raghu.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 8, '09 at 11:04a
activeJan 8, '09 at 6:38p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Jean-Adrien: 2 posts Raghu Angadi: 1 post

People

Translate

site design / logo © 2022 Grokbase