Grokbase Groups HBase user June 2010
FAQ
Hi,

I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
our system.
I recently noticed that a hbase query was timing out with

org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
to locate root region


I looked at the master logs and none of the region servers had connected

2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN


The master had a stderr output when it started

java.io.EOFException
....
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
DFSClient_-107490689

The regionservers have been trying to connect with the master ever since
with the error

2010-06-03 14:33:28,960 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
master. Retrying. Error was: java.net.ConnectException: Connection refused


All the region servers and master processes are running now. Except none of
the region servers are connected.


My first question is how to monitor this problem. None of the logs report an
error. I monitor processes so they are all fine. The logs don't report any
error.
How do i check for the general health of the cluster?


My second question is why did this happen?

thanks
ishwar

Search Discussions

  • Jean-Daniel Cryans at Jun 11, 2010 at 7:56 pm
    You can check the general health by using the webui, it runs on the
    master node at port 60010.

    For the errors, the context you gave is so limited that giving any
    meaningful answer is impossible. Please post full logs on a web server
    or on pastebin.com (or your preferred code pasting site) if it fits.

    J-D
    On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani wrote:
    Hi,

    I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
    our system.
    I recently noticed that a hbase query was timing out with

    org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
    to locate root region


    I looked at the master logs and none of the region servers had connected

    2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0
    region servers, 0 dead, average load NaN


    The master had a stderr output when it started

    java.io.EOFException
    ....
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
    complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
    DFSClient_-107490689

    The regionservers have been trying to connect with the master ever since
    with the error

    2010-06-03 14:33:28,960 WARN
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
    master. Retrying. Error was: java.net.ConnectException: Connection refused


    All the region servers and master processes are running now. Except none of
    the region servers are connected.


    My first question is how to monitor this problem. None of the logs report an
    error.  I monitor processes so they are all fine. The logs don't report any
    error.
    How do i check for the general health of the cluster?


    My second question is why did this happen?

    thanks
    ishwar
  • Ishwar ramani at Jun 16, 2010 at 7:55 pm
    Hi Jean,

    It happened again today during a server restart. This involved a hadoop
    start following by a hbase start.
    There was also an exception when hbase master came up on reading a file
    from hadoop. Not sure if that is the problem.
    Pasted those logs too.


    Current state of the system: master, zookeeper, region servers are all up.
    But region servers are not connected to master.

    Here are the logs ....


    1. logs on hbase master and hadoop namenode.
    hbase-master.out :http://pastebin.com/6a88nRh5
    hadoop-namemode: http://pastebin.com/wHP5uQBh

    2. syslog on hbase master.
    http://pastebin.com/S9KVVsSf

    3. syslog on hbase regionservers. Posted one the other is the same.
    http://pastebin.com/kR42Xt2t


    I did a netstat -tna to confirm that master is listening on port
    127.0.0.121:60000

    I did a restart of regionservers only and its able to connect fine.


    thanks
    ishwar

    On Fri, Jun 11, 2010 at 12:56 PM, Jean-Daniel Cryans wrote:

    You can check the general health by using the webui, it runs on the
    master node at port 60010.

    For the errors, the context you gave is so limited that giving any
    meaningful answer is impossible. Please post full logs on a web server
    or on pastebin.com (or your preferred code pasting site) if it fits.

    J-D
    On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani wrote:
    Hi,

    I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
    our system.
    I recently noticed that a hbase query was timing out with

    org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
    to locate root region


    I looked at the master logs and none of the region servers had connected

    2010-06-04 00:00:21,510 INFO
    org.apache.hadoop.hbase.master.ServerManager: 0
    region servers, 0 dead, average load NaN


    The master had a stderr output when it started

    java.io.EOFException
    ....
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
    complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
    DFSClient_-107490689

    The regionservers have been trying to connect with the master ever since
    with the error

    2010-06-03 14:33:28,960 WARN
    org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
    master. Retrying. Error was: java.net.ConnectException: Connection refused

    All the region servers and master processes are running now. Except none of
    the region servers are connected.


    My first question is how to monitor this problem. None of the logs report an
    error. I monitor processes so they are all fine. The logs don't report any
    error.
    How do i check for the general health of the cluster?


    My second question is why did this happen?

    thanks
    ishwar

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJun 11, '10 at 7:48p
activeJun 16, '10 at 7:55p
posts3
users2
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase