Hi,
I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
our system.
I recently noticed that a hbase query was timing out with
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
to locate root region
I looked at the master logs and none of the region servers had connected
2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
The master had a stderr output when it started
java.io.EOFException
....
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
DFSClient_-107490689
The regionservers have been trying to connect with the master ever since
with the error
2010-06-03 14:33:28,960 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
master. Retrying. Error was: java.net.ConnectException: Connection refused
All the region servers and master processes are running now. Except none of
the region servers are connected.
My first question is how to monitor this problem. None of the logs report an
error. I monitor processes so they are all fine. The logs don't report any
error.
How do i check for the general health of the cluster?
My second question is why did this happen?
thanks
ishwar