Grokbase Groups HBase user March 2011
FAQ
Hi,

I'm setting a 4-region-nodes hbase clusters, with the master running outside
the data clusters.
It works 'well'. The problem is after some sort of stress tests, say
launching 20 threads, putting
data with restful api, each block of 4mb size, the hmaster always shut down
due to zookeeper
session timeout. After the timeout, I tried to restart the hmaster, then I
saw lots of

====
2011-03-09 17:56:16,032 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=107 entries
from
hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636326073
2011-03-09 17:56:16,032 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 134 of
219:
hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636333037,
length=96233748
2011-03-09 17:56:16,032 INFO org.apache.hadoop.hbase.util.FSUtils:
Recovering file
hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636333037
2011-03-09 17:56:18,609 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
lease recover attempt for
hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636333037
====

in the log file, this recovery would take pretty long time, and it would
eventually lead to java heap out of memory,
even I set HBASE_HEAPSIZE to 4000 in conf/hbase-site.xml.

My question is the recovery takes so long time, and in the meanwhile, I
could not do anything, say bin/hbase shell, and list.
Is this normal? or some sort of configuration issue? If it's normal, is
there any guide line or document address this performance
issue? Thanks a lot for your reply.

I'm using hbase-0.90.1-CDH3B4.

Best regards,

Mao Xu-Feng

Search Discussions

  • Stack at Mar 9, 2011 at 7:01 pm

    On Wed, Mar 9, 2011 at 2:05 AM, 茅旭峰 wrote:
    It works 'well'. The problem is after some sort of stress tests, say
    launching 20 threads, putting
    data with restful api, each block of 4mb size, the hmaster always shut down
    due to zookeeper
    session timeout.
    What is 4MB? Each of your entries?

    Regards shutting down of RS because of ZK session loss, can you
    correlate the session expiration to anything? Swapping? Or long GC
    pause?

    After the timeout, I tried to restart the hmaster, then I
    saw lots of

    ====
    2011-03-09 17:56:16,032 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=107 entries
    from
    hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636326073
    2011-03-09 17:56:16,032 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 134 of
    219:

    This is the master splitting WAL files. It must complete before
    regions can come back on line again.

    Interesting about the above is the amount of log files we're
    splitting. We have to split 219 files before we get back on line
    again. Thats a lot. And it'll take a bunch of time. I wonder if its
    because your cell sizes are sort of larger than usual, it causes the
    run up in the number of hlogs? On the regionserver, you should
    messages about it trying to clear the WAL logs. Do you?

    in the log file, this recovery would take pretty long time, and it would
    eventually lead to java heap out of memory,
    even I set HBASE_HEAPSIZE to 4000 in conf/hbase-site.xml.
    This is also interesting. Again, I'd guess it your big cells that are
    bringing on this new condition. A bunch of work was done before the
    0.90 release to defend against log splitting bringing on OOMEs but
    your big cells seem to bring it on. Can you post more of your master
    log?

    St.Ack

    My question is the recovery takes so long time, and in the meanwhile, I
    could not do anything, say bin/hbase shell, and list.
    Is this normal? or some sort of configuration issue? If it's normal, is
    there any guide line or document address this performance
    issue? Thanks a lot for your reply.

    I'm using hbase-0.90.1-CDH3B4.

    Best regards,

    Mao Xu-Feng
  • 茅旭峰 at Mar 10, 2011 at 3:24 am
    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.

    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,

    ========
    2011-03-09 10:14:37,511 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
    Scanned 6943 catalog row(s) and gc'd 1 unreferenced parent region(s)
    2011-03-09 10:14:39,010 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    This client just lost it's session with ZooKeeper, trying to reconnect.
    2011-03-09 10:14:39,010 INFO org.apache.zookeeper.ClientCnxn: Unable to
    reconnect to ZooKeeper service, session 0x12e70bfa76b00c0 has expired,
    closing socket connection
    2011-03-09 10:14:39,010 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    Trying to reconnect to zookeeper
    2011-03-09 10:14:39,011 INFO org.apache.zookeeper.ZooKeeper: Initiating
    client connection, connectString=cloud140:2181 sessionTimeout=180000
    watcher=hconnection
    2011-03-09 10:14:39,012 INFO org.apache.zookeeper.ClientCnxn: Opening socket
    connection to server cloud140/10.241.67.33:2181
    2011-03-09 10:14:39,014 INFO org.apache.zookeeper.ClientCnxn: Socket
    connection established to cloud140/10.241.67.33:2181, initiating session
    2011-03-09 10:14:39,024 INFO org.apache.zookeeper.ClientCnxn: Unable to
    reconnect to ZooKeeper service, session 0x12e70bfa76b00bf has expired,
    closing socket connection
    2011-03-09 10:14:39,024 FATAL org.apache.hadoop.hbase.master.HMaster:
    master:60000-0x12e70bfa76b00bf master:60000-0x12e70bfa76b00bf received
    expired from ZooKeeper, aborting
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
    2011-03-09 10:14:39,025 INFO org.apache.hadoop.hbase.master.HMaster:
    Aborting
    2011-03-09 10:14:39,025 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-03-09 10:14:39,058 INFO org.apache.zookeeper.ClientCnxn: Session
    establishment complete on server cloud140/10.241.67.33:2181, sessionid =
    0x12e70bfa76b00ce, negotiated timeout = 40000
    2011-03-09 10:14:39,074 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    Reconnected successfully. This disconnect could have been caused by a
    network partition or a long-running GC pause, either way it's recommended
    that you verify your environment.
    2011-03-09 10:14:39,075 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-03-09 10:14:39,125 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
    master:60000-0x12e70bfa76b00bf Unable to get data of znode
    /hbase/unassigned/445e3880a1281b2e549ac7a36c83ba4f
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for
    /hbase/unassigned/445e3880a1281b2e549ac7a36c83ba4f
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
    at
    org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:765)
    at
    org.apache.hadoop.hbase.master.AssignmentManager.handleSplitReport(AssignmentManager.java:1754)
    at
    org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:281)
    at
    org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:639)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    2011-03-09 10:14:39,125 ERROR
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
    master:60000-0x12e70bfa76b00bf Received unexpected KeeperException,
    re-throwing exception
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for
    /hbase/unassigned/445e3880a1281b2e549ac7a36c83ba4f
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
    at
    org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:765)
    at
    org.apache.hadoop.hbase.master.AssignmentManager.handleSplitReport(AssignmentManager.java:1754)
    at
    org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:281)
    at
    org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:639)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    2011-03-09 10:14:39,125 WARN
    org.apache.hadoop.hbase.master.AssignmentManager: Exception while validating
    RIT during split report
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for
    /hbase/unassigned/445e3880a1281b2e549ac7a36c83ba4f
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
    at
    org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:765)
    at
    org.apache.hadoop.hbase.master.AssignmentManager.handleSplitReport(AssignmentManager.java:1754)
    at
    org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:281)
    at
    org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:639)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    2011-03-09 10:14:39,516 DEBUG org.apache.hadoop.hbase.master.HMaster:
    Stopping service threads
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
    server on 60000
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.hbase.master.HMaster$1:
    cloud135:60000-BalancerChore exiting
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 7 on 60000: exiting
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
    Server listener on 60000
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 3 on 60000: exiting
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 5 on 60000: exiting
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 6 on 60000: exiting
    2011-03-09 10:14:39,516 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
    cloud135:60000-CatalogJanitor exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 0 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 1 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 8 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 2 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
    Server Responder
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 4 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 9 on 60000: exiting
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.hbase.master.HMaster:
    Stopping infoServer
    2011-03-09 10:14:39,517 INFO org.apache.hadoop.hbase.master.LogCleaner:
    master-cloud135:60000.oldLogCleaner exiting
    2011-03-09 10:14:39,518 INFO org.mortbay.log: Stopped
    SelectChannelConnector@0.0.0.0:60010
    2011-03-09 10:14:39,519 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
    master:60000-0x12e70bfa76b00bf Unable to get data of znode /hbase/master
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for /hbase/master
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
    at
    org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:180)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:293)
    2011-03-09 10:14:39,520 ERROR
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
    master:60000-0x12e70bfa76b00bf Received unexpected KeeperException,
    re-throwing exception
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for /hbase/master
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
    at
    org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:180)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:293)
    2011-03-09 10:14:39,520 ERROR
    org.apache.hadoop.hbase.master.ActiveMasterManager:
    master:60000-0x12e70bfa76b00bf Error deleting our own master address node
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired for /hbase/master
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
    at
    org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
    at
    org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAsAddress(ZKUtil.java:620)
    at
    org.apache.hadoop.hbase.master.ActiveMasterManager.stop(ActiveMasterManager.java:180)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:293)
    2011-03-09 10:14:39,520 DEBUG
    org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker
    org.apache.hadoop.hbase.catalog.CatalogTracker@5e29c58e
    2011-03-09 10:14:39,521 INFO
    org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
    cloud135:60000.timeoutMonitor exiting
    2011-03-09 10:14:39,621 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    Closed zookeeper sessionid=0x12e70bfa76b00ce
    2011-03-09 10:14:39,623 INFO org.apache.zookeeper.ZooKeeper: Session:
    0x12e70bfa76b00ce closed
    2011-03-09 10:14:39,623 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-03-09 10:14:39,623 INFO org.apache.hadoop.hbase.master.HMaster: HMaster
    main thread exiting
    =======

    I'm inclined to think it's not related to swapping, long gc pause or session
    expiration, because the timestamp
    indicates all that happens in a single second '2011-03-09 10:14:39'.

    ----- You said
    Interesting about the above is the amount of log files we're
    splitting. We have to split 219 files before we get back on line
    again. Thats a lot. And it'll take a bunch of time. I wonder if its
    because your cell sizes are sort of larger than usual, it causes the
    run up in the number of hlogs? On the regionserver, you should
    messages about it trying to clear the WAL logs. Do you?
    -----

    What is the recommended or usual size of the cell that hbase means to
    support?
    I'm not clear about the 'messages about it trying to clear....' issue,
    can you explain a bit more?

    ----- You said
    This is also interesting. Again, I'd guess it your big cells that are
    bringing on this new condition. A bunch of work was done before the
    0.90 release to defend against log splitting bringing on OOMEs but
    your big cells seem to bring it on. Can you post more of your master
    log?
    -----

    The master log is like
    =====
    2011-03-09 17:14:00,943 INFO
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path
    hdfs://cloud135:9000/hbase/richard/fee41daaa54b8d629a3ed20a4c25109c/recovered.edits/0000000000003130149
    (wrote 6 edits in 95ms)
    2011-03-09 17:14:00,997 INFO
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path
    hdfs://cloud135:9000/hbase/richard/ff3a4482f21a69b8b5773cfeedb909f1/recovered.edits/0000000000003131006
    (wrote 23 edits in 1812ms)
    2011-03-09 17:14:01,009 INFO
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path
    hdfs://cloud135:9000/hbase/richard/ffcf64cfc0fc36ab8796c6da1b8ab668/recovered.edits/0000000000003132448
    (wrote 9 edits in 300ms)
    2011-03-09 17:14:01,021 INFO
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path
    hdfs://cloud135:9000/hbase/richard/ffdbb9aed85f5ce3a85bd05f9c1a4336/recovered.edits/0000000000003129287
    (wrote 9 edits in 242ms)
    2011-03-09 17:14:01,026 FATAL org.apache.hadoop.hbase.master.HMaster:
    Unhandled exception. Starting shutdown.
    java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:1970)
    at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:1977)
    at
    org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:118)
    at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1766)
    at
    org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1894)
    at
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
    at
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
    at
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
    at
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
    at
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
    at
    org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
    at
    org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:180)
    at
    org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:379)
    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:278)
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.hbase.master.HMaster:
    Aborting
    2011-03-09 17:14:01,037 DEBUG org.apache.hadoop.hbase.master.HMaster:
    Stopping service threads
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
    server on 60000
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 1 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 0 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 2 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 9 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 3 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 5 on 60000: exiting
    2011-03-09 17:14:01,037 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 8 on 60000: exiting
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
    Server listener on 60000
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 6 on 60000: exiting
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 7 on 60000: exiting
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
    handler 4 on 60000: exiting
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
    Server Responder
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.hbase.master.HMaster:
    Stopping infoServer
    2011-03-09 17:14:01,038 INFO org.apache.hadoop.hbase.master.LogCleaner:
    master-cloud135:60000.oldLogCleaner exiting
    2011-03-09 17:14:01,040 INFO org.mortbay.log: Stopped
    SelectChannelConnector@0.0.0.0:60010
    2011-03-09 17:14:01,174 DEBUG
    org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker
    org.apache.hadoop.hbase.catalog.CatalogTracker@1f57ea4a
    2011-03-09 17:14:01,174 INFO
    org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
    cloud135:60000.timeoutMonitor exiting
    2011-03-09 17:14:01,175 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    Closed zookeeper sessionid=0x22e9990fcfe0006
    2011-03-09 17:14:01,183 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-03-09 17:14:01,184 INFO org.apache.zookeeper.ZooKeeper: Session:
    0x22e9990fcfe0006 closed
    2011-03-09 17:14:01,254 INFO org.apache.zookeeper.ZooKeeper: Session:
    0x22e9990fcfe0005 closed
    2011-03-09 17:14:01,254 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-03-09 17:14:01,254 INFO org.apache.hadoop.hbase.master.HMaster: HMaster
    main thread exiting
    Wed Mar 9 17:16:33 CST 2011 Stopping hbase (via master)
    =====

    Hopefully, it could helps to address this issue. Thanks again for your
    reply.

    On Thu, Mar 10, 2011 at 2:57 AM, Stack wrote:
    On Wed, Mar 9, 2011 at 2:05 AM, 茅旭峰 wrote:
    It works 'well'. The problem is after some sort of stress tests, say
    launching 20 threads, putting
    data with restful api, each block of 4mb size, the hmaster always shut down
    due to zookeeper
    session timeout.
    What is 4MB? Each of your entries?

    Regards shutting down of RS because of ZK session loss, can you
    correlate the session expiration to anything? Swapping? Or long GC
    pause?

    After the timeout, I tried to restart the hmaster, then I
    saw lots of

    ====
    2011-03-09 17:56:16,032 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=107 entries
    from
    hdfs://cloud135:9000/hbase/.logs/cloud136,60020,1299572235779/cloud136%3A60020.1299636326073
    2011-03-09 17:56:16,032 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 134 of
    219:

    This is the master splitting WAL files. It must complete before
    regions can come back on line again.

    Interesting about the above is the amount of log files we're
    splitting. We have to split 219 files before we get back on line
    again. Thats a lot. And it'll take a bunch of time. I wonder if its
    because your cell sizes are sort of larger than usual, it causes the
    run up in the number of hlogs? On the regionserver, you should
    messages about it trying to clear the WAL logs. Do you?

    in the log file, this recovery would take pretty long time, and it would
    eventually lead to java heap out of memory,
    even I set HBASE_HEAPSIZE to 4000 in conf/hbase-site.xml.
    This is also interesting. Again, I'd guess it your big cells that are
    bringing on this new condition. A bunch of work was done before the
    0.90 release to defend against log splitting bringing on OOMEs but
    your big cells seem to bring it on. Can you post more of your master
    log?

    St.Ack

    My question is the recovery takes so long time, and in the meanwhile, I
    could not do anything, say bin/hbase shell, and list.
    Is this normal? or some sort of configuration issue? If it's normal, is
    there any guide line or document address this performance
    issue? Thanks a lot for your reply.

    I'm using hbase-0.90.1-CDH3B4.

    Best regards,

    Mao Xu-Feng
  • Ted Yu at Mar 10, 2011 at 3:30 am
    java.lang.OutOfMemoryError: Java heap space
    Have you tried increasing heap memory ?
    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:

    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.

    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,
  • 茅旭峰 at Mar 10, 2011 at 5:00 am
    Yes, I increased the heap memory to 16GB, then the master fall into such
    kind loop like

    =====
    2011-03-10 12:56:25,765 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://cloud135:9000/hbase/richard/1992d77ac89a289dfb15e1a626e037c7/recovered.edits/0000000000001113639
    region=1992d77ac89a289dfb15e1a626e037c7
    2011-03-10 12:56:25,792 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/260eb775f84574acf15edccc397ae88b/recovered.edits/0000000000001113703,
    length=3865335
    2011-03-10 12:56:25,879 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/7fe64e1b343f6a4920c5a15907c7604c/recovered.edits/0000000000001112552,
    length=3280578
    2011-03-10 12:56:25,880 INFO
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
    -- HDFS-200
    =====

    and when I tried scan '-ROOT-' in hbase shell, I got

    =====
    ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
    contact region server cloud134:60020 for region -ROOT-,,0, row '', but
    failed after 7 attempts.
    Exceptions:
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    -ROOT-,,0
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2321)
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1766)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    =====

    Looks like I've lost the ROOT region?


    On Thu, Mar 10, 2011 at 11:29 AM, Ted Yu wrote:

    java.lang.OutOfMemoryError: Java heap space
    Have you tried increasing heap memory ?
    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:

    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.
    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,
  • 茅旭峰 at Mar 10, 2011 at 7:02 am
    It seems like there are lots of WAL files in .logs and .oldlogs directories.
    Is there any parameter to control
    the size of those WAL files? Or the frequency at which to check the WAL
    files.

    Thanks a lot!
    On Thu, Mar 10, 2011 at 1:00 PM, 茅旭峰 wrote:

    Yes, I increased the heap memory to 16GB, then the master fall into such
    kind loop like

    =====
    2011-03-10 12:56:25,765 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://cloud135:9000/hbase/richard/1992d77ac89a289dfb15e1a626e037c7/recovered.edits/0000000000001113639
    region=1992d77ac89a289dfb15e1a626e037c7
    2011-03-10 12:56:25,792 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/260eb775f84574acf15edccc397ae88b/recovered.edits/0000000000001113703,
    length=3865335
    2011-03-10 12:56:25,879 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/7fe64e1b343f6a4920c5a15907c7604c/recovered.edits/0000000000001112552,
    length=3280578
    2011-03-10 12:56:25,880 INFO
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
    -- HDFS-200
    =====

    and when I tried scan '-ROOT-' in hbase shell, I got

    =====
    ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
    contact region server cloud134:60020 for region -ROOT-,,0, row '', but
    failed after 7 attempts.
    Exceptions:
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    -ROOT-,,0
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2321)
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1766)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    =====

    Looks like I've lost the ROOT region?



    On Thu, Mar 10, 2011 at 11:29 AM, Ted Yu wrote:

    java.lang.OutOfMemoryError: Java heap space
    Have you tried increasing heap memory ?
    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:

    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.
    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,
  • 陈加俊 at Mar 10, 2011 at 8:11 am
    copy your hbase-site.xml here
    On Thu, Mar 10, 2011 at 3:01 PM, 茅旭峰 wrote:

    It seems like there are lots of WAL files in .logs and .oldlogs
    directories.
    Is there any parameter to control
    the size of those WAL files? Or the frequency at which to check the WAL
    files.

    Thanks a lot!
    On Thu, Mar 10, 2011 at 1:00 PM, 茅旭峰 wrote:

    Yes, I increased the heap memory to 16GB, then the master fall into such
    kind loop like

    =====
    2011-03-10 12:56:25,765 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://cloud135:9000/hbase/richard/1992d77ac89a289dfb15e1a626e037c7/recovered.edits/0000000000001113639
    region=1992d77ac89a289dfb15e1a626e037c7
    2011-03-10 12:56:25,792 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/260eb775f84574acf15edccc397ae88b/recovered.edits/0000000000001113703,
    length=3865335
    2011-03-10 12:56:25,879 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/7fe64e1b343f6a4920c5a15907c7604c/recovered.edits/0000000000001112552,
    length=3280578
    2011-03-10 12:56:25,880 INFO
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
    -- HDFS-200
    =====

    and when I tried scan '-ROOT-' in hbase shell, I got

    =====
    ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
    contact region server cloud134:60020 for region -ROOT-,,0, row '', but
    failed after 7 attempts.
    Exceptions:
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    -ROOT-,,0
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2321)
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1766)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    =====

    Looks like I've lost the ROOT region?



    On Thu, Mar 10, 2011 at 11:29 AM, Ted Yu wrote:

    java.lang.OutOfMemoryError: Java heap space
    Have you tried increasing heap memory ?
    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:

    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.
    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,


    --
    Thanks & Best regards
    jiajun
  • 茅旭峰 at Mar 10, 2011 at 8:17 am
    I have not set much stuff in the xml file.

    hbase-site.xml
    ===
    <configuration>
    <property>
    <name>hbase.rootdir</name>
    <value>hdfs://cloud135:9000/hbase</value>
    </property>
    <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    </property>
    <property>
    <name>hbase.client.keyvalue.maxsize</name>
    <value>20971520</value>
    </property>
    <property>
    <name>zookeeper.session.timeout</name>
    <value>180000</value>
    </property>
    </configuration>
    ===

    zoo.cfg
    ===
    # The number of milliseconds of each tick
    tickTime=2000
    # The number of ticks that the initial
    # synchronization phase can take
    initLimit=10
    # The number of ticks that can pass between
    # sending a request and getting an acknowledgement
    syncLimit=5
    # the directory where the snapshot is stored.
    dataDir=/home/hadoop/zookeeper
    # the port at which the clients will connect
    clientPort=2181
    server.1=cloud135:2888:3888
    server.2=cloud137:2888:3888
    server.3=cloud139:2888:3888
    ===

    Thanks for your attention!

    On Thu, Mar 10, 2011 at 4:11 PM, 陈加俊 wrote:

    copy your hbase-site.xml here
    On Thu, Mar 10, 2011 at 3:01 PM, 茅旭峰 wrote:

    It seems like there are lots of WAL files in .logs and .oldlogs
    directories.
    Is there any parameter to control
    the size of those WAL files? Or the frequency at which to check the WAL
    files.

    Thanks a lot!
    On Thu, Mar 10, 2011 at 1:00 PM, 茅旭峰 wrote:

    Yes, I increased the heap memory to 16GB, then the master fall into
    such
    kind loop like

    =====
    2011-03-10 12:56:25,765 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://cloud135:9000/hbase/richard/1992d77ac89a289dfb15e1a626e037c7/recovered.edits/0000000000001113639
    region=1992d77ac89a289dfb15e1a626e037c7
    2011-03-10 12:56:25,792 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing
    old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/260eb775f84574acf15edccc397ae88b/recovered.edits/0000000000001113703,
    length=3865335
    2011-03-10 12:56:25,879 WARN
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing
    old
    edits file. It could be the result of a previous failed split attempt.
    Deleting
    hdfs://cloud135:9000/hbase/richard/7fe64e1b343f6a4920c5a15907c7604c/recovered.edits/0000000000001112552,
    length=3280578
    2011-03-10 12:56:25,880 INFO
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
    -- HDFS-200
    =====

    and when I tried scan '-ROOT-' in hbase shell, I got

    =====
    ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
    contact region server cloud134:60020 for region -ROOT-,,0, row '', but
    failed after 7 attempts.
    Exceptions:
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    -ROOT-,,0
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2321)
    at
    org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1766)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
    org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
    =====

    Looks like I've lost the ROOT region?



    On Thu, Mar 10, 2011 at 11:29 AM, Ted Yu wrote:

    java.lang.OutOfMemoryError: Java heap space
    Have you tried increasing heap memory ?
    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:

    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.
    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,


    --
    Thanks & Best regards
    jiajun
  • Stack at Mar 10, 2011 at 10:46 pm

    On Wed, Mar 9, 2011 at 11:01 PM, 茅旭峰 wrote:
    It seems like there are lots of WAL files in .logs and .oldlogs directories.
    Is there any parameter to control
    the size of those WAL files? Or the frequency at which to check the WAL
    files.
    Check your regionserver logs. They should be acting to cleanup WALs
    when the number is in excess of 32.

    Is it possible that you have edits that are much larger than 15MB?

    To get your server up and running again though with probably dataloss,
    you could move aside the content of .logs.

    St.Ack
  • Stack at Mar 10, 2011 at 10:44 pm

    On Wed, Mar 9, 2011 at 7:23 PM, 茅旭峰 wrote:
    Thanks Stack for your reply!

    Yes, our application is using big cells, ranging from 4mb - 15mb per entry.
    OK. Thats kinda at the outside of what we are used to. Have you seen
    the 'storing blobs' discussion elsewhere on this list. You might
    consider writing such blobs to the filesystem direct. See the
    discussion of blobs that is close to this mail thread in this list.
    Regarding shutting down of RS because of ZK session loss,

    The master logs are below,

    ========
    2011-03-09 10:14:37,511 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
    Scanned 6943 catalog row(s) and gc'd 1 unreferenced parent region(s)
    2011-03-09 10:14:39,010 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    This client just lost it's session with ZooKeeper, trying to reconnect.
    2011-03-09 10:14:39,010 INFO org.apache.zookeeper.ClientCnxn: Unable to
    reconnect to ZooKeeper service, session 0x12e70bfa76b00c0 has expired,
    closing socket connection

    So, this looks like the master has lost its session with zookeeper,
    probably because of a long pause GC'ing.

    =======

    I'm inclined to think it's not related to swapping, long gc pause or session
    expiration, because the timestamp
    indicates all that happens in a single second '2011-03-09 10:14:39'.
    What about logs before this one:

    2011-03-09 10:14:37,511 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
    Scanned 6943 catalog row(s) and gc'd 1 unreferenced parent region(s)

    Are they in same second?

    What is the recommended or usual size of the cell that hbase means to
    support?

    Usually we'll recommend you put your blobs elsewhere if they are 10MB or so.

    HBase can be made support big blobs, its just that we don't have much
    experience running it carrying big blobs and our defaults are
    definetly not set for big cells.

    Is it possible that your cells are even larger than 15MB? Do you have
    a strict accounting knowing the sizes you are entering?

    I'm not clear about the 'messages about it trying to clear....' issue,
    can you explain a bit more?

    Over in the regionserver logs, you'll see it trying to catch up the
    WAL logs. It'll log that the number of WALs --- where WALs are the
    files the master is trying to split here -- are in excess of the upper
    limit of 32 (I believe this the default).

    ----- You said
    This is also interesting.  Again, I'd guess it your big cells that are
    bringing on this new condition.   A bunch of work was done before the
    0.90 release to defend against log splitting bringing on OOMEs but
    your big cells seem to bring it on.  Can you post more of your master
    log?
    -----

    Hopefully, it could helps to address this issue. Thanks again for your
    reply.

    The Master log doesn't tell us anything. Can we see even more of it?

    Meantime, I'm going to open an issue on big cells bringing on Master
    OOME: https://issues.apache.org/jira/browse/HBASE-3619. You can
    attach your master log to the issue.

    Thanks,

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMar 9, '11 at 10:06a
activeMar 10, '11 at 10:46p
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase