Grokbase Groups HBase user April 2011
FAQ
Our HBase master and region servers all came down during a quiet period
today (no read or write traffic). After inspecting the logs, I think I've
pieced together what might have happened:

1. One of the region servers (hadoop11) timed out and started closing
it's regions and aborting.
2. The master started splitting logs for the hadoop11 region.
3. While working on the first HLog, something causes the master to hang
for 4.5 minutes.
4. This was longer than the 180-second default zookeeper timeout, so the
master receives a zk session expired.
5. The master aborts.

Some snippets from the logs are pasted below. Does anyone know what may
have caused this? Was the hang really a garbage collection? How might I
prevent this from happening again?

Thanks,
Garrett


// From the master (hadoop01)
2011-04-28 07:59:24,004 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration [hadoop11,60020,1302898865706]
2011-04-28 07:59:24,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
Added=hadoop11,60020,1302898865706 to dead servers, submitted shutdown handler
to be executed, root=false, meta=false
2011-04-28 07:59:24,005 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
for hadoop11,60020,1302898865706
[ ... ]
2011-04-28 07:59:24,984 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
path=hdfs://hadoop01/hbase/processed/03819443e22488ec15418f22c67f02f5/recovered.edits/0000000000060056396
region=03819443e22488ec15418f22c67f02f5
2011-04-28 07:59:25,101 INFO
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
-- HDFS-200
2011-04-28 07:59:25,101 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
path=hdfs://hadoop01/hbase/processed/c56e10565f9e97fd19fb0879f59fd8b0/recovered.edits/0000000000060060517
region=c56e10565f9e97fd19fb0879f59fd8b0
[ // oh no! a 3 minute gap! ]
2011-04-28 08:03:58,571 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=77384 entries
from
hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174692580
2011-04-28 08:03:58,572 DEBUG
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 2 of
29:
hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174703672,
length=63864464
[ ... ]
2011-04-28 08:03:59,055 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
273689ms instead of 1000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
[ ... ]
2011-04-28 08:04:00,261 FATAL org.apache.hadoop.hbase.master.HMaster:
master:60000-0x12f4c26d7652910 master:60000-0x12f4c26d7652910 received
expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-04-28 08:04:00,261 INFO org.apache.hadoop.hbase.master.HMaster:
Aborting

// From the regionserver (hadoop11)
2011-04-28 07:41:06,851 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
evictedPerRun=1814.197021484375
2011-04-28 07:46:06,851 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
evictedPerRun=1814.197021484375
2011-04-28 07:51:06,852 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
evictedPerRun=1814.197021484375
2011-04-28 07:59:39,198 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 285991ms for sessionid
0x12f4c26d7652917, closing socket connection and attempting reconnect
2011-04-28 07:59:40,600 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server hadoop01/172.30.160.15:2181
2011-04-28 07:59:40,601 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop01/172.30.160.15:2181, initiating session
2011-04-28 07:59:40,602 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
elapsed
2011-04-28 07:59:40,652 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoop01/172.30.160.15:2181, sessionid =
0x12f4c26d7652917, negotiated timeout = 40000
2011-04-28 07:59:41,288 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 285993ms for sessionid
0x12f4c26d7652920, closing socket connection and attempting reconnect
[ ... ]
2011-04-28 07:59:23,174 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Reconnected successfully. This disconnect could have been caused by a
network partition or a long-running GC pause, either way it's recommended
that you verify your environment.
2011-04-28 07:59:23,174 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2011-04-28 07:59:24,004 INFO org.apache.zookeeper.ClientCnxn: Unable to read
additional data from server sessionid 0x12f4c26d7652920, likely server has
closed socket, closing socket connection and attempting reconnect
2011-04-28 07:59:25,773 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server hadoop01/172.30.160.15:2181
2011-04-28 07:59:25,774 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop01/172.30.160.15:2181, initiating session
2011-04-28 07:59:25,775 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x12f4c26d7652920 has expired,
closing socket connection
2011-04-28 07:59:25,775 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hadoop11.sdc2.svcmot.com,60020,1302898865706, load=(requests=0,
regions=46, usedHeap=4130, maxHeap=6117):
regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920
regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920 received expired from
ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)

Search Discussions

  • Ted Dunning at Apr 29, 2011 at 6:36 am
    Swap and gc are the usual culprits for this.

    Are you running a recent enough version to have Todd's wondrous mslab
    option?
    On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu wrote:

    Some snippets from the logs are pasted below. Does anyone know what may
    have caused this? Was the hang really a garbage collection? How might I
    prevent this from happening again?
  • Michel Segel at Apr 29, 2011 at 3:59 pm
    +1 on Todd's GC stuff. I just implemented it and it looks like there's a bit of improvement. I'm still early on in my testing...

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Apr 29, 2011, at 1:35 AM, Ted Dunning wrote:

    Swap and gc are the usual culprits for this.

    Are you running a recent enough version to have Todd's wondrous mslab
    option?
    On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu wrote:

    Some snippets from the logs are pasted below. Does anyone know what may
    have caused this? Was the hang really a garbage collection? How might I
    prevent this from happening again?
  • Stack at Apr 29, 2011 at 4:14 pm

    On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu wrote:
    2. The master started splitting logs for the hadoop11 region.
    Splitting logs is a pretty intense operation. It could have run up
    your master heap and CPU such that it brought on a long GC pause.
    Look for a big gap in your logging just before the zk session timeout
    (In 0.92.0, log splitting is distributed across the cluster; the
    master does nought but orchestrate the process).

    In you regionserver node log, there does seem to a big pause just
    before the zk session timeout. Similarily in the master (at about the
    point where you've inserted the 'oh no').

    For more on what Ted and Michel are referring to, see
    http://hbase.apache.org/book.html#jvm

    St.Ack

    Some snippets from the logs are pasted below.  Does anyone know what may
    have caused this?  Was the hang really a garbage collection?  How might I
    prevent this from happening again?

    Thanks,
    Garrett


    // From the master (hadoop01)
    2011-04-28 07:59:24,004 INFO
    org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
    ephemeral node deleted, processing expiration [hadoop11,60020,1302898865706]
    2011-04-28 07:59:24,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
    Added=hadoop11,60020,1302898865706 to dead servers, submitted shutdown handler
    to be executed, root=false, meta=false
    2011-04-28 07:59:24,005 INFO
    org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
    for hadoop11,60020,1302898865706
    [ ... ]
    2011-04-28 07:59:24,984 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://hadoop01/hbase/processed/03819443e22488ec15418f22c67f02f5/recovered.edits/0000000000060056396
    region=03819443e22488ec15418f22c67f02f5
    2011-04-28 07:59:25,101 INFO
    org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
    -- HDFS-200
    2011-04-28 07:59:25,101 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
    path=hdfs://hadoop01/hbase/processed/c56e10565f9e97fd19fb0879f59fd8b0/recovered.edits/0000000000060060517
    region=c56e10565f9e97fd19fb0879f59fd8b0
    [ // oh no! a 3 minute gap! ]
    2011-04-28 08:03:58,571 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=77384 entries
    from
    hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174692580
    2011-04-28 08:03:58,572 DEBUG
    org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 2 of
    29:
    hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174703672,
    length=63864464
    [ ... ]
    2011-04-28 08:03:59,055 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
    273689ms instead of 1000ms, this is likely due to a long garbage collecting
    pause and it's usually bad, see
    http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
    [ ... ]
    2011-04-28 08:04:00,261 FATAL org.apache.hadoop.hbase.master.HMaster:
    master:60000-0x12f4c26d7652910 master:60000-0x12f4c26d7652910 received
    expired from ZooKeeper, aborting
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
    2011-04-28 08:04:00,261 INFO org.apache.hadoop.hbase.master.HMaster:
    Aborting

    // From the regionserver (hadoop11)
    2011-04-28 07:41:06,851 DEBUG
    org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
    free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
    hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
    cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
    evictedPerRun=1814.197021484375
    2011-04-28 07:46:06,851 DEBUG
    org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
    free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
    hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
    cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
    evictedPerRun=1814.197021484375
    2011-04-28 07:51:06,852 DEBUG
    org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
    free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
    hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
    cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
    evictedPerRun=1814.197021484375
    2011-04-28 07:59:39,198 INFO org.apache.zookeeper.ClientCnxn: Client session
    timed out, have not heard from server in 285991ms for sessionid
    0x12f4c26d7652917, closing socket connection and attempting reconnect
    2011-04-28 07:59:40,600 INFO org.apache.zookeeper.ClientCnxn: Opening socket
    connection to server hadoop01/172.30.160.15:2181
    2011-04-28 07:59:40,601 INFO org.apache.zookeeper.ClientCnxn: Socket
    connection established to hadoop01/172.30.160.15:2181, initiating session
    2011-04-28 07:59:40,602 DEBUG
    org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
    elapsed
    2011-04-28 07:59:40,652 INFO org.apache.zookeeper.ClientCnxn: Session
    establishment complete on server hadoop01/172.30.160.15:2181, sessionid =
    0x12f4c26d7652917, negotiated timeout = 40000
    2011-04-28 07:59:41,288 INFO org.apache.zookeeper.ClientCnxn: Client session
    timed out, have not heard from server in 285993ms for sessionid
    0x12f4c26d7652920, closing socket connection and attempting reconnect
    [ ... ]
    2011-04-28 07:59:23,174 INFO
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
    Reconnected successfully. This disconnect could have been caused by a
    network partition or a long-running GC pause, either way it's recommended
    that you verify your environment.
    2011-04-28 07:59:23,174 INFO org.apache.zookeeper.ClientCnxn: EventThread
    shut down
    2011-04-28 07:59:24,004 INFO org.apache.zookeeper.ClientCnxn: Unable to read
    additional data from server sessionid 0x12f4c26d7652920, likely server has
    closed socket, closing socket connection and attempting reconnect
    2011-04-28 07:59:25,773 INFO org.apache.zookeeper.ClientCnxn: Opening socket
    connection to server hadoop01/172.30.160.15:2181
    2011-04-28 07:59:25,774 INFO org.apache.zookeeper.ClientCnxn: Socket
    connection established to hadoop01/172.30.160.15:2181, initiating session
    2011-04-28 07:59:25,775 INFO org.apache.zookeeper.ClientCnxn: Unable to
    reconnect to ZooKeeper service, session 0x12f4c26d7652920 has expired,
    closing socket connection
    2011-04-28 07:59:25,775 FATAL
    org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
    serverName=hadoop11.sdc2.svcmot.com,60020,1302898865706, load=(requests=0,
    regions=46, usedHeap=4130, maxHeap=6117):
    regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920
    regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920 received expired from
    ZooKeeper, aborting
    org.apache.zookeeper.KeeperException$SessionExpiredException:
    KeeperErrorCode = Session expired
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
    at
    org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
    at
    org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
  • Garrett Wu at Apr 29, 2011 at 10:12 pm
    Thanks for the info guys, it was all very helpful.

    I do have the mslab option enabled, but it looks like this was just a gc
    triggered by the load of log splitting as Stack points out. Glad to see
    distributed log splitting in 0.92.

    Garrett

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedApr 29, '11 at 4:53a
activeApr 29, '11 at 10:12p
posts5
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase