FAQ
In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

The logs show:

// Master assigned the root table to 82
2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

//The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
I think we shouldn't verify the root / meta in shutdownhandler processing

2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy6.getRegionInfo(Unknown Source)
at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.



Region server logs:
2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1

// blocked for a long time.
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181

Search Discussions

  • Ted Yu at Aug 11, 2011 at 2:16 am
    I think this is hbase-4168 whose fix got committed today.
    We run 0.90.4 patched with hbase-4168.

    Please try the patch.
    On Aug 10, 2011, at 7:05 PM, Gaojinchao wrote:

    In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

    The logs show:

    // Master assigned the root table to 82
    2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

    //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
    I think we shouldn't verify the root / meta in shutdownhandler processing

    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
    2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
    2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
    java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
    at $Proxy6.getRegionInfo(Unknown Source)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
    at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
    at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
    at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
    Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at java.io.DataInputStream.readInt(DataInputStream.java:370)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
    2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
    2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.



    Region server logs:
    2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1

    // blocked for a long time.
    2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
    2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
    2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
    2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181
  • Gaojinchao at Aug 15, 2011 at 8:17 am
    Thanks for your reply. It seems hbase-4168 can't fix my case. In my case ,there is not any error in master logs.
    Master can visit region server and get the region information.

    Master logs said that it is a region server with root and meta. But It only had assigned meta and skipped the root.

    //it said that it is a region server with root and meta.
    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
    2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
    2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
    java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
    at $Proxy6.getRegionInfo(Unknown Source)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
    at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
    at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
    at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
    Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at java.io.DataInputStream.readInt(DataInputStream.java:370)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    // It said that meta was assigned . but Root had no logs. The above logs said exception is CatalogJanitor not shutdownhandle.
    2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
    2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
    2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.
    -----邮件原件-----
    发件人: Ted Yu
    发送时间: 2011年8月11日 10:16
    收件人: user@hbase.apache.org
    抄送: user@hbase.apache.org
    主题: Re: Root table couldn't be opened

    I think this is hbase-4168 whose fix got committed today.
    We run 0.90.4 patched with hbase-4168.

    Please try the patch.
    On Aug 10, 2011, at 7:05 PM, Gaojinchao wrote:

    In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

    The logs show:

    // Master assigned the root table to 82
    2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

    //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
    I think we shouldn't verify the root / meta in shutdownhandler processing

    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
    2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
    2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
    java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
    at $Proxy6.getRegionInfo(Unknown Source)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
    at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
    at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
    at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
    at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
    at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
    Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at java.io.DataInputStream.readInt(DataInputStream.java:370)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
    at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
    2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
    2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.



    Region server logs:
    2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1

    // blocked for a long time.
    2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
    2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
    2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
    2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181
  • Stack at Aug 16, 2011 at 4:12 am

    On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao wrote:
    In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

    The logs show:

    // Master assigned the root table to 82
    2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

    //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
    I think we shouldn't verify the root / meta in shutdownhandler processing

    82 did not exit?

    Why did the master replay its logs if it did not exit?

    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true

    Isn't this the master handling 82 likes its been shutdown?

    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    So, it looks like 82 tried to come in (after GC I suppose) but we told
    it go away.

    Why did we not notice that -ROOT- was on 82 and as part of the
    shutdown handling of 82, we reassigned it. This is what you are
    saying in your subsequent message (I like how you noticed the log
    message that says 82 has root and meta). I'm not sure why it did not
    reassign root. Its skipping something in shutdown handler or the
    verify location for root has a bug in it where we are not considering
    the fact that current server held -ROOT- so if verification returns
    current server as holding -ROOT-, then we should ignore it.

    Good stuff Gao,
    St.Ack
  • Gaojinchao at Aug 16, 2011 at 4:24 am
    Why did the master replay its logs if it did not exit?
    Zk is expired because of gc. But region server isn't shutdown.

    (I like how you noticed the log message that says 82 has root and meta)

    Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    It said that 82 has root and meta. "root=true" shows the dead region server has root table.


    -----邮件原件-----
    发件人: saint.ack@gmail.com 代表 Stack
    发送时间: 2011年8月16日 12:12
    收件人: user@hbase.apache.org
    主题: Re: Root table couldn't be opened
    On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao wrote:
    In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

    The logs show:

    // Master assigned the root table to 82
    2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

    //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
    I think we shouldn't verify the root / meta in shutdownhandler processing

    82 did not exit?

    Why did the master replay its logs if it did not exit?

    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true

    Isn't this the master handling 82 likes its been shutdown?

    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    So, it looks like 82 tried to come in (after GC I suppose) but we told
    it go away.

    Why did we not notice that -ROOT- was on 82 and as part of the
    shutdown handling of 82, we reassigned it. This is what you are
    saying in your subsequent message (I like how you noticed the log
    message that says 82 has root and meta). I'm not sure why it did not
    reassign root. Its skipping something in shutdown handler or the
    verify location for root has a bug in it where we are not considering
    the fact that current server held -ROOT- so if verification returns
    current server as holding -ROOT-, then we should ignore it.

    Good stuff Gao,
    St.Ack
  • Stack at Aug 16, 2011 at 4:34 am

    On Mon, Aug 15, 2011 at 9:23 PM, Gaojinchao wrote:
    Why did the master replay its logs if it did not exit?
    Sorry. Which logs?
    Zk is expired because of gc. But region server isn't shutdown.
    Right, but it probably went down soon after it came out of GC, right?

    (I like how you noticed the log message that says 82 has root and meta)

    Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    It said that 82 has root and meta. "root=true" shows the dead region server has root table.
    So, you think there is a bug in our shutdown handler where we are not
    doing -ROOT- processing properly?
    St.Ack
  • Gaojinchao at Aug 16, 2011 at 6:47 am
    Yes, I think shutdown handler should not verify machine that is included deadservers set.

    2011-07-28 22:25:09,336 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
    2011-07-28 22:25:09,450 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
    2011-07-28 22:25:09,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://158.1.101.82:9000/hbase/.META./1028785192/recovered.edits/0000000000000025786 (wrote 121 edits in 2567ms)
    2011-07-28 22:25:09,860 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://158.1.101.82:9000/hbase/ufdr5/745550eb514e441f31ff26dbde8402ae/recovered.edits/0000000000000617740 (wrote 211642 edits in 141887ms)

    //split logs finished and assigned root table firstly .at the same time, region server came out of GC, verifying root passed.
    2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386

    // region server is rejected and region server will shutdown itself.
    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
    2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
    2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.

    all logs is in attachment.

    -----邮件原件-----
    发件人: saint.ack@gmail.com 代表 Stack
    发送时间: 2011年8月16日 12:33
    收件人: user@hbase.apache.org
    主题: Re: Root table couldn't be opened
    On Mon, Aug 15, 2011 at 9:23 PM, Gaojinchao wrote:
    Why did the master replay its logs if it did not exit?
    Sorry. Which logs?
    Zk is expired because of gc. But region server isn't shutdown.
    Right, but it probably went down soon after it came out of GC, right?

    (I like how you noticed the log message that says 82 has root and meta)

    Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    It said that 82 has root and meta. "root=true" shows the dead region server has root table.
    So, you think there is a bug in our shutdown handler where we are not
    doing -ROOT- processing properly?
    St.Ack
  • Chris Richard at Aug 16, 2011 at 5:59 am
    -----Original Message-----
    From: Gaojinchao <gaojinchao@huawei.com>
    Date: Tue, 16 Aug 2011 04:23:58
    To: user@hbase.apache.org<user@hbase.apache.org>
    Reply-To: user@hbase.apache.org
    Subject: re: Root table couldn't be opened

    Why did the master replay its logs if it did not exit?
    Zk is expired because of gc. But region server isn't shutdown.

    (I like how you noticed the log message that says 82 has root and meta)

    Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
    It said that 82 has root and meta. "root=true" shows the dead region server has root table.


    -----邮件原件-----
    发件人: saint.ack@gmail.com 代表 Stack
    发送时间: 2011年8月16日 12:12
    收件人: user@hbase.apache.org
    主题: Re: Root table couldn't be opened
    On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao wrote:
    In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

    The logs show:

    // Master assigned the root table to 82
    2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

    //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
    I think we shouldn't verify the root / meta in shutdownhandler processing

    82 did not exit?

    Why did the master replay its logs if it did not exit?

    2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true

    Isn't this the master handling 82 likes its been shutdown?

    2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
    So, it looks like 82 tried to come in (after GC I suppose) but we told
    it go away.

    Why did we not notice that -ROOT- was on 82 and as part of the
    shutdown handling of 82, we reassigned it. This is what you are
    saying in your subsequent message (I like how you noticed the log
    message that says 82 has root and meta). I'm not sure why it did not
    reassign root. Its skipping something in shutdown handler or the
    verify location for root has a bug in it where we are not considering
    the fact that current server held -ROOT- so if verification returns
    current server as holding -ROOT-, then we should ignore it.

    Good stuff Gao,
    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedAug 11, '11 at 2:06a
activeAug 16, '11 at 6:47a
posts8
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase