Hi Brian, thanks so much for the prompt reply. Response (and lovely
exceptions!) in-line:
On Sat, 16 Mar 2013 18:46:40 -0400
Bryan Beaudreault wrote:
1) Create empty dirs on the target cluster. The Backup script (and perhaps
distcp) don't seem to do this reliably.
A cursory quick look with -lsr showed that my source and
destination /hbase had exactly the same number of files. I confirmed
this with lsr_diff.py from Akela so no dice there.
2) Do NOT copy over region directories which contained an empty "splits"
directory. These are remnants of a bug in the 0.90.x version of HBase that
I don't have on hand but basically resulted in region dirs being left
behind and not cleaned up after splits happened.
I found a split directory in only one of the regions so before firing
up the master and regionservers I moved that whole region out of
the /hbase tree. Still had the issue.
This is the master opening the last region after boot. Everything up
until this point seems hunky-dory.
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6. has been deleted.
INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6. that was online on data04.chuci.org,60020,1363478141080
and then we start to get the errors on the master:
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=data04.bob.skimlinks.com,60020,1363478141080, region=e6c3d9b765cc79585b99d6ef66122529
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. destination server is data04.chuci.org,60020,1363478141080
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. so generated a random one; hri=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529., src=, dest=data01.chuci.org,60020,1363478055981; 3 (online=3, available=2) available servers
DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for e6c3d9b765cc79585b99d6ef66122529
DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. state=CLOSED, ts=1363478405568, server=data04.chuci.org,60020,1363478141080
DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x33d759f93cb0000 Creating (or updating) unassigned node for e6c3d9b765cc79585b99d6ef66122529 with OFFLINE state
Here is the related exception on the regionserver
DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://name01.chuci.org:8020/hbase/users_raw/e6c3d9b765cc79585b99d6ef66122529/d/8726648272599185172.70f10cbd2f6384f300679cfdbac46cd5 is a top reference to hdfs://name01.chuci.org:8020/hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529., starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1301)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1254)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1227)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1209)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:393)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:554)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:467)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3981)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3929)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Sure enough I don't have a region 70f10cbd2f6384f300679cfdbac46cd5
I think accounting for both of those issues above, regardless of if you use
distcp or Backup, should get you closer to a successful migration.
Let me know if you have any other questions.
Well I'm just looking for pointers now about how to proceed (and also
getting increasingly worried I may have corruption in the 3u5 cluster
as well) so any suggestions would be greatly recieved.
Cheers,
Steph
--
Steph Gosling <steph@chuci.org>
--