FAQ
Hi all,

I'm trying to copy data from a 3u5 cluster to a new 4.2 cluster for our
HBase but I'm having serious difficulty trying to run the destination.

Right now broadly I'm:

Shutting down src HBase,
distcp over hftp from source nameserver to destination hdfs
Check the map job finishes cleanly, check permissions
Start destination.

The destination master and regionservers come up clean and regions get
assigned but then the master goes into an endless reassignment for things
that weren't online and I get a slew of exceptions on the regionservers
about directories not existing in the HDFS. The exact messages I don't have
to hand (will follow up once this attempt at the copy finishes). Finally
the regionservers then seem to run out of file descriptors leaving sockets
in CLOSE_WAIT talking to other datanodes. The source DB is consistent and
runs fine, the destination shows some holes but doing an offline ROOT and
.META. rebuild seems clean.

My question in the (absence of exceptions to show) is this: I half remember
reading somewhere that HBase upgrades would be clean (this may have been
from upstream) between consecutive major versions but with 3u5 → 4.2 I'm
jumping from 0.90.6 to 0.94.2. Does this actually work or do I need to
upgrade to 4.0 or 4.1 first?

--

Search Discussions

  • Bryan Beaudreault at Mar 16, 2013 at 10:47 pm
    We've recently successfully migrated multiple CDH3 clusters (ranging from
    3u2 to 3u5) directly to CDH4.2. We did it similarly to you, with copying
    to a brand new cluster. Instead of using distcp we used a modified version
    of mozilla's Backup job available at
    https://github.com/mozilla-metrics/akela. Our first few test runs of this
    ran into similar problems that you've noticed, so the modifications we
    needed to make were:

    1) Create empty dirs on the target cluster. The Backup script (and perhaps
    distcp) don't seem to do this reliably.
    2) Do NOT copy over region directories which contained an empty "splits"
    directory. These are remnants of a bug in the 0.90.x version of HBase that
    I don't have on hand but basically resulted in region dirs being left
    behind and not cleaned up after splits happened.

    We made other minor improvements, as well as getting the job to compile
    under CDH4.2 and a bunch of fabric scripts for automation, and we're hoping
    to opensource it at some point in the near future. There are a number of
    higher priorities right now though so that timeline probably doesn't jive
    with yours.

    I think accounting for both of those issues above, regardless of if you use
    distcp or Backup, should get you closer to a successful migration.

    Let me know if you have any other questions.

    - Bryan

    On Sat, Mar 16, 2013 at 4:30 PM, Steph Gosling wrote:

    Hi all,

    I'm trying to copy data from a 3u5 cluster to a new 4.2 cluster for our
    HBase but I'm having serious difficulty trying to run the destination.

    Right now broadly I'm:

    Shutting down src HBase,
    distcp over hftp from source nameserver to destination hdfs
    Check the map job finishes cleanly, check permissions
    Start destination.

    The destination master and regionservers come up clean and regions get
    assigned but then the master goes into an endless reassignment for things
    that weren't online and I get a slew of exceptions on the regionservers
    about directories not existing in the HDFS. The exact messages I don't have
    to hand (will follow up once this attempt at the copy finishes). Finally
    the regionservers then seem to run out of file descriptors leaving sockets
    in CLOSE_WAIT talking to other datanodes. The source DB is consistent and
    runs fine, the destination shows some holes but doing an offline ROOT and
    .META. rebuild seems clean.

    My question in the (absence of exceptions to show) is this: I half
    remember reading somewhere that HBase upgrades would be clean (this may
    have been from upstream) between consecutive major versions but with 3u5 →
    4.2 I'm jumping from 0.90.6 to 0.94.2. Does this actually work or do I need
    to upgrade to 4.0 or 4.1 first?

    --


    --
  • Steph Gosling at Mar 17, 2013 at 12:53 am
    Hi Brian, thanks so much for the prompt reply. Response (and lovely
    exceptions!) in-line:


    On Sat, 16 Mar 2013 18:46:40 -0400
    Bryan Beaudreault wrote:
    1) Create empty dirs on the target cluster. The Backup script (and perhaps
    distcp) don't seem to do this reliably.
    A cursory quick look with -lsr showed that my source and
    destination /hbase had exactly the same number of files. I confirmed
    this with lsr_diff.py from Akela so no dice there.
    2) Do NOT copy over region directories which contained an empty "splits"
    directory. These are remnants of a bug in the 0.90.x version of HBase that
    I don't have on hand but basically resulted in region dirs being left
    behind and not cleaned up after splits happened.
    I found a split directory in only one of the regions so before firing
    up the master and regionservers I moved that whole region out of
    the /hbase tree. Still had the issue.

    This is the master opening the last region after boot. Everything up
    until this point seems hunky-dory.

    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6. has been deleted.
    INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has opened the region users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6. that was online on data04.chuci.org,60020,1363478141080

    and then we start to get the errors on the master:

    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_FAILED_OPEN, server=data04.bob.skimlinks.com,60020,1363478141080, region=e6c3d9b765cc79585b99d6ef66122529
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. destination server is data04.chuci.org,60020,1363478141080
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. so generated a random one; hri=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529., src=, dest=data01.chuci.org,60020,1363478055981; 3 (online=3, available=2) available servers
    DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for e6c3d9b765cc79585b99d6ef66122529
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529. state=CLOSED, ts=1363478405568, server=data04.chuci.org,60020,1363478141080
    DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x33d759f93cb0000 Creating (or updating) unassigned node for e6c3d9b765cc79585b99d6ef66122529 with OFFLINE state

    Here is the related exception on the regionserver

    DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://name01.chuci.org:8020/hbase/users_raw/e6c3d9b765cc79585b99d6ef66122529/d/8726648272599185172.70f10cbd2f6384f300679cfdbac46cd5 is a top reference to hdfs://name01.chuci.org:8020/hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
    ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529., starting to roll back the global memstore size.
    java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1301)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1254)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1227)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1209)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:393)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

    at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:554)
    at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:467)
    at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3981)
    at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3929)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
    at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

    Sure enough I don't have a region 70f10cbd2f6384f300679cfdbac46cd5
    I think accounting for both of those issues above, regardless of if you use
    distcp or Backup, should get you closer to a successful migration.

    Let me know if you have any other questions.
    Well I'm just looking for pointers now about how to proceed (and also
    getting increasingly worried I may have corruption in the 3u5 cluster
    as well) so any suggestions would be greatly recieved.

    Cheers,

    Steph

    --
    Steph Gosling <steph@chuci.org>

    --
  • Bryan Beaudreault at Mar 18, 2013 at 3:35 pm
    Steph,

    We also were concerned about corruption in the 3u5 cluster, and while I
    wouldn't call it corruption I would call it inconsistent state.
    Bookkeeping in the later versions of HBase has improved, so hopefully
    you'll be in a better position once you get this successfully migrated --
    we've been .

    During my migrations I never saw an exception regarding "X is a top
    reference to Y". It sounds like another issue with splits not being
    cleaned up though. I think what I would try doing is using the HFile tool
    to read the keys in the region that is a top reference
    (e6c3d9b765cc79585b99d6ef66122529), to find the first and last key. Then
    use the HBase API to find which region those keys actually belong to. If
    they belong to a different region, then this region is probably a zombie
    that was never cleaned up after a previous split. In that case you could
    probably move it out of the way.

    Another alternative would be to incrementally move out of the way all
    regions that give you a problem as you try to start up. Once you get it up
    and running, bulk load the moved aside regions back into the table. Since
    timestamp is part of each KeyValue this will be an idempotent operation.

    As far as I know, region references should go away once a split is
    completed and transitioned. So the final suggestion would be to just
    proactively move away all regions that seem to be a reference to another
    region. This is probably easily seen from hdfs, and again you could bulk
    load them back in if you are concerned about losing data.

    Let me know how those go.

    - Bryan

    On Sat, Mar 16, 2013 at 8:53 PM, Steph Gosling wrote:

    Hi Brian, thanks so much for the prompt reply. Response (and lovely
    exceptions!) in-line:


    On Sat, 16 Mar 2013 18:46:40 -0400
    Bryan Beaudreault wrote:
    1) Create empty dirs on the target cluster. The Backup script (and perhaps
    distcp) don't seem to do this reliably.
    A cursory quick look with -lsr showed that my source and
    destination /hbase had exactly the same number of files. I confirmed
    this with lsr_diff.py from Akela so no dice there.
    2) Do NOT copy over region directories which contained an empty "splits"
    directory. These are remnants of a bug in the 0.90.x version of HBase that
    I don't have on hand but basically resulted in region dirs being left
    behind and not cleaned up after splits happened.
    I found a split directory in only one of the regions so before firing
    up the master and regionservers I moved that whole region out of
    the /hbase tree. Still had the issue.

    This is the master opening the last region after boot. Everything up
    until this point seems hunky-dory.

    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of
    region
    users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6.
    has been deleted.
    INFO org.apache.hadoop.hbase.master.AssignmentManager: The master has
    opened the region
    users_raw,ffe04a49daf268ab45095cedd08e0c23,1360302484363.9f22d0767a1a0f2d7e62366e0ff4c6d6.
    that was online on data04.chuci.org,60020,1363478141080

    and then we start to get the errors on the master:

    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
    transition=RS_ZK_REGION_FAILED_OPEN, server=data04.bob.skimlinks.com,60020,1363478141080,
    region=e6c3d9b765cc79585b99d6ef66122529
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing
    plan for
    users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529.
    destination server is data04.chuci.org,60020,1363478141080
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous
    transition plan was found (or we are ignoring an existing plan) for
    users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529.
    so generated a random one;
    hri=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529.,
    src=, dest=data01.chuci.org,60020,1363478055981; 3 (online=3,
    available=2) available servers
    DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling
    CLOSED event for e6c3d9b765cc79585b99d6ef66122529
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
    was=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529.
    state=CLOSED, ts=1363478405568, server=data04.chuci.org
    ,60020,1363478141080
    DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x33d759f93cb0000 Creating (or updating) unassigned node for
    e6c3d9b765cc79585b99d6ef66122529 with OFFLINE state

    Here is the related exception on the regionserver

    DEBUG org.apache.hadoop.hbase.regionserver.StoreFile: Store file hdfs://
    name01.chuci.org:8020/hbase/users_raw/e6c3d9b765cc79585b99d6ef66122529/d/8726648272599185172.70f10cbd2f6384f300679cfdbac46cd5is a top reference to hdfs://
    name01.chuci.org:8020/hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
    ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
    Failed open of
    region=users_raw,3aa7e2b1bde7134ff042256b32db284c,1362048128482.e6c3d9b765cc79585b99d6ef66122529.,
    starting to roll back the global memstore size.
    java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
    File does not exist:
    /hbase/users_raw/70f10cbd2f6384f300679cfdbac46cd5/d/8726648272599185172
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1301)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1254)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1227)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1209)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:393)
    at
    org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170)
    at
    org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064)
    at
    org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

    at
    org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:554)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:467)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3981)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3929)
    at
    org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:332)
    at
    org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
    at
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

    Sure enough I don't have a region 70f10cbd2f6384f300679cfdbac46cd5
    I think accounting for both of those issues above, regardless of if you use
    distcp or Backup, should get you closer to a successful migration.

    Let me know if you have any other questions.
    Well I'm just looking for pointers now about how to proceed (and also
    getting increasingly worried I may have corruption in the 3u5 cluster
    as well) so any suggestions would be greatly recieved.

    Cheers,

    Steph

    --
    Steph Gosling <steph@chuci.org>
    --
  • Steph Gosling at Mar 18, 2013 at 11:31 pm
    Bryan,

    Thanks for the input and the suggestions, after some back and forth
    with my developers we've decided to start afresh on the new version and
    backfill what we can with batch jobs after-the-fact ; apparently that
    existing data isn't time sensitive though recollecting the data would
    be a pain.

    With regards to corruption vs. inconsistency, I'm hoping that I'm in in
    the latter camp but the fact that I've not managed to complete a
    CopyTable on that table without Out of Heap'ing one of the MR tasks
    (even with giving them 10x more memory than the largest StorFile) doesn't
    fill me with hope.

    With your last suggestion I'm curious when you say that irregularities
    could be seen from the HDFS? You mean programmatically looking for
    filenames ending in .xyz or whatever then looking for regions starting
    the same (or the absence thereof?) or something else?

    Thanks again for the help, (and sincerest apologies for mispelling
    your name earlier).

    Cheers,

    Steph


    On Mon, 18 Mar 2013 11:35:30 -0400
    Bryan Beaudreault wrote:
    Steph,

    We also were concerned about corruption in the 3u5 cluster, and while I
    wouldn't call it corruption I would call it inconsistent state.
    Bookkeeping in the later versions of HBase has improved, so hopefully
    you'll be in a better position once you get this successfully migrated --
    we've been .

    During my migrations I never saw an exception regarding "X is a top
    reference to Y". It sounds like another issue with splits not being
    cleaned up though. I think what I would try doing is using the HFile tool
    to read the keys in the region that is a top reference
    (e6c3d9b765cc79585b99d6ef66122529), to find the first and last key. Then
    use the HBase API to find which region those keys actually belong to. If
    they belong to a different region, then this region is probably a zombie
    that was never cleaned up after a previous split. In that case you could
    probably move it out of the way.

    Another alternative would be to incrementally move out of the way all
    regions that give you a problem as you try to start up. Once you get it up
    and running, bulk load the moved aside regions back into the table. Since
    timestamp is part of each KeyValue this will be an idempotent operation.

    As far as I know, region references should go away once a split is
    completed and transitioned. So the final suggestion would be to just
    proactively move away all regions that seem to be a reference to another
    region. This is probably easily seen from hdfs, and again you could bulk
    load them back in if you are concerned about losing data.

    Let me know how those go.

    - Bryan
    --
    Steph Gosling <steph@chuci.org>

    --
  • Bryan Beaudreault at Mar 19, 2013 at 4:14 pm
    Hey Steph,

    Responses inline.

    On Mon, Mar 18, 2013 at 7:31 PM, Steph Gosling wrote:

    Bryan,

    Thanks for the input and the suggestions, after some back and forth
    with my developers we've decided to start afresh on the new version and
    backfill what we can with batch jobs after-the-fact ; apparently that
    existing data isn't time sensitive though recollecting the data would
    be a pain.
    If that's the case and you create the tables with the same splits, you may
    be able to just grab all the HFiles and bulk load them into the new table.
    Not sure exactly if it will work but it's worth a shot and should be very
    easy to test on a single HFile.

    With regards to corruption vs. inconsistency, I'm hoping that I'm in in
    the latter camp but the fact that I've not managed to complete a
    CopyTable on that table without Out of Heap'ing one of the MR tasks
    (even with giving them 10x more memory than the largest StorFile) doesn't
    fill me with hope.

    Your memory issues may not be due to either corruption or inconsistency.
    Try using -Dhbase.client.scanner.caching=X (default is 100) to set
    scanner caching which would affect how many rows are pulled back in each
    next() call in the scan that backs the mappers. You may currently be
    pulling back too many rows at once to fit into memory, so start with some
    reasonably .


    >
    With your last suggestion I'm curious when you say that irregularities
    could be seen from the HDFS? You mean programmatically looking for
    filenames ending in .xyz or whatever then looking for regions starting
    the same (or the absence thereof?) or something else?
    Yea, similar to how there were empty splits directories in some of your
    folders, I find that a lot of state is determined by the layout on hdfs.
    For instance in this case it is not normal for an HFile to end in
    .70f10cbd2f6384f300679cfdbac46cd5, so you could pretty easily pick out all
    of the problem files programatically and deal with them how you will.

    Thanks again for the help, (and sincerest apologies for mispelling
    your name earlier).
    No worries, it happens all the time :).

    Cheers,
    Steph


    On Mon, 18 Mar 2013 11:35:30 -0400
    Bryan Beaudreault wrote:
    Steph,

    We also were concerned about corruption in the 3u5 cluster, and while I
    wouldn't call it corruption I would call it inconsistent state.
    Bookkeeping in the later versions of HBase has improved, so hopefully
    you'll be in a better position once you get this successfully migrated --
    we've been .

    During my migrations I never saw an exception regarding "X is a top
    reference to Y". It sounds like another issue with splits not being
    cleaned up though. I think what I would try doing is using the HFile tool
    to read the keys in the region that is a top reference
    (e6c3d9b765cc79585b99d6ef66122529), to find the first and last key. Then
    use the HBase API to find which region those keys actually belong to. If
    they belong to a different region, then this region is probably a zombie
    that was never cleaned up after a previous split. In that case you could
    probably move it out of the way.

    Another alternative would be to incrementally move out of the way all
    regions that give you a problem as you try to start up. Once you get it up
    and running, bulk load the moved aside regions back into the table. Since
    timestamp is part of each KeyValue this will be an idempotent operation.

    As far as I know, region references should go away once a split is
    completed and transitioned. So the final suggestion would be to just
    proactively move away all regions that seem to be a reference to another
    region. This is probably easily seen from hdfs, and again you could bulk
    load them back in if you are concerned about losing data.

    Let me know how those go.

    - Bryan
    --
    Steph Gosling <steph@chuci.org>
    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcdh-user @
categorieshadoop
postedMar 16, '13 at 8:30p
activeMar 19, '13 at 4:14p
posts6
users2
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase