Grokbase Groups HBase user April 2011
FAQ
Hi,

I am trying failover cases on a small 3-node fully-distributed cluster
of the following topology:
- master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
- slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.

ROOT and META are initially served by two different nodes.

I create table 'incr' with a single column family 'value', put 'incr',
'00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
with still human readable content, then start calling

$ incr 'incr', '00000000', 'value:main', 1

once in a second or two. Then I kill -9 one of my region servers, the
one that serves 'incr'.

The subsequent shell incr times out. I terminate it with Ctrl-C,
launch hbase-shell again and repeat the command, getting the following
message repeated several times:

11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
giving up.

tail master log yields the following diagnostic:

2011-04-27 14:08:32,982 INFO
org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
loaded servers
2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
src=regionserver1,60020,1303895356068,
dest=regionserver2,60020,1303898049443
2011-04-27 14:08:32,982 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting
unassignment of region
incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
2011-04-27 14:08:32,982 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
but it is not currently assigned anywhere

hbase hbck finds 2 inconsistencies (regionserver1 down, region not
served). hbase hbck -fix reports 2 initial and 1 eventual
inconsistency, migrating the region to a live region server. However,
when I repeat the test with regionserver2 and regionserver1 swapped
(i.e. kill -9 the region server process on regionserver2, the initial
evacuation target), hbcase hbck -fix throws

org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
setting up proxy interface
org.apache.hadoop.hbase.ipc.HRegionInterface to
regionserver2/10.50.3.68:60020 after attempts=1
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
at org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)

zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
configuration is consistent around the cluster, so these are not the
causes.

Manual region reassignment also helps for the first time, and only for
the first time. Subsequent retries leave 'incr' regions not assigned
anywhere, and I cannot even query table regions on the client since
HTable instances fail to connect.

As soon as I restart the killed region server, cluster operation resumes.
However, as far as I understand the HBase book, this is not the
intended behavior. The cluster should automatically evacuate regions
from dead region servers to known alive ones.

I run the cluster on RH 5, Sun JDK 1.6.0_24.
JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
should duplicate the assignment in hbase-env.sh).
Is this one of the issues known to be fixed in 0.90.2 or later
releases? I grepped Jira and found no matching issues described;
failover scenarios mentioned there are far more complex.
What other logs or config files shall I check and/or post here?

Reg.,
Alex Romanovsky
(message might appear duplicate; I apologize if it does so)

Search Discussions

  • Alex Romanovsky at Apr 27, 2011 at 2:29 pm
    Hi,

    I am trying failover cases on a small 3-node fully-distributed cluster
    of the following topology:
    - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
    - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.

    ROOT and META are initially served by two different nodes.

    I create table 'incr' with a single column family 'value', put 'incr',
    '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
    with still human readable content, then start calling

    $ incr 'incr', '00000000', 'value:main', 1

    once in a second or two. Then I kill -9 one of my region servers, the
    one that serves 'incr'.

    The subsequent shell incr times out. I terminate it with Ctrl-C,
    launch hbase-shell again and repeat the command, getting the following
    message repeated several times:

    11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
    regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
    giving up.

    tail master log yields the following diagnostic:

    2011-04-27 14:08:32,982 INFO
    org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
    in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
    loaded servers
    2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
    balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
    src=regionserver1,60020,1303895356068,
    dest=regionserver2,60020,1303898049443
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Starting
    unassignment of region
    incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
    unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
    but it is not currently assigned anywhere

    hbase hbck finds 2 inconsistencies (regionserver1 down, region not
    served). hbase hbck -fix reports 2 initial and 1 eventual
    inconsistency, migrating the region to a live region server. However,
    when I repeat the test with regionserver2 and regionserver1 swapped
    (i.e. kill -9 the region server process on regionserver2, the initial
    evacuation target), hbcase hbck -fix throws

    org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
    setting up proxy interface
    org.apache.hadoop.hbase.ipc.HRegionInterface to
    regionserver2/10.50.3.68:60020 after attempts=1
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
    at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
    at org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
    at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
    at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)

    zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
    configuration is consistent around the cluster, so these are not the
    causes.

    Manual region reassignment also helps for the first time, and only for
    the first time. Subsequent retries leave 'incr' regions not assigned
    anywhere, and I cannot even query table regions on the client since
    HTable instances fail to connect.

    As soon as I restart the killed region server, cluster operation resumes.
    However, as far as I understand the HBase book, this is not the
    intended behavior. The cluster should automatically evacuate regions
    from dead region servers to known alive ones.

    I run the cluster on RH 5, Sun JDK 1.6.0_24.
    JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
    should duplicate the assignment in hbase-env.sh).
    Is this one of the issues known to be fixed in 0.90.2 or later
    releases? I grepped Jira and found no matching issues described;
    failover scenarios mentioned there are far more complex.
    What other logs or config files shall I check and/or post here?

    Reg.,
    Alex Romanovsky
  • Jean-Daniel Cryans at Apr 27, 2011 at 6:35 pm
    Hi Alex,

    Before answering I made sure it was working for me and it does. In
    your master log after killing the -ROOT- region server you should see
    lines like this:

    INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
    RegionServer ephemeral node deleted, processing expiration
    [servername]
    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added= servername
    to dead servers, submitted shutdown handler to be executed, root=true,
    meta=false
    ...
    INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
    ROOT region location in ZooKeeper
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region -ROOT-,,0.70236052 to servername
    ...

    Then when killing the .META. region server you would have some
    equivalent lines such as:

    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=servername
    to dead servers, submitted shutdown handler to be executed,
    root=false, meta=true
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region .META.,,1.1028785192 to servername

    If it doesn't show, then there might be some other issue. Other comments inline.

    J-D

    On Wed, Apr 27, 2011 at 4:03 AM, Alex Romanovsky
    wrote:
    Hi,

    I am trying failover cases on a small 3-node fully-distributed cluster
    of the following topology:
    - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
    - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.

    ROOT and META are initially served by two different nodes.

    I create table 'incr' with a single column family 'value', put 'incr',
    '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
    with still human readable content, then start calling

    $ incr 'incr', '00000000', 'value:main', 1

    once in a second or two. Then I kill -9 one of my region servers, the
    one that serves 'incr'.

    The subsequent shell incr times out. I terminate it with Ctrl-C,
    launch hbase-shell again and repeat the command, getting the following
    message repeated several times:

    11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
    regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
    giving up.
    That's somewhat expected, the shell is configured to not retry a lot
    so the regions might not already be reassigned.
    tail master log yields the following diagnostic:

    2011-04-27 14:08:32,982 INFO
    org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
    in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
    loaded servers
    2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
    balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
    src=regionserver1,60020,1303895356068,
    dest=regionserver2,60020,1303898049443
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Starting
    unassignment of region
    incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
    unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
    but it is not currently assigned anywhere
    That's 11 minutes after you killed the region server right? Anything
    else after 13:57:43?
    hbase hbck finds 2 inconsistencies (regionserver1 down, region not
    served). hbase hbck -fix reports 2 initial and 1 eventual
    inconsistency, migrating the region to a live region server.
    How long after you killed the RS did you run this? Was anything shown
    in the master log (like repeating lines) before that? If so, what?
    However,
    when I repeat the test with regionserver2 and regionserver1 swapped
    (i.e. kill -9 the region server process on regionserver2, the initial
    evacuation target), hbcase hbck -fix throws
    org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
    setting up proxy interface
    org.apache.hadoop.hbase.ipc.HRegionInterface to
    regionserver2/10.50.3.68:60020 after attempts=1
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
    at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
    at org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
    at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
    at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)
    So it seems that when you ran hbck the region server wasn't detected
    as dead yet because hbck tried to connect to it.
    zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
    configuration is consistent around the cluster, so these are not the
    causes.
    I won't rule that out until you prove us that it's the case. In your
    logs you should have a line like this after starting a region server:

    INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
    on server zk_server:2181, sessionid = 0xsome_hex, negotiated timeout =
    1000

    If not, then review that configuration.
    Manual region reassignment also helps for the first time, and only for
    the first time. Subsequent retries leave 'incr' regions not assigned
    anywhere, and I cannot even query table regions on the client since
    HTable instances fail to connect.

    As soon as I restart the killed region server, cluster operation resumes.
    However, as far as I understand the HBase book, this is not the
    intended behavior. The cluster should automatically evacuate regions
    from dead region servers to known alive ones.
    It really seems like the region server was never considered dead. The
    log should tell.
    I run the cluster on RH 5, Sun JDK 1.6.0_24.
    JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
    should duplicate the assignment in hbase-env.sh).
    Is this one of the issues known to be fixed in 0.90.2 or later
    releases? I grepped Jira and found no matching issues described;
    failover scenarios mentioned there are far more complex.
    What other logs or config files shall I check and/or post here?
    AFAIK this is not a known issue, and it works well for us. Feel free
    to pastebin whole logs.
    Reg.,
    Alex Romanovsky
    (message might appear duplicate; I apologize if it does so)
    It did, why?
  • Alex Romanovsky at Apr 28, 2011 at 10:24 am
    Thank you a lot for your help Jean!

    It was a reverse DNS lookup issue - we recently changed our default
    domain suffix.
    I noticed that by looking up the server name from the "No HServerInfo
    found" message through the list returned by
    admin.getClusterStatus().getServerInfo().

    I'll drop the DNS cache on every cluster host now and restart the cluster.

    WBR,
    Alex Romanovsky

    P.S.
    It did, why?
    My first message didn't appear in the list for a long time and I
    thought it could happen because I had sent it before I actually
    subscribed to the list. Really sorry for inconvenience.\
    On 4/27/11, Jean-Daniel Cryans wrote:
    Hi Alex,

    Before answering I made sure it was working for me and it does. In
    your master log after killing the -ROOT- region server you should see
    lines like this:

    INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
    RegionServer ephemeral node deleted, processing expiration
    [servername]
    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added= servername
    to dead servers, submitted shutdown handler to be executed, root=true,
    meta=false
    ...
    INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
    ROOT region location in ZooKeeper
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region -ROOT-,,0.70236052 to servername
    ...

    Then when killing the .META. region server you would have some
    equivalent lines such as:

    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=servername
    to dead servers, submitted shutdown handler to be executed,
    root=false, meta=true
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region .META.,,1.1028785192 to servername

    If it doesn't show, then there might be some other issue. Other comments
    inline.

    J-D

    On Wed, Apr 27, 2011 at 4:03 AM, Alex Romanovsky
    wrote:
    Hi,

    I am trying failover cases on a small 3-node fully-distributed cluster
    of the following topology:
    - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
    - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.

    ROOT and META are initially served by two different nodes.

    I create table 'incr' with a single column family 'value', put 'incr',
    '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
    with still human readable content, then start calling

    $ incr 'incr', '00000000', 'value:main', 1

    once in a second or two. Then I kill -9 one of my region servers, the
    one that serves 'incr'.

    The subsequent shell incr times out. I terminate it with Ctrl-C,
    launch hbase-shell again and repeat the command, getting the following
    message repeated several times:

    11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
    regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
    giving up.
    That's somewhat expected, the shell is configured to not retry a lot
    so the regions might not already be reassigned.
    tail master log yields the following diagnostic:

    2011-04-27 14:08:32,982 INFO
    org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
    in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
    loaded servers
    2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
    balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
    src=regionserver1,60020,1303895356068,
    dest=regionserver2,60020,1303898049443
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Starting
    unassignment of region
    incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
    unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
    but it is not currently assigned anywhere
    That's 11 minutes after you killed the region server right? Anything
    else after 13:57:43?
    hbase hbck finds 2 inconsistencies (regionserver1 down, region not
    served). hbase hbck -fix reports 2 initial and 1 eventual
    inconsistency, migrating the region to a live region server.
    How long after you killed the RS did you run this? Was anything shown
    in the master log (like repeating lines) before that? If so, what?
    However,
    when I repeat the test with regionserver2 and regionserver1 swapped
    (i.e. kill -9 the region server process on regionserver2, the initial
    evacuation target), hbcase hbck -fix throws
    org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
    setting up proxy interface
    org.apache.hadoop.hbase.ipc.HRegionInterface to
    regionserver2/10.50.3.68:60020 after attempts=1
    at
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
    at
    org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
    at
    org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
    at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
    at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)
    So it seems that when you ran hbck the region server wasn't detected
    as dead yet because hbck tried to connect to it.
    zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
    configuration is consistent around the cluster, so these are not the
    causes.
    I won't rule that out until you prove us that it's the case. In your
    logs you should have a line like this after starting a region server:

    INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
    on server zk_server:2181, sessionid = 0xsome_hex, negotiated timeout =
    1000

    If not, then review that configuration.
    Manual region reassignment also helps for the first time, and only for
    the first time. Subsequent retries leave 'incr' regions not assigned
    anywhere, and I cannot even query table regions on the client since
    HTable instances fail to connect.

    As soon as I restart the killed region server, cluster operation resumes.
    However, as far as I understand the HBase book, this is not the
    intended behavior. The cluster should automatically evacuate regions
    from dead region servers to known alive ones.
    It really seems like the region server was never considered dead. The
    log should tell.
    I run the cluster on RH 5, Sun JDK 1.6.0_24.
    JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
    should duplicate the assignment in hbase-env.sh).
    Is this one of the issues known to be fixed in 0.90.2 or later
    releases? I grepped Jira and found no matching issues described;
    failover scenarios mentioned there are far more complex.
    What other logs or config files shall I check and/or post here?
    AFAIK this is not a known issue, and it works well for us. Feel free
    to pastebin whole logs.
    Reg.,
    Alex Romanovsky
    (message might appear duplicate; I apologize if it does so)
    It did, why?
  • Jean-Daniel Cryans at Apr 28, 2011 at 7:06 pm
    Happy that you could figure out quickly and even happier that your
    wrote back to the list with details.

    Thanks!

    J-D

    On Thu, Apr 28, 2011 at 3:24 AM, Alex Romanovsky
    wrote:
    Thank you a lot for your help Jean!

    It was a reverse DNS lookup issue - we recently changed our default
    domain suffix.
    I noticed that by looking up the server name from the "No HServerInfo
    found" message through the list returned by
    admin.getClusterStatus().getServerInfo().

    I'll drop the DNS cache on every cluster host now and restart the cluster.

    WBR,
    Alex Romanovsky

    P.S.
    It did, why?
    My first message didn't appear in the list for a long time and I
    thought it could happen because I had sent it before I actually
    subscribed to the list. Really sorry for inconvenience.\
    On 4/27/11, Jean-Daniel Cryans wrote:
    Hi Alex,

    Before answering I made sure it was working for me and it does. In
    your master log after killing the -ROOT- region server you should see
    lines like this:

    INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
    RegionServer ephemeral node deleted, processing expiration
    [servername]
    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added= servername
    to dead servers, submitted shutdown handler to be executed, root=true,
    meta=false
    ...
    INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
    ROOT region location in ZooKeeper
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region -ROOT-,,0.70236052 to servername
    ...

    Then when killing the .META. region server you would have some
    equivalent lines such as:

    DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=servername
    to dead servers, submitted shutdown handler to be executed,
    root=false, meta=true
    ...
    DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
    region .META.,,1.1028785192 to servername

    If it doesn't show, then there might be some other issue. Other comments
    inline.

    J-D

    On Wed, Apr 27, 2011 at 4:03 AM, Alex Romanovsky
    wrote:
    Hi,

    I am trying failover cases on a small 3-node fully-distributed cluster
    of the following topology:
    - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
    - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.

    ROOT and META are initially served by two different nodes.

    I create table 'incr' with a single column family 'value', put 'incr',
    '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
    with still human readable content, then start calling

    $ incr 'incr', '00000000', 'value:main', 1

    once in a second or two. Then I kill -9 one of my region servers, the
    one that serves 'incr'.

    The subsequent shell incr times out. I terminate it with Ctrl-C,
    launch hbase-shell again and repeat the command, getting the following
    message repeated several times:

    11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
    regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
    giving up.
    That's somewhat expected, the shell is configured to not retry a lot
    so the regions might not already be reassigned.
    tail master log yields the following diagnostic:

    2011-04-27 14:08:32,982 INFO
    org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
    in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
    loaded servers
    2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
    balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
    src=regionserver1,60020,1303895356068,
    dest=regionserver2,60020,1303898049443
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Starting
    unassignment of region
    incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
    2011-04-27 14:08:32,982 DEBUG
    org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
    unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
    but it is not currently assigned anywhere
    That's 11 minutes after you killed the region server right? Anything
    else after 13:57:43?
    hbase hbck finds 2 inconsistencies (regionserver1 down, region not
    served). hbase hbck -fix reports 2 initial and 1 eventual
    inconsistency, migrating the region to a live region server.
    How long after you killed the RS did you run this? Was anything shown
    in the master log (like repeating lines) before that? If so, what?
    However,
    when I repeat the test with regionserver2 and regionserver1 swapped
    (i.e. kill -9 the region server process on regionserver2, the initial
    evacuation target), hbcase hbck -fix throws
    org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
    setting up proxy interface
    org.apache.hadoop.hbase.ipc.HRegionInterface to
    regionserver2/10.50.3.68:60020 after attempts=1
    at
    org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
    at
    org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
    at
    org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
    at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
    at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)
    So it seems that when you ran hbck the region server wasn't detected
    as dead yet because hbck tried to connect to it.
    zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
    configuration is consistent around the cluster, so these are not the
    causes.
    I won't rule that out until you prove us that it's the case. In your
    logs you should have a line like this after starting a region server:

    INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
    on server zk_server:2181, sessionid = 0xsome_hex, negotiated timeout =
    1000

    If not, then review that configuration.
    Manual region reassignment also helps for the first time, and only for
    the first time. Subsequent retries leave 'incr' regions not assigned
    anywhere, and I cannot even query table regions on the client since
    HTable instances fail to connect.

    As soon as I restart the killed region server, cluster operation resumes.
    However, as far as I understand the HBase book, this is not the
    intended behavior. The cluster should automatically evacuate regions
    from dead region servers to known alive ones.
    It really seems like the region server was never considered dead. The
    log should tell.
    I run the cluster on RH 5, Sun JDK 1.6.0_24.
    JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
    should duplicate the assignment in hbase-env.sh).
    Is this one of the issues known to be fixed in 0.90.2 or later
    releases? I grepped Jira and found no matching issues described;
    failover scenarios mentioned there are far more complex.
    What other logs or config files shall I check and/or post here?
    AFAIK this is not a known issue, and it works well for us. Feel free
    to pastebin whole logs.
    Reg.,
    Alex Romanovsky
    (message might appear duplicate; I apologize if it does so)
    It did, why?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedApr 27, '11 at 11:03a
activeApr 28, '11 at 7:06p
posts5
users2
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase