Grokbase Groups HBase user June 2011
FAQ
How does one recover when a regionserver dies? We have this problem periodically and we basically have to restart hbase or all our jobs die with these type of errors:

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server c1-s35.blablabla.com:60020 for region urlhashv4,F4657B47F9881A42AF88864EC5EA9B27,1307217134729.4fa3defeeaeb59dc56f7ce6f155b2a0b., row 'F471203BA4FF5DD2BD2549308FD81F4A', but failed after 10 attempts.
Exceptions:


Then eventually this results in a general failure with Wrong Region exceptions and the whole table seems to go corrupt. The errors one sees at the regionserver level are:

2011-06-22 10:32:35,559 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816 is zero-length, deleting.
2011-06-22 10:32:35,563 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816
2011-06-22 10:33:19,769 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669 is zero-length, deleting.
2011-06-22 10:33:19,770 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669


Shouldn't the master detect deaths and rebalance the regions to other regionservers? Or is there a manual way to do this without having to restart the whole thing?

Thanks,

Robert Gonzalez
Maxpoint Interactive

Search Discussions

  • Jean-Daniel Cryans at Jun 22, 2011 at 5:46 pm
    Hadoop and HBase versions please.

    (no you shouldn't have to do anything special)

    J-D

    On Wed, Jun 22, 2011 at 9:44 AM, Robert Gonzalez
    wrote:
    How does one recover when a regionserver dies?  We have this problem periodically and we basically have to restart hbase or all our jobs die with these type of errors:

    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server c1-s35.blablabla.com:60020 for region urlhashv4,F4657B47F9881A42AF88864EC5EA9B27,1307217134729.4fa3defeeaeb59dc56f7ce6f155b2a0b., row 'F471203BA4FF5DD2BD2549308FD81F4A', but failed after 10 attempts.
    Exceptions:


    Then eventually this results in a general failure with Wrong Region exceptions and the whole table seems to go corrupt.  The errors one sees at the regionserver level are:

    2011-06-22 10:32:35,559 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816 is zero-length, deleting.
    2011-06-22 10:32:35,563 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816
    2011-06-22 10:33:19,769 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669 is zero-length, deleting.
    2011-06-22 10:33:19,770 ERROR org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669


    Shouldn't the master detect deaths and rebalance the regions to other regionservers?  Or is there a manual way to do this without having to restart the whole thing?

    Thanks,

    Robert Gonzalez
    Maxpoint Interactive


  • Robert Gonzalez at Jun 22, 2011 at 5:54 pm
    Hbase: 0.90.0
    Hadoop: 0.20.2+320

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of Jean-Daniel Cryans
    Sent: Wednesday, June 22, 2011 12:46 PM
    To: user@hbase.apache.org
    Subject: Re: recovery from regionserver death

    Hadoop and HBase versions please.

    (no you shouldn't have to do anything special)

    J-D
    On Wed, Jun 22, 2011 at 9:44 AM, Robert Gonzalez wrote:
    How does one recover when a regionserver dies?  We have this problem periodically and we basically have to restart hbase or all our jobs die with these type of errors:

    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server c1-s35.blablabla.com:60020 for region urlhashv4,F4657B47F9881A42AF88864EC5EA9B27,1307217134729.4fa3defeeaeb59dc56f7ce6f155b2a0b., row 'F471203BA4FF5DD2BD2549308FD81F4A', but failed after 10 attempts.
    Exceptions:


    Then eventually this results in a general failure with Wrong Region exceptions and the whole table seems to go corrupt.  The errors one sees at the regionserver level are:

    2011-06-22 10:32:35,559 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816 is zero-length, deleting.
    2011-06-22 10:32:35,563 ERROR
    org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of
    hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/r
    ecovered.edits/0000000000097403816
    2011-06-22 10:33:19,769 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669 is zero-length, deleting.
    2011-06-22 10:33:19,770 ERROR
    org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of
    hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/r
    ecovered.edits/0000000000097403669


    Shouldn't the master detect deaths and rebalance the regions to other regionservers?  Or is there a manual way to do this without having to restart the whole thing?

    Thanks,

    Robert Gonzalez
    Maxpoint Interactive


  • Robert Gonzalez at Jun 22, 2011 at 7:06 pm
    When I do a consistency check, I get:

    Chain of regions in table urlhashv4 is broken; edges does not contain 7BB16418308C2CB6B8AE56982781A5C6
    Table urlhashv4 is inconsistent.

    This is the same thing I saw before. Is there anyway of creating an empty region that covers the range of keys that its missing? If I could do that, I could go on. The data is not super-critical, it can be regenerated.

    Why do these regions just dissappear like this? They are not in the hdfs directory for the table at all.

    -----Original Message-----
    From: Robert Gonzalez
    Sent: Wednesday, June 22, 2011 12:48 PM
    To: 'user@hbase.apache.org'
    Subject: RE: recovery from regionserver death

    Hbase: 0.90.0
    Hadoop: 0.20.2+320

    -----Original Message-----
    From: jdcryans@gmail.com On Behalf Of Jean-Daniel Cryans
    Sent: Wednesday, June 22, 2011 12:46 PM
    To: user@hbase.apache.org
    Subject: Re: recovery from regionserver death

    Hadoop and HBase versions please.

    (no you shouldn't have to do anything special)

    J-D
    On Wed, Jun 22, 2011 at 9:44 AM, Robert Gonzalez wrote:
    How does one recover when a regionserver dies?  We have this problem periodically and we basically have to restart hbase or all our jobs die with these type of errors:

    org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server c1-s35.blablabla.com:60020 for region urlhashv4,F4657B47F9881A42AF88864EC5EA9B27,1307217134729.4fa3defeeaeb59dc56f7ce6f155b2a0b., row 'F471203BA4FF5DD2BD2549308FD81F4A', but failed after 10 attempts.
    Exceptions:


    Then eventually this results in a general failure with Wrong Region exceptions and the whole table seems to go corrupt.  The errors one sees at the regionserver level are:

    2011-06-22 10:32:35,559 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816 is zero-length, deleting.
    2011-06-22 10:32:35,563 ERROR
    org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of
    hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/r
    ecovered.edits/0000000000097403816
    2011-06-22 10:33:19,769 WARN org.apache.hadoop.hbase.regionserver.HRegion: File hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669 is zero-length, deleting.
    2011-06-22 10:33:19,770 ERROR
    org.apache.hadoop.hbase.regionserver.HRegion: Failed delete of
    hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/r
    ecovered.edits/0000000000097403669


    Shouldn't the master detect deaths and rebalance the regions to other regionservers?  Or is there a manual way to do this without having to restart the whole thing?

    Thanks,

    Robert Gonzalez
    Maxpoint Interactive


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJun 22, '11 at 5:44p
activeJun 22, '11 at 7:06p
posts4
users2
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase