Grokbase Groups HBase user May 2011
FAQ
Hello everybody
I've run into this strange problem. We run a 6 RS cluster and suddenly
the client application started reporting errors, region not online. In
the web console all regionserver appeared up. I've run hbck and got
strange results

Number of Tables: 2
Number of live region servers: 6
Number of dead region servers: 12

Cluster was in inconsistent state. With hbase shell status 'detailed' I
got the dead machines

12 dead servers
search-hadoop-eu006.v300.gmx.net,60020,1305025929461
search-hadoop-eu002.v300.gmx.net,60020,1305019508570
search-hadoop-eu004.v300.gmx.net,60020,1305019551236
search-hadoop-eu003.v300.gmx.net,60020,1305025688666
search-hadoop-eu005.v300.gmx.net,60020,1305025841017
search-hadoop-eu006.v300.gmx.net,60020,1306156842070
search-hadoop-eu005.v300.gmx.net,60020,1305019568146
search-hadoop-eu001.v300.gmx.net,60020,1305025543786
search-hadoop-eu004.v300.gmx.net,60020,1305025761173
search-hadoop-eu002.v300.gmx.net,60020,1305025611163
search-hadoop-eu006.v300.gmx.net,60020,1305019572576
search-hadoop-eu003.v300.gmx.net,60020,1305019547053


It appears that all live regionserver are listed as dead also. I tried
hbck -fix and the cluster is now in Ok state but still reports 12
machines dead as above.
I've checked the logs but nothing obvious. Any idea? We use CDH3u0.


Thanks
Daniel

Search Discussions

  • Jean-Daniel Cryans at May 23, 2011 at 4:49 pm
    It was fixed in 0.90.3, before that we didn't clear the list.

    J-D
    On Mon, May 23, 2011 at 9:27 AM, Daniel Iancu wrote:
    Hello everybody
    I've run into this strange problem. We run a 6 RS cluster and suddenly the
    client application started reporting errors, region not online. In the web
    console all regionserver appeared up.  I've run hbck and got strange results

    Number of Tables: 2
    Number of live region servers: 6
    Number of dead region servers: 12

    Cluster was in inconsistent state. With hbase shell status 'detailed' I got
    the dead machines

    12 dead servers
    search-hadoop-eu006.v300.gmx.net,60020,1305025929461
    search-hadoop-eu002.v300.gmx.net,60020,1305019508570
    search-hadoop-eu004.v300.gmx.net,60020,1305019551236
    search-hadoop-eu003.v300.gmx.net,60020,1305025688666
    search-hadoop-eu005.v300.gmx.net,60020,1305025841017
    search-hadoop-eu006.v300.gmx.net,60020,1306156842070
    search-hadoop-eu005.v300.gmx.net,60020,1305019568146
    search-hadoop-eu001.v300.gmx.net,60020,1305025543786
    search-hadoop-eu004.v300.gmx.net,60020,1305025761173
    search-hadoop-eu002.v300.gmx.net,60020,1305025611163
    search-hadoop-eu006.v300.gmx.net,60020,1305019572576
    search-hadoop-eu003.v300.gmx.net,60020,1305019547053


    It appears that all live regionserver are listed as dead also. I tried hbck
    -fix and the cluster is now in Ok state but still reports 12 machines dead
    as above.
    I've checked the logs but nothing obvious. Any idea? We use CDH3u0.


    Thanks
    Daniel


  • Jinsong Hu at May 23, 2011 at 5:30 pm
    Hi,

    today I run "hbase hbck " to check our production cluster and dev cluster,
    the production cluster comes out clean, but
    in our dev cluster, I have seem more than 2K errors like this:

    ERROR: Region
    HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
    oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
    found
    in META, but not in HDFS, and deployed on m0002040.ppops.net:60020

    I checked hbase GUI, and indeed , it is correct, the region is loaded by the
    region server, but the hdfs directory
    is not there.

    I am running cdh3u0, and I wonder how this can happen. Once it has happened,
    what can I do to recover to bring the table back to healthy state.

    Jimmy.
  • Jean-Daniel Cryans at May 23, 2011 at 5:54 pm
    I don't remember seeing this sort of issue a lot, or at all... Usually
    the region would not be on .META. so it looks like a different issue.

    Could you grep the master logs and see what's the story of that
    region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
    figure what happened to that region, might give us a clue.

    J-D
    On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu wrote:
    Hi,

    today I run "hbase hbck " to check our production cluster and dev cluster,
    the production cluster comes out clean, but
    in our dev cluster, I have seem more than 2K errors like this:

    ERROR: Region
    HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
    oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
    found
    in META, but not in HDFS, and deployed on m0002040.ppops.net:60020

    I checked hbase GUI, and indeed , it is correct, the region is loaded by the
    region server, but the hdfs directory
    is not there.

    I am running cdh3u0, and I wonder how this can happen. Once it has happened,
    what can I do to recover to bring the table back to healthy state.

    Jimmy.
  • Jinsong Hu at May 23, 2011 at 6:39 pm
    I checked the master, unfortunately , I must have wrong setting that all
    master log are not there.
    So I checked the regionserver which hosted this region. I have 14 days log
    there and I grep this 02f9ec575b19864ae44e714d9245138f,
    and I don't see any log. then I searched all regionserver's log for last
    several days , and don't see
    any log related to this region either.


    Jimmy.

    --------------------------------------------------
    From: "Jean-Daniel Cryans" <jdcryans@apache.org>
    Sent: Monday, May 23, 2011 10:53 AM
    To: <user@hbase.apache.org>
    Subject: Re: hbase hbck error
    I don't remember seeing this sort of issue a lot, or at all... Usually
    the region would not be on .META. so it looks like a different issue.

    Could you grep the master logs and see what's the story of that
    region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
    figure what happened to that region, might give us a clue.

    J-D
    On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu wrote:
    Hi,

    today I run "hbase hbck " to check our production cluster and dev
    cluster,
    the production cluster comes out clean, but
    in our dev cluster, I have seem more than 2K errors like this:

    ERROR: Region
    HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
    oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
    found
    in META, but not in HDFS, and deployed on m0002040.ppops.net:60020

    I checked hbase GUI, and indeed , it is correct, the region is loaded by
    the
    region server, but the hdfs directory
    is not there.

    I am running cdh3u0, and I wonder how this can happen. Once it has
    happened,
    what can I do to recover to bring the table back to healthy state.

    Jimmy.
  • Jinsong Hu at May 25, 2011 at 4:20 pm
    This is a follow up of what I have found . I exported the several
    complained tables to hdfs, truncate the original table, and import it again,
    and run hbck, and found that the hbck still complain the problem saying the
    hdfs directory is not there. I go to hdfs and take a look, and the region's
    hdfs directory is there. so the hbck's complain is bogus this time.

    By accident, I run the same hbck on one of the regionserver, and to my
    surprise, the hbck check comes out clean for all tables ! I then run this
    command in several other regionserver, and then all 3 hbase masters, all of
    the come out clean ,
    even for the table that has problem before and I didn't export and import.

    I tried several other non-hbase machines that has proper configuration, sure
    enough, all of them complain problems.

    So it seems the result of hbck depends on non-hbase machine or hbase
    machine. Judging from the results they show,
    none of them is correct. The correct result should be the imported tables
    are clean and non-imported tables are not.

    Can anybody explain why hbck have this kind of behavior ?

    Jimmy



    --------------------------------------------------
    From: "Jinsong Hu" <jinsong_hu@hotmail.com>
    Sent: Monday, May 23, 2011 11:39 AM
    To: <user@hbase.apache.org>
    Subject: Re: hbase hbck error
    I checked the master, unfortunately , I must have wrong setting that all
    master log are not there.
    So I checked the regionserver which hosted this region. I have 14 days
    log there and I grep this 02f9ec575b19864ae44e714d9245138f,
    and I don't see any log. then I searched all regionserver's log for last
    several days , and don't see
    any log related to this region either.


    Jimmy.

    --------------------------------------------------
    From: "Jean-Daniel Cryans" <jdcryans@apache.org>
    Sent: Monday, May 23, 2011 10:53 AM
    To: <user@hbase.apache.org>
    Subject: Re: hbase hbck error
    I don't remember seeing this sort of issue a lot, or at all... Usually
    the region would not be on .META. so it looks like a different issue.

    Could you grep the master logs and see what's the story of that
    region? Just look for 02f9ec575b19864ae44e714d9245138f and try to
    figure what happened to that region, might give us a clue.

    J-D

    On Mon, May 23, 2011 at 10:29 AM, Jinsong Hu <jinsong_hu@hotmail.com>
    wrote:
    Hi,

    today I run "hbase hbck " to check our production cluster and dev
    cluster,
    the production cluster comes out clean, but
    in our dev cluster, I have seem more than 2K errors like this:

    ERROR: Region
    HEARTBEAT_MASTERPATCH,time\x09daily\x092010-08-15\x09uobkayhian_pr
    oduction\x09patch-0000694,1287356584131.02f9ec575b19864ae44e714d9245138f.
    found
    in META, but not in HDFS, and deployed on m0002040.ppops.net:60020

    I checked hbase GUI, and indeed , it is correct, the region is loaded by
    the
    region server, but the hdfs directory
    is not there.

    I am running cdh3u0, and I wonder how this can happen. Once it has
    happened,
    what can I do to recover to bring the table back to healthy state.

    Jimmy.
  • Stack at May 25, 2011 at 5:03 pm

    On Wed, May 25, 2011 at 9:18 AM, Jinsong Hu wrote:
    I tried several other non-hbase machines that has proper configuration, sure
    enough, all of them complain problems.
    This is interesting Jinsong. For sure the configuration was pointed
    at the right filesystem. Do you think there could have been a
    suppressed error or some such thing remotely querying the filesystem
    for the presence of region directories? Can you add in a of
    printf'ing to see whats going on in hbck?

    Thanks for digging in on this.
    St.Ack
  • Jinsong Hu at May 25, 2011 at 5:50 pm
    Hi, Stack:
    You have a point. I checked the non-hbase machine's hbck's result, and it
    shows :
    Summary:
    2418 inconsistencies detected.
    Status: INCONSISTENT
    That number seems very familiar to me, so I went to the master admin
    page, and found:
    Total: servers: 6 requests=2783, regions=2417

    if we add the root region back in, then essentially the hbck is complaining
    every region is bad,
    which is not true.

    On the other hand, the hbase machine hbck says
    0 inconsistencies detected.
    Status: OK
    that is probably too good to be true too.

    I run "hadoop dfs -ls /hbase/table_name | grep region_id" and confirmed that
    in both machine,
    the region's directory showed up. In both machine, I was running in hdfs
    account.

    When you say I print more info, does that mean I need to modify the hbck
    code ? I might do it later
    when I can find some time.

    Jimmy.



    --------------------------------------------------
    From: "Stack" <stack@duboce.net>
    Sent: Wednesday, May 25, 2011 10:03 AM
    To: <user@hbase.apache.org>
    Subject: Re: hbase hbck error
    On Wed, May 25, 2011 at 9:18 AM, Jinsong Hu wrote:
    I tried several other non-hbase machines that has proper configuration,
    sure
    enough, all of them complain problems.
    This is interesting Jinsong. For sure the configuration was pointed
    at the right filesystem. Do you think there could have been a
    suppressed error or some such thing remotely querying the filesystem
    for the presence of region directories? Can you add in a of
    printf'ing to see whats going on in hbck?

    Thanks for digging in on this.
    St.Ack
  • Stack at May 25, 2011 at 7:28 pm
    On Wed, May 25, 2011 at 10:49 AM, Jinsong Hu wrote
    if we add the root region back in, then  essentially the hbck is complaining
    every region is bad,
    which is not true.
    I did notice and recently fix an issue where HBCK will print an ERROR
    for all regions that follow a bad one so rather than just one bad
    ERROR message, instead you get an ERROR the bad one and for all the
    good (and bad) that follow.

    When you say I print more info, does that mean I need to modify the hbck
    code ? I might do it later
    when I can find some time.
    Yes. That is what I was suggesting. The hbck is client-only
    application so you could make changes and try stuff without having to
    change your cluster software.


    Thanks for digging in.
    St.Ack
  • Stack at May 23, 2011 at 4:51 pm

    On Mon, May 23, 2011 at 9:27 AM, Daniel Iancu wrote:
    Hello everybody
    I've run into this strange problem. We run a 6 RS cluster and suddenly the
    client application started reporting errors, region not online. In the web
    console all regionserver appeared up.
    What happened at this time (Check master log at this timestamp --
    should give you a clue).

    I've run hbck and got strange results ...
    12 dead servers
    search-hadoop-eu006.v300.gmx.net,60020,1305025929461
    search-hadoop-eu002.v300.gmx.net,60020,1305019508570
    search-hadoop-eu004.v300.gmx.net,60020,1305019551236
    search-hadoop-eu003.v300.gmx.net,60020,1305025688666
    search-hadoop-eu005.v300.gmx.net,60020,1305025841017
    search-hadoop-eu006.v300.gmx.net,60020,1306156842070
    search-hadoop-eu005.v300.gmx.net,60020,1305019568146
    search-hadoop-eu001.v300.gmx.net,60020,1305025543786
    search-hadoop-eu004.v300.gmx.net,60020,1305025761173
    search-hadoop-eu002.v300.gmx.net,60020,1305025611163
    search-hadoop-eu006.v300.gmx.net,60020,1305019572576
    search-hadoop-eu003.v300.gmx.net,60020,1305019547053
    We used to hang on to the list of dead servers. In 0.90.2 we fixed
    this ("HBASE-3580 Remove RS from DeadServer when new instance checks
    in"). I'm not sure this change made it into the released cdh3 (You
    might check the cdh CHANGES).

    So, do the online regionservers have the same startcode (the last
    number listed above?). I'd guess not.

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 23, '11 at 4:26p
activeMay 25, '11 at 7:28p
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2019 Grokbase