FAQ
In the master logs, I am seeing "regions in transition timed out" and
"region has been PENDING_CLOSE for too long, running forced unasign".
Both of these log messages occur at INFO level, so I assume they are
innocuous. Should I be concerned?



-geoff

Search Discussions

  • Stack at Sep 3, 2011 at 5:51 am
    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126 Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Sep 3, 2011 at 7:12 am
    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126 Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Stuart Smith at Oct 29, 2011 at 8:39 pm
    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

     - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
     - WrongRegionExceptions due to overlapping regions & holes in the regions.

    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Oct 29, 2011 at 9:36 pm
    Sure. I posted the code many weeks back for a tool that will repair holes in .mETA.

    If you do a check on the list, you should find it. I'll send you the latest code for that. Maybe I made some fixes after I posted the code. Please ping me if I forget. I've used it to repair huge tables (and fixed subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the use case of doing bulk writes from an mr job. It shits the bed every time you pass a certain scale. For this reason we've completely rewritten our code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

    I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
    But!
    I've been seeing the same issues for months:

    - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
    - WrongRegionExceptions due to overlapping regions & holes in the regions.

    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

    Take care,
    -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126 Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Ted Yu at Oct 29, 2011 at 11:19 pm
    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

    I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
    I've been seeing the same issues for months:

    - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
    - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
    -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126 Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Oct 30, 2011 at 2:09 am
    Stuart -

    Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

    Sent from my iPhone
    On Oct 29, 2011, at 4:19 PM, "Ted Yu" wrote:

    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

    I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
    I've been seeing the same issues for months:

    - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
    - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
    -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126 Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Stuart Smith at Nov 14, 2011 at 11:20 pm
    Thanks Geoff!

      The slow reply was due to the saga being moved to the cloudera lists.

    I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

    This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

    Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


    Take care,
      -stu




    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: user@hbase.apache.org; Stuart Smith <stu24mail@yahoo.com>
    Sent: Saturday, October 29, 2011 7:08 PM
    Subject: Re: PENDING_CLOSE for too long

    Stuart -

    Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

    Sent from my iPhone
    On Oct 29, 2011, at 4:19 PM, "Ted Yu" wrote:

    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables  (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

      - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
      - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Nov 14, 2011 at 11:22 pm
    thanks, and CCing my team

    -----Original Message-----
    From: Stuart Smith
    Sent: Monday, November 14, 2011 3:20 PM
    To: user@hbase.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Thanks Geoff!

      The slow reply was due to the saga being moved to the cloudera lists.

    I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

    This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

    Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


    Take care,
      -stu




    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: user@hbase.apache.org; Stuart Smith <stu24mail@yahoo.com>
    Sent: Saturday, October 29, 2011 7:08 PM
    Subject: Re: PENDING_CLOSE for too long

    Stuart -

    Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

    Sent from my iPhone
    On Oct 29, 2011, at 4:19 PM, "Ted Yu" wrote:

    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables  (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

      - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
      - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Nov 14, 2011 at 11:29 pm
    Oh, and by the way in the case of scan-for-a-single value being super slow, a guy on our team found that the client caches region meta information aggressively. It can be turned off using hbase.client.prefetch.limit, and you will see scan-for-a-single value being about 10x faster.

    We've also been using the merge script, but it sure is slow.,

    -geoff

    -----Original Message-----
    From: Stuart Smith
    Sent: Monday, November 14, 2011 3:20 PM
    To: user@hbase.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Thanks Geoff!

      The slow reply was due to the saga being moved to the cloudera lists.

    I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

    This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

    Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


    Take care,
      -stu




    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: user@hbase.apache.org; Stuart Smith <stu24mail@yahoo.com>
    Sent: Saturday, October 29, 2011 7:08 PM
    Subject: Re: PENDING_CLOSE for too long

    Stuart -

    Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

    Sent from my iPhone
    On Oct 29, 2011, at 4:19 PM, "Ted Yu" wrote:

    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables  (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

      - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
      - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Lars hofhansl at Nov 14, 2011 at 11:44 pm
    Hi Stuart,

    when you get come time, could you tell us how you "mess[ed] up badly", so that others can avoid the same mistakes?

    Thanks.


    -- Lars



    ----- Original Message -----
    From: Stuart Smith <stu24mail@yahoo.com>
    To: "user@hbase.apache.org" <user@hbase.apache.org>
    Cc:
    Sent: Monday, November 14, 2011 3:20 PM
    Subject: Re: PENDING_CLOSE for too long

    Thanks Geoff!

      The slow reply was due to the saga being moved to the cloudera lists.

    I ended up trying to merge all my regions (offline) using the java API (since I had gotten to about 20K regions for a given table), and messing up badly, so I just started from scratch, and have started reloading data with a new max region filesize.

    This took the number of regions I had from 20K to high hundreds, and so far, hbase seems much happier - I'm only about 1/2 - 2/3's of the way to where I was before, though, so we'll see what happens, but it does seem to work a lot better :)

    Btw.. if you use the merge API.. make sure you don't accidently comment out code that sorts your region listing by key before you start merging.. the API will happily let you merge any two random regions.. creating lots of interesting overlaps.... :O


    Take care,
      -stu




    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: user@hbase.apache.org; Stuart Smith <stu24mail@yahoo.com>
    Sent: Saturday, October 29, 2011 7:08 PM
    Subject: Re: PENDING_CLOSE for too long

    Stuart -

    Have you disabled splitting? I believe you can work around the issue of PENDInG_CLOSE by presplitting your table and disabling splitting. Worked for us.

    Sent from my iPhone
    On Oct 29, 2011, at 4:19 PM, "Ted Yu" wrote:

    In 0.92 (to be released in 2 weeks), you can expect improvement in this
    regard.
    See HBASE-3368.

    Geoff:
    Can you publish your tool on HBASE JIRA ?

    Thanks
    On Sat, Oct 29, 2011 at 2:35 PM, Geoff Hendrey wrote:

    Sure. I posted the code many weeks back for a tool that will repair holes
    in .mETA.

    If you do a check on the list, you should find it. I'll send you the
    latest code for that. Maybe I made some fixes after I posted the code.
    Please ping me if I forget. I've used it to repair huge tables  (and fixed
    subtle bugs in the process) so I'm confident it works.

    No matter what anyone tells me, I know hbase is horribly broken for the
    use case of doing bulk writes from an mr job. It shits the bed every time
    you pass a certain scale. For this reason we've completely rewritten our
    code so that we use bulkloading. It's way more efficient and always work.

    Please ping me until I send you the code. Otherwise I will forget.

    Sent from my iPhone
    On Oct 29, 2011, at 1:39 PM, "Stuart Smith" wrote:

    Hello Geoff,

       I usually don't show up here, since I use CDH, and good form means I
    should stay on CDH-users,
    But!
       I've been seeing the same issues for months:

      - PENDING_CLOSE too long, master tries to reassign - I see an
    continuous stream of these.
      - WrongRegionExceptions due to overlapping regions & holes in the regions.
    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb
    script to write a java program to fix up overlaps & holes in an offline
    fashion (hbase down, directly on hdfs), and will start testing next week
    (cross my fingers!).
    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm
    doing, maybe we could share notes?
    Take care,
       -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack

    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey <ghendrey@decarta.com>
    wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Oct 31, 2011 at 5:47 pm
    Hi Guys -

    This is a fairly complete little Tool (Configured) whose purpose is to move out a whole slew of regions into a backup directory and restore .META. when done. We found that we needed to do this when a huge volume of keys had been generated into a production table, and it turned out the whole set of keys had an incorrect prefix. Thus, what we really wanted to do was move the data out of all the regions into some backup directory in one fell swoop. This tool accepts some parameters with -D (hadoop arguments). It will remove a slew of contiguous regions, relink the .META., and place the removed data in a backup directory in HDFS. It has been tested on big tables and includes some more subtle "gotchas" catches like being careful when parsing region names, to guard against rowkeys actually containing commas. It worked for me, but use at your own risk.

    Basically you give it -Dregion.remove.regionname.start=STARTREGION and region.remove.regionname.end=ENDREGION and all the data between STARTREGION and ENDREGION will be moved out of your table, where STARTREGION and ENDREGION are region names.

    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Iterator;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HConstants;
    import org.apache.hadoop.hbase.HRegionInfo;
    import org.apache.hadoop.hbase.HTableDescriptor;
    import org.apache.hadoop.hbase.NotServingRegionException;
    import org.apache.hadoop.hbase.client.Delete;
    import org.apache.hadoop.hbase.client.Get;
    import org.apache.hadoop.hbase.client.HBaseAdmin;
    import org.apache.hadoop.hbase.client.HTable;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.ResultScanner;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.util.Bytes;
    import org.apache.hadoop.hbase.util.FSUtils;
    import org.apache.hadoop.hbase.util.Writables;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    /**
    * @author ghendrey
    */
    public class RemoveRegions extends Configured implements Tool {

    public static void main(String[] args) throws Exception {
    int exitCode = ToolRunner.run(new RemoveRegions(), args);
    System.exit(exitCode);
    }

    private static void deleteMetaRow(HRegionInfo closedRegion, HTable hMetaTable) throws IOException {
    Delete del = new Delete(closedRegion.getRegionName()); //Delete the original row from .META.
    hMetaTable.delete(del);
    System.out.println("Deleted the region's row from .META. " + closedRegion.getRegionNameAsString());
    }

    private static HRegionInfo closeRegion(Result result, HBaseAdmin admin) throws RuntimeException, IOException {

    byte[] bytes = result.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
    HRegionInfo closedRegion = Writables.getHRegionInfo(bytes);

    try {
    admin.closeRegion(closedRegion.getRegionName(), null); //. Close the existing region if open.
    System.out.println("Closed the Region " + closedRegion.getRegionNameAsString());
    } catch (Exception nse) {
    System.out.println("Skipped closing the region because: " + nse.getMessage());
    }
    return closedRegion;
    }

    private static HRegionInfo getRegionInfo(String exclusiveStartRegionName, Configuration hConfig) throws IOException {
    HTable readTable = new HTable(hConfig, Bytes.toBytes(".META."));
    Get readGet = new Get(Bytes.toBytes(exclusiveStartRegionName));
    Result readResult = readTable.get(readGet);
    byte[] readBytes = readResult.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
    HRegionInfo regionInfo = Writables.getHRegionInfo(readBytes); //Read the existing hregioninfo.
    System.out.println("got region info: " + regionInfo);
    return regionInfo;
    }

    private static void createBackupDir(Configuration conf) throws IOException {

    String path = conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis());
    Path backupDirPath = new Path(path);
    FileSystem fs = backupDirPath.getFileSystem(conf);
    FSUtils.DirFilter dirFilt = new FSUtils.DirFilter(fs);
    System.out.println("creating backup dir: " + backupDirPath.toString());
    fs.mkdirs(backupDirPath);
    }

    public int run(String[] strings) throws Exception {
    try {
    System.setProperty("javax.xml.parsers.DocumentBuilderFactory", "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");
    Configuration conf = getConf();
    Configuration hConfig = HBaseConfiguration.create(conf);
    hConfig.set("hbase.zookeeper.quorum", System.getProperty("hbase.zookeeper.quorum", "doop2.dt.sv4.decarta.com,doop3.dt.sv4.decarta.com,doop4.dt.sv4.decarta.com,doop5.dt.sv4.decarta.com,doop7.dt.sv4.decarta.com,doop8.dt.sv4.decarta.com,doop9.dt.sv4.decarta.com,doop10.dt.sv4.decarta.com"));
    HBaseAdmin admin = new HBaseAdmin(hConfig);
    HBaseAdmin.checkHBaseAvailable(hConfig);


    System.out.println("regions will be moved out from between region.remove.regionname.start and region.remove.regionname.end (exclusive)");
    String exclusiveStartRegionName = conf.get("region.remove.regionname.start");
    if (null == exclusiveStartRegionName) {
    throw new RuntimeException("Current implementation requires an exclusive region.remove.regionname.start");
    }
    System.out.println("region.remove.regionname.start=" + exclusiveStartRegionName);
    String exclusiveEndRegionName = conf.get("region.remove.regionname.end");
    if (null == exclusiveEndRegionName) {

    throw new RuntimeException("Current implementation requires an exclusive region.remove.endrow");
    }
    System.out.println("region.remove.regionname.end=" + exclusiveEndRegionName);

    //CREATE A BACKUP DIR FOR THE REGION DATA TO BE MOVED INTO
    createBackupDir(hConfig);


    Path hbaseRootPath = FSUtils.getRootDir(hConfig);
    if (null == hbaseRootPath) {
    throw new RuntimeException("couldn't determine hbase root dir");
    } else {
    System.out.println("hbase rooted at " + hbaseRootPath.toString());
    }

    HTable hMetaTable = new HTable(hConfig, Bytes.toBytes(".META."));
    System.out.println("connected to .META.");

    //get region info for start and end regions
    HRegionInfo exclusiveStartRegionInfo = getRegionInfo(exclusiveStartRegionName, hConfig);
    HRegionInfo exclusiveEndRegionInfo = getRegionInfo(exclusiveEndRegionName, hConfig);


    //CLOSE all the regions starting with the exclusiveStartRegionName (including it), and up to but excluding closing the exclusiveEndRegionName
    //and DELETE rows from .META.
    Scan scan = new Scan(Bytes.toBytes(exclusiveStartRegionName), Bytes.toBytes(exclusiveEndRegionName));
    ResultScanner metaScanner = hMetaTable.getScanner(scan);
    int i = 0;
    for (Iterator<Result> iter = metaScanner.iterator(); iter.hasNext();) {
    Result res = iter.next();
    //CLOSE REGION
    HRegionInfo closedRegion = closeRegion(res, admin);
    //MOVE ACTUAL DATA OUT OF HBASE HDFS INTO BACKUP AREA
    moveDataToBackup(closedRegion, hConfig);
    //DELETE ROW FROM META TABLE
    deleteMetaRow(closedRegion, hMetaTable);
    }

    //now reinsert the startrow into .META. with it's endrow pointing to the startrow of the exclusiveEndRegionInfo
    //This effectively "relinks" the link list of .META., now that all the interstitial region-rows have been removed from .META.
    relinkStartRow(exclusiveStartRegionInfo, exclusiveEndRegionInfo, hConfig, admin);


    return 0;

    } catch (Exception ex) {
    throw new RuntimeException(ex.getMessage(), ex);
    }

    }

    private void relinkStartRow(HRegionInfo exclusiveStartRegionInfo, HRegionInfo exclusiveEndRegionInfo, Configuration hConfig, HBaseAdmin admin) throws IllegalArgumentException, IOException {
    //Now we are going to recreate the region info for exclusiveStartRegion, such that it's endKey points to the startKey
    //of the exclusiveEndRegion.
    HTableDescriptor descriptor = new HTableDescriptor(exclusiveStartRegionInfo.getTableDesc()); //Use existing hregioninfo htabledescriptor and this construction
    // Just changing the End key , nothing else. This performs the "unlink" step
    byte[] startKey = exclusiveStartRegionInfo.getStartKey();
    byte[] endKey = exclusiveEndRegionInfo.getStartKey();
    HRegionInfo newStartRegion = new HRegionInfo(descriptor, startKey, endKey);
    byte[] value = Writables.getBytes(newStartRegion);
    Put put = new Put(newStartRegion.getRegionName()); // Same time stamp from the record.
    put.add(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER, value); //Insert the new entry in .META. using new hregioninfo name as row key and add an info:regioninfo whose contents is the serialized new hregioninfo.
    HTable metaTable = new HTable(hConfig, ".META.");
    metaTable.put(put);
    System.out.println("New row in .META.: " + newStartRegion.getRegionNameAsString() + " End key is " + Bytes.toString(exclusiveEndRegionInfo.getStartKey()));
    admin.assign(newStartRegion.getRegionName(), true); //Assign the new region.
    System.out.println("Assigned the new region " + newStartRegion.getRegionNameAsString());
    }

    private static void moveDataToBackup(HRegionInfo closedRegion, Configuration conf) throws IOException {


    Path rootPath = FSUtils.getRootDir(conf);
    String tablename = closedRegion.getRegionNameAsString().split(",")[0]; //split regionname on comma. tablename comes before first comma
    Path tablePath = new Path(rootPath, tablename);
    String[] dotSplit = closedRegion.getRegionNameAsString().split("\\.", 0);
    String regionId = dotSplit[dotSplit.length - 1]; //split regionname on dot. regionId between last two dots
    Path regionPath = new Path(tablePath, regionId);
    System.out.println(regionPath);
    FileSystem fs = FileSystem.get(conf);

    Path regionBackupPath = new Path(conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis()) + "/" + regionId);

    //Path regionBackupPath = new Path(backupPath, regionId);
    System.out.println("moving to: " + regionBackupPath);
    fs.rename(regionPath, regionBackupPath);

    }
    }

    -----Original Message-----
    From: Stuart Smith
    Sent: Saturday, October 29, 2011 1:39 PM
    To: user@hbase.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

     - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
     - WrongRegionExceptions due to overlapping regions & holes in the regions.

    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Stack at Oct 31, 2011 at 7:51 pm
    Thanks Geoff. Mind making a JIRA and attaching the code as a patch?
    Copying and pasting from email might not work so well. Thanks boss,
    St.Ack
    On Mon, Oct 31, 2011 at 10:46 AM, Geoff Hendrey wrote:
    Hi Guys -

    This is a fairly complete little Tool (Configured) whose purpose is to move out a whole slew of regions into a backup directory and restore .META. when done. We found that we needed to do this when a huge volume of keys had been generated into a production table, and it turned out the whole set of keys had an incorrect prefix. Thus, what we really wanted to do was move the data out of all the regions into some backup directory in one fell swoop. This tool accepts some parameters with -D (hadoop arguments). It will remove a slew of contiguous regions, relink the .META., and place the removed data in a backup directory in HDFS. It has been tested on big tables and includes some more subtle "gotchas" catches like being careful when parsing region names, to guard against rowkeys actually containing commas. It worked for me, but use at your own risk.

    Basically you give it -Dregion.remove.regionname.start=STARTREGION and region.remove.regionname.end=ENDREGION and all the data between STARTREGION and ENDREGION will be moved out of your table, where STARTREGION and ENDREGION are region names.

    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Iterator;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.HConstants;
    import org.apache.hadoop.hbase.HRegionInfo;
    import org.apache.hadoop.hbase.HTableDescriptor;
    import org.apache.hadoop.hbase.NotServingRegionException;
    import org.apache.hadoop.hbase.client.Delete;
    import org.apache.hadoop.hbase.client.Get;
    import org.apache.hadoop.hbase.client.HBaseAdmin;
    import org.apache.hadoop.hbase.client.HTable;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.client.Result;
    import org.apache.hadoop.hbase.client.ResultScanner;
    import org.apache.hadoop.hbase.client.Scan;
    import org.apache.hadoop.hbase.util.Bytes;
    import org.apache.hadoop.hbase.util.FSUtils;
    import org.apache.hadoop.hbase.util.Writables;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    /**
     * @author ghendrey
     */
    public class RemoveRegions extends Configured implements Tool {

       public static void main(String[] args) throws Exception {
           int exitCode = ToolRunner.run(new RemoveRegions(), args);
           System.exit(exitCode);
       }

       private static void deleteMetaRow(HRegionInfo closedRegion, HTable hMetaTable) throws IOException {
           Delete del = new Delete(closedRegion.getRegionName()); //Delete the original row from .META.
           hMetaTable.delete(del);
           System.out.println("Deleted the region's row from .META. " + closedRegion.getRegionNameAsString());
       }

       private static HRegionInfo closeRegion(Result result, HBaseAdmin admin) throws RuntimeException, IOException {

           byte[] bytes = result.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
           HRegionInfo closedRegion = Writables.getHRegionInfo(bytes);

           try {
               admin.closeRegion(closedRegion.getRegionName(), null); //. Close the existing region if open.
               System.out.println("Closed the Region " + closedRegion.getRegionNameAsString());
           } catch (Exception nse) {
               System.out.println("Skipped closing the region because: " + nse.getMessage());
           }
           return closedRegion;
       }

       private static HRegionInfo getRegionInfo(String exclusiveStartRegionName, Configuration hConfig) throws IOException {
           HTable readTable = new HTable(hConfig, Bytes.toBytes(".META."));
           Get readGet = new Get(Bytes.toBytes(exclusiveStartRegionName));
           Result readResult = readTable.get(readGet);
           byte[] readBytes = readResult.getValue(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER);
           HRegionInfo regionInfo = Writables.getHRegionInfo(readBytes); //Read the existing hregioninfo.
           System.out.println("got region info: " + regionInfo);
           return regionInfo;
       }

       private static void createBackupDir(Configuration conf) throws IOException {

           String path = conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis());
           Path backupDirPath = new Path(path);
           FileSystem fs = backupDirPath.getFileSystem(conf);
           FSUtils.DirFilter dirFilt = new FSUtils.DirFilter(fs);
           System.out.println("creating backup dir: " + backupDirPath.toString());
           fs.mkdirs(backupDirPath);
       }

       public int run(String[] strings) throws Exception {
           try {
               System.setProperty("javax.xml.parsers.DocumentBuilderFactory", "com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl");
               Configuration conf = getConf();
               Configuration hConfig = HBaseConfiguration.create(conf);
               hConfig.set("hbase.zookeeper.quorum", System.getProperty("hbase.zookeeper.quorum", "doop2.dt.sv4.decarta.com,doop3.dt.sv4.decarta.com,doop4.dt.sv4.decarta.com,doop5.dt.sv4.decarta.com,doop7.dt.sv4.decarta.com,doop8.dt.sv4.decarta.com,doop9.dt.sv4.decarta.com,doop10.dt.sv4.decarta.com"));
               HBaseAdmin admin = new HBaseAdmin(hConfig);
               HBaseAdmin.checkHBaseAvailable(hConfig);


               System.out.println("regions will be moved out from between region.remove.regionname.start and region.remove.regionname.end (exclusive)");
               String exclusiveStartRegionName = conf.get("region.remove.regionname.start");
               if (null == exclusiveStartRegionName) {
                   throw new RuntimeException("Current implementation requires an exclusive region.remove.regionname.start");
               }
               System.out.println("region.remove.regionname.start=" + exclusiveStartRegionName);
               String exclusiveEndRegionName = conf.get("region.remove.regionname.end");
               if (null == exclusiveEndRegionName) {

                   throw new RuntimeException("Current implementation requires an exclusive region.remove.endrow");
               }
               System.out.println("region.remove.regionname.end=" + exclusiveEndRegionName);

               //CREATE A BACKUP DIR FOR THE REGION DATA TO BE MOVED INTO
               createBackupDir(hConfig);


               Path hbaseRootPath = FSUtils.getRootDir(hConfig);
               if (null == hbaseRootPath) {
                   throw new RuntimeException("couldn't determine hbase root dir");
               } else {
                   System.out.println("hbase rooted at " + hbaseRootPath.toString());
               }

               HTable hMetaTable = new HTable(hConfig, Bytes.toBytes(".META."));
               System.out.println("connected to .META.");

               //get region info for start and end regions
               HRegionInfo exclusiveStartRegionInfo = getRegionInfo(exclusiveStartRegionName, hConfig);
               HRegionInfo exclusiveEndRegionInfo = getRegionInfo(exclusiveEndRegionName, hConfig);


               //CLOSE all the regions starting with the exclusiveStartRegionName (including it), and up to but excluding closing the exclusiveEndRegionName
               //and DELETE rows from .META.
               Scan scan = new Scan(Bytes.toBytes(exclusiveStartRegionName), Bytes.toBytes(exclusiveEndRegionName));
               ResultScanner metaScanner = hMetaTable.getScanner(scan);
               int i = 0;
               for (Iterator<Result> iter = metaScanner.iterator(); iter.hasNext();) {
                   Result res = iter.next();
                   //CLOSE REGION
                   HRegionInfo closedRegion = closeRegion(res, admin);
                   //MOVE ACTUAL DATA OUT OF HBASE HDFS INTO BACKUP AREA
                   moveDataToBackup(closedRegion, hConfig);
                   //DELETE ROW FROM META TABLE
                   deleteMetaRow(closedRegion, hMetaTable);
               }

               //now reinsert the startrow into .META. with it's endrow pointing to the startrow of the exclusiveEndRegionInfo
               //This effectively "relinks" the link list of .META., now that all the interstitial region-rows have been removed from .META.
               relinkStartRow(exclusiveStartRegionInfo, exclusiveEndRegionInfo, hConfig, admin);


               return 0;

           } catch (Exception ex) {
               throw new RuntimeException(ex.getMessage(), ex);
           }

       }

       private void relinkStartRow(HRegionInfo exclusiveStartRegionInfo, HRegionInfo exclusiveEndRegionInfo, Configuration hConfig, HBaseAdmin admin) throws IllegalArgumentException, IOException {
           //Now we are going to recreate the region info for exclusiveStartRegion, such that it's endKey points to the startKey
           //of the exclusiveEndRegion.
           HTableDescriptor descriptor = new HTableDescriptor(exclusiveStartRegionInfo.getTableDesc()); //Use existing hregioninfo htabledescriptor and this construction
           // Just changing the End key , nothing else. This performs the "unlink" step
           byte[] startKey = exclusiveStartRegionInfo.getStartKey();
           byte[] endKey = exclusiveEndRegionInfo.getStartKey();
           HRegionInfo newStartRegion = new HRegionInfo(descriptor, startKey, endKey);
           byte[] value = Writables.getBytes(newStartRegion);
           Put put = new Put(newStartRegion.getRegionName()); //  Same time stamp from the record.
           put.add(HConstants.CATALOG_FAMILY, HConstants.REGIONINFO_QUALIFIER, value); //Insert the new entry in .META. using new hregioninfo name as row key and add an info:regioninfo whose contents is the serialized new hregioninfo.
           HTable metaTable = new HTable(hConfig, ".META.");
           metaTable.put(put);
           System.out.println("New row in .META.: " + newStartRegion.getRegionNameAsString() + " End key is " + Bytes.toString(exclusiveEndRegionInfo.getStartKey()));
           admin.assign(newStartRegion.getRegionName(), true); //Assign the new region.
           System.out.println("Assigned the new region " + newStartRegion.getRegionNameAsString());
       }

       private static void moveDataToBackup(HRegionInfo closedRegion, Configuration conf) throws IOException {


           Path rootPath = FSUtils.getRootDir(conf);
           String tablename = closedRegion.getRegionNameAsString().split(",")[0]; //split regionname on comma. tablename comes before first comma
           Path tablePath = new Path(rootPath, tablename);
           String[] dotSplit = closedRegion.getRegionNameAsString().split("\\.", 0);
           String regionId = dotSplit[dotSplit.length - 1]; //split regionname on dot. regionId between last two dots
           Path regionPath = new Path(tablePath, regionId);
           System.out.println(regionPath);
           FileSystem fs = FileSystem.get(conf);

           Path regionBackupPath = new Path(conf.get("region.remove.backupdir", "regionBackup-" + System.currentTimeMillis()) + "/" + regionId);

           //Path regionBackupPath = new Path(backupPath, regionId);
           System.out.println("moving to: " + regionBackupPath);
           fs.rename(regionPath, regionBackupPath);

       }
    }

    -----Original Message-----
    From: Stuart Smith
    Sent: Saturday, October 29, 2011 1:39 PM
    To: user@hbase.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

     - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
     - WrongRegionExceptions due to overlapping regions & holes in the regions.

    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff
  • Geoff Hendrey at Oct 31, 2011 at 5:48 pm
    attached is my original email to the list, which contains code for a tool to repair your "hole" in .META.



    -----Original Message-----
    From: Stuart Smith
    Sent: Saturday, October 29, 2011 1:39 PM
    To: user@hbase.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Hello Geoff,

      I usually don't show up here, since I use CDH, and good form means I should stay on CDH-users,
    But!
      I've been seeing the same issues for months:

     - PENDING_CLOSE too long, master tries to reassign - I see an continuous stream of these.
     - WrongRegionExceptions due to overlapping regions & holes in the regions.

    I just spent all day yesterday cribbing off of St.Ack's check_meta.rb script to write a java program to fix up overlaps & holes in an offline fashion (hbase down, directly on hdfs), and will start testing next week (cross my fingers!).

    It seems like the pending close messages can be ignored?
    And once I test my tool, and confirm I know a little bit about what I'm doing, maybe we could share notes?

    Take care,
      -stu



    ________________________________
    From: Geoff Hendrey <ghendrey@decarta.com>
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Sent: Saturday, September 3, 2011 12:11 AM
    Subject: RE: PENDING_CLOSE for too long

    "Are you having trouble getting to any of your data out in tables?"

    depends what you mean. We see corruptions from time to time that prevent
    us from getting data, one way or another. Today's corruption was regions
    with duplicate start and end rows. We fixed that by deleting the
    offending regions from HDFS, and running add_table.rb to restore the
    meta. The other common corruption is the holes in ".META." that we
    repair with a little tool we wrote. We'd love to learn why we see these
    corruptions with such regularity (seemingly much higher than others on
    the list).

    We will implement timeout you suggest, and see how it goes.

    Thanks,
    Geoff

    -----Original Message-----
    From: saint.ack@gmail.com On Behalf Of
    Stack
    Sent: Friday, September 02, 2011 10:51 PM
    To: user@hbase.apache.org
    Cc: hbase-user@hadoop.apache.org
    Subject: Re: PENDING_CLOSE for too long

    Are you having trouble getting to any of your data out in tables?

    To get rid of them, try restarting your master.

    Before you restart your master, do "HBASE-4126  Make timeoutmonitor
    timeout after 30 minutes instead of 3"; i.e. set
    "hbase.master.assignment.timeoutmonitor.timeout" to 1800000 in
    hbase-site.xml.

    St.Ack
    On Fri, Sep 2, 2011 at 1:40 PM, Geoff Hendrey wrote:
    In the master logs, I am seeing "regions in transition timed out" and
    "region has been PENDING_CLOSE for too long, running forced unasign".
    Both of these log messages occur at INFO level, so I assume they are
    innocuous. Should I be concerned?



    -geoff

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedSep 2, '11 at 8:41p
activeNov 14, '11 at 11:44p
posts14
users5
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase