Grokbase Groups HBase dev May 2009
FAQ
Taking down ROOT/META regionserver can result in cluster becoming in-operational
--------------------------------------------------------------------------------

Key: HBASE-1457
URL: https://issues.apache.org/jira/browse/HBASE-1457
Project: Hadoop HBase
Issue Type: Bug
Affects Versions: 0.20.0
Reporter: ryan rawson
Fix For: 0.20.0


Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • ryan rawson (JIRA) at May 29, 2009 at 5:48 am
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    ryan rawson updated HBASE-1457:
    -------------------------------

    Attachment: HBASE-1457.patch

    may not patch against trunk cleanly :-(
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • ryan rawson (JIRA) at May 29, 2009 at 7:00 am
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    ryan rawson reassigned HBASE-1457:
    ----------------------------------

    Assignee: ryan rawson
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • ryan rawson (JIRA) at May 29, 2009 at 7:02 am
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    ryan rawson updated HBASE-1457:
    -------------------------------

    Status: Patch Available (was: Open)
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrew Purtell (JIRA) at May 29, 2009 at 4:54 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714508#action_12714508 ]

    Andrew Purtell commented on HBASE-1457:
    ---------------------------------------

    Patch looks good. I'll try it out.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 29, 2009 at 5:36 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714525#action_12714525 ]

    stack commented on HBASE-1457:
    ------------------------------

    Patch changes MetaRegion so it takes a HRegionInfo rather than region name and startkey.

    It ensures -ROOT- and .META. assignment happens first -- previous -ROOT- didn't get special treatment. Also, doesn't depend on getting close of catalog region. Instead, exiting, checks if server was carrying catalog regions and if it was, schedules them for immediate assignment (no log splitting when server exits, as opposes to crashes).

    It takes the updating of region historian out of the main code path processing alls-well messages putting it instead on the todo queue to be processed by worker thread IF meta and root are on line.



    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 29, 2009 at 5:54 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1457:
    -------------------------

    Attachment: HBASE-1457-v2.patch



    I reviewed Ryan's patch and its all good to me. I was going to suggest adding toString to the new anonymous TODO queue addition but see it already done. The attached patch applies cleanly to TRUNK.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • ryan rawson (JIRA) at May 30, 2009 at 4:55 am
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    ryan rawson updated HBASE-1457:
    -------------------------------

    Attachment: HBASE-1457-v3.patch

    fixes a case where ROOT isnt recovered after a regionserver hard kill -9 type crash. Beefed up handling of ROOT/META in ProcessServerShutdown.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 30, 2009 at 1:36 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714713#action_12714713 ]

    stack commented on HBASE-1457:
    ------------------------------

    This last patch seems to work great but the only odd thing is that it always reassigns -ROOT- and .META. I flush -ROOT- region then I kill -9 the -ROOT- server. I see the logs being split and then -ROOT- assigned. I see the regionserver opening -ROOT- AND applying edits but then when -ROOT- scanner runs, it says server and startcode are empty and things .META. assignment invalid.

    I'm looking into the above some.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Andrew Purtell (JIRA) at May 30, 2009 at 6:36 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714756#action_12714756 ]

    Andrew Purtell commented on HBASE-1457:
    ---------------------------------------

    +1 for committing -v3 patch now to trunk and 0.19 branch and work on stack's reported nit in another issue.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • ryan rawson (JIRA) at May 31, 2009 at 8:42 am
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    ryan rawson updated HBASE-1457:
    -------------------------------

    Attachment: HBASE-1457-v4.patch

    the latest fix, including:
    - make region historian writes into todo queue
    - make todo queue a priority queue, putting higher priority items to the top
    - ensure double assignment of ROOT/META can't happen
    - prevent assignment bugs when the cluster is mis-loaded, and ensure ROOT/META get assigned as fast as possible to the first server (rather than the best server as was previously)
    -- assignment could get stuck when the 'best' server was unable to contact the master because the ROOT/META is offline. Very ugly bug.
    - reduce how much we retry in pending operations, this can delay recovery because if the META/ROOT goes down while processing a TODO, the recovery of the META/ROOT has to wait until the currently running pending operation times out. This could take over 5 minutes previously (!!). 1 second time outs * 10 * 2-3 per commit() * 2 attempts takes a long time.
    - improve a bug where if ROOT was unavailable some pending operations might fail and not get requeued.
    - Handle bugs where a server would go offline and 'forget' to mention that ROOT or META went offline, thus delaying reassignment. Now we force META/ROOT offline ASAP and get them reassigned as fast as possible on clean shutdown.
    - Improved unclean shutdown handling of META - instead of waiting for the ROOT scanner to detect a bad assignment and fix it, be more proactive and put the META to be assigned once log split is finished. This can improve META recovery time by 5-10 seconds.
    - Fixed a rare but deadly NPE in ProcessRegionOpen, improved the handling of failed todo operations - instead of putting them back into the todo queue, put them into the delayed queue (since the NPE is a side effect of not having ROOT online yet).

    Yes, All these bugs are incorporated in this relatively small patch. (933 lines of diff)

    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457-v4.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 31, 2009 at 3:56 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714888#action_12714888 ]

    stack commented on HBASE-1457:
    ------------------------------

    I tested it. Works great. There is an issue where if -ROOT- goes down, after successful redeploy, I see that the .META. also will be redeployed (says assignment is invalid though it is not). Will make separate issue for this.

    Working on the backport. Its a little sticky.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457-v4.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 31, 2009 at 4:04 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1457:
    -------------------------

    Attachment: 1457-0.19.patch

    0.19 patch based on v4. Testing now.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.20.0

    Attachments: 1457-0.19.patch, HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457-v4.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at May 31, 2009 at 4:30 pm
    [ https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1457:
    -------------------------

    Resolution: Fixed
    Fix Version/s: (was: 0.20.0)
    0.19.4
    Status: Resolved (was: Patch Available)

    Tested 0.19 patch. The recovery is not as sleek as it is in 0.20. because no zk back in the branch -- but it works. I left in the 'alls well' message in trunk but removed it in branch. Its a little obnoxious but we can turn it off just before release. Meantime will help debugging. Thanks for great patch Ryan.
    Taking down ROOT/META regionserver can result in cluster becoming in-operational
    --------------------------------------------------------------------------------

    Key: HBASE-1457
    URL: https://issues.apache.org/jira/browse/HBASE-1457
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: ryan rawson
    Fix For: 0.19.4

    Attachments: 1457-0.19.patch, HBASE-1457-v2.patch, HBASE-1457-v3.patch, HBASE-1457-v4.patch, HBASE-1457.patch


    Take down a regionserver via controlled or uncontrolled shutdown, the master doesn't properly reassign the root/meta regions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedMay 29, '09 at 5:46a
activeMay 31, '09 at 4:30p
posts14
users1
websitehbase.apache.org

1 user in discussion

stack (JIRA): 14 posts

People

Translate

site design / logo © 2022 Grokbase