FAQ
Hi,



Could somebody explain an expected HBase 0.18.1 nodes behavior in case
Hadoop cluster failover for a following reasons:



- HBase master region server fails

- HBase slave region server fails

- Hadoop master server fails

- Hadoop slave server fails

- Hadoop master and HBase master are fail ( in case they're
installed on the same computer and for instance disk has failover)

- HBase slave region server is failed but HBase data could be
recovered and copied to other node and the new node is added instead of
failed one.



Any help would be appreciated,



Gennady Gilin

Search Discussions

  • Cosmin Lehene at Jan 7, 2009 at 3:36 pm
    Hi,

    My answers inline...

    On 1/7/09 12:05 PM, "Genady" wrote:

    Hi,



    Could somebody explain an expected HBase 0.18.1 nodes behavior in case
    Hadoop cluster failover for a following reasons:



    - HBase master region server fails
    You need to manually set a different machine as master and redistribute the
    configuration files on all other region servers and restart the cluster.

    Maybe someone in the development team could explain if this will change with
    Zookeeper integration.

    - HBase slave region server fails
    This is handled transparently. Regions served by the failed region server
    are reassigned to the rest of the region servers.
    - Hadoop master server fails
    I suppose you mean the HDFS namenode. Currently, the namenode is a single
    point of failure in HDFS and needs manual intervention to configure a new
    namenode. A secondary namenode can be configured, however this one only
    keeps a metadata replica and does not act as failover node.

    http://wiki.apache.org/hadoop/NameNode
    - Hadoop slave server fails
    If a HDFS datanode fails it's files are already replicated on 2 other
    datanodes. Eventually the replication will be fixed by the namenode -
    creating third replica on one of remaining datanodes.
    - Hadoop master and HBase master are fail ( in case they're
    installed on the same computer and for instance disk has failover)
    These servers run independently so you can see above what happens.
    - HBase slave region server is failed but HBase data could be
    recovered and copied to other node and the new node is added instead of
    failed one.
    Hbase region servers don't actually hold the data. Data is stored in HDFS.
    Region servers just serve regions and when the region server fails the
    regions are reassigned (see above).

    Cosmin


    Any help would be appreciated,



    Gennady Gilin


  • Andrew Purtell at Jan 7, 2009 at 7:23 pm
    You can affect the current situation with a little development
    effort, as we (privately) have at my employer. Comments inline
    below.
    From: Cosmin Lehene
    My answers inline...
    On 1/7/09 12:05 PM, "Genady" wrote:
    Could somebody explain an expected HBase 0.18.1 nodes
    behavior in case Hadoop cluster failover for a following
    reasons:



    - HBase master region server fails
    You need to manually set a different machine as master and
    redistribute the configuration files on all other region
    servers and restart the cluster.

    Maybe someone in the development team could explain if this
    will change with Zookeeper integration.
    First, my understanding is that the ZK integration will handle
    master role reassignment without requiring a cluster restart.
    J-D could say more (or deny).

    What we (my employer) do currently is run the Hadoop and HBase
    daemons as child processes of custom monitoring daemons that
    use a private DHT which supports TTLs on cells to write
    heartbeats into the cloud. This same mechanism also supports
    service discovery. All HBase processes in particular can be
    automatically restarted should the location of the master shift.
    (The location of the master may shift if a node has a hard
    failure.) We write Hadoop and HBase configuration files on the
    fly as necessary.

    This all took me only a few days to implement and a few more
    to debug.

    Relocation of the Hadoop name node is trickier. I believe it
    is possible to have it write the fs image out to a NFS share
    such that a service relocation to another host with the same
    NFS mount will pick up the latest edits seamlessly. However I
    do not trust NFS under a number of failure conditions so will
    not try this myself. There might be other better strategies
    for replication of the fs image.
    - HBase slave region server fails
    This is handled transparently. Regions served by the failed
    region server are reassigned to the rest of the region
    servers.
    - Hadoop master server fails
    I suppose you mean the HDFS namenode. Currently, the
    namenode is a single point of failure in HDFS and needs
    manual intervention to configure a new namenode. A
    secondary namenode can be configured, however this one only
    keeps a metadata replica and does not act as failover node.
    http://wiki.apache.org/hadoop/NameNode
    See my comments above.
    - Hadoop slave server fails
    If a HDFS datanode fails it's files are already replicated
    on 2 other datanodes. Eventually the replication will be
    fixed by the namenode - creating third replica on one of
    remaining datanodes.
    You want to make sure your client is requesting the default
    replication. The stock Hadoop config does allow DFS clients
    to specify a replication factor of 1 only. However the HBase
    DFS client will always request the default so this is not
    an issue for HBase.
    - Hadoop master and HBase master are fail (
    in case they're installed on the same computer and for
    instance disk has failover)
    These servers run independently so you can see above what
    happens.
    Don't run them on the same node regardless. The Hadoop name
    node can become very busy given a lot of DFS file system
    activity. Let it have its own dedicated node to avoid problems
    e.g. replication stalls.
    - HBase slave region server is failed but
    HBase data could be recovered and copied to other node
    and the new node is added instead of failed one.
    Hbase region servers don't actually hold the data. Data
    is stored in HDFS. Region servers just serve regions and
    when the region server fails the regions are reassigned
    (see above).

    Cosmin
    - Andy
  • Jean-Daniel Cryans at Jan 7, 2009 at 7:34 pm
    With ZK in 0.20, a cluster restart won't be necessary. Since the ROOT
    address will be stored in ZK, the clients will practically never communicate
    with the master and the region servers will just keep serving regions. If
    the master fails, the RS should not block gets/puts but won't be able to do
    splits. However, the new master will have to be started manually (or we can
    implement a simple way to have extra masters sleeping just in case) so that
    it gets its unique lock which will surely contain it's address.

    BTW, this is all under devlopment but we (nitay and me) will follow the
    Bigtable way.

    J-D
    On Wed, Jan 7, 2009 at 2:22 PM, Andrew Purtell wrote:

    You can affect the current situation with a little development
    effort, as we (privately) have at my employer. Comments inline
    below.
    From: Cosmin Lehene
    My answers inline...
    On 1/7/09 12:05 PM, "Genady" wrote:
    Could somebody explain an expected HBase 0.18.1 nodes
    behavior in case Hadoop cluster failover for a following
    reasons:



    - HBase master region server fails
    You need to manually set a different machine as master and
    redistribute the configuration files on all other region
    servers and restart the cluster.

    Maybe someone in the development team could explain if this
    will change with Zookeeper integration.
    First, my understanding is that the ZK integration will handle
    master role reassignment without requiring a cluster restart.
    J-D could say more (or deny).

    What we (my employer) do currently is run the Hadoop and HBase
    daemons as child processes of custom monitoring daemons that
    use a private DHT which supports TTLs on cells to write
    heartbeats into the cloud. This same mechanism also supports
    service discovery. All HBase processes in particular can be
    automatically restarted should the location of the master shift.
    (The location of the master may shift if a node has a hard
    failure.) We write Hadoop and HBase configuration files on the
    fly as necessary.

    This all took me only a few days to implement and a few more
    to debug.

    Relocation of the Hadoop name node is trickier. I believe it
    is possible to have it write the fs image out to a NFS share
    such that a service relocation to another host with the same
    NFS mount will pick up the latest edits seamlessly. However I
    do not trust NFS under a number of failure conditions so will
    not try this myself. There might be other better strategies
    for replication of the fs image.
    - HBase slave region server fails
    This is handled transparently. Regions served by the failed
    region server are reassigned to the rest of the region
    servers.
    - Hadoop master server fails
    I suppose you mean the HDFS namenode. Currently, the
    namenode is a single point of failure in HDFS and needs
    manual intervention to configure a new namenode. A
    secondary namenode can be configured, however this one only
    keeps a metadata replica and does not act as failover node.
    http://wiki.apache.org/hadoop/NameNode
    See my comments above.
    - Hadoop slave server fails
    If a HDFS datanode fails it's files are already replicated
    on 2 other datanodes. Eventually the replication will be
    fixed by the namenode - creating third replica on one of
    remaining datanodes.
    You want to make sure your client is requesting the default
    replication. The stock Hadoop config does allow DFS clients
    to specify a replication factor of 1 only. However the HBase
    DFS client will always request the default so this is not
    an issue for HBase.
    - Hadoop master and HBase master are fail (
    in case they're installed on the same computer and for
    instance disk has failover)
    These servers run independently so you can see above what
    happens.
    Don't run them on the same node regardless. The Hadoop name
    node can become very busy given a lot of DFS file system
    activity. Let it have its own dedicated node to avoid problems
    e.g. replication stalls.
    - HBase slave region server is failed but
    HBase data could be recovered and copied to other node
    and the new node is added instead of failed one.
    Hbase region servers don't actually hold the data. Data
    is stored in HDFS. Region servers just serve regions and
    when the region server fails the regions are reassigned
    (see above).

    Cosmin
    - Andy



  • Andrew Purtell at Jan 7, 2009 at 8:04 pm

    From: Jean-Daniel Cryans

    With ZK in 0.20, a cluster restart won't be necessary.
    Since the ROOT address will be stored in ZK, the clients
    will practically never communicate with the master and
    the region servers will just keep serving regions. If
    the master fails, the RS should not block gets/puts but
    won't be able to do splits. However, the new master will
    have to be started manually (or we can implement a
    simple way to have extra masters sleeping just in case)
    so that it gets its unique lock which will surely
    contain it's address.
    If the master fails, and then a HRS fails, then what?
    Especially what happens if the HRS is carrying ROOT or META?

    There's no reason that all the live HRS cannot use ZK to
    negotiate among themselves who should also assume the master
    role, since the master role will also go on a diet. After ZK
    integration, is there a need for separate processes for the
    master and region server functions?

    In fact the master role might be distributed among the HRS
    via ZK. About the only need for a master would be to manage
    region assignments upon splits and HRS failures. Why not put
    up locks (or appropriate synchronization primitives) for
    every region and have the HRS figure out among themselves
    who should carry new or unassigned regions?

    Just thinking out loud here.

    - Andy
  • Jean-Daniel Cryans at Jan 7, 2009 at 8:17 pm
    I'm in favor of doing thing incrementally but also in favor of a system less
    dependent on a single master. As always, you have good ideas Andrew ;) Locks
    and all leases should be transfered in ZK.

    For the next release, if we stick to current plan, a HRS failing while the
    master is unavailable is indeed very bad so it would be advisable to keep
    the master downtime as short as possible (by keeping a sleeping master that
    keeps pinging the unique master lock every 2 seconds for example). Then the
    first thing the new master has to do is to scan ROOT to see if all META
    assignments are correct and fix those that are wrong (same with ROOT after
    reading ZK). Then it scans META to confirm all region assignments, reassign
    regions when needed.

    J-D
    On Wed, Jan 7, 2009 at 3:04 PM, Andrew Purtell wrote:

    From: Jean-Daniel Cryans

    With ZK in 0.20, a cluster restart won't be necessary.
    Since the ROOT address will be stored in ZK, the clients
    will practically never communicate with the master and
    the region servers will just keep serving regions. If
    the master fails, the RS should not block gets/puts but
    won't be able to do splits. However, the new master will
    have to be started manually (or we can implement a
    simple way to have extra masters sleeping just in case)
    so that it gets its unique lock which will surely
    contain it's address.
    If the master fails, and then a HRS fails, then what?
    Especially what happens if the HRS is carrying ROOT or META?

    There's no reason that all the live HRS cannot use ZK to
    negotiate among themselves who should also assume the master
    role, since the master role will also go on a diet. After ZK
    integration, is there a need for separate processes for the
    master and region server functions?

    In fact the master role might be distributed among the HRS
    via ZK. About the only need for a master would be to manage
    region assignments upon splits and HRS failures. Why not put
    up locks (or appropriate synchronization primitives) for
    every region and have the HRS figure out among themselves
    who should carry new or unassigned regions?

    Just thinking out loud here.

    - Andy



  • Stack at Jan 7, 2009 at 11:32 pm

    Andrew Purtell wrote:
    There's no reason that all the live HRS cannot use ZK to
    negotiate among themselves who should also assume the master
    role, since the master role will also go on a diet. After ZK
    integration, is there a need for separate processes for the
    master and region server functions?
    Yesterday profiling the master, its doing next to nothing but waiting
    for something to do (heavy upload on small cluster) -- and this is
    before a bunch of its functionality is redone over in ZK.
    In fact the master role might be distributed among the HRS
    via ZK. About the only need for a master would be to manage
    region assignments upon splits and HRS failures. Why not put
    up locks (or appropriate synchronization primitives) for
    every region and have the HRS figure out among themselves
    who should carry new or unassigned regions?
    I like this idea of yours Andrew. Make an issue?

    St.Ack
  • Stack at Jan 7, 2009 at 11:50 pm

    stack wrote:
    Andrew Purtell wrote:
    In fact the master role might be distributed among the HRS
    via ZK. About the only need for a master would be to manage
    region assignments upon splits and HRS failures. Why not put
    up locks (or appropriate synchronization primitives) for
    every region and have the HRS figure out among themselves
    who should carry new or unassigned regions?
    Thinking more on the above, HRS failures are currently done by the
    single-node master in a single thread when what we really need is a
    distributed processing of the failure splitting commit logs (as per BT
    paper) so failures are cleaned up promptly.

    Base point is that after ZK integration, Master can be shrunk even
    further. Rather than running actual split, instead its role shrinks to
    that of orchestrator.

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJan 7, '09 at 10:08a
activeJan 7, '09 at 11:50p
posts8
users5
websitehbase.apache.org

People

Translate

site design / logo © 2023 Grokbase