FAQ
Hi

I'm looking into Name Node high availability, but so far found only an
approach using DRBD.
I tried to make it work using Xen over DRBD, but it didn't quite work - in
fact I received a very valuable experience of recovering meta-data from
SecondaryNameNode. :)

Are there any other approaches which will make the NameNode
highly-available?

Also, if we speaking about this, is it possible to use the config directory
from NFS, to have a single configuration for all the node?

Thanks in advance for any idea.

Search Discussions

  • Todd Lipcon at Oct 1, 2009 at 7:03 pm

    On Thu, Oct 1, 2009 at 10:53 AM, Stas Oskin wrote:

    Hi

    I'm looking into Name Node high availability, but so far found only an
    approach using DRBD.
    I tried to make it work using Xen over DRBD, but it didn't quite work - in
    fact I received a very valuable experience of recovering meta-data from
    SecondaryNameNode. :)
    Could you share the way in which it didn't quite work? Would be valuable
    information for the community.

    Always good to learn how to recover metadata :) You can do fire drills like
    this on a pseudodistributed cluster too - probably good for any ops people
    out there who haven't tried it before.

    Are there any other approaches which will make the NameNode
    highly-available?
    I think this discussion came up last week on the list. Check the archives.


    Also, if we speaking about this, is it possible to use the config directory
    from NFS, to have a single configuration for all the node?
    Yes, it should work fine, but you'll really be kicking yourself when your
    NFS server is down and thus the entirety of your Hadoop cluster won't start
    either :) I'd recommend rsync, personally. Keep things simple :)

    -Todd
  • Stas Oskin at Oct 1, 2009 at 10:11 pm
    Hi.

    Could you share the way in which it didn't quite work? Would be valuable
    information for the community.
    The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which
    would be running over DRBD, as described here:
    http://www.drbd.org/users-guide/ch-xen.html

    The VM will be monitored by heart-beat, which would restart it on another
    node when it fails.

    I wanted to go that way as I thought it's perfect in case of small cluster,
    as then the node can be re-used for other tasks.
    Once the cluster grows reasonably, the VM could be migrated to dedicated
    machine in live fashion - with minimum downtime.

    Problem is, that it didn't work as expected. The Xen over DRBD is just not
    reliable, as described. The most basic operation of live domain migration
    works only in 50% of cases. Most often the domain migration leaves the DRBD
    in read-only status, meaning the domain can't be cleanly shut down - only
    killed. This often leads in turn to NN meta-data corruption.


    Always good to learn how to recover metadata :) You can do fire drills like
    this on a pseudodistributed cluster too - probably good for any ops people
    out there who haven't tried it before.
    By the way, several times I managed to break SNN main checkpoint as well. In
    this case, I manually replaced the checkpoint with contents of "previous"
    directory.

    Was this a planned activity, that such copying has to be done manually? I
    mean, if the NN couldn't importCheckpoint from the SNN, shouldn't it offer
    to import the previous one?

    And another question while we at it - if meta-data rolled back to last
    stable check-point, what happens with the files on DataNodes which were
    created after the checkpoint? Will DataNodes erase them eventually, or they
    will be just left there forever?


    Are there any other approaches which will make the NameNode
    highly-available?
    I think this discussion came up last week on the list. Check the archives.
    Can you tell me the name of the discussion in the list?

    Also, if we speaking about this, is it possible to use the config directory
    from NFS, to have a single configuration for all the node?

    Yes, it should work fine, but you'll really be kicking yourself when your
    NFS server is down and thus the entirety of your Hadoop cluster won't start
    either :) I'd recommend rsync, personally. Keep things simple :)
    Good idea :).

    Thanks again.
  • Steve Loughran at Oct 2, 2009 at 9:50 am

    Stas Oskin wrote:
    Hi.

    Could you share the way in which it didn't quite work? Would be valuable
    information for the community.
    The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which
    would be running over DRBD, as described here:
    http://www.drbd.org/users-guide/ch-xen.html

    The VM will be monitored by heart-beat, which would restart it on another
    node when it fails.

    I wanted to go that way as I thought it's perfect in case of small cluster,
    as then the node can be re-used for other tasks.
    Once the cluster grows reasonably, the VM could be migrated to dedicated
    machine in live fashion - with minimum downtime.

    Problem is, that it didn't work as expected. The Xen over DRBD is just not
    reliable, as described. The most basic operation of live domain migration
    works only in 50% of cases. Most often the domain migration leaves the DRBD
    in read-only status, meaning the domain can't be cleanly shut down - only
    killed. This often leads in turn to NN meta-data corruption.
    It's probably a quirk of virtualisation, all those clocks and things,
    causes trouble for any HA protocol running round the cluster. I would
    not blame Xen, as VMWare and virtualbox are also tricky.

    As you have a virtual infrastructure, why not have an image of the 1ary
    NN, ready to bring up on demand when the NN goes down, pointed at a copy
    of the NN datasets?
  • Stas Oskin at Oct 2, 2009 at 10:42 am
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for example?

    Thanks.
    On 10/2/09, Steve Loughran wrote:
    Stas Oskin wrote:
    Hi.

    Could you share the way in which it didn't quite work? Would be valuable
    information for the community.
    The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which
    would be running over DRBD, as described here:
    http://www.drbd.org/users-guide/ch-xen.html

    The VM will be monitored by heart-beat, which would restart it on another
    node when it fails.

    I wanted to go that way as I thought it's perfect in case of small
    cluster,
    as then the node can be re-used for other tasks.
    Once the cluster grows reasonably, the VM could be migrated to dedicated
    machine in live fashion - with minimum downtime.

    Problem is, that it didn't work as expected. The Xen over DRBD is just not
    reliable, as described. The most basic operation of live domain migration
    works only in 50% of cases. Most often the domain migration leaves the
    DRBD
    in read-only status, meaning the domain can't be cleanly shut down - only
    killed. This often leads in turn to NN meta-data corruption.
    It's probably a quirk of virtualisation, all those clocks and things,
    causes trouble for any HA protocol running round the cluster. I would
    not blame Xen, as VMWare and virtualbox are also tricky.

    As you have a virtual infrastructure, why not have an image of the 1ary
    NN, ready to bring up on demand when the NN goes down, pointed at a copy
    of the NN datasets?
    --
    Sent from my mobile device
  • Steve Loughran at Oct 2, 2009 at 11:23 am

    Stas Oskin wrote:
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for example?
    * You need to have something " a resource manager" keeping an eye on the
    NN from somewhere. Needless to say, that needs to be fairly HA too.

    * your NN image has to be ready to go

    * when the deployed NA goes away, bring up a new machine with the same
    image, hostname *and IP Address*. You can't always pull the latter off,
    it depends on the infrastructure. Without that, you'd need to bring up
    all the nodes with DNS caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
  • Stas Oskin at Oct 2, 2009 at 3:18 pm
    Hi.

    * You need to have something " a resource manager" keeping an eye on the NN
    from somewhere. Needless to say, that needs to be fairly HA too.
    * your NN image has to be ready to go
    * when the deployed NA goes away, bring up a new machine with the same
    image, hostname *and IP Address*. You can't always pull the latter off, it
    depends on the infrastructure. Without that, you'd need to bring up all the
    nodes with DNS caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
    All this can be done with Heartbeat and Xen:

    1) Heartbeat is P2P, so there is no SPOF here.
    2) It possible to start running Xen VM machine on another node, in case the
    other node has failed.

    The only question left, is how to keep access to NN/SNN meta-data, in case
    one of the NN VM fails.

    Maybe by keeping NFS exports on both Dom0, and writing to them in parallel?

    In this case if one of Dom0 fails, taking VM with it, the other will come
    up, and read the NFS from it's own Dom0.

    It less clean then having the meta-data inside the VM as well, but might
    work, as DRBD won't be used here.

    1) What do you think of this approach? Maybe there is better solution?

    2) What happens if the NN crashes in middle of work - can it corrupt the
    meta-data in any way, which would require manual restore of SNN checkpoint?
    Meaning, the process will still require use intervention.

    3) Any idea where is that list about HA, which was discussed last week?

    Thanks again!
  • Otis Gospodnetic at Oct 3, 2009 at 5:07 am
    Related (but not helping the immediate question). China Telecom developed something they call HyperDFS. They modified Hadoop and made it possible to run a cluster of NNs, thus eliminating the SPOF.

    I don't have the details - the presenter at Hadoop World (last round of sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked about contributing it back.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Steve Loughran <stevel@apache.org>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 7:22:45 AM
    Subject: Re: NameNode high availability

    Stas Oskin wrote:
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for
    example?

    * You need to have something " a resource manager" keeping an eye on the NN from
    somewhere. Needless to say, that needs to be fairly HA too.

    * your NN image has to be ready to go

    * when the deployed NA goes away, bring up a new machine with the same image,
    hostname *and IP Address*. You can't always pull the latter off, it depends on
    the infrastructure. Without that, you'd need to bring up all the nodes with DNS
    caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
  • Edward Capriolo at Oct 3, 2009 at 1:01 pm

    On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic wrote:
    Related (but not helping the immediate question).  China Telecom developed something they call HyperDFS.  They modified Hadoop and made it possible to run a cluster of NNs, thus eliminating the SPOF.

    I don't have the details - the presenter at Hadoop World (last round of sessions, 2nd floor) mentioned that.  Didn't give a clear answer when asked about contributing it back.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Steve Loughran <stevel@apache.org>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 7:22:45 AM
    Subject: Re: NameNode high availability

    Stas Oskin wrote:
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for
    example?

    * You need to have something " a resource manager" keeping an eye on the NN from
    somewhere. Needless to say, that needs to be fairly HA too.

    * your NN image has to be ready to go

    * when the deployed NA goes away, bring up a new machine with the same image,
    hostname *and IP Address*. You can't always pull the latter off, it depends on
    the infrastructure. Without that, you'd need to bring up all the nodes with DNS
    caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
    Stas,

    I think your setup does work but there are some inherent complexities
    that are not accounted for. For instance, is that NameNode meta data
    is not written to disk in a transactional fashion. Thus, even though
    you have block level replication you can not be sure that the
    underlying name node data is in a consistent state. (Should not be an
    issue with live-migrate though)

    I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let
    me summarize my experiences. We know the concept is a normal
    standalone system has components that fail, but the mean time to
    failure is high, say 99.9 normally disk components fail most often,
    solid state or a RAID1 say 99.99. If you look really hard at what the
    namenode does, a massive hadoop file system may be only a few hundred
    MB to GB of NameNode data. Assuming you have a hot spare restoring a
    few GB of data would not take very long. (large volumes of data are
    tricky because they take longer to restore)

    If XEN LiveMigrate works it is a slam dunk and very cool! You might
    not even miss a ping. But lets say your NOT doing a live migration,
    down the xen instance and bring it up on the other node. The failover
    might take 15 seconds, but it may work more reliably.
    From experience, I will tell you on a different project I went way
    cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file
    system simultaneously on two nodes. It worked ok, but I sunk weeks of
    research into it and it was very complex. OCFS2 conf files, HA Conf
    Files, DRDB Conf files, kernel modules, and no matter how much docs I
    wrote no one could follow it but me.

    My lesson learned was I really did not need that sub-second failover,
    nor did i need to be able to duel mount the drive. With those things
    stripped out I had better results better. So you might want to
    consider. Do you need live-migrate? do you need xen?

    This doc tries using less moving parts
    http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
    context web presented at Hadoop World NYC they mentioned that with
    this setup they had 6 failures, 3 planned 3 unplanned and it worked
    each time.

    FYI- I did imply that the DRBD XEND approach should work before. Sorry
    for being misleading. I find people who have mixed results with some
    of these HA tools. I have them working in cases and then in other edge
    cases they do not. Your mileage may vary.
  • Stas Oskin at Oct 3, 2009 at 3:08 pm
    Hi Edward.
    Thanks for the follow-up, and for the time put in private chat. I must say
    that while Xen migrate is cool, I'm really looking for HA only.

    So far, the above mentioned Xen/DRBD didn't work for me reliably, and I'm
    also burnt at least 3 days into it.

    There are too many components that make this whole mix, and it appears that
    many of them still have bugs.

    Therefore, I'm really thinking to abandon this approach, and to follow-up
    the above mentioned guide - as I already have the infrastruture (heartbeat
    and DRBD) ready.

    My only concern was so far, it that I really wanted to keep some kind of
    logical separation (i.e. DataNode and NameNode), with the plan to seamlessy
    migrate it to separate machine in the future. But it seems it really doesn't
    worth the hassle in it.

    So my new plan:

    1) Drop the Xen completely, to keep the amount of possible fault layers to
    minimum.

    2) Keep DRBD for Xen NN and SNN metadata, posibly mixing it with MySQL
    data.
    This means I can get 2 HA solutions in cost of one - any comments if this
    mixup healthy?

    3) Create NFS exports on each server for SNN backup - in the unlikely event
    that DRBD will broken on every server at once.

    Additional point that I forgot to add, that in my case DRBD is over software
    RAID 1, which means I have server-level HA as well.
    So far this setup worked just fine.

    Any comments from the list regarding the above?

    Edward, as we talked, I'm also going to check out VLinux, for limiting the
    processes in case they all are running on same machine.

    Thanks in advance!


    2009/10/3 Edward Capriolo <edlinuxguru@gmail.com>
    On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic
    wrote:
    Related (but not helping the immediate question). China Telecom
    developed something they call HyperDFS. They modified Hadoop and made it
    possible to run a cluster of NNs, thus eliminating the SPOF.
    I don't have the details - the presenter at Hadoop World (last round of
    sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked
    about contributing it back.
    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Steve Loughran <stevel@apache.org>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 7:22:45 AM
    Subject: Re: NameNode high availability

    Stas Oskin wrote:
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for
    example?

    * You need to have something " a resource manager" keeping an eye on the
    NN from
    somewhere. Needless to say, that needs to be fairly HA too.

    * your NN image has to be ready to go

    * when the deployed NA goes away, bring up a new machine with the same
    image,
    hostname *and IP Address*. You can't always pull the latter off, it
    depends on
    the infrastructure. Without that, you'd need to bring up all the nodes
    with DNS
    caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
    Stas,

    I think your setup does work but there are some inherent complexities
    that are not accounted for. For instance, is that NameNode meta data
    is not written to disk in a transactional fashion. Thus, even though
    you have block level replication you can not be sure that the
    underlying name node data is in a consistent state. (Should not be an
    issue with live-migrate though)

    I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let
    me summarize my experiences. We know the concept is a normal
    standalone system has components that fail, but the mean time to
    failure is high, say 99.9 normally disk components fail most often,
    solid state or a RAID1 say 99.99. If you look really hard at what the
    namenode does, a massive hadoop file system may be only a few hundred
    MB to GB of NameNode data. Assuming you have a hot spare restoring a
    few GB of data would not take very long. (large volumes of data are
    tricky because they take longer to restore)

    If XEN LiveMigrate works it is a slam dunk and very cool! You might
    not even miss a ping. But lets say your NOT doing a live migration,
    down the xen instance and bring it up on the other node. The failover
    might take 15 seconds, but it may work more reliably.

    From experience, I will tell you on a different project I went way
    cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file
    system simultaneously on two nodes. It worked ok, but I sunk weeks of
    research into it and it was very complex. OCFS2 conf files, HA Conf
    Files, DRDB Conf files, kernel modules, and no matter how much docs I
    wrote no one could follow it but me.

    My lesson learned was I really did not need that sub-second failover,
    nor did i need to be able to duel mount the drive. With those things
    stripped out I had better results better. So you might want to
    consider. Do you need live-migrate? do you need xen?

    This doc tries using less moving parts
    http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
    context web presented at Hadoop World NYC they mentioned that with
    this setup they had 6 failures, 3 planned 3 unplanned and it worked
    each time.

    FYI- I did imply that the DRBD XEND approach should work before. Sorry
    for being misleading. I find people who have mixed results with some
    of these HA tools. I have them working in cases and then in other edge
    cases they do not. Your mileage may vary.
  • Stas Oskin at Oct 3, 2009 at 2:08 pm
    Hi.
    Intresting - aren't they supposed to contribute back, as Hadoop is open
    source?

    Regards.

    2009/10/3 Otis Gospodnetic <otis_gospodnetic@yahoo.com>
    Related (but not helping the immediate question). China Telecom developed
    something they call HyperDFS. They modified Hadoop and made it possible to
    run a cluster of NNs, thus eliminating the SPOF.

    I don't have the details - the presenter at Hadoop World (last round of
    sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked
    about contributing it back.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Steve Loughran <stevel@apache.org>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 7:22:45 AM
    Subject: Re: NameNode high availability

    Stas Oskin wrote:
    Hi.

    The HA service (heartbeat) is running on Dom0, and when the primary
    node is down, it basically just starts the VM on the other node. So
    there not supposed to be any time issues.

    Can you explain a bit more about your approach, how to automate it for
    example?

    * You need to have something " a resource manager" keeping an eye on the NN from
    somewhere. Needless to say, that needs to be fairly HA too.

    * your NN image has to be ready to go

    * when the deployed NA goes away, bring up a new machine with the same image,
    hostname *and IP Address*. You can't always pull the latter off, it
    depends on
    the infrastructure. Without that, you'd need to bring up all the nodes with DNS
    caching set to a short time and update a DNS entry.

    This isn't real HA, its recovery.
  • Steve Loughran at Oct 5, 2009 at 9:29 am

    Stas Oskin wrote:
    Hi.
    Intresting - aren't they supposed to contribute back, as Hadoop is open
    source?

    Regards.

    2009/10/3 Otis Gospodnetic <otis_gospodnetic@yahoo.com>
    Related (but not helping the immediate question). China Telecom developed
    something they call HyperDFS. They modified Hadoop and made it possible to
    run a cluster of NNs, thus eliminating the SPOF.

    I don't have the details - the presenter at Hadoop World (last round of
    sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked
    about contributing it back.

    Otis

    1. Apache license says no need to contribute back
    2. Even LGPL and GPL say no need to contribute back if you dont
    distribute the code

    so no, they dont.

    By not distributing their patches, they say "we agree to take on all
    maintenance and testing costs forever"

    Their choice.
  • Isabel Drost at Oct 5, 2009 at 11:44 am

    On Mon, 05 Oct 2009 10:28:58 +0100 Steve Loughran wrote:

    2. Even LGPL and GPL say no need to contribute back if you dont
    distribute the code
    Sorry in advance about the nitpicking: IANAL - but AFAIK even LGPL and
    GPL do not force you to contribute back. The only thing that GPL does,
    is forcing you to distribute the source code (including potential
    modifications from your side) along with the binary you created*.

    ;)

    Isabel


    * The goal being to enable the user of your binary to freely use the
    software, to study the way it was created, to improve it and to share
    it freely with others.
  • Steve Loughran at Oct 5, 2009 at 12:01 pm

    Isabel Drost wrote:
    On Mon, 05 Oct 2009 10:28:58 +0100
    Steve Loughran wrote:
    2. Even LGPL and GPL say no need to contribute back if you dont
    distribute the code
    Sorry in advance about the nitpicking: IANAL - but AFAIK even LGPL and
    GPL do not force you to contribute back. The only thing that GPL does,
    is forcing you to distribute the source code (including potential
    modifications from your side) along with the binary you created*.
    Good point. You only have to publish the source to the customers who
    receive the binaries, not get the changes back in to the codebase.

    If you run GPL code on your own servers, you aren't distributing it, so
    the requirements to publish any source don't even kick in.

    Now, the ASF license doesn't even require you to publish your changes.
    Even so, because you end up taking on all maintenance costs -including
    testing- it's not something I'd recommend. You might think because you
    have put in six months worth of effort that your code is somehow better
    -and it may be- but the value of that effort will evaporate unless you
    keep investing effort merging the changes in with the code, and there is
    a risk that other people who want the same feature will do something
    different.

    In the specific case of Hadoop, it is a very fast moving codebase, as
    someone who does keep his own branch and updates it every few weeks, the
    code split was pretty traumatic all round. Things have stopped moving
    now, but it will have hurt everyone who branched. Whoever wrote an HA
    namenode will not only have to deal with those changes, but the ongoing
    extensions to the NN to improve recovery -the backup namenode, and other
    implications.

    I'm steering clear of all that, but I am trying to make it easier to
    manage and configure the various services, so that you can bring them up
    in different ways quite rapidly. Ideally I'd like every worker node to
    handle the failure and migration of the master nodes far more gracefully
    than today.

    -steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 1, '09 at 5:53p
activeOct 5, '09 at 12:01p
posts14
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase