FAQ
Hi All,

Ive been working michael nolls multi-node cluster setup example
(Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
then on my slave machine -- which is currently running a datanode killed the
process in an effort to try to simulate some sort of failure on the slave
machine datanode. I had assumed that the namenode would have been polling
its datanodes and thus attempted to bring up any node that goes down. On
looking at my slave machine it seems that the datanode process is still down
(I've checked jps).

Obviously im missing something ! Does hadoop look after its datanodes ? Is
there a config setting that i may have missed ? Do I need to create some
sort of external tool to pool and attempt to bring up nodes that have gone
down ?

Thanks
Will

--
View this message in context: http://www.nabble.com/How-does-an-offline-Datanode-come-back-up---tp20192214p20192214.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Search Discussions

  • Alex Loddengaard at Oct 27, 2008 at 6:23 pm
    I'm pretty sure that failed nodes won't be automatically added to the
    cluster when they go down. It's the sysadmin's responsibility to deal
    with downed nodes and get them back in to the cluster.

    Alex
    On 10/27/08, wmitchell wrote:

    Hi All,

    Ive been working michael nolls multi-node cluster setup example
    (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
    then on my slave machine -- which is currently running a datanode killed the
    process in an effort to try to simulate some sort of failure on the slave
    machine datanode. I had assumed that the namenode would have been polling
    its datanodes and thus attempted to bring up any node that goes down. On
    looking at my slave machine it seems that the datanode process is still down
    (I've checked jps).

    Obviously im missing something ! Does hadoop look after its datanodes ? Is
    there a config setting that i may have missed ? Do I need to create some
    sort of external tool to pool and attempt to bring up nodes that have gone
    down ?

    Thanks
    Will

    --
    View this message in context:
    http://www.nabble.com/How-does-an-offline-Datanode-come-back-up---tp20192214p20192214.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Steve Loughran at Oct 28, 2008 at 10:26 am

    wmitchell wrote:
    Hi All,

    Ive been working michael nolls multi-node cluster setup example
    (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
    then on my slave machine -- which is currently running a datanode killed the
    process in an effort to try to simulate some sort of failure on the slave
    machine datanode. I had assumed that the namenode would have been polling
    its datanodes and thus attempted to bring up any node that goes down. On
    looking at my slave machine it seems that the datanode process is still down
    (I've checked jps).
    That's up to you or your management tools. The namenode knows that the
    datanode is unreachable, but doesn't know how to go about reconnecting
    it to the network. Which, given there are many causes of "down", sort of
    makes sense. The switch failing, the hdds dying or the process crashing,
    all look the same: no datanode heartbeats.
  • Norbert Burger at Oct 29, 2008 at 3:01 am
    Along these lines, I'm curious what "management tools" folks are using to
    ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

    Are you using a custom cron script, or maybe something more complex
    (Ganglia, Nagios, puppet, etc.)?

    Thanks,
    Norbert
    On 10/28/08, Steve Loughran wrote:

    wmitchell wrote:
    Hi All,

    Ive been working michael nolls multi-node cluster setup example
    (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
    then on my slave machine -- which is currently running a datanode killed
    the
    process in an effort to try to simulate some sort of failure on the slave
    machine datanode. I had assumed that the namenode would have been polling
    its datanodes and thus attempted to bring up any node that goes down. On
    looking at my slave machine it seems that the datanode process is still
    down
    (I've checked jps).
    That's up to you or your management tools. The namenode knows that the
    datanode is unreachable, but doesn't know how to go about reconnecting it to
    the network. Which, given there are many causes of "down", sort of makes
    sense. The switch failing, the hdds dying or the process crashing, all look
    the same: no datanode heartbeats.
  • David Wei at Oct 29, 2008 at 3:08 am
    I think using cron tab will be a good solution. Just using a test script
    to ensure the living processes and restart them when they are down.



    Norbert Burger 写道:
    Along these lines, I'm curious what "management tools" folks are using to
    ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

    Are you using a custom cron script, or maybe something more complex
    (Ganglia, Nagios, puppet, etc.)?

    Thanks,
    Norbert
    On 10/28/08, Steve Loughran wrote:

    wmitchell wrote:

    Hi All,

    Ive been working michael nolls multi-node cluster setup example
    (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
    then on my slave machine -- which is currently running a datanode killed
    the
    process in an effort to try to simulate some sort of failure on the slave
    machine datanode. I had assumed that the namenode would have been polling
    its datanodes and thus attempted to bring up any node that goes down. On
    looking at my slave machine it seems that the datanode process is still
    down
    (I've checked jps).

    That's up to you or your management tools. The namenode knows that the
    datanode is unreachable, but doesn't know how to go about reconnecting it to
    the network. Which, given there are many causes of "down", sort of makes
    sense. The switch failing, the hdds dying or the process crashing, all look
    the same: no datanode heartbeats.
    .
  • Edward Capriolo at Oct 29, 2008 at 12:32 pm
    Someone on the list is looking at monitoring hadoop features with
    nagios. Nagios can be configured with an event_handler. In the past I
    have written event handlers to do operations like this. If down ---
    use SSH key and restart.

    However....Since you have an SSH key on your master node, you should
    be able to have a centralized node restarter running from the master
    cron. Maybe an interesting argument to run a separate nagios as your
    hadoop user!

    In any case you can also run a cronjob on each slave as suggested above.

    The thing about all systems like this is you have to remember to shut
    them down when you actually want the service down for service etc.

    We run Nagios and cacti so I would like to develop check scripts for
    these services. I am going to get SVN repo together if anyone is
    interested in contributing let me know.
  • Steve Loughran at Oct 29, 2008 at 2:41 pm

    Norbert Burger wrote:
    Along these lines, I'm curious what "management tools" folks are using to
    ensure cluster availability (ie., auto-restart failed datanodes/namenodes).

    Are you using a custom cron script, or maybe something more complex
    (Ganglia, Nagios, puppet, etc.)?
    We use SmartFrog, http://smartfrog.org/ , to do this kind of thing, not
    just because it comes from our organisation, but because it gives us the
    ability to manage other parts of the system at the same time.

    To get SF deploying Hadoop in a way I'm happy with, I have had to make a
    fair few changes to the lifecycle of the "services" -things like
    namenode, datanode, jobtracker and task tracker. Most of the changes are
    in HADOOP-3628, though I need
    to push through another iteration of this [1]. Even with the changes I'm
    worried about race conditions and shutdown, as the existing code assumes
    that every node starts in its own process -which is what I recommend for
    production. We gave a talk on this topic in august at the Hadoop UK
    event [2]

    None of this stuff is in a public release yet, but I may cut one next
    week which includes an unsupported 0.20-alpha-patched version of Hadoop
    in an RPM. This RPM can be pushed out to the machines through your RPM
    publish mechanism of choice; when the SmartFrog daemon comes up it
    deploys whatever it has been told to, or it announces to the world it is
    unpurposed and gets told to deploy whatever someone it trusts talks to.

    Failure handing is still interesting. With a language like SmartFrog you
    can declare how failures can be handled; we have various workflowy
    containers to do things like
    -retry and restart
    -kill and report upwards (default)
    -roll back the whole deployment and restart
    For things like task trackers and such like, such loss is best handled
    by killing and restarting. But the filesystem is much more temperamental
    -and it is FS and HDD failures that create the most stress in any
    project. That and the accidental deletions of the entire dataset. A node
    in the cluster that is only a tasktracker is disposable: any problems
    you may as well flip the power switch and have the PXE reboot bring it
    back to a blank state. Datanode failures, though, that's an issue. If
    the data on the node is replicated in >1 place, I'd decomission the node
    and do the same thing. If the data isn't adequately replicated yet, you
    want to get the stuff off it first. And if you think its a physical HDD
    problem, time to stop using that particular disk.

    I think everyone is still learning the main failure modes of a cluster,
    and still deciding how to react.

    [1] https://issues.apache.org/jira/browse/HADOOP-3628
    [2]
    http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf

    Thanks,
    Norbert
    On 10/28/08, Steve Loughran wrote:
    wmitchell wrote:
    Hi All,

    Ive been working michael nolls multi-node cluster setup example
    (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
    then on my slave machine -- which is currently running a datanode killed
    the
    process in an effort to try to simulate some sort of failure on the slave
    machine datanode. I had assumed that the namenode would have been polling
    its datanodes and thus attempted to bring up any node that goes down. On
    looking at my slave machine it seems that the datanode process is still
    down
    (I've checked jps).
    That's up to you or your management tools. The namenode knows that the
    datanode is unreachable, but doesn't know how to go about reconnecting it to
    the network. Which, given there are many causes of "down", sort of makes
    sense. The switch failing, the hdds dying or the process crashing, all look
    the same: no datanode heartbeats.

    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 27, '08 at 5:23p
activeOct 29, '08 at 2:41p
posts7
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase