Norbert Burger wrote:
Along these lines, I'm curious what "management tools" folks are using to
ensure cluster availability (ie., auto-restart failed datanodes/namenodes).
Are you using a custom cron script, or maybe something more complex
(Ganglia, Nagios, puppet, etc.)?
We use SmartFrog, http://smartfrog.org/
, to do this kind of thing, not
just because it comes from our organisation, but because it gives us the
ability to manage other parts of the system at the same time.
To get SF deploying Hadoop in a way I'm happy with, I have had to make a
fair few changes to the lifecycle of the "services" -things like
namenode, datanode, jobtracker and task tracker. Most of the changes are
in HADOOP-3628, though I need
to push through another iteration of this . Even with the changes I'm
worried about race conditions and shutdown, as the existing code assumes
that every node starts in its own process -which is what I recommend for
production. We gave a talk on this topic in august at the Hadoop UK
None of this stuff is in a public release yet, but I may cut one next
week which includes an unsupported 0.20-alpha-patched version of Hadoop
in an RPM. This RPM can be pushed out to the machines through your RPM
publish mechanism of choice; when the SmartFrog daemon comes up it
deploys whatever it has been told to, or it announces to the world it is
unpurposed and gets told to deploy whatever someone it trusts talks to.
Failure handing is still interesting. With a language like SmartFrog you
can declare how failures can be handled; we have various workflowy
containers to do things like
-retry and restart
-kill and report upwards (default)
-roll back the whole deployment and restart
For things like task trackers and such like, such loss is best handled
by killing and restarting. But the filesystem is much more temperamental
-and it is FS and HDD failures that create the most stress in any
project. That and the accidental deletions of the entire dataset. A node
in the cluster that is only a tasktracker is disposable: any problems
you may as well flip the power switch and have the PXE reboot bring it
back to a blank state. Datanode failures, though, that's an issue. If
the data on the node is replicated in >1 place, I'd decomission the node
and do the same thing. If the data isn't adequately replicated yet, you
want to get the stuff off it first. And if you think its a physical HDD
problem, time to stop using that particular disk.
I think everyone is still learning the main failure modes of a cluster,
and still deciding how to react.
On 10/28/08, Steve Loughran wrote:
Ive been working michael nolls multi-node cluster setup example
(Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I
then on my slave machine -- which is currently running a datanode killed
process in an effort to try to simulate some sort of failure on the slave
machine datanode. I had assumed that the namenode would have been polling
its datanodes and thus attempted to bring up any node that goes down. On
looking at my slave machine it seems that the datanode process is still
(I've checked jps).
That's up to you or your management tools. The namenode knows that the
datanode is unreachable, but doesn't know how to go about reconnecting it to
the network. Which, given there are many causes of "down", sort of makes
sense. The switch failing, the hdds dying or the process crashing, all look
the same: no datanode heartbeats.
Steve Loughran http://www.1060.org/blogxter/publish/5
Author: Ant in Action http://antbook.org/