On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic wrote:Related (but not helping the immediate question). China Telecom developed something they call HyperDFS. They modified Hadoop and made it possible to run a cluster of NNs, thus eliminating the SPOF.
I don't have the details - the presenter at Hadoop World (last round of sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked about contributing it back.
Otis
--
Sematext is hiring --
http://sematext.com/about/jobs.html?mlsLucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
----- Original Message ----From: Steve Loughran <stevel@apache.org>
To: common-user@hadoop.apache.org
Sent: Friday, October 2, 2009 7:22:45 AM
Subject: Re: NameNode high availability
Stas Oskin wrote:
Hi.
The HA service (heartbeat) is running on Dom0, and when the primary
node is down, it basically just starts the VM on the other node. So
there not supposed to be any time issues.
Can you explain a bit more about your approach, how to automate it for
example?
* You need to have something " a resource manager" keeping an eye on the NN from
somewhere. Needless to say, that needs to be fairly HA too.
* your NN image has to be ready to go
* when the deployed NA goes away, bring up a new machine with the same image,
hostname *and IP Address*. You can't always pull the latter off, it depends on
the infrastructure. Without that, you'd need to bring up all the nodes with DNS
caching set to a short time and update a DNS entry.
This isn't real HA, its recovery.
Stas,
I think your setup does work but there are some inherent complexities
that are not accounted for. For instance, is that NameNode meta data
is not written to disk in a transactional fashion. Thus, even though
you have block level replication you can not be sure that the
underlying name node data is in a consistent state. (Should not be an
issue with live-migrate though)
I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let
me summarize my experiences. We know the concept is a normal
standalone system has components that fail, but the mean time to
failure is high, say 99.9 normally disk components fail most often,
solid state or a RAID1 say 99.99. If you look really hard at what the
namenode does, a massive hadoop file system may be only a few hundred
MB to GB of NameNode data. Assuming you have a hot spare restoring a
few GB of data would not take very long. (large volumes of data are
tricky because they take longer to restore)
If XEN LiveMigrate works it is a slam dunk and very cool! You might
not even miss a ping. But lets say your NOT doing a live migration,
down the xen instance and bring it up on the other node. The failover
might take 15 seconds, but it may work more reliably.
From experience, I will tell you on a different project I went way
cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file
system simultaneously on two nodes. It worked ok, but I sunk weeks of
research into it and it was very complex. OCFS2 conf files, HA Conf
Files, DRDB Conf files, kernel modules, and no matter how much docs I
wrote no one could follow it but me.
My lesson learned was I really did not need that sub-second failover,
nor did i need to be able to duel mount the drive. With those things
stripped out I had better results better. So you might want to
consider. Do you need live-migrate? do you need xen?
This doc tries using less moving parts
http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/context web presented at Hadoop World NYC they mentioned that with
this setup they had 6 failures, 3 planned 3 unplanned and it worked
each time.
FYI- I did imply that the DRBD XEND approach should work before. Sorry
for being misleading. I find people who have mixed results with some
of these HA tools. I have them working in cases and then in other edge
cases they do not. Your mileage may vary.