I'm trying to implement NameNode failover (or at least NameNode local
data backup), but it is hard since there is no official documentation.
Pages on this subject are created, but still empty:
I have been browsing the web and hadoop mailing list to see how this
should be implemented, but I got even more confused. People are asking
do we even need SecondaryNameNode etc. (since NameNode can write local
data to multiple locations, so one of those locations can be a mounted
disk from other machine). I think I understand the motivation for
SecondaryNameNode (to create a snapshoot of NameNode data every n
seconds/hours), but setting (deploying and running) SecondaryNameNode on
different machine than NameNode is not as trivial as I expected. First I
found that if I need to run SecondaryNameNode on other machine than
NameNode I should change masters file on NameNode (change localhost to
SecondaryNameNode host) and set some properties in hadoop-site.xml on
SecondaryNameNode (fs.default.name, fs.checkpoint.dir,
This was enough to start SecondaryNameNode when starting NameNode with
bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. Then
I found that I need to set dfs.http.address on NameNode address (so now
I have NameNode address in both fs.default.name and dfs.http.address).
Now I get following exception:
2008-10-28 09:18:00,098 ERROR NameNode.Secondary - Exception in
2008-10-28 09:18:00,098 ERROR NameNode.Secondary -
java.net.SocketException: Unexpected end of file from server
My questions are following:
How to resolve this problem (this exception)?
Do I need additional property in SecondaryNameNode's hadoop-site.xml or
How should NameNode failover work ideally? Is it like this:
SecondaryNameNode runs on separate machine than NameNode and stores
NameNode's data (fsimage and fsiedits) locally in fs.checkpoint.dir.
When NameNode machine crashes, we start NameNode on machine where
SecondaryNameNode was running and we set dfs.name.dir to
fs.checkpoint.dir. Also we need to change how DNS resolves NameNode
hostname (change from the primary to the secondary).
Is this correct ?