I have been working on setting up a Hadoop on Demand cluster on
three machines and have run into a bit of a snag. I went through the
admin and user guides and have successfully installed torque and HOD.
When I run "hod allocate" it successfully starts hodring on 2 of the
machines but not the third. The result is that I have a working
Namenode and Jobtracker (though its UI does not seem to work
presently) but no slave nodes.
Even at level 4 debug in all sections there is nothing to indicate
a failure as the ringmaster has no problem communicating with the
running hodring jobs and pbsdsh returns without error. I can find no
logs on any of the machines indicating a torque issue (though I admit
I am not terribly familiar with torque) and no logs at all for HOD on
the machine that is not running hodring.
It would appear that the pbsdsh job simply isn't starting hodring
on the one node given the lack of any HOD log on that machine. Either
it is not recognizing the node (seems somewhat unlikely as it comes up
in pbsnodes as free) or there is a relatively silent failure
somewhere. If you have any suggestions I would much appreciate them.
One quick side note is I have successfully run a standard
hadoop-0.20.0 cluster on these three machines with no difficulty,
which should rule out connection, ssh or firewall issues.