Can you tell me how did you fix the Failover Controller issue ?
I am having same issue as you had experienced with High Availability. I
see below error in the Failover Controller log.
6:42:12.449 PM INFO org.apache.hadoop.ipc.Client
Retrying connect to server: vm-F0CD-5B46.nam.nsroot.net/10.49.216.121:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
6:42:12.450 PM WARN org.apache.hadoop.ha.HealthMonitor
Transport-level exception trying to monitor health of NameNode at vm-F0CD-5B46.nam.nsroot.net/10.49.216.121:8020: Call From vm-F0CD-5B46.nam.nsroot.net/10.49.216.121 to vm-F0CD-5B46.nam.nsroot.net:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Thanks
Madhu
On Thursday, October 18, 2012 10:48:25 PM UTC-5, Romit Singhai wrote:
Hi Aaron,
The issue was with the fail over controller. I was able to figure it out
and it seems to be working.
Thanks,
Romit
On Thu, Oct 18, 2012 at 6:00 PM, Aaron T. Myers <a...@cloudera.com<javascript:>
--Hi Aaron,
The issue was with the fail over controller. I was able to figure it out
and it seems to be working.
Thanks,
Romit
On Thu, Oct 18, 2012 at 6:00 PM, Aaron T. Myers <a...@cloudera.com<javascript:>
wrote:
Hi Romit,
That's pretty surprising. My guess would be that you have some hostnames
misconfigured. Perhaps you could send the relevant portion of your
hdfs-site.xml?
--
Aaron T. Myers
Software Engineer, Cloudera
On Thu, Oct 18, 2012 at 2:11 PM, Romit Singhai <rom...@gmail.com<javascript:>
Hi Romit,
That's pretty surprising. My guess would be that you have some hostnames
misconfigured. Perhaps you could send the relevant portion of your
hdfs-site.xml?
--
Aaron T. Myers
Software Engineer, Cloudera
On Thu, Oct 18, 2012 at 2:11 PM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Aaron,
Thanks for your response. I configured the QJM and have two name nodes
i.e namenode738 and namenode739 configured on the cluster.
I have currently configured shell(/bin/true) in the fencing method as a
fall back option. All services are up and running and cluster is currently
working.
Now to test the fail over I used the following command.
sudo -u hdfs hdfs haadmin -failover namenode738 namenode739
I receive the following error
sudo -u hdfs hdfs haadmin -failover namenode738 namenode739
12/10/18 16:03:38 INFO ipc.Client: Retrying connect to server:
abc-snn-10ge/172.16.10.1:8019. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
Operation failed: Call From abc-nn-bond0/172.16.10.2 to
abc-snn-10ge:8019 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
[root@abc-nn-bond0 hadoop-hdfs]#
Thanks in advance!
Romit.
On Thu, Oct 18, 2012 at 11:52 AM, Aaron T. Myers <a...@cloudera.com<javascript:>
--Hi Aaron,
Thanks for your response. I configured the QJM and have two name nodes
i.e namenode738 and namenode739 configured on the cluster.
I have currently configured shell(/bin/true) in the fencing method as a
fall back option. All services are up and running and cluster is currently
working.
Now to test the fail over I used the following command.
sudo -u hdfs hdfs haadmin -failover namenode738 namenode739
I receive the following error
sudo -u hdfs hdfs haadmin -failover namenode738 namenode739
12/10/18 16:03:38 INFO ipc.Client: Retrying connect to server:
abc-snn-10ge/172.16.10.1:8019. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
Operation failed: Call From abc-nn-bond0/172.16.10.2 to
abc-snn-10ge:8019 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
[root@abc-nn-bond0 hadoop-hdfs]#
Thanks in advance!
Romit.
On Thu, Oct 18, 2012 at 11:52 AM, Aaron T. Myers <a...@cloudera.com<javascript:>
wrote:
Hi Romit,
QJM is supported as of 4.1.0, but if you're on 4.1.0 I strongly
recommend you upgrade to 4.1.1 promptly, as it addresses a serious HDFS
performance regression present in 4.1.0.
Regarding updated docs, yes, the HA guide was updated to include
instructions for deploying the QJM. See the following link for the HA
guide, and note the two forks in the guide which discuss QJM vs. NFS:
https://ccp.cloudera.com/display/CDH4DOC/CDH4+High+Availability+Guide
Regarding CM - no, CM is not required to deploy the QJM, though CM 4.1
will make this much easier.
As for the fencing configuration when using QJM, see this section of
the guide, which discusses this subject in detail:
https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-FencingConfiguration
I hope this helps,
Aaron
--
Aaron T. Myers
Software Engineer, Cloudera
On Thu, Oct 18, 2012 at 11:46 AM, Romit Singhai <rom...@gmail.com<javascript:>
--Hi Romit,
QJM is supported as of 4.1.0, but if you're on 4.1.0 I strongly
recommend you upgrade to 4.1.1 promptly, as it addresses a serious HDFS
performance regression present in 4.1.0.
Regarding updated docs, yes, the HA guide was updated to include
instructions for deploying the QJM. See the following link for the HA
guide, and note the two forks in the guide which discuss QJM vs. NFS:
https://ccp.cloudera.com/display/CDH4DOC/CDH4+High+Availability+Guide
Regarding CM - no, CM is not required to deploy the QJM, though CM 4.1
will make this much easier.
As for the fencing configuration when using QJM, see this section of
the guide, which discusses this subject in detail:
https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-FencingConfiguration
I hope this helps,
Aaron
--
Aaron T. Myers
Software Engineer, Cloudera
On Thu, Oct 18, 2012 at 11:46 AM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Todd,
Is QJM only supported by 4.1.1 onwards as we already have 4.1?
Do we have a modified HA guide published for CDH4.1.1 and is there
any requirement for CM to be installed for QJM?
Also to use QJM do we need to specify QJM as the fencing option in the
configuration files like sshfence?
Thanks,
Romit
On Thu, Oct 18, 2012 at 10:56 AM, Todd Lipcon <to...@cloudera.com<javascript:>
--Hi Todd,
Is QJM only supported by 4.1.1 onwards as we already have 4.1?
Do we have a modified HA guide published for CDH4.1.1 and is there
any requirement for CM to be installed for QJM?
Also to use QJM do we need to specify QJM as the fencing option in the
configuration files like sshfence?
Thanks,
Romit
On Thu, Oct 18, 2012 at 10:56 AM, Todd Lipcon <to...@cloudera.com<javascript:>
wrote:
On Thu, Oct 18, 2012 at 10:34 AM, Romit Singhai <rom...@gmail.com<javascript:>
if it fails to fence. Configuring sshfence properly can be a bit of a pain
- it's one of the reasons that we built QuorumJournalManager for CDH4.1.
You could mess around with the config to get sshfence to work, but
instead I'd recommend just upgrading to 4.1.1 and using QJM.
Thanks
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
--
--On Thu, Oct 18, 2012 at 10:34 AM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Todd,
I did not have this setting configured so I believe it defaults to 5
sec. I see an error related to fencing method in my log file stating that
the fencing method is unable to fence the namenode.
I am using sshfence as the fencing method and both namenodes are
running as user hdfs. The nodes can do password less ssh to each other
using the root user which I have specified in the config file for the
fencing method.
Am I missing something?
Apparently something is misconfigured with the sshfence setup, then,Hi Todd,
I did not have this setting configured so I believe it defaults to 5
sec. I see an error related to fencing method in my log file stating that
the fencing method is unable to fence the namenode.
I am using sshfence as the fencing method and both namenodes are
running as user hdfs. The nodes can do password less ssh to each other
using the root user which I have specified in the config file for the
fencing method.
Am I missing something?
if it fails to fence. Configuring sshfence properly can be a bit of a pain
- it's one of the reasons that we built QuorumJournalManager for CDH4.1.
You could mess around with the config to get sshfence to work, but
instead I'd recommend just upgrading to 4.1.1 and using QJM.
Thanks
-Todd
On Thursday, October 18, 2012 9:55:56 AM UTC-7, Todd Lipcon wrote:
Hi Romit,
How long are you waiting to allow the failover to happen? Is there
anything in the ZKFC logs on the failed node?
-Todd
On Thu, Oct 18, 2012 at 9:53 AM, Romit Singhai <rom...@gmail.com>
wrote:
--
Todd Lipcon
Software Engineer, Cloudera
--Hi Romit,
How long are you waiting to allow the failover to happen? Is there
anything in the ZKFC logs on the failed node?
-Todd
On Thu, Oct 18, 2012 at 9:53 AM, Romit Singhai <rom...@gmail.com>
wrote:
Hello Experts,
I have installed and configured CDH4.1 and all the services seems to be
running fine. I also validated the MR functionality by running a small job
on the cluster.
While testing HA with automatic failover, the standby name node is not
transitioning to active state when currently active name node is
stopped. It
continues to remain in the standby state and hence the jobs which are
running on the cluster fails.
On restarting the stopped node, it starts in the standby state
transitioning
the current standby node to active.
Any suggestions/insights into why the transition is not happening
automatically?
Thanks,
Romit
--
I have installed and configured CDH4.1 and all the services seems to be
running fine. I also validated the MR functionality by running a small job
on the cluster.
While testing HA with automatic failover, the standby name node is not
transitioning to active state when currently active name node is
stopped. It
continues to remain in the standby state and hence the jobs which are
running on the cluster fails.
On restarting the stopped node, it starts in the standby state
transitioning
the current standby node to active.
Any suggestions/insights into why the transition is not happening
automatically?
Thanks,
Romit
--
--
Todd Lipcon
Software Engineer, Cloudera
--
Todd Lipcon
Software Engineer, Cloudera
--