FAQ
Hello Romit,

Can you tell me how did you fix the Failover Controller issue ?
I am having same issue as you had experienced with High Availability. I
see below error in the Failover Controller log.

6:42:12.449 PM INFO org.apache.hadoop.ipc.Client

Retrying connect to server: vm-F0CD-5B46.nam.nsroot.net/10.49.216.121:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)

6:42:12.450 PM WARN org.apache.hadoop.ha.HealthMonitor

Transport-level exception trying to monitor health of NameNode at vm-F0CD-5B46.nam.nsroot.net/10.49.216.121:8020: Call From vm-F0CD-5B46.nam.nsroot.net/10.49.216.121 to vm-F0CD-5B46.nam.nsroot.net:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused


Thanks
Madhu
On Thursday, October 18, 2012 10:48:25 PM UTC-5, Romit Singhai wrote:

Hi Aaron,

The issue was with the fail over controller. I was able to figure it out
and it seems to be working.

Thanks,
Romit

On Thu, Oct 18, 2012 at 6:00 PM, Aaron T. Myers <a...@cloudera.com<javascript:>
wrote:
Hi Romit,

That's pretty surprising. My guess would be that you have some hostnames
misconfigured. Perhaps you could send the relevant portion of your
hdfs-site.xml?


--
Aaron T. Myers
Software Engineer, Cloudera



On Thu, Oct 18, 2012 at 2:11 PM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Aaron,

Thanks for your response. I configured the QJM and have two name nodes
i.e namenode738 and namenode739 configured on the cluster.

I have currently configured shell(/bin/true) in the fencing method as a
fall back option. All services are up and running and cluster is currently
working.

Now to test the fail over I used the following command.

sudo -u hdfs hdfs haadmin -failover namenode738 namenode739

I receive the following error

sudo -u hdfs hdfs haadmin -failover namenode738 namenode739
12/10/18 16:03:38 INFO ipc.Client: Retrying connect to server:
abc-snn-10ge/172.16.10.1:8019. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
Operation failed: Call From abc-nn-bond0/172.16.10.2 to
abc-snn-10ge:8019 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
[root@abc-nn-bond0 hadoop-hdfs]#


Thanks in advance!

Romit.











On Thu, Oct 18, 2012 at 11:52 AM, Aaron T. Myers <a...@cloudera.com<javascript:>
wrote:
Hi Romit,

QJM is supported as of 4.1.0, but if you're on 4.1.0 I strongly
recommend you upgrade to 4.1.1 promptly, as it addresses a serious HDFS
performance regression present in 4.1.0.

Regarding updated docs, yes, the HA guide was updated to include
instructions for deploying the QJM. See the following link for the HA
guide, and note the two forks in the guide which discuss QJM vs. NFS:

https://ccp.cloudera.com/display/CDH4DOC/CDH4+High+Availability+Guide

Regarding CM - no, CM is not required to deploy the QJM, though CM 4.1
will make this much easier.

As for the fencing configuration when using QJM, see this section of
the guide, which discusses this subject in detail:


https://ccp.cloudera.com/display/CDH4DOC/Software+Configuration+for+Quorum-based+Storage#SoftwareConfigurationforQuorum-basedStorage-FencingConfiguration

I hope this helps,
Aaron

--
Aaron T. Myers
Software Engineer, Cloudera



On Thu, Oct 18, 2012 at 11:46 AM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Todd,

Is QJM only supported by 4.1.1 onwards as we already have 4.1?

Do we have a modified HA guide published for CDH4.1.1 and is there
any requirement for CM to be installed for QJM?

Also to use QJM do we need to specify QJM as the fencing option in the
configuration files like sshfence?

Thanks,
Romit

On Thu, Oct 18, 2012 at 10:56 AM, Todd Lipcon <to...@cloudera.com<javascript:>
wrote:
On Thu, Oct 18, 2012 at 10:34 AM, Romit Singhai <rom...@gmail.com<javascript:>
wrote:
Hi Todd,

I did not have this setting configured so I believe it defaults to 5
sec. I see an error related to fencing method in my log file stating that
the fencing method is unable to fence the namenode.

I am using sshfence as the fencing method and both namenodes are
running as user hdfs. The nodes can do password less ssh to each other
using the root user which I have specified in the config file for the
fencing method.

Am I missing something?
Apparently something is misconfigured with the sshfence setup, then,
if it fails to fence. Configuring sshfence properly can be a bit of a pain
- it's one of the reasons that we built QuorumJournalManager for CDH4.1.

You could mess around with the config to get sshfence to work, but
instead I'd recommend just upgrading to 4.1.1 and using QJM.

Thanks
-Todd


On Thursday, October 18, 2012 9:55:56 AM UTC-7, Todd Lipcon wrote:

Hi Romit,

How long are you waiting to allow the failover to happen? Is there
anything in the ZKFC logs on the failed node?

-Todd

On Thu, Oct 18, 2012 at 9:53 AM, Romit Singhai <rom...@gmail.com>
wrote:
Hello Experts,

I have installed and configured CDH4.1 and all the services seems to be
running fine. I also validated the MR functionality by running a small job
on the cluster.

While testing HA with automatic failover, the standby name node is not
transitioning to active state when currently active name node is
stopped. It
continues to remain in the standby state and hence the jobs which are
running on the cluster fails.

On restarting the stopped node, it starts in the standby state
transitioning
the current standby node to active.

Any suggestions/insights into why the transition is not happening
automatically?

Thanks,
Romit




--



--
Todd Lipcon
Software Engineer, Cloudera
--




--
Todd Lipcon
Software Engineer, Cloudera

--


--


--


--


--


--

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 10 | next ›
Discussion Overview
groupcdh-user @
categorieshadoop
postedOct 18, '12 at 5:27p
activeMar 13, '13 at 10:46p
posts10
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase