I'm trying to run a large CopyTable job between clusters in totally
different datacenters and I'm trying to determine what network connectivity
is required here.
As per the Cloudera blog post about Copytable, I understand that the
network should be such that "MR TaskTrackers can access all the HBase and
ZK nodes in the destination cluster." So in practise that means that the
source task trackers should be able to access:
* Zookeeper on port 2181
* the Master on its RPC port (16000)
* the Regions' on their RPC ports (16020)
Anything else I need to configure here? Does Hadoop on the source need to
talk to directly with the destination Hadoop etc?
Also, what's unclear to me is what I should be doing with DNS. I'm guessing
that the source cluster needs to be able to resolve the hostnames of remote
RegionServers and Master nodes as stored in Zookeeper. Anything else I need
to configure here?
Thanks for your time!
Lex ToumbourouLead engineer at scrunch.com <http://scrunch.com/>