On Jun 7, 2011, at 12:07 AM, sanjeev.taran@us.pwc.com wrote:


I wanted to know if anyone has any tips or tutorials on howto install the
hadoop cluster on multiple datacenters
Generally, this is a bad idea. Why?
1) Inter-datacenter bandwidth is expensive compared to cluster bandwidth.
2) This extra topological constraint is not currently well-modeled in the Hadoop architecture. This means that you will likely find assumptions in the software that are not true in the inter-datacenter case.
3) None of the biggest users currently do this. Until you plan on putting serious money into the game, follow what is well-established to work.

I would note that, in my other life, I work with a batch-oriented distributed computing system called Condor (http://www.cs.wisc.edu/condor/). Condor is designed to naturally span the globe (I've seen it spanning around 50 clusters). However, it is batch job oriented, not data oriented. If you have to wedge your problem to fit into the MapReduce paradigm, this might be a good alternate.
Do you need ssh connectivity between the nodes across these data centers?

Definitely not. SSH is only used in the wrapper scripts to start the HDFS daemons. It's a usability crutch for smaller clusters that don't have proper management.

If your ops folks don't have a better way to manage what is running on your cluster, fire them.


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 5 | next ›
Discussion Overview
groupcommon-user @
postedJun 7, '11 at 5:08a
activeJun 7, '11 at 12:58p



site design / logo © 2021 Grokbase