I've gone back through the thread and here's Elton's initial question...
On 06/06/11 08:22, elton sky wrote:
As I don't have experience with big scale cluster, I cannot figure out
the inter-rack communication in a mapreduce job is "significantly"
I saw cisco catalyst 4900 series switch can reach upto 320Gbps
capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
much contention at the switch, is it?
Elton's question deals with why connections within the same switch are faster than connections that traverse a set of switches.
The issue isn't so much one of the fabric within the switch itself, but the width of the connection between the two switches.
If you have 40GBs (each direction) on a switch and you want it to communicate seamlessly with machines on the next switch, you have to have be able to bond 4 10GBe ports together.
(Note: there's a bit more to it, but its the general idea.)
You're going to have a significant slow down on communication between nodes that are on different racks because of the bandwidth limitations on the ports used to connect the switches and not the 'fabric' within the switch itself.
To your point, you can monitor your jobs and see how much of your work is being done by 'data local' tasks. In one job we had 519 tasks started where 482 were 'data local'.
So we had ~93% of the jobs where we didn't have an issue with any network latency. And then with the 7% of the jobs, you have to consider what percentage would have occurred where the data traffic is going to involve pulling data across a 'trunk'. So yes, network latency isn't going to be a huge factor in terms of improving overall efficiency.
However, that's just for Hadoop. What happens when you run HBase? ;-)
(You can have more network traffic during a m/r job.)