You are correct. The other big factor in this is the cost of connections between the Mappers and the Reducers. With N mappers and M reducers you will make M*N connections between them. This can be a very large cost as well. The basic tricks you can play are to filter data before doing a join so that less data is passed through the network; make your block sizes larger so that more data is processed per map and less connections are made between mappers and reducers; and make sure you have compression enabled between the map/reduce phases usually LZO is a good choice for this. But for the most part this is a very difficult problem to address because fundamentally a join requires transferring data with the same key to the same node. There is some experimental work into very large processing but it is still just that experimental and does not actually try to make a join scale linearly.
On 6/8/11 4:17 PM, "Shantian Purkad" wrote:
From: Shantian Purkad <email@example.com>
To: "firstname.lastname@example.org" <email@example.com>
Sent: Tuesday, June 7, 2011 3:53 PM
Subject: Linear scalability question
I have a question on the linear scalability of Hadoop.
We have a situation where we have to do reduce side joins on two big tables (10+ TB). This causes lot of data to be transferred over network and network is becoming a bottleneck.
In few years these table will have 100TB + data and the reduce side joins will demand lot of data transfer over network. Since network bandwidth is limited and can not be addressed by adding more nodes, hadoop will no longer be linearly scalable in this case.
Is my understanding correct? Am I missing anything here? How do people address these kind of bottlenecks?
Thanks and Regards,