FAQ
Hi,

I have a question on the linear scalability of Hadoop.

We have a situation where we have to do reduce side joins on two big tables (10+ TB). This causes lot of data to be transferred over network and network is becoming a bottleneck.

In few years these table will have 100TB + data and the reduce side joins will demand lot of data transfer over network. Since network bandwidth is limited and can not be addressed by adding more nodes, hadoop will no longer be linearly scalable in this case.

Is my understanding correct? Am I missing anything here? How do people address these kind of bottlenecks?

Thanks and Regards,
Shantian

Search Discussions

  • Shantian Purkad at Jun 8, 2011 at 9:18 pm
    any comments?



    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Tuesday, June 7, 2011 3:53 PM
    Subject: Linear scalability question

    Hi,

    I have a question on the linear scalability of Hadoop.

    We have a situation where we have to do reduce side joins on two big tables (10+ TB). This causes lot of data to be transferred over network and network is becoming a bottleneck.

    In few years these table will have 100TB + data and the reduce side joins will demand lot of data transfer over network. Since network bandwidth is limited and can not be addressed by adding more nodes, hadoop will no longer be linearly scalable in this case.

    Is my understanding correct? Am I missing anything here? How do people address these kind of bottlenecks?

    Thanks and Regards,
    Shantian
  • Robert Evans at Jun 9, 2011 at 2:24 pm
    Shantian,

    You are correct. The other big factor in this is the cost of connections between the Mappers and the Reducers. With N mappers and M reducers you will make M*N connections between them. This can be a very large cost as well. The basic tricks you can play are to filter data before doing a join so that less data is passed through the network; make your block sizes larger so that more data is processed per map and less connections are made between mappers and reducers; and make sure you have compression enabled between the map/reduce phases usually LZO is a good choice for this. But for the most part this is a very difficult problem to address because fundamentally a join requires transferring data with the same key to the same node. There is some experimental work into very large processing but it is still just that experimental and does not actually try to make a join scale linearly.

    --Bobby Evans

    On 6/8/11 4:17 PM, "Shantian Purkad" wrote:

    any comments?



    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com>
    To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
    Sent: Tuesday, June 7, 2011 3:53 PM
    Subject: Linear scalability question

    Hi,

    I have a question on the linear scalability of Hadoop.

    We have a situation where we have to do reduce side joins on two big tables (10+ TB). This causes lot of data to be transferred over network and network is becoming a bottleneck.

    In few years these table will have 100TB + data and the reduce side joins will demand lot of data transfer over network. Since network bandwidth is limited and can not be addressed by adding more nodes, hadoop will no longer be linearly scalable in this case.

    Is my understanding correct? Am I missing anything here? How do people address these kind of bottlenecks?

    Thanks and Regards,
    Shantian

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 7, '11 at 10:54p
activeJun 9, '11 at 2:24p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Shantian Purkad: 2 posts Robert Evans: 1 post

People

Translate

site design / logo © 2022 Grokbase