FAQ
hello everyone,

As I don't have experience with big scale cluster, I cannot figure out why
the inter-rack communication in a mapreduce job is "significantly" slower
than intra-rack.
I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not be
much contention at the switch, is it?

Can anyone explain how this "slow"ness happens to me?

Thanks
Elton

Search Discussions

  • Steve Loughran at Jun 6, 2011 at 9:28 am

    On 06/06/11 08:22, elton sky wrote:
    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure out why
    the inter-rack communication in a mapreduce job is "significantly" slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not be
    much contention at the switch, is it?
    I don't know enough about these switches; I do hear stories about
    buffering and the like, and I also hear that a lot of switches don't
    always expect all the ports to light up simultaneously.

    Outside hadoop, try setting up some simple bandwidth tests to measure
    inter-rack bandwidth: have every node on one rack try and talk to one on
    another at full rate.

    Set up every node talking to every other node at least once, to make
    sure there aren't odd problems between two nodes, which can happen if
    one of the NICs is playing up.

    Once you are happy that the basic bandwidth between servers is OK, then
    it's time to start worrying adding hadoop to the mix

    -steve
  • Elton sky at Jun 6, 2011 at 11:04 am
    Thanks for reply, Steve,

    I totally agree benchmark is a good idea. But the problem is I don't have
    switch to play with rather than a small cluster.
    I am curious of this and post the question.
    Can some experienced ppl can share their knowledge with us?

    Cheers
    On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran wrote:
    On 06/06/11 08:22, elton sky wrote:

    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure out why
    the inter-rack communication in a mapreduce job is "significantly" slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not
    be
    much contention at the switch, is it?
    I don't know enough about these switches; I do hear stories about buffering
    and the like, and I also hear that a lot of switches don't always expect all
    the ports to light up simultaneously.

    Outside hadoop, try setting up some simple bandwidth tests to measure
    inter-rack bandwidth: have every node on one rack try and talk to one on
    another at full rate.

    Set up every node talking to every other node at least once, to make sure
    there aren't odd problems between two nodes, which can happen if one of the
    NICs is playing up.

    Once you are happy that the basic bandwidth between servers is OK, then
    it's time to start worrying adding hadoop to the mix

    -steve
  • Joey Echeverria at Jun 6, 2011 at 1:02 pm
    Larger Hadoop installations are space dense, 20-40 nodes per rack.
    When you get to that density with multiple racks, it becomes expensive
    to buy a switch with enough capacity for all of the nodes in all of
    the racks. The typical solution is to install a switch per rack with
    uplinks to a core switch to route between the racks. In this
    arrangement, you'll be limited by the uplink bandwidth to the core
    switch for interrack communication. Typically these uplinks are 10-20
    Gbps (bidirectional).

    Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps
    isn't enough bandwidth to push all of those ports at full tilt between
    racks. That's why Hadoop has the ability to take advantage of rack
    locality. It will try to schedule I/O local to a rack where it's less
    likely to block.

    -Joey
    On Mon, Jun 6, 2011 at 7:04 AM, elton sky wrote:
    Thanks for reply, Steve,

    I totally agree benchmark is a good idea. But the problem is I don't have
    switch to play with rather than a small cluster.
    I am curious of this and post the question.
    Can some experienced ppl can share their knowledge with us?

    Cheers
    On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran wrote:
    On 06/06/11 08:22, elton sky wrote:

    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure out why
    the inter-rack communication in a mapreduce job is "significantly" slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should not
    be
    much contention at the switch, is it?
    I don't know enough about these switches; I do hear stories about buffering
    and the like, and I also hear that a lot of switches don't always expect all
    the ports to light up simultaneously.

    Outside hadoop, try setting up some simple bandwidth tests to measure
    inter-rack bandwidth: have every node on one rack try and talk to one on
    another at full rate.

    Set up every node talking to every other node at least once, to make sure
    there aren't odd problems between two nodes, which can happen if one of the
    NICs is playing up.

    Once you are happy that the basic bandwidth between servers is OK, then
    it's time to start worrying adding hadoop to the mix

    -steve


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Darren at Jun 6, 2011 at 1:19 pm
    I never understood how hadoop can throttle an inter-rack fiber switch.
    Its supposed to operate on the principle of move-the-code to the data
    because of the I/O cost of moving the data, right?
    On Mon, 6 Jun 2011 09:01:48 -0400, Joey Echeverria wrote:
    Larger Hadoop installations are space dense, 20-40 nodes per rack.
    When you get to that density with multiple racks, it becomes expensive
    to buy a switch with enough capacity for all of the nodes in all of
    the racks. The typical solution is to install a switch per rack with
    uplinks to a core switch to route between the racks. In this
    arrangement, you'll be limited by the uplink bandwidth to the core
    switch for interrack communication. Typically these uplinks are 10-20
    Gbps (bidirectional).

    Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps
    isn't enough bandwidth to push all of those ports at full tilt between
    racks. That's why Hadoop has the ability to take advantage of rack
    locality. It will try to schedule I/O local to a rack where it's less
    likely to block.

    -Joey
    On Mon, Jun 6, 2011 at 7:04 AM, elton sky wrote:
    Thanks for reply, Steve,

    I totally agree benchmark is a good idea. But the problem is I don't
    have
    switch to play with rather than a small cluster.
    I am curious of this and post the question.
    Can some experienced ppl can share their knowledge with us?

    Cheers
    On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran wrote:
    On 06/06/11 08:22, elton sky wrote:

    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure
    out
    why
    the inter-rack communication in a mapreduce job is "significantly"
    slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps
    forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
    not
    be
    much contention at the switch, is it?
    I don't know enough about these switches; I do hear stories about
    buffering
    and the like, and I also hear that a lot of switches don't always
    expect all
    the ports to light up simultaneously.

    Outside hadoop, try setting up some simple bandwidth tests to measure
    inter-rack bandwidth: have every node on one rack try and talk to one
    on
    another at full rate.

    Set up every node talking to every other node at least once, to make
    sure
    there aren't odd problems between two nodes, which can happen if one
    of
    the
    NICs is playing up.

    Once you are happy that the basic bandwidth between servers is OK,
    then
    it's time to start worrying adding hadoop to the mix

    -steve
  • John Armstrong at Jun 6, 2011 at 1:22 pm

    On Mon, 06 Jun 2011 09:18:45 -0400, wrote:
    I never understood how hadoop can throttle an inter-rack fiber switch.
    Its supposed to operate on the principle of move-the-code to the data
    because of the I/O cost of moving the data, right?
    But what happens when a reducer on rack A gets most of its input from
    mappers on rack A, but needs a serious chunk of data from mappers on racks,
    B, C, D...
  • Darren at Jun 6, 2011 at 1:26 pm
    I'm not a hadoop jedi, but in that case, wouldn't one of the hadoop
    "trackers" get bottlenecked to resolve those dependencies?

    Again, this exposes the oddity of hadoop IMO, it tries to NOT
    be I/O bound, but seems its very I/O bound...

    sorry. not trying to shift the thread topic.

    On Mon, 06 Jun 2011 09:21:51 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:18:45 -0400, wrote:
    I never understood how hadoop can throttle an inter-rack fiber switch.
    Its supposed to operate on the principle of move-the-code to the data
    because of the I/O cost of moving the data, right?
    But what happens when a reducer on rack A gets most of its input from
    mappers on rack A, but needs a serious chunk of data from mappers on racks,
    B, C, D...
  • John Armstrong at Jun 6, 2011 at 1:29 pm

    On Mon, 06 Jun 2011 09:26:11 -0400, wrote:
    I'm not a hadoop jedi, but in that case, wouldn't one of the hadoop
    "trackers" get bottlenecked to resolve those dependencies?

    Again, this exposes the oddity of hadoop IMO, it tries to NOT
    be I/O bound, but seems its very I/O bound...
    I'm not a jedi either, but I think that the key is in the word "tries".
    Distributed computing is extremely I/O bound, and Hadoop tries to bring
    that down to just very.
  • Darren at Jun 6, 2011 at 1:35 pm
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?

    if I recall correctly, in order for a intra-rack node to be
    selected, the tracker service has to be consulted, putting further
    load on it...

    On Mon, 06 Jun 2011 09:28:57 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:26:11 -0400, wrote:
    I'm not a hadoop jedi, but in that case, wouldn't one of the hadoop
    "trackers" get bottlenecked to resolve those dependencies?

    Again, this exposes the oddity of hadoop IMO, it tries to NOT
    be I/O bound, but seems its very I/O bound...
    I'm not a jedi either, but I think that the key is in the word "tries".
    Distributed computing is extremely I/O bound, and Hadoop tries to bring
    that down to just very.
  • John Armstrong at Jun 6, 2011 at 1:40 pm

    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.
  • Darren at Jun 6, 2011 at 1:50 pm
    Yeah, the way you described it, maybe not. Because the hellabytes
    are all coming from one rack. But in reality, wouldn't this be
    more uniform because of how hadoop/hdfs work (distributed more evenly)?

    And if that is true, then for all the switched packets passing through
    the inter-rack switch, a consultation to the tracker would have preceeded
    it?

    Well, I'm just theorizing, and I'm sure we'll see more concrete numbers
    on the relation between # racks, # nodes, # switches, # trackers and
    their configurations.

    I like your idea about racking the trackers though. so it won't need any
    tracker trackers?!? ;)

    On Mon, 06 Jun 2011 09:40:12 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.
  • Joey Echeverria at Jun 6, 2011 at 1:54 pm
    Most of the network bandwidth used during a MapReduce job should come
    from the shuffle/sort phase. This part doesn't use HDFS. The
    TaskTrackers running reduce tasks will pull intermediate results from
    TaskTrackers running map tasks over HTTP. In most cases, it's
    difficult to get rack locality during this process because of the
    contract between mappers and reducers. If you wanted locality, your
    data would already have to be grouped by key in your source files and
    you'd need to use a custom partitioner.

    -Joey
    On Mon, Jun 6, 2011 at 9:49 AM, wrote:

    Yeah, the way you described it, maybe not. Because the hellabytes
    are all coming from one rack. But in reality, wouldn't this be
    more uniform because of how hadoop/hdfs work (distributed more evenly)?

    And if that is true, then for all the switched packets passing through
    the inter-rack switch, a consultation to the tracker would have preceeded
    it?

    Well, I'm just theorizing, and I'm sure we'll see more concrete numbers
    on the relation between # racks, # nodes, # switches, # trackers and
    their configurations.

    I like your idea about racking the trackers though. so it won't need any
    tracker trackers?!? ;)

    On Mon, 06 Jun 2011 09:40:12 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always.  For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Mauricio Cavallera at Jun 7, 2011 at 1:45 am
    Unsubscribe
    El jun 6, 2011 10:54 a.m., "Joey Echeverria" <joey@cloudera.com> escribió:
    Most of the network bandwidth used during a MapReduce job should come
    from the shuffle/sort phase. This part doesn't use HDFS. The
    TaskTrackers running reduce tasks will pull intermediate results from
    TaskTrackers running map tasks over HTTP. In most cases, it's
    difficult to get rack locality during this process because of the
    contract between mappers and reducers. If you wanted locality, your
    data would already have to be grouped by key in your source files and
    you'd need to use a custom partitioner.

    -Joey
    On Mon, Jun 6, 2011 at 9:49 AM, wrote:

    Yeah, the way you described it, maybe not. Because the hellabytes
    are all coming from one rack. But in reality, wouldn't this be
    more uniform because of how hadoop/hdfs work (distributed more evenly)?

    And if that is true, then for all the switched packets passing through
    the inter-rack switch, a consultation to the tracker would have preceeded
    it?

    Well, I'm just theorizing, and I'm sure we'll see more concrete numbers
    on the relation between # racks, # nodes, # switches, # trackers and
    their configurations.

    I like your idea about racking the trackers though. so it won't need any
    tracker trackers?!? ;)

    On Mon, 06 Jun 2011 09:40:12 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • James Seigel at Jun 7, 2011 at 1:53 am
    Not sure that will make the interconnect faster, but it is worth a try.
    On Mon, Jun 6, 2011 at 7:44 PM, Mauricio Cavallera wrote:
    Unsubscribe
    El jun 6, 2011 10:54 a.m., "Joey Echeverria" <joey@cloudera.com> escribió:
    Most of the network bandwidth used during a MapReduce job should come
    from the shuffle/sort phase. This part doesn't use HDFS. The
    TaskTrackers running reduce tasks will pull intermediate results from
    TaskTrackers running map tasks over HTTP. In most cases, it's
    difficult to get rack locality during this process because of the
    contract between mappers and reducers. If you wanted locality, your
    data would already have to be grouped by key in your source files and
    you'd need to use a custom partitioner.

    -Joey
    On Mon, Jun 6, 2011 at 9:49 AM, wrote:

    Yeah, the way you described it, maybe not. Because the hellabytes
    are all coming from one rack. But in reality, wouldn't this be
    more uniform because of how hadoop/hdfs work (distributed more evenly)?

    And if that is true, then for all the switched packets passing through
    the inter-rack switch, a consultation to the tracker would have preceeded
    it?

    Well, I'm just theorizing, and I'm sure we'll see more concrete numbers
    on the relation between # racks, # nodes, # switches, # trackers and
    their configurations.

    I like your idea about racking the trackers though. so it won't need any
    tracker trackers?!? ;)

    On Mon, 06 Jun 2011 09:40:12 -0400, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always.  For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Elton sky at Jun 6, 2011 at 1:59 pm
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?

    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al) would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example, tracker
    is on rack A; sees that the best place to put reducer R is on rack B; sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M to
    send data to R; switches on B and C get throttled, leaving A free to handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.
  • Darren at Jun 6, 2011 at 3:00 pm
    IMO, that's right. Because map/reduce/hadoop was originally designed for
    that kind of text processing purpose. (i.e. few stages, low dependency,
    highly parallel).

    Its when one tries to solve general purpose algorithms of modest
    complexity that map/reduce gets into I/O churning problems.
    On Mon, 6 Jun 2011 23:58:53 +1000, elton sky wrote:
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?


    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al)
    would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example,
    tracker
    is on rack A; sees that the best place to put reducer R is on rack B;
    sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M
    to
    send data to R; switches on B and C get throttled, leaving A free to
    handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to
    have
    each of the main control daemons on their own nodes, but to put THOSE
    nodes
    on their own rack and keep all the data elsewhere.
  • Michael Segel at Jun 6, 2011 at 3:43 pm
    Well the problem is pretty basic.

    Take your typical 1 GBe switch with 42 ports.
    Each port is capable of doing 1 GBe in each direction across the switche's fabric.
    Depending on your hardware, that's a fabric of 40GB, shared.

    Depending on your hardware, you are usually using 1 or maybe 2 ports to 'trunk' to your network's back plane. (To keep this simple, lets just say that its a 1-2 GBe 'trunk' to your next rack.
    So you end up with 1GBe traffic from each node trying to communicate to another node on the next rack. So if that's 20 nodes per rack and they all want to communicate... you end up with 20 GBe (each direction) trying to fit through a 1 - 2 GBe pipe.

    Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people don't know how to drive. :-P

    The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per rack.

    JMHO

    -Mike


    To: common-user@hadoop.apache.org
    Subject: Re: Why inter-rack communication in mapreduce slow?
    Date: Mon, 6 Jun 2011 11:00:05 -0400
    From: darren@ontrenet.com


    IMO, that's right. Because map/reduce/hadoop was originally designed for
    that kind of text processing purpose. (i.e. few stages, low dependency,
    highly parallel).

    Its when one tries to solve general purpose algorithms of modest
    complexity that map/reduce gets into I/O churning problems.
    On Mon, 6 Jun 2011 23:58:53 +1000, elton sky wrote:
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?


    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al)
    would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example,
    tracker
    is on rack A; sees that the best place to put reducer R is on rack B;
    sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M
    to
    send data to R; switches on B and C get throttled, leaving A free to
    handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to
    have
    each of the main control daemons on their own nodes, but to put THOSE
    nodes
    on their own rack and keep all the data elsewhere.
  • Chris Smith at Jun 6, 2011 at 4:42 pm
    Elton,

    Rapleaf's blog has an interesting posting on their experience that's
    worth a read: http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/

    And if you want to get an idea of the interaction between CPU, Disk
    and Network there nothing like a picture, see Slide 9 in this deck, of
    a very simply Terasort Map/Reduce job. Obviously the real world is
    very different but the individual Map/Reduce jobs follow a similar
    pattern.

    Even doubling the node network performance in this simple example
    would not get you much performance improvement as the job is CPU bound
    for 50% of the time and only uses the network for roughly 10% of the
    remaining time.

    Chris
    On 6 June 2011 16:42, Michael Segel wrote:

    Well the problem is pretty basic.

    Take your typical 1 GBe switch with 42 ports.
    Each port is capable of doing 1 GBe in each direction across the switche's fabric.
    Depending on your hardware, that's a fabric of 40GB, shared.

    Depending on your hardware, you are usually using 1 or maybe 2 ports to 'trunk' to your network's back plane. (To keep this simple, lets just say that its a 1-2 GBe 'trunk' to your next rack.
    So you end up with 1GBe traffic from each node trying to communicate to another node on the next rack.  So if that's 20 nodes per rack and they all want to communicate... you end up with 20 GBe (each direction) trying to fit through a 1 - 2 GBe  pipe.

    Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people don't know how to drive. :-P

    The quick fix... spend the 8-10K per switch  to get a ToR that has 10+ GBe uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per rack.

    JMHO

    -Mike


    To: common-user@hadoop.apache.org
    Subject: Re: Why inter-rack communication in mapreduce slow?
    Date: Mon, 6 Jun 2011 11:00:05 -0400
    From: darren@ontrenet.com


    IMO, that's right. Because map/reduce/hadoop was originally designed for
    that kind of text processing purpose. (i.e. few stages, low dependency,
    highly parallel).

    Its when one tries to solve general purpose algorithms of modest
    complexity that map/reduce gets into I/O churning problems.

    On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <eltonsky9404@gmail.com>
    wrote:
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?


    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al)
    would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always.  For example,
    tracker
    is on rack A; sees that the best place to put reducer R is on rack B;
    sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M
    to
    send data to R; switches on B and C get throttled, leaving A free to
    handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to
    have
    each of the main control daemons on their own nodes, but to put THOSE
    nodes
    on their own rack and keep all the data elsewhere.
  • Michael Segel at Jun 6, 2011 at 6:04 pm
    Chris,

    I've gone back through the thread and here's Elton's initial question...
    On 06/06/11 08:22, elton sky wrote:

    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure out
    why
    the inter-rack communication in a mapreduce job is "significantly"
    slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps
    forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
    not
    be
    much contention at the switch, is it?
    Elton's question deals with why connections within the same switch are faster than connections that traverse a set of switches.
    The issue isn't so much one of the fabric within the switch itself, but the width of the connection between the two switches.

    If you have 40GBs (each direction) on a switch and you want it to communicate seamlessly with machines on the next switch, you have to have be able to bond 4 10GBe ports together.
    (Note: there's a bit more to it, but its the general idea.)

    You're going to have a significant slow down on communication between nodes that are on different racks because of the bandwidth limitations on the ports used to connect the switches and not the 'fabric' within the switch itself.

    To your point, you can monitor your jobs and see how much of your work is being done by 'data local' tasks. In one job we had 519 tasks started where 482 were 'data local'.
    So we had ~93% of the jobs where we didn't have an issue with any network latency. And then with the 7% of the jobs, you have to consider what percentage would have occurred where the data traffic is going to involve pulling data across a 'trunk'. So yes, network latency isn't going to be a huge factor in terms of improving overall efficiency.

    However, that's just for Hadoop. What happens when you run HBase? ;-)
    (You can have more network traffic during a m/r job.)

    HTH

    -Mike
  • Elton sky at Jun 7, 2011 at 1:25 am
    Michael,
    Depending on your hardware, that's a fabric of 40GB, shared.
    So that fabric is shared by all 42 ports. And even if I just used 2 ports
    out of 42, connecting to 2 racks, if there's enough traffic coming, these 2
    ports could use all 40GB. Is this right?

    -Elton
    On Tue, Jun 7, 2011 at 1:42 AM, Michael Segel wrote:


    Well the problem is pretty basic.

    Take your typical 1 GBe switch with 42 ports.
    Each port is capable of doing 1 GBe in each direction across the switche's
    fabric.
    Depending on your hardware, that's a fabric of 40GB, shared.

    Depending on your hardware, you are usually using 1 or maybe 2 ports to
    'trunk' to your network's back plane. (To keep this simple, lets just say
    that its a 1-2 GBe 'trunk' to your next rack.
    So you end up with 1GBe traffic from each node trying to communicate to
    another node on the next rack. So if that's 20 nodes per rack and they all
    want to communicate... you end up with 20 GBe (each direction) trying to fit
    through a 1 - 2 GBe pipe.

    Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people
    don't know how to drive. :-P

    The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe
    uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per
    rack.

    JMHO

    -Mike


    To: common-user@hadoop.apache.org
    Subject: Re: Why inter-rack communication in mapreduce slow?
    Date: Mon, 6 Jun 2011 11:00:05 -0400
    From: darren@ontrenet.com


    IMO, that's right. Because map/reduce/hadoop was originally designed for
    that kind of text processing purpose. (i.e. few stages, low dependency,
    highly parallel).

    Its when one tries to solve general purpose algorithms of modest
    complexity that map/reduce gets into I/O churning problems.

    On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <eltonsky9404@gmail.com>
    wrote:
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?


    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al)
    would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example,
    tracker
    is on rack A; sees that the best place to put reducer R is on rack B;
    sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M
    to
    send data to R; switches on B and C get throttled, leaving A free to
    handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to
    have
    each of the main control daemons on their own nodes, but to put THOSE
    nodes
    on their own rack and keep all the data elsewhere.
  • Michael Segel at Jun 7, 2011 at 2:16 am

    Date: Tue, 7 Jun 2011 11:24:53 +1000
    Subject: Re: Why inter-rack communication in mapreduce slow?
    From: eltonsky9404@gmail.com
    To: common-user@hadoop.apache.org

    Michael,
    Depending on your hardware, that's a fabric of 40GB, shared.
    So that fabric is shared by all 42 ports. And even if I just used 2 ports
    out of 42, connecting to 2 racks, if there's enough traffic coming, these 2
    ports could use all 40GB. Is this right?

    -Elton
    Elton,

    That 40GB is shared across all of the ports in the switch. But the maximum bandwidth each port can handle is only 1GBe bidirectional.
    So if you've got 40GB/s of traffic w only 1 1GBe port, you can only push 1GB/s through the port.

    I realize I'm over simplifying it, but that's the basic problem. This is why you will find switches that have 1GBe ports with 10GBe uplinks.

    HTH

    -Mike
    On Tue, Jun 7, 2011 at 1:42 AM, Michael Segel wrote:


    Well the problem is pretty basic.

    Take your typical 1 GBe switch with 42 ports.
    Each port is capable of doing 1 GBe in each direction across the switche's
    fabric.
    Depending on your hardware, that's a fabric of 40GB, shared.

    Depending on your hardware, you are usually using 1 or maybe 2 ports to
    'trunk' to your network's back plane. (To keep this simple, lets just say
    that its a 1-2 GBe 'trunk' to your next rack.
    So you end up with 1GBe traffic from each node trying to communicate to
    another node on the next rack. So if that's 20 nodes per rack and they all
    want to communicate... you end up with 20 GBe (each direction) trying to fit
    through a 1 - 2 GBe pipe.

    Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people
    don't know how to drive. :-P

    The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe
    uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per
    rack.

    JMHO

    -Mike


    To: common-user@hadoop.apache.org
    Subject: Re: Why inter-rack communication in mapreduce slow?
    Date: Mon, 6 Jun 2011 11:00:05 -0400
    From: darren@ontrenet.com


    IMO, that's right. Because map/reduce/hadoop was originally designed for
    that kind of text processing purpose. (i.e. few stages, low dependency,
    highly parallel).

    Its when one tries to solve general purpose algorithms of modest
    complexity that map/reduce gets into I/O churning problems.

    On Mon, 6 Jun 2011 23:58:53 +1000, elton sky <eltonsky9404@gmail.com>
    wrote:
    Hi John,

    Because for map task, job tracker tries to assign them to local data nodes,
    so there' not much n/w traffic.
    Then the only potential issue will be, as you said, reducers, which copies
    data from all maps.
    So in other words, if the application only creates small intermediate
    output, e.g. grep, wordcount, this jam between racks is not likely happen,
    is it?


    On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong
    wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400, wrote:
    Yeah, that's a good point.

    I wonder though, what the load on the tracker nodes (port et. al)
    would
    be if a inter-rack fiber switch at 10's of GBS' is getting maxed.

    Seems to me that if there is that much traffic being mitigate across
    racks, that the tracker node (or whatever node it is) would overload
    first?
    It could happen, but I don't think it would always. For example,
    tracker
    is on rack A; sees that the best place to put reducer R is on rack B;
    sees
    reducer still needs a few hellabytes from mapper M on rack C; tells M
    to
    send data to R; switches on B and C get throttled, leaving A free to
    handle
    other things.

    In fact, it almost makes me wonder if an ideal setup is not only to
    have
    each of the main control daemons on their own nodes, but to put THOSE
    nodes
    on their own rack and keep all the data elsewhere.
  • Steve Loughran at Jun 7, 2011 at 9:48 am

    On 06/06/2011 02:40 PM, John Armstrong wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400,wrote:
    Yeah, that's a good point.
    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.
    I'd give them 10Gbps connection to the main network fabric, as with any
    ingress/egress nodes whose aim in life is to get data into and out of
    the cluster. There's a lot to be said for fast nodes within the
    datacentre but not hosting datanodes, as that way their writes get
    scattered everywhere -which is what you need when loading data into HDFS.

    You don't need separate racks for this, just more complicated wiring.

    -steve

    (disclaimer, my network knowledge generally stops at Connection Refused
    and No Route to Host messages)
  • Michael Segel at Jun 7, 2011 at 11:23 am
    We've looked at the network problem for the past two years.

    10GBe is now a reality. Even a year ago prices were still at a premium.
    Because you now have 10GBe and you have 2TB drives at a price sweet spot, you really need to sit down and think out your cluster design.
    You need to look at things in terms of power usage, density, cost and vendor relationships...

    There's really more to this problem.

    If you're looking at a simple answer, if you run 1 GBe TOR switches, buy a reliable switch that not only has better uplinks but allows you to bond ports to create a fatter pipe.
    Arista and BladeNetworks (Now part of IBM) are producing some interesting switches and the prices aren't too bad. (Plus they claim to play nice with Cisco switches...)

    If you're looking to build out a very large cluster... you really need to take a holistic approach.


    HTH

    -Mike
    Date: Tue, 7 Jun 2011 10:47:25 +0100
    From: stevel@apache.org
    To: common-user@hadoop.apache.org
    Subject: Re: Why inter-rack communication in mapreduce slow?
    On 06/06/2011 02:40 PM, John Armstrong wrote:
    On Mon, 06 Jun 2011 09:34:56 -0400,wrote:
    Yeah, that's a good point.
    In fact, it almost makes me wonder if an ideal setup is not only to have
    each of the main control daemons on their own nodes, but to put THOSE nodes
    on their own rack and keep all the data elsewhere.
    I'd give them 10Gbps connection to the main network fabric, as with any
    ingress/egress nodes whose aim in life is to get data into and out of
    the cluster. There's a lot to be said for fast nodes within the
    datacentre but not hosting datanodes, as that way their writes get
    scattered everywhere -which is what you need when loading data into HDFS.

    You don't need separate racks for this, just more complicated wiring.

    -steve

    (disclaimer, my network knowledge generally stops at Connection Refused
    and No Route to Host messages)
  • Elton sky at Jun 6, 2011 at 1:40 pm
    Thanks Joey,

    So the b/w is throttled by the core switch when many nodes are requesting
    traffic and the core switch can not keep up.
    It's only happens when the cluster is busy enough.
    On Mon, Jun 6, 2011 at 11:01 PM, Joey Echeverria wrote:

    Larger Hadoop installations are space dense, 20-40 nodes per rack.
    When you get to that density with multiple racks, it becomes expensive
    to buy a switch with enough capacity for all of the nodes in all of
    the racks. The typical solution is to install a switch per rack with
    uplinks to a core switch to route between the racks. In this
    arrangement, you'll be limited by the uplink bandwidth to the core
    switch for interrack communication. Typically these uplinks are 10-20
    Gbps (bidirectional).

    Assuming you have 32 nodes in a rack with 1 Gbps links, then 20 Gbps
    isn't enough bandwidth to push all of those ports at full tilt between
    racks. That's why Hadoop has the ability to take advantage of rack
    locality. It will try to schedule I/O local to a rack where it's less
    likely to block.

    -Joey
    On Mon, Jun 6, 2011 at 7:04 AM, elton sky wrote:
    Thanks for reply, Steve,

    I totally agree benchmark is a good idea. But the problem is I don't have
    switch to play with rather than a small cluster.
    I am curious of this and post the question.
    Can some experienced ppl can share their knowledge with us?

    Cheers
    On Mon, Jun 6, 2011 at 7:28 PM, Steve Loughran wrote:
    On 06/06/11 08:22, elton sky wrote:

    hello everyone,

    As I don't have experience with big scale cluster, I cannot figure out
    why
    the inter-rack communication in a mapreduce job is "significantly"
    slower
    than intra-rack.
    I saw cisco catalyst 4900 series switch can reach upto 320Gbps
    forwarding
    capacity. Connected with 48 nodes with 1Gbps ethernet each, it should
    not
    be
    much contention at the switch, is it?
    I don't know enough about these switches; I do hear stories about
    buffering
    and the like, and I also hear that a lot of switches don't always expect
    all
    the ports to light up simultaneously.

    Outside hadoop, try setting up some simple bandwidth tests to measure
    inter-rack bandwidth: have every node on one rack try and talk to one on
    another at full rate.

    Set up every node talking to every other node at least once, to make
    sure
    there aren't odd problems between two nodes, which can happen if one of
    the
    NICs is playing up.

    Once you are happy that the basic bandwidth between servers is OK, then
    it's time to start worrying adding hadoop to the mix

    -steve


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 6, '11 at 7:23a
activeJun 7, '11 at 11:23a
posts24
users9
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase