FAQ
Hi,

I am trying to understand the effects of increasing block size or minimum
split size. If I increase them, then a mapper will process more data,
effectively reducing the number of mappers that will be spawned. As there is
an overhead in starting mappers, so this seems good.

However, If I increase their values too much, what negative effects will
come up? Put in other words, how to compute what is the best number of
mappers to start for processing a given size data on a cluster.

For calculations, let us assume- 100G of data, 4 machines (dual core).

Also if I set the reuse jvm flag to -1, will it make a difference?

Thanks,
Tarandeep

Search Discussions

  • Jothi Padmanabhan at Jun 12, 2009 at 11:36 am
    If the number of maps is reduced, it is possible that the size of
    individual map outputs might increase. A couple of possible issues come to
    mind immediately:
    1. Number of spills in the map might be more. This might incur extra cost
    during merging.
    2. Also, while the reduces might pull in more data per fetch (which is
    good), it might also result in a state where the reducer is not able to
    store the map output in memory but needs to shuffle it to disk.

    JVM reuse should help, but if the individual task completion time is very
    high, there might not be any discernible performance gain.

    Jothi

    On 6/11/09 11:36 PM, "Tarandeep Singh" wrote:

    Hi,

    I am trying to understand the effects of increasing block size or minimum
    split size. If I increase them, then a mapper will process more data,
    effectively reducing the number of mappers that will be spawned. As there is
    an overhead in starting mappers, so this seems good.

    However, If I increase their values too much, what negative effects will
    come up? Put in other words, how to compute what is the best number of
    mappers to start for processing a given size data on a cluster.

    For calculations, let us assume- 100G of data, 4 machines (dual core).

    Also if I set the reuse jvm flag to -1, will it make a difference?

    Thanks,
    Tarandeep
  • Tarandeep Singh at Jun 12, 2009 at 4:42 pm
    Thanks Jothi...

    -Tarandeep
    On Fri, Jun 12, 2009 at 4:35 AM, Jothi Padmanabhan wrote:

    If the number of maps is reduced, it is possible that the size of
    individual map outputs might increase. A couple of possible issues come to
    mind immediately:
    1. Number of spills in the map might be more. This might incur extra cost
    during merging.
    2. Also, while the reduces might pull in more data per fetch (which is
    good), it might also result in a state where the reducer is not able to
    store the map output in memory but needs to shuffle it to disk.

    JVM reuse should help, but if the individual task completion time is very
    high, there might not be any discernible performance gain.

    Jothi

    On 6/11/09 11:36 PM, "Tarandeep Singh" wrote:

    Hi,

    I am trying to understand the effects of increasing block size or minimum
    split size. If I increase them, then a mapper will process more data,
    effectively reducing the number of mappers that will be spawned. As there is
    an overhead in starting mappers, so this seems good.

    However, If I increase their values too much, what negative effects will
    come up? Put in other words, how to compute what is the best number of
    mappers to start for processing a given size data on a cluster.

    For calculations, let us assume- 100G of data, 4 machines (dual core).

    Also if I set the reuse jvm flag to -1, will it make a difference?

    Thanks,
    Tarandeep
  • Owen O'Malley at Jun 13, 2009 at 12:00 am

    On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote:

    I am trying to understand the effects of increasing block size or
    minimum
    split size. If I increase them, then a mapper will process more data,
    effectively reducing the number of mappers that will be spawned. As
    there is
    an overhead in starting mappers, so this seems good.
    Even more important is that the shuffle depends on the number of maps
    * reduces. For the sort benchmark, we found that it was much more
    performant to have a few very large maps (500MB+)

    -- Owen
  • Tarandeep Singh at Jun 13, 2009 at 12:36 am

    On Fri, Jun 12, 2009 at 4:59 PM, Owen O'Malley wrote:

    On Jun 11, 2009, at 11:06 AM, Tarandeep Singh wrote:

    I am trying to understand the effects of increasing block size or minimum
    split size. If I increase them, then a mapper will process more data,
    effectively reducing the number of mappers that will be spawned. As there
    is
    an overhead in starting mappers, so this seems good.
    Even more important is that the shuffle depends on the number of maps *
    reduces. For the sort benchmark, we found that it was much more performant
    to have a few very large maps (500MB+)

    Owen, what were the values for other parameters for your sort benchmark,
    like- io.sort.* etc. Is this documented somewhere so that I can take a look
    or if you can please paste it here.

    thanks,
    Tarandeep

    -- Owen
  • Harish Mallipeddi at Jun 13, 2009 at 11:17 am

    Owen, what were the values for other parameters for your sort benchmark,
    like- io.sort.* etc. Is this documented somewhere so that I can take a look
    or if you can please paste it here.

    thanks,
    Tarandeep

    Tarandeep,

    Check the following links:

    http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf
    http://people.apache.org/~omalley/tera-2009/

    Cheers,

    --
    Harish Mallipeddi
    http://blog.poundbang.in

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 11, '09 at 6:06p
activeJun 13, '09 at 11:17a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase