FAQ
I'm working on a cluster with 360 reducer slots. I've got a big job, so
when I launch it I follow the recommendations in the Hadoop documentation
and set mapred.reduce.tasks=350, i.e. slightly less than the available
number of slots.

The problem is that my reducers can still take a long time (2-4 hours) to
run. So I end up grabbing a big slab of reducers and starving everybody
else out. I've got my priority set to VERY_LOW and
mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done
everything I can do on the job parameters front. Currently there isn't a
way to make the individual reducers run faster, so I'm trying to figure out
the best way to run my job so that it plays nice with other users of the
cluster. My rule of thumb has always been to not try and do any scheduling
myself, but let Hadoop handle it for me, but I don't think that works in
this scenario.

Questions:

1. Am I correct in thinking that long reducer times just mess up Hadoop's
scheduling granularity to a degree that it can't handle? Is 4-hour reducer
outside the normal operating range of Hadoop?
2. Is there any way to stagger task launches? (Aside from manually.)
3. What if I set mapred.reduce.tasks to be some value much, much larger
than the number of available reducer slots, like 100,000. Will that make
the amount of work sent to each reducer smaller (hence increasing the
scheduler granularity) or will it have no effect?
4. In this scenario, do I just have to reconcile myself to the fact that
my job is going to squat on a block of reducers no matter what and
set mapred.reduce.tasks
to something much less than the available number of slots?

Search Discussions

  • James Seigel at May 18, 2011 at 9:30 pm
    W.P.,

    Hard to help out without knowing more about the characteristics of your data? How many keys are you expecting? How many values per key?

    Cheers
    James.
    On 2011-05-18, at 3:25 PM, W.P. McNeill wrote:

    I'm working on a cluster with 360 reducer slots. I've got a big job, so
    when I launch it I follow the recommendations in the Hadoop documentation
    and set mapred.reduce.tasks=350, i.e. slightly less than the available
    number of slots.

    The problem is that my reducers can still take a long time (2-4 hours) to
    run. So I end up grabbing a big slab of reducers and starving everybody
    else out. I've got my priority set to VERY_LOW and
    mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done
    everything I can do on the job parameters front. Currently there isn't a
    way to make the individual reducers run faster, so I'm trying to figure out
    the best way to run my job so that it plays nice with other users of the
    cluster. My rule of thumb has always been to not try and do any scheduling
    myself, but let Hadoop handle it for me, but I don't think that works in
    this scenario.

    Questions:

    1. Am I correct in thinking that long reducer times just mess up Hadoop's
    scheduling granularity to a degree that it can't handle? Is 4-hour reducer
    outside the normal operating range of Hadoop?
    2. Is there any way to stagger task launches? (Aside from manually.)
    3. What if I set mapred.reduce.tasks to be some value much, much larger
    than the number of available reducer slots, like 100,000. Will that make
    the amount of work sent to each reducer smaller (hence increasing the
    scheduler granularity) or will it have no effect?
    4. In this scenario, do I just have to reconcile myself to the fact that
    my job is going to squat on a block of reducers no matter what and
    set mapred.reduce.tasks
    to something much less than the available number of slots?
  • W.P. McNeill at May 18, 2011 at 9:43 pm
    Altogether my reducers are handling about 10^8 keys. The number of values
    per key varies, but ranges from 1-100. I'd guess the mean and mode is
    around 10, but I'm not sure.
  • W.P. McNeill at May 18, 2011 at 9:54 pm
    Also, the values are much larger (maybe a factor of 10^3) than the keys. I
    get the impression that this is unusual for Hadoop apps. (It certainly
    isn't true of word count in any event.)
  • W.P. McNeill at May 19, 2011 at 1:05 am
    Here's a consequence that I see of having the values be much larger than the
    keys: there's not much point in me adding a combiner.

    My mapper emits pairs of the form:

    <Key, Value>

    where the size of value is much greater than the size of Key. The reducer
    then processes input of the form:

    <Key, Iterator<Value>>

    The reducer then looks at the set of values corresponding to a Key and
    separates it into one of two bins. I don't think this is particularly
    CPU-intensive, however, the reducer needs access to the entire set of
    Values. The set can't be boiled down into some smaller sufficient statistic
    the way, say, in a word count program we can combine the counts for a word
    from different documents into a single number. As a result, the only
    combiner strategy I can see is to have the mapper emit a Value as a single
    item list:

    <Key, [Value]>

    Have a combiner combine the lists:

    <Key, [Value, Value...]

    and then the reducer would work on lists of lists.

    <Key, Iterator<[Value, Value...]>>

    This would save on redundant Key IO, but since Values are so much bigger
    than Keys I don't think this would matter.
  • James Seigel at May 18, 2011 at 10:00 pm
    W.P,

    Sounds like you are going to be taking a long time no matter what. With a keyspace of about 10^7 that means that either hadoop is going to eventually allocate 10^7 reducers (if you set you reducer count to 10^7) or is going to re-use the ones you have 10^6 / (number of reducers you allocate) times. It is probably just a big job :)

    Look into fairscheduler or specify less reducers for this job and suffer a slight slowdown, but allow other jobs to get reducers when they need them.

    You *might* get some efficiencies if you can reduce the number of keys, or ensure that very few keys are getting big lists of data (anti-parallel). Make sure you are using a combiner if there is an opportunity to reduce the amount of data that goes through the shuffle. That is always a good thing IO = slow.

    Also, see if you can break your job up into smaller pieces so the more expensive operations are happening on less data volume.

    Good luck!

    Cheers
    James.


    On 2011-05-18, at 3:42 PM, W.P. McNeill wrote:

    Altogether my reducers are handling about 10^8 keys. The number of values
    per key varies, but ranges from 1-100. I'd guess the mean and mode is
    around 10, but I'm not sure.
  • W.P. McNeill at May 18, 2011 at 10:05 pm
    I'm using fair scheduler and JVM reuse. It is just plain a big job.

    I'm not using a combiner right now, but that's something to look at.

    What about bumping the mapred.reduce.tasks up to some huge number? I think
    that shouldn't make a difference, but I'm hearing conflicting information on
    this.
  • James Seigel at May 18, 2011 at 10:09 pm
    W.P,

    Upping the reduce.tasks to a huge number just means that it will eventually spawn reducers = to (that huge number). You still only have slots for 360 so there is no real advantage, UNLESS you are running into OOM errors, which we’ve seen with higher re-use on the smaller number of reducers.

    Anyhoo, someone else can chime in and correct me if I am off base.

    Does that make sense?

    Cheers
    James.
    On 2011-05-18, at 4:04 PM, W.P. McNeill wrote:

    I'm using fair scheduler and JVM reuse. It is just plain a big job.

    I'm not using a combiner right now, but that's something to look at.

    What about bumping the mapred.reduce.tasks up to some huge number? I think
    that shouldn't make a difference, but I'm hearing conflicting information on
    this.
  • Joey Echeverria at May 19, 2011 at 1:02 am
    The one advantage you would get with a large number of reducers is
    that the scheduler will be able to give open reduce slots to other
    jobs without having to be preemptive.

    This will reduce the risk of you losing a reducer 3 hours into a 4 hour run.

    -Joey
    On Wed, May 18, 2011 at 3:08 PM, James Seigel wrote:
    W.P,

    Upping the reduce.tasks to a huge number just means that it will eventually spawn reducers = to (that huge number).  You still only have slots for 360 so there is no real advantage, UNLESS you are running into OOM errors, which we’ve seen with higher re-use on the smaller number of reducers.

    Anyhoo, someone else can chime in and correct me if I am off base.

    Does that make sense?

    Cheers
    James.
    On 2011-05-18, at 4:04 PM, W.P. McNeill wrote:

    I'm using fair scheduler and JVM reuse.  It is just plain a big job.

    I'm not using a combiner right now, but that's something to look at.

    What about bumping the mapred.reduce.tasks up to some huge number?  I think
    that shouldn't make a difference, but I'm hearing conflicting information on
    this.


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Michel Segel at May 19, 2011 at 11:53 am
    Fair scheduler won't help unless you set it to allow preemptive executions which may not be a good thing...

    Fair scheduler will wait until the current task completes before assigning a new task to the open slot. So if you have a long running job... You're SOL.

    A combiner will definitely help but you will still have the issue of long running jobs. You could put you job in a queue that limits the number of slots... But then you will definitely increase the time to run your job.

    If you could suspend a task... But that's anon-trivial solution...

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On May 18, 2011, at 5:04 PM, "W.P. McNeill" wrote:

    I'm using fair scheduler and JVM reuse. It is just plain a big job.

    I'm not using a combiner right now, but that's something to look at.

    What about bumping the mapred.reduce.tasks up to some huge number? I think
    that shouldn't make a difference, but I'm hearing conflicting information on
    this.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 18, '11 at 9:25p
activeMay 19, '11 at 11:53a
posts10
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase