FAQ
I've been thinking (which is always a dangerous thing) about data locality lately.

If we look at file systems, there is this idea of 'reserved space'. This space is used for a variety of reasons, including to reduce fragmentation on busy file systems. This allows the file system driver to make smarter decisions of block placement and helping the overall throughput.

At LinkedIn, we're about to build a new grid with a few hundred nodes. I'm beginning to wonder if it wouldn't make sense to actually 'hold back' some task slots from usage with this same concept in mind. Let's take a grid that is full: all of the task slots are in use. When a task ends, the scheduler has to make a decision as to which task gets used for any available task slots. If we assume a fairly FIFO view of the world (default scheduler, capacity, maybe fair share?), it pulls the next task off the stack and pushes it to the task slot. If only one task slot is free, locality doesn't enter into the picture at all. In essence, we've fragmented our execution.

If we were to leave even 1 slot 'always' free (and therefore sacrificing execution speed by 1 slot), the scheduler could potentially make sure the task is host or rack local. If it can't, no loss--it wouldn't have been local anyway. Obviously reserving more slots as 'always' free increases our likelihood of being local. It just comes down to how much of a tradeoff it is worth.

I guess the real question comes down to how much of an impact does data locality really have. I know in the case of the bigger grids at Yahoo!, the ops team suspected (but never did the homework to verify) that our grids and their usage so massive that the data locality rarely happened, especially for "popular" data. I can't help but wonder if the situation would have been better if we would have kept x% (say .005%?) of the grid free based upon the speculation above.

Thoughts?

Search Discussions

  • Todd Lipcon at May 28, 2010 at 6:44 pm
    Hi Allen,

    Recent versions of the fair scheduler have configurations for "delay
    scheduling" - essentially, it will wait for a few seconds when a slot opens
    up to try to find a local task before assigning a non-local one. This is
    specifically to avoid the issue you're describing.

    Check out Matei's Eurosys 2010 paper here:

    http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf

    I believe this got lumped in with MAPREDUCE-706.

    Thanks
    -Todd
    On Fri, May 28, 2010 at 11:37 AM, Allen Wittenauer wrote:


    I've been thinking (which is always a dangerous thing) about data
    locality lately.

    If we look at file systems, there is this idea of 'reserved space'.
    This space is used for a variety of reasons, including to reduce
    fragmentation on busy file systems. This allows the file system driver to
    make smarter decisions of block placement and helping the overall
    throughput.

    At LinkedIn, we're about to build a new grid with a few hundred
    nodes. I'm beginning to wonder if it wouldn't make sense to actually 'hold
    back' some task slots from usage with this same concept in mind. Let's take
    a grid that is full: all of the task slots are in use. When a task ends,
    the scheduler has to make a decision as to which task gets used for any
    available task slots. If we assume a fairly FIFO view of the world (default
    scheduler, capacity, maybe fair share?), it pulls the next task off the
    stack and pushes it to the task slot. If only one task slot is free,
    locality doesn't enter into the picture at all. In essence, we've
    fragmented our execution.

    If we were to leave even 1 slot 'always' free (and therefore
    sacrificing execution speed by 1 slot), the scheduler could potentially make
    sure the task is host or rack local. If it can't, no loss--it wouldn't have
    been local anyway. Obviously reserving more slots as 'always' free
    increases our likelihood of being local. It just comes down to how much of
    a tradeoff it is worth.

    I guess the real question comes down to how much of an impact does
    data locality really have. I know in the case of the bigger grids at
    Yahoo!, the ops team suspected (but never did the homework to verify) that
    our grids and their usage so massive that the data locality rarely happened,
    especially for "popular" data. I can't help but wonder if the situation
    would have been better if we would have kept x% (say .005%?) of the grid
    free based upon the speculation above.

    Thoughts?



    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Allen Wittenauer at May 28, 2010 at 6:55 pm

    On May 28, 2010, at 11:43 AM, Todd Lipcon wrote:

    Hi Allen,

    Recent versions of the fair scheduler have configurations for "delay scheduling" - essentially, it will wait for a few seconds when a slot opens up to try to find a local task before assigning a non-local one. This is specifically to avoid the issue you're describing.

    Check out Matei's Eurosys 2010 paper here:

    http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf

    I believe this got lumped in with MAPREDUCE-706.

    Thanks. I'll take a look at this and see how applicable this is to Capacity.
  • Arun C Murthy at May 29, 2010 at 1:10 am

    Thanks. I'll take a look at this and see how applicable this is to
    Capacity.
    The CapacityScheduler has the fix over at MAPREDUCE-517. The patch
    needs some more tweaking, but it's nearly there.

    Arun

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMay 28, '10 at 6:38p
activeMay 29, '10 at 1:10a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase