FAQ
Hi, all
I wonder how Hadoop schedules mappers and reducers (e.g. consider
load balancing, affinity to data?). For example, how to decide on
which nodes mappers and reducers are to be executed and when.
Thanks!

Gerald

Search Discussions

  • Jeff Zhang at Oct 29, 2010 at 7:15 am
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.

    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang
  • Harsh J at Oct 29, 2010 at 2:11 pm
    Hello,
    On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang wrote:
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.
    Yes. To add some more, a scheduler is responsible to do assignments of
    tasks (based on various stats, including data locality) to proper
    tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
    TaskTracker its tasks, and the scheduler type is configurable (Some
    examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

    This scheduling is done when a heart beat response is to be sent back
    to a TaskTracker that called JobTracker.heartbeat(...).
    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang


    --
    Harsh J
    www.harshj.com
  • Zhenhua Guo at Nov 1, 2010 at 2:49 am
    Thanks!
    One more question. Is the input file replicated on each node where a
    mapper is run? Or just the portion processed by a mapper is
    transferred?

    Gerald
    On Fri, Oct 29, 2010 at 10:11 AM, Harsh J wrote:
    Hello,
    On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang wrote:
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.
    Yes. To add some more, a scheduler is responsible to do assignments of
    tasks (based on various stats, including data locality) to proper
    tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
    TaskTracker its tasks, and the scheduler type is configurable (Some
    examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

    This scheduling is done when a heart beat response is to be sent back
    to a TaskTracker that called JobTracker.heartbeat(...).
    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang


    --
    Harsh J
    www.harshj.com
  • Harsh J at Nov 1, 2010 at 3:36 am
    Hi,
    On Mon, Nov 1, 2010 at 8:19 AM, Zhenhua Guo wrote:
    Thanks!
    One more question. Is the input file replicated on each node where a
    mapper is run? Or just the portion processed by a mapper is
    transferred?
    With the use of HDFS, this is what happens: Mappers are run on nodes
    where the input file's blocks are already present [Data-local map
    tasks]. If TaskTracker slots are unavailable on that node for the
    mapper to run, it is run somewhere else and the input block ("portion
    processed by a mapper") is fetched from one of the DataNodes in the
    same rack [Rack-local map tasks].

    --
    Harsh J
    www.harshj.com
  • He Chen at Nov 1, 2010 at 3:43 am
    If you use the default scheduler of hadoop 0.20.2 or higher. The
    jobQueueScheduler will take the data locality into account. That means when
    a heart beat from TT arrives, the JT will first check a cache which is a map
    of node and data-local tasks this node has. The JT will assign node local
    task first, then the rack local, non-local, recover and speculative tasks if
    they have default priorities.

    If a TT get a non-local task, it will query the nodes which have the data
    and finish this task, you can also decide to keep those fetched data on this
    TT or not by configuring the Hadoop mapred-site.xml file.

    BTW, even TT get a data local task, it may also ask other data owner (if you
    have more than one replica)for data to accelerate the process. (??? my
    understanding, any one can confirm)

    Hope this will help.

    Chen
    On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo wrote:

    Thanks!
    One more question. Is the input file replicated on each node where a
    mapper is run? Or just the portion processed by a mapper is
    transferred?

    Gerald
    On Fri, Oct 29, 2010 at 10:11 AM, Harsh J wrote:
    Hello,
    On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang wrote:
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.
    Yes. To add some more, a scheduler is responsible to do assignments of
    tasks (based on various stats, including data locality) to proper
    tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
    TaskTracker its tasks, and the scheduler type is configurable (Some
    examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

    This scheduling is done when a heart beat response is to be sent back
    to a TaskTracker that called JobTracker.heartbeat(...).
    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang


    --
    Harsh J
    www.harshj.com
  • Hemanth Yamijala at Nov 1, 2010 at 4:01 am
    Hi,
    On Mon, Nov 1, 2010 at 9:13 AM, He Chen wrote:
    If you use the default scheduler of hadoop 0.20.2 or higher. The
    jobQueueScheduler will take the data locality into account.
    This is true irrespective of the scheduler in use. Other schedulers
    currently add a layer to decide which job to pick up first based on
    constraints they choose to satisfy - like fairness, queue capacities
    etc. Once a job is picked up, the logic for picking up a task within
    the job is currently in framework code that all schedulers use.
    That means when
    a heart beat from TT arrives, the JT will first check a cache which is a map
    of node and data-local tasks this node has.  The JT will assign node local
    task first, then the rack local, non-local, recover and speculative tasks if
    they have default priorities.

    If a TT get a non-local task, it will query the nodes which have the data
    and finish this task, you can also decide to keep those fetched data on this
    TT or not by configuring the Hadoop mapred-site.xml file.

    BTW, even TT get a data local task, it may also ask other data owner (if you
    have more than one replica)for data to accelerate the process. (??? my
    understanding, any one can confirm)
    Not that I am aware of. The task's input location is used directly to
    read the data.

    Thanks
    Hemanth
    Hope this will help.

    Chen
    On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo wrote:

    Thanks!
    One more question. Is the input file replicated on each node where a
    mapper is run? Or just the portion processed by a mapper is
    transferred?

    Gerald
    On Fri, Oct 29, 2010 at 10:11 AM, Harsh J wrote:
    Hello,
    On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang wrote:
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.
    Yes. To add some more, a scheduler is responsible to do assignments of
    tasks (based on various stats, including data locality) to proper
    tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
    TaskTracker its tasks, and the scheduler type is configurable (Some
    examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

    This scheduling is done when a heart beat response is to be sent back
    to a TaskTracker that called JobTracker.heartbeat(...).
    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang


    --
    Harsh J
    www.harshj.com
  • Zhenhua Guo at Nov 3, 2010 at 9:41 pm
    Thanks, Jeff, Harsh, He, Hemanth. Those information is quite helpful!

    Gerald
    On Mon, Nov 1, 2010 at 12:01 AM, Hemanth Yamijala wrote:
    Hi,
    On Mon, Nov 1, 2010 at 9:13 AM, He Chen wrote:
    If you use the default scheduler of hadoop 0.20.2 or higher. The
    jobQueueScheduler will take the data locality into account.
    This is true irrespective of the scheduler in use. Other schedulers
    currently add a layer to decide which job to pick up first based on
    constraints they choose to satisfy - like fairness, queue capacities
    etc. Once a job is picked up, the logic for picking up a task within
    the job is currently in framework code that all schedulers use.
    That means when
    a heart beat from TT arrives, the JT will first check a cache which is a map
    of node and data-local tasks this node has.  The JT will assign node local
    task first, then the rack local, non-local, recover and speculative tasks if
    they have default priorities.

    If a TT get a non-local task, it will query the nodes which have the data
    and finish this task, you can also decide to keep those fetched data on this
    TT or not by configuring the Hadoop mapred-site.xml file.

    BTW, even TT get a data local task, it may also ask other data owner (if you
    have more than one replica)for data to accelerate the process. (??? my
    understanding, any one can confirm)
    Not that I am aware of. The task's input location is used directly to
    read the data.

    Thanks
    Hemanth
    Hope this will help.

    Chen
    On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo wrote:

    Thanks!
    One more question. Is the input file replicated on each node where a
    mapper is run? Or just the portion processed by a mapper is
    transferred?

    Gerald
    On Fri, Oct 29, 2010 at 10:11 AM, Harsh J wrote:
    Hello,
    On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang wrote:
    TaskTracker will tell JobTracker how many free slots it has through
    heartbeat. And JobTracker will choose the best tasktracker with the
    consideration of data locality.
    Yes. To add some more, a scheduler is responsible to do assignments of
    tasks (based on various stats, including data locality) to proper
    tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
    TaskTracker its tasks, and the scheduler type is configurable (Some
    examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

    This scheduling is done when a heart beat response is to be sent back
    to a TaskTracker that called JobTracker.heartbeat(...).
    On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo wrote:
    Hi, all
    I wonder how Hadoop schedules mappers and reducers (e.g. consider
    load balancing, affinity to data?). For example, how to decide on
    which nodes mappers and reducers are to be executed and when.
    Thanks!

    Gerald


    --
    Best Regards

    Jeff Zhang


    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 28, '10 at 9:53p
activeNov 3, '10 at 9:41p
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase