FAQ
Hello,
I recall asking this question but this is in addition to what I'ev
askd.
Firstly, to recap my question and Arun's specific response:

-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
-- Does the "Data-local map tasks" counter mean the number of tasks
that the had the input data already present on the machine on they
are running on?
-- i.e the wasn't a need to ship the data to them.

Response from Arun
-- Yes. Your understanding is correct. More specifically it means that
the map-task got scheduled on a machine on which one of the
-- replicas of it's input-split-block was present and was served by
the datanode running on that machine. *smile* Arun


Now, Is Hadoop designed to schedule a map task on a machine which has
one of the replicas of it's input split block?
Failing that, does then assign a map task on machine close to one
that contains a replica of it's input split block?
Are there any performance metrics for this?

Many thanks
Saptarshi


Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha

Search Discussions

  • Heyongqiang at Jul 1, 2008 at 1:17 am
    Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
    The selection does not consider the task's input file at all.





    Best regards,

    Yongqiang He
    2008-06-25



    发件人: Saptarshi Guha
    发送时间: 2008-06-30 21:12:24
    收件人: core-user@hadoop.apache.org
    抄送:
    主题: Data-local tasks

    Hello,
    I recall asking this question but this is in addition to what I'ev askd.
    Firstly, to recap my question and Arun's specific response:



    -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
    -- Does the "Data-local map tasks" counter mean the number of tasks that the had the input data already present on the machine on they are running on?
    -- i.e the wasn't a need to ship the data to them.


    Response from Arun

    -- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the
    -- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun




    Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?

    Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?

    Are there any performance metrics for this?



    Many thanks

    Saptarshi





    Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
  • Heyongqiang at Jul 1, 2008 at 5:08 am
    Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
    The selection does not consider the task's input file at all.





    Best regards,

    Yongqiang He
    2008-06-25



    发件人: Saptarshi Guha
    发送时间: 2008-06-30 21:12:24
    收件人: core-user@hadoop.apache.org
    抄送:
    主题: Data-local tasks

    Hello,
    I recall asking this question but this is in addition to what I'ev askd.
    Firstly, to recap my question and Arun's specific response:



    -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
    -- Does the "Data-local map tasks" counter mean the number of tasks that the had the input data already present on the machine on they are running on?
    -- i.e the wasn't a need to ship the data to them.


    Response from Arun

    -- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the
    -- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun




    Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?

    Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?

    Are there any performance metrics for this?



    Many thanks

    Saptarshi





    Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
  • Amar Kamat at Jul 1, 2008 at 4:40 am

    Saptarshi Guha wrote:
    Hello,
    I recall asking this question but this is in addition to what I'ev askd.
    Firstly, to recap my question and Arun's specific response:

    -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
    -- Does the "Data-local map tasks" counter mean the number of tasks
    that the had the input data already present on the machine on they
    are running on?
    -- i.e the wasn't a need to ship the data to them.

    Response from Arun
    -- Yes. Your understanding is correct. More specifically it means that
    the map-task got scheduled on a machine on which one of the
    -- replicas of it's input-split-block was present and was served by
    the datanode running on that machine. *smile* Arun


    Now, Is Hadoop designed to schedule a map task on a machine which has
    one of the replicas of it's input split block? Yes.
    Failing that, does then assign a map task on machine close to one that
    contains a replica of it's input split block?
    The scheduling is tasktracker based rather than split based. By that
    what I mean is that the tasktracker asks for a task and the JT schedules
    a task to that tracker.
    If there is any split that is data local to the tasktracker and not yet
    scheduled, it will be assigned to the tracker. If no such split can be
    found the JT will assign a high priority split to it. The priority
    amongst the splits is based on their ordering given by the jobclient. By
    default its sorted on split size (decreasing order). Either the split is
    data-local (on the same machine), rack local (within the same rack) or
    is not-local. There is no other measure of closeness. The scheduling
    problem is 'given a tasktracker find out the best split' rather than
    'given a split find out the best/closest tracker'.
    Are there any performance metrics for this?

    Many thanks
    Saptarshi


    */Saptarshi Guha | saptarshi.guha@gmail.com
    <http://www.stat.purdue.edu/%7Esguha>/*

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 30, '08 at 1:42p
activeJul 1, '08 at 5:08a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase