Hi,
I had some questions specifically on the Map-Reduce phase:

[1] For the reduce phase, the TaskTrackers corresponding to the reduce node, poll the Job Tracker to know about maps that have completed and if the Jobtracker informs it about maps that are complete, it then pulls the data from the map node where the map is complete. This is a "pull" model as opposed to "push" model where the map directly sends a region of the map output to the appropriate reduce node. Is the pull model the default  in 0.20, 0.23 etc ?

In the pull model, how does the Reduce node know it is responsible for a particular region of map output? (Is this determined up front? From where it gets this information?)

[2]There can be multiple reduce tasks per reduce node. The number of reduce tasks is configurable, How about the number of reduce nodes? How is this determined?

[3]Pre 0.23, The map/reduce tasks slots for a node are allocated statically . Is this based on just configuration ?

Thanks in advance!

Search Discussions

  • Real great.. at Dec 16, 2011 at 1:44 pm
    [1]. I think the reducers are allocated a space before the execution begins
    and it depends on the number of reducers. If am not a mistaken, a hash
    logic is used to implement this.
    [2]. do not think we can determine the 'number' of reduce nodes. Its
    determined by the load conditions i assume and number of free reduce slots
    per node.
    [3]. DIdnt get the third one.
    On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal wrote:

    Hi,
    I had some questions specifically on the Map-Reduce phase:

    [1] For the reduce phase, the TaskTrackers corresponding to the reduce
    node, *poll* the Job Tracker to know about maps that have completed and
    if the Jobtracker informs it about maps that are complete, it then *pulls*the data from the map node where the map is complete. This is a "pull"
    model as opposed to "push" model where the map directly sends a region of
    the map output to the appropriate reduce node. Is the pull model the
    default in 0.20, 0.23 etc ?

    In the pull model, how does the Reduce node know it is responsible for a
    particular region of map output? (Is this determined up front? From where
    it gets this information?)

    [2]There can be multiple reduce tasks per reduce node. The number of
    reduce tasks is configurable, How about the number of reduce nodes? How is
    this determined?

    [3]Pre 0.23, The map/reduce tasks slots for a node are allocated
    statically . Is this based on just configuration ?

    Thanks in advance!


    --
    Regards,
    R.V.
  • Harsh J at Dec 16, 2011 at 2:23 pm

    On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal wrote:
    Hi,
    I had some questions specifically on the Map-Reduce phase:

    [1] For the reduce phase, the TaskTrackers corresponding to the reduce node,
    poll the Job Tracker to know about maps that have completed and if the
    Jobtracker informs it about maps that are complete, it then pulls the data
    from the map node where the map is complete. This is a "pull" model as
    opposed to "push" model where the map directly sends a region of the map
    output to the appropriate reduce node. Is the pull model the default  in
    0.20, 0.23 etc ?
    Yes, it is.
    In the pull model, how does the Reduce node know it is responsible for a
    particular region of map output? (Is this determined up front? From where it
    gets this information?)
    The reducer ID is == the partition ID from the map side. Thereby,
    reducer 0 will pull 0th partitions, and so on.
    [2]There can be multiple reduce tasks per reduce node. The number of reduce
    tasks is configurable, How about the number of reduce nodes? How is this
    determined?
    Reducers are assigned by the scheduler in use. They are assigned once
    per TT heartbeat, to have somewhat of an equal distribution today -
    but there is no way to induce locality of reducers from the
    user-program itself. So far I've not seen a case where this may be
    absolutely necessary to have, as the schedulers today are pretty
    capable of doing things right.
    [3]Pre 0.23, The map/reduce tasks slots for a node are allocated statically
    . Is this based on just configuration ?
    What do you mean by 'allocated statically' here? Are you talking about
    slot limit configurations?

    --
    Harsh J
  • Ann Pal at Dec 16, 2011 at 9:24 pm
    Thanks a lot for your answers!
    For [1] With the "Pull" model chances of seeing a TCP-Incast problem where multiple map nodes send data to the same reduce node at the same time are minimal (since the reducer is responsible for retrieving data it can handle). Is this a valid assumption?

    For [3] what i meant was how does the infra decide how many map/reduce slots are present for a given node. Is it based on capacity (memory/cpu) of the node?

    Thanks again in advance..

    ________________________________
    From: Harsh J <harsh@cloudera.com>
    To: mapreduce-user@hadoop.apache.org; Ann Pal <ann_r_pal@yahoo.com>
    Sent: Friday, December 16, 2011 6:22 AM
    Subject: Re: Map Reduce Phase questions:
    On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal wrote:
    Hi,
    I had some questions specifically on the Map-Reduce phase:

    [1] For the reduce phase, the TaskTrackers corresponding to the reduce node,
    poll the Job Tracker to know about maps that have completed and if the
    Jobtracker informs it about maps that are complete, it then pulls the data
    from the map node where the map is complete. This is a "pull" model as
    opposed to "push" model where the map directly sends a region of the map
    output to the appropriate reduce node. Is the pull model the default  in
    0.20, 0.23 etc ?
    Yes, it is.
    In the pull model, how does the Reduce node know it is responsible for a
    particular region of map output? (Is this determined up front? From where it
    gets this information?)
    The reducer ID is == the partition ID from the map side. Thereby,
    reducer 0 will pull 0th partitions, and so on.
    [2]There can be multiple reduce tasks per reduce node. The number of reduce
    tasks is configurable, How about the number of reduce nodes? How is this
    determined?
    Reducers are assigned by the scheduler in use. They are assigned once
    per TT heartbeat, to have somewhat of an equal distribution today -
    but there is no way to induce locality of reducers from the
    user-program itself. So far I've not seen a case where this may be
    absolutely necessary to have, as the schedulers today are pretty
    capable of doing things right.
    [3]Pre 0.23, The map/reduce tasks slots for a node are allocated statically
    . Is this based on just configuration ?
    What do you mean by 'allocated statically' here? Are you talking about
    slot limit configurations?

    --
    Harsh J
  • Bejoy Ks at Dec 17, 2011 at 3:13 pm
    Ann
    Adding on to the responses, The map outputs are transferred to
    corresponding reducer over http and no through TCP.
    Definitely the available hardware definitely decides on the max num of
    tasks that a node can handle, it depends on number of cores, available
    physical memory etc. But that is not the only deciding factor, there are
    other factors like
    - for what all purposes you cluster is used for
    - If you use Hbase in your cluster you have to consider the hardware you
    need to allocate for the same
    - memory requirements for the generic jobs on your cluster etc

    Regards
    Bejoy.K.S
    On Sat, Dec 17, 2011 at 2:53 AM, Ann Pal wrote:

    Thanks a lot for your answers!
    For [1] With the "Pull" model chances of seeing a TCP-Incast problem where
    multiple map nodes send data to the same reduce node at the same time are
    minimal (since the reducer is responsible for retrieving data it can
    handle). Is this a valid assumption?

    For [3] what i meant was how does the infra decide how many map/reduce
    slots are present for a given node. Is it based on capacity (memory/cpu) of
    the node?

    Thanks again in advance..
    ------------------------------
    *From:* Harsh J <harsh@cloudera.com>
    *To:* mapreduce-user@hadoop.apache.org; Ann Pal <ann_r_pal@yahoo.com>
    *Sent:* Friday, December 16, 2011 6:22 AM
    *Subject:* Re: Map Reduce Phase questions:
    On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal wrote:
    Hi,
    I had some questions specifically on the Map-Reduce phase:

    [1] For the reduce phase, the TaskTrackers corresponding to the reduce node,
    poll the Job Tracker to know about maps that have completed and if the
    Jobtracker informs it about maps that are complete, it then pulls the data
    from the map node where the map is complete. This is a "pull" model as
    opposed to "push" model where the map directly sends a region of the map
    output to the appropriate reduce node. Is the pull model the default in
    0.20, 0.23 etc ?
    Yes, it is.
    In the pull model, how does the Reduce node know it is responsible for a
    particular region of map output? (Is this determined up front? From where it
    gets this information?)
    The reducer ID is == the partition ID from the map side. Thereby,
    reducer 0 will pull 0th partitions, and so on.
    [2]There can be multiple reduce tasks per reduce node. The number of reduce
    tasks is configurable, How about the number of reduce nodes? How is this
    determined?
    Reducers are assigned by the scheduler in use. They are assigned once
    per TT heartbeat, to have somewhat of an equal distribution today -
    but there is no way to induce locality of reducers from the
    user-program itself. So far I've not seen a case where this may be
    absolutely necessary to have, as the schedulers today are pretty
    capable of doing things right.
    [3]Pre 0.23, The map/reduce tasks slots for a node are allocated
    statically
    . Is this based on just configuration ?
    What do you mean by 'allocated statically' here? Are you talking about
    slot limit configurations?

    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 16, '11 at 1:33p
activeDec 17, '11 at 3:13p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase