FAQ
Hi all,

Had some queries on Map task's awareness. From what I understand,
every map task instance is destined to process the data in a specific
Input split (can be across HDFS blocks).

1) Do these map tasks have a unique instance number? If yes, are they
mapped to its specific input splits and the mapping is done using what
parameters (say for eg. map task number to input file byte offset ?) ?
where exactly is this hash-map preserved (at what level - jobtracker,
tasktracker or each tasks) ?

2) coming to a practical scenario, when I run hadoop in local mode. I
run a mapreduce job with 10 maps. Since there is an inherent jvm
parallelism (say the node can afford to run 2 map task jvms
simultaneously) I assume that there are some map tasks that run
concurrently. Since HDFS doesnot play a role in this case, how is the
map task instance - to - input split mapping mechanism carried out ?
Or do we have a concept of input split at all (will all the maps start
scanning from the start of the input file) ?

Please help me with these queries..

Thanks,
Matthew John

Search Discussions

  • Harsh J at Apr 1, 2011 at 8:20 am
    Hey Matthew,

    You can gain some more knowledge on this by reading up on how the
    MapReduce parts interact with their DFS counterparts in Hadoop's
    architecture.

    Yahoo's resources carry a good graphical representation and
    description, for starters:
    http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow

    On Wed, Mar 30, 2011 at 11:51 AM, Matthew John
    wrote:
    Hi all,

    Had some queries on Map task's awareness. From what I understand,
    every map task instance is destined to process the data in a specific
    Input split (can be across HDFS blocks).

    1) Do these map tasks have a unique instance number?
    Yes, all tasks carry a unique ID.
    If yes, are they
    mapped to its specific input splits and the mapping is done using what
    parameters (say for eg. map task number to input file byte offset ?) ?
    where exactly is this hash-map preserved (at what level - jobtracker,
    tasktracker or each tasks) ?
    Roughly speaking, one TIP object is generated per InputSplit and a
    list of this is kept by the JIP object in the memory of the
    JobTracker. The scheduler is then responsible for choosing the right
    tasktracker for each of the to-be-run TIPs.
    2) coming to a practical scenario, when I run hadoop in local mode. I
    run a mapreduce job with 10 maps. Since there is an inherent jvm
    parallelism (say the node can afford to run 2 map task jvms
    simultaneously) I assume that there are some map tasks that run
    concurrently. Since HDFS doesnot play a role in this case, how is the
    map task instance - to - input split mapping mechanism carried out ?
    Or do we have a concept of input split at all (will all the maps start
    scanning from the start of the input file) ?
    In case of 'local' mode, splits are still generated using simple seek
    offsets using a default or supplied split size. Every mapper task then
    seeks to its assigned split's start and begins the processing till
    split's end is reached.

    P.s. If you're delving into code of a current release, give
    http://wiki.apache.org/hadoop/HadoopMapRedClasses a read. Pretty
    helpful before you dive.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 30, '11 at 6:21a
activeApr 1, '11 at 8:20a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Harsh J: 1 post Matthew John: 1 post

People

Translate

site design / logo © 2022 Grokbase