FAQ
I have a generic question about how the number of mapper tasks is
calculated, as far as I know, the number is primarily based on the number of
splits, say if I have 5 splits and I have 10 tasktracker running in the
cluster, I will have 5 mapper tasks running in my MR job, right?

But what I found is that sometimes if the input is huge(5 GB), at this point
I still have 5 splits which is on purpose, but I got more than 40 mapper
tasks running, how this happens? Now, if I compress the huge input to
smaller size, the number of mapper got back to 5 again, is something tricky
happens here relevant to DFS block location of the input?

BTW, our InputFormat is a special kind of FileInputFormat which does not
split each file, whereas we copy each file to DFS and the location of the
file on DFS will be the input key to mapper task.

--
--Anfernee

Search Discussions

  • Chiku Singh at Jul 26, 2011 at 3:54 am
    What is your use case? Why would you only want to use only 5 mappers and not
    the whole 10 task trackers?

    "If an individual file is so large that it will affect seek time it will be
    split to several Splits" (http://wiki.apache.org/hadoop/HadoopMapReduce)

    "if a split span over more than one dfs block, you lose the data locality
    scheduling benefits." (https://issues.apache.org/jira/browse/HADOOP-2560)
    On Tue, Jul 26, 2011 at 12:53 AM, Anfernee Xu wrote:

    I have a generic question about how the number of mapper tasks is
    calculated, as far as I know, the number is primarily based on the number of
    splits, say if I have 5 splits and I have 10 tasktracker running in the
    cluster, I will have 5 mapper tasks running in my MR job, right?

    But what I found is that sometimes if the input is huge(5 GB), at this
    point I still have 5 splits which is on purpose, but I got more than 40
    mapper tasks running, how this happens? Now, if I compress the huge input to
    smaller size, the number of mapper got back to 5 again, is something tricky
    happens here relevant to DFS block location of the input?

    BTW, our InputFormat is a special kind of FileInputFormat which does not
    split each file, whereas we copy each file to DFS and the location of the
    file on DFS will be the input key to mapper task.

    --
    --Anfernee

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 25, '11 at 7:23p
activeJul 26, '11 at 3:54a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Chiku Singh: 1 post Anfernee Xu: 1 post

People

Translate

site design / logo © 2022 Grokbase