FAQ
Hello!
I would like to know how Hadoop is computing the number of mappers when CombineFileInputFormat is used? I have read the API specification for CombineFileInputFormat (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html), but unfortunately I could not understand the way that the input splits are computed.
We have a cluster with 10 Data nodes, and data files (mapfile) spread over them.
We have used this InputFormat in our M/R jobs. Reading the spec, we have to distinguish betwenn three scenarios:
1. if the mapred.max.split.size property it is not specfied, then we will have one mapper. This behavior is correct regarding the spec:
"If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits"
2. mapred.max.split.size value specified and equals with the the block size (our case 64M). According to spec: "If the maxSplitSize is equal to the block size, then this class is similar to the default spliting behaviour in Hadoop: each block is a locally processed split."
Question: So for my understanding, in this case, the number of splits it is calculated the same as in the case when you don't use CombineFileInputFormat?
3. According to spec: "If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack"
Question 1: From the above, I have understood that the number of splits is equal with the number of nodes ("blocks on the same node are combined to form a single split"). I have observed that is not the case. So how you compute?
Question 2. What is the best practice to set up the mapred.max.split.size(maxSplitSize) greater or equal with the block size?
(In my opinion, I'll use the same size as block size in order do not loose data locality, but please correct me if I'm wrong)

In the spec, it is stated that "A split cannot have files from different pools". What means pool? A datanode?

I'll look forward for your answers.
Thank you,
Florin

Search Discussions

  • Florin P at Sep 21, 2011 at 8:00 pm
    Hello1
    I would like to know how Hadoop is
    computing the number of mappers when CombineFileInputFormat
    is used? I have read the API specification for
    CombineFileInputFormat (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html),
    but unfortunately I could not understand the way that the
    input splits are computed.
    We have a cluster with 10 Data nodes, and data files
    (mapfile) spread over them.
    We have used this InputFormat in our M/R jobs. Reading the
    spec, we have to distinguish betwenn three scenarios:
    1. if the  mapred.max.split.size property it is
    not specified, then we will have one mapper. This behavior is
    correct regarding the spec:
    "If maxSplitSize is not specified, then blocks from the
    same rack are combined in a single split; no attempt is made
    to create node-local splits"
    2. mapred.max.split.size value specified and equals with
    the the block size (our case 64M). According to spec: "If
    the maxSplitSize is equal to the block size, then this class
    is similar to the default spliting behaviour in Hadoop: each
    block is a locally processed split."
    Question: So for my understanding, in this case, the number
    of splits it is calculated the same as in the case when you
    don't use CombineFileInputFormat?
    3. According to spec: "If a maxSplitSize is specified, then
    blocks on the same node are combined to form a single split.
    Blocks that are left over are then combined with other
    blocks in the same rack"
    Question 1: From the above, I have
    understood that the number of splits is equal with the
    number of nodes ("blocks on the same node are combined to
    form a single split"). I have observed that is not the case.
    So how you compute?
    Question 2. What is the best practice to set up the
    mapred.max.split.size(maxSplitSize) greater or equal with
    the block size?
    (In my opinion, I'll use the same size as block size
    in order do not loose data locality, but please correct me
    if I'm wrong)

    In the spec, it is stated that "A split cannot have files
    from different pools". What means pool? A datanode?

    I'll look forward for your answers.
    Thank you,
    Florin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedAug 12, '11 at 6:36a
activeSep 21, '11 at 8:00p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Florin P: 2 posts

People

Translate

site design / logo © 2021 Grokbase