FAQ
Hi everyone

According to my understanding of Hadoop, it save MapReduce job's
intermediate results into files in the mapper's hard drive. Each key will
occupy a file. I am curious what will happen if mapper's hard drive does not
have enough inodes to save the generated keys. Because every file needs a
inode.

Best wishes!

Chen

Search Discussions

  • Harsh J at Sep 25, 2011 at 7:50 pm
    Chen,

    Files are stored based on the reducer partitions, not exactly per-key.
    The result is that there are far lesser files than you imagine there
    ought to be. The keys are kept sorted inside the partitioned files and
    thus you do not lose out on your key groups either.

    See Partitioner, which is responsible for doing the partitioning of
    your map outputs:
    (http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Partitioner)
    On Sun, Sep 25, 2011 at 10:30 PM, He Chen wrote:
    Hi everyone

    According to my understanding of Hadoop, it save MapReduce  job's
    intermediate results into files in the mapper's hard drive. Each key will
    occupy a file. I am curious what will happen if mapper's hard drive does not
    have enough inodes to save the generated keys.  Because every file needs a
    inode.

    Best wishes!

    Chen


    --
    Harsh J
  • Arun C Murthy at Sep 25, 2011 at 7:55 pm
    There is only one file per-map. Actually two, an output file and an index file to quickly get the offset/length for a given reducer.

    The index file is also cached in memory for performance.

    Arun
    On Sep 25, 2011, at 10:00 AM, He Chen wrote:

    Hi everyone

    According to my understanding of Hadoop, it save MapReduce job's
    intermediate results into files in the mapper's hard drive. Each key will
    occupy a file. I am curious what will happen if mapper's hard drive does not
    have enough inodes to save the generated keys. Because every file needs a
    inode.

    Best wishes!

    Chen
  • He Chen at Sep 25, 2011 at 9:01 pm
    Hi Arun and Harsh J

    Thank you for your replies.

    Yes, there will be two finally. But during the map running, there are more
    than two.

    The scenario I mentioned before will not occur with the Hadoop default
    partitioner. If there is a partitioner lead to above problem. Is there any
    security policy prevent this?

    We all know that the unbalanced keys distribution can lead to the
    differences of reduce tasks' execution time even in homogeneous environment.
    It will be easier to rearrange unbalanced keys if each key occupies a file.

    Thanks!

    Chen
    On Sun, Sep 25, 2011 at 2:55 PM, Arun C Murthy wrote:

    There is only one file per-map. Actually two, an output file and an index
    file to quickly get the offset/length for a given reducer.

    The index file is also cached in memory for performance.

    Arun
    On Sep 25, 2011, at 10:00 AM, He Chen wrote:

    Hi everyone

    According to my understanding of Hadoop, it save MapReduce job's
    intermediate results into files in the mapper's hard drive. Each key will
    occupy a file. I am curious what will happen if mapper's hard drive does not
    have enough inodes to save the generated keys. Because every file needs a
    inode.

    Best wishes!

    Chen
  • Arun C Murthy at Sep 26, 2011 at 4:39 am

    On Sep 25, 2011, at 2:01 PM, He Chen wrote:

    Hi Arun and Harsh J

    Thank you for your replies.

    Yes, there will be two finally. But during the map running, there are more
    than two.

    The scenario I mentioned before will not occur with the Hadoop default
    partitioner. If there is a partitioner lead to above problem. Is there any
    security policy prevent this?
    Irrespective of the partitioner used a single file stores all keys/values during a single iteration of each 'spill' after sorting records in the sort-buffer.

    You could have multiple spills, but you have lots of keys/values in each spill - we never do file per record. You'd very quickly run out of inodes.

    In very early days we had a file per reducer and that caused huge issues, never mind file per record.

    Arun

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedSep 25, '11 at 5:00p
activeSep 26, '11 at 4:39a
posts5
users3
websitehadoop.apache.org...
irc#hadoop

3 users in discussion

He Chen: 2 posts Arun C Murthy: 2 posts Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase