FAQ
Hi all,

I had some doubts regarding the functioning of Hadoop MapReduce :

1) I understand that every MapReduce job is parameterized using an XML file
(with all the job configurations). So whenever I set certain parameters
using my MR code (say I set splitsize to be 320000kb) it does get reflected
in the job (number of mappers). How exactly does that happen ? Does the
parameters coded in the MR module override the default parameters set in the
configuration XML ? And how does the JobTracker ensure that the
configuration is followed by all the TaskTrackers ? What is the mechanism
followed ?

2) Assume I am running cascading (chained) MR modules. In this case I feel
there is a huge overhead when output of MR1 is written back to HDFS and then
read from there as input of MR2.Can this be avoided ? (maybe store it in
some memory without hitting the HDFS and NameNode ) Please let me know if
there s some means of exercising this because it will increase the
efficiency of chained MR to a great extent.

Matthew

Search Discussions

  • Harsh J at Feb 10, 2011 at 1:08 pm
    Hello,

    On Thu, Feb 10, 2011 at 5:16 PM, Matthew John
    wrote:
    Hi all,

    I had some doubts regarding the functioning of Hadoop MapReduce :

    1) I understand that every MapReduce job is parameterized using an XML file
    (with all the job configurations). So whenever I set certain parameters
    using my MR code (say I set splitsize to be 320000kb) it does get reflected
    in the job (number of mappers). How exactly does that happen ? Does the
    parameters coded in the MR module override the default parameters set in the
    configuration XML ? And how does the JobTracker ensure that the
    configuration is followed by all the TaskTrackers ? What is the mechanism
    followed ?
    Yes, your configurations are applied over the defaults that are loaded
    from Hadoop's core/etc jars.

    A job is represented by its job file + jars/files, where the job file
    is the 'job.xml' produced by the configuration saving mechanism,
    performed upon submission of a Job. This file is distributed to all
    workers to read and utilize, by the JobTracker as part of its
    submission and localization process. I suggest reading Hadoop's source
    code from the submit call upwards.
    2) Assume I am running cascading (chained) MR modules. In this case I feel
    there is a huge overhead when output of MR1 is written back to HDFS and then
    read from there as input of MR2.Can this be avoided ? (maybe store it in
    some memory without hitting the HDFS and NameNode ) Please let me know if
    there s some means of exercising this because it will increase the
    efficiency of chained MR to a great extent.
    Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop
    On-line project), which has some of what you seek.

    --
    Harsh J
    www.harshj.com
  • Greg Roelofs at Feb 10, 2011 at 11:53 pm

    2) Assume I am running cascading (chained) MR modules. In this case I feel
    there is a huge overhead when output of MR1 is written back to HDFS and then
    read from there as input of MR2.Can this be avoided ? (maybe store it in
    some memory without hitting the HDFS and NameNode ) Please let me know if
    there s some means of exercising this because it will increase the
    efficiency of chained MR to a great extent.
    Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop
    On-line project), which has some of what you seek.
    It is under some circumstances. With ChainMapper and ChainReducer, if the
    key/value signatures of the inputs and outputs of all mappers and reducers
    are the same, then the only disk I/O is at the endpoints. Note that there
    is _no_ buffering at all, however (just a single-element queue between each
    pair), so all maps and reduces in each ChainMapper or ChainReducer chain
    have to reside in memory simultaneously.

    I haven't ever used them, btw, so I don't know how useful or efficient they
    are. I just came across them while working on another feature that turns
    out to be fundamentally incompatible with them...

    Greg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 10, '11 at 11:47a
activeFeb 10, '11 at 11:53p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase