Hi all,
I had some doubts regarding the functioning of Hadoop MapReduce :
1) I understand that every MapReduce job is parameterized using an XML file
(with all the job configurations). So whenever I set certain parameters
using my MR code (say I set splitsize to be 320000kb) it does get reflected
in the job (number of mappers). How exactly does that happen ? Does the
parameters coded in the MR module override the default parameters set in the
configuration XML ? And how does the JobTracker ensure that the
configuration is followed by all the TaskTrackers ? What is the mechanism
followed ?
2) Assume I am running cascading (chained) MR modules. In this case I feel
there is a huge overhead when output of MR1 is written back to HDFS and then
read from there as input of MR2.Can this be avoided ? (maybe store it in
some memory without hitting the HDFS and NameNode ) Please let me know if
there s some means of exercising this because it will increase the
efficiency of chained MR to a great extent.
Matthew