|| at Feb 17, 2011 at 7:50 am
Do you mean profiling the data path in MapReduce? I think the general consensus is that a decent amount of time is spent in deserialization and in data copies in the HDFS stack, although of course there is work to improve this. For example, take a look at https://issues.apache.org/jira/browse/HDFS-347
for optimizations for the HDFS read path (at least for local data). My guess is that these are two of the more surprising things you'll see in a profile. Of course, for many jobs, the tasks might be IO-bound, so this may not matter.
On Feb 16, 2011, at 11:40 PM, Matthew John wrote:
I want to know if anyone had already done an in-depth analysis of the
MapReduce mechanism. Has anyone really gone into bytecode level
understanding of the Map and Reduce mechanism. It would be good if we
can take a simple MapReduce (say WordCount) and then try the analysis.
Please send me pointers to if there s already some work done in this
respect. Or please help me with how to proceed with the same analysis
if you feel a specific technique/software/development environment has
ready plugins to help in this regard.