After mapping and reducing some data, I need to do an additional
processing step. This additional step shares the conract of a reduce
function, expecting its input data (the output from the original reduce)
to be grouped by key.
Currently, I achieve the above using two iterations:
1. MyMapper -> MyFirstReducer
2. IdentityMapper -> MySecondReducer
As my project is purely academic, I am wondering if this approach really
is the best I can do with respect to performance. Unless Hadoop has some
built in optimization around the IdentityMapper class (v0.18.3), I
believe my current approach causes the intermediate data between the two
reduce functions to be completely read and rewritten to HDFS for no
Can Hadoop be instructed to completely skip running the IdentityMapper
in my application? Or is there some other/better way to run do
Thanks in advance.