FAQ
After mapping and reducing some data, I need to do an additional
processing step. This additional step shares the conract of a reduce
function, expecting its input data (the output from the original reduce)
to be grouped by key.

Currently, I achieve the above using two iterations:

1. MyMapper -> MyFirstReducer
2. IdentityMapper -> MySecondReducer

As my project is purely academic, I am wondering if this approach really
is the best I can do with respect to performance. Unless Hadoop has some
built in optimization around the IdentityMapper class (v0.18.3), I
believe my current approach causes the intermediate data between the two
reduce functions to be completely read and rewritten to HDFS for no
reason.

Can Hadoop be instructed to completely skip running the IdentityMapper
in my application? Or is there some other/better way to run do
"MapReduceReduce"?

Thanks in advance.

/Jørn

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 3, '10 at 8:48p
activeMar 3, '10 at 8:48p
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Jørn Schou-Rode: 1 post

People

Translate

site design / logo © 2022 Grokbase