|| at Aug 13, 2011 at 3:22 pm
the Identity Mapper and Reducer do what the name implies, they pretty much return their input as their output.
TeraSort relies on the sorting that is built in Hadoop's Sort&Shuffle phase.
So, the map() method in TeraSort looks like this:
map(offset, line) -> (line, _)
offset is the key to map() and represents the byte offset of the line (which is the value). map() returns the line as the key and some value which is not needed.
reduce() looks like this:
reduce(line, values) -> (line)
Again, the input is returned as is. The sort&shuffle layer between map() and reduce() guarantees that keys (lines) will come in sorted order. That's why the overall output will be the sorted input.
This all is easy when there's just one reducer. Question to make sure you understood things so far: What's the issue with more than one reducer?
Am 13.08.2011 um 17:10 schrieb Sean Hogan:
Thanks for the link, but it hasn't helped answer my original question - that
Sort.java seems to use IdentityMapper and IdentityReducer. Perhaps it is the
Sort.java that is used when executing the below command, but I can't figure
out what it actually uses for the mapper and reducer. It's entirely possible
I'm just missing something obvious.
I'm interested in seeing how the map and reduce fits into sorting with the
$ hadoop jar hadoop-*-examples.jar sort input output
I'd appreciate it if someone could explain what mappers/reducers are used in
that above command (link to the implementation of whatever sort they use and
how it fits into MapReduce)