FAQ
Hi all,

I was interested in learning from how Hadoop implements their sort algorithm
in the map/reduce framework. Could someone point me to the directory of the
source code that has the mapper/reducer that the Sort example uses by
default when I invoke:

$ hadoop jar hadoop-*-examples.jar sort input output

Thanks. I've found Sort.java here :

http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/

But have not been able to track down the mapper/reducer implementation.

-Sean Hogan

Search Discussions

  • Arun C Murthy at Aug 12, 2011 at 11:51 pm
    Sean,

    The sort impl is spread out over many files. I'd start with MapTask and ReduceTask and follow from there on.

    LMK if you need more info.

    thanks,
    Arun
    On Aug 12, 2011, at 12:48 PM, Sean Hogan wrote:

    Hi all,

    I was interested in learning from how Hadoop implements their sort algorithm
    in the map/reduce framework. Could someone point me to the directory of the
    source code that has the mapper/reducer that the Sort example uses by
    default when I invoke:

    $ hadoop jar hadoop-*-examples.jar sort input output

    Thanks. I've found Sort.java here :

    http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/

    But have not been able to track down the mapper/reducer implementation.

    -Sean Hogan
  • Sean Hogan at Aug 13, 2011 at 1:27 pm
    Hi all,

    I was interested in learning from how Hadoop implements their sort algorithm
    in the map/reduce framework. Could someone point me to the directory of the
    source code that has the mapper/reducer that the Sort example uses by
    default when I invoke:

    $ hadoop jar hadoop-*-examples.jar sort input output

    Thanks. I've found Sort.java here :

    http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/

    But have not been able to track down the mapper/reducer implementation.

    -Sean Hogan
  • Kai Voigt at Aug 13, 2011 at 1:58 pm
    Hi,

    some search on Google would have told you. Here's one link:

    http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache/hadoop/examples/?r=131

    Kai

    Am 13.08.2011 um 15:27 schrieb Sean Hogan:
    Hi all,

    I was interested in learning from how Hadoop implements their sort algorithm
    in the map/reduce framework. Could someone point me to the directory of the
    source code that has the mapper/reducer that the Sort example uses by
    default when I invoke:

    $ hadoop jar hadoop-*-examples.jar sort input output

    Thanks. I've found Sort.java here :

    http://svn.apache.org/viewvc/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/

    But have not been able to track down the mapper/reducer implementation.

    -Sean Hogan
    --
    Kai Voigt
    k@123.org
  • Sean Hogan at Aug 13, 2011 at 3:11 pm
    Thanks for the link, but it hasn't helped answer my original question - that
    Sort.java seems to use IdentityMapper and IdentityReducer. Perhaps it is the
    Sort.java that is used when executing the below command, but I can't figure
    out what it actually uses for the mapper and reducer. It's entirely possible
    I'm just missing something obvious.

    I'm interested in seeing how the map and reduce fits into sorting with the
    following command:

    $ hadoop jar hadoop-*-examples.jar sort input output

    I'd appreciate it if someone could explain what mappers/reducers are used in
    that above command (link to the implementation of whatever sort they use and
    how it fits into MapReduce)

    Thanks.

    -Sean
  • Kai Voigt at Aug 13, 2011 at 3:22 pm
    Hi,

    the Identity Mapper and Reducer do what the name implies, they pretty much return their input as their output.

    TeraSort relies on the sorting that is built in Hadoop's Sort&Shuffle phase.

    So, the map() method in TeraSort looks like this:

    map(offset, line) -> (line, _)

    offset is the key to map() and represents the byte offset of the line (which is the value). map() returns the line as the key and some value which is not needed.

    reduce() looks like this:

    reduce(line, values) -> (line)

    Again, the input is returned as is. The sort&shuffle layer between map() and reduce() guarantees that keys (lines) will come in sorted order. That's why the overall output will be the sorted input.

    This all is easy when there's just one reducer. Question to make sure you understood things so far: What's the issue with more than one reducer?

    Kai

    Am 13.08.2011 um 17:10 schrieb Sean Hogan:
    Thanks for the link, but it hasn't helped answer my original question - that
    Sort.java seems to use IdentityMapper and IdentityReducer. Perhaps it is the
    Sort.java that is used when executing the below command, but I can't figure
    out what it actually uses for the mapper and reducer. It's entirely possible
    I'm just missing something obvious.

    I'm interested in seeing how the map and reduce fits into sorting with the
    following command:

    $ hadoop jar hadoop-*-examples.jar sort input output

    I'd appreciate it if someone could explain what mappers/reducers are used in
    that above command (link to the implementation of whatever sort they use and
    how it fits into MapReduce)

    Thanks.

    -Sean
    --
    Kai Voigt
    k@123.org
  • Sean Hogan at Aug 13, 2011 at 3:45 pm
    Oh, okay, got it - if there was more than one reducer then there needs to be
    a way to guarantee that the overall output from multiple reducers will still
    be sorted.

    So I want to look for where the implementation of the shuffle/sort phase is
    located. Or find something on how Hadoop implements the MapReduce
    sort/shuffle phase.

    Thanks!

    -Sean
  • Kai Voigt at Aug 13, 2011 at 3:50 pm
    Good job, in MapReduce you can build your own Partitioner. That is code determining which reducer will get which keys.

    For simplicity, assume you're running 26 reducers. Your custom Partitioner will make sure the first reducer gets all keys starting with 'a', and so on.

    Since the keys will be sorted within a single reducer, you can concatenate your 26 output files to get an overall sorted output.

    Making sense?

    Kai

    Am 13.08.2011 um 17:44 schrieb Sean Hogan:
    Oh, okay, got it - if there was more than one reducer then there needs to be
    a way to guarantee that the overall output from multiple reducers will still
    be sorted.

    So I want to look for where the implementation of the shuffle/sort phase is
    located. Or find something on how Hadoop implements the MapReduce
    sort/shuffle phase.

    Thanks!

    -Sean
    --
    Kai Voigt
    k@123.org
  • Sean Hogan at Aug 13, 2011 at 3:54 pm
    Yep, got it.

    Thanks.

    -Sean

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 12, '11 at 11:12p
activeAug 13, '11 at 3:54p
posts9
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase