FAQ
If the data generated by the Map function and Reduce function is far bigger
than the available RAM on my server, is it possible to sort the data?

Search Discussions

  • Josh Patterson at May 30, 2010 at 4:33 pm
    Kevin,
    As the data is sorted, it is spilled to disk as needed. There are parameters
    you can set that affect the spill mechanics. Basically each map task has a
    circular memory buffer that it write the ouput to, and when the contents of
    the buffer reach a certain threshold (one that you can tune) --- a
    background thread begins to spill the contents to disk. In the process of
    spilling to disk, the thread divides the data into partitions and sorts each
    partition by key (there are a lot of tricks you can do here to provide
    fancier algorithms). Before the map task is done, the spill files are merged
    into a single file which is partitioned and sorted. This is the file that
    reduce tasks then will download at some point.

    Something else you can look at doing is employing LZO support and
    compressing the map task output to provide faster throughput by touching the
    disk less.

    For a better understanding of the shuffle, check out Tom White's Oreilly
    Press book, chapter 6 section "Shuffle and Sort".


    Josh Patterson

    Solutions Architect
    Cloudera

    On Sun, May 30, 2010 at 12:22 PM, Kevin Tse wrote:

    If the data generated by the Map function and Reduce function is far bigger
    than the available RAM on my server, is it possible to sort the data?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 30, '10 at 4:22p
activeMay 30, '10 at 4:33p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Josh Patterson: 1 post Kevin Tse: 1 post

People

Translate

site design / logo © 2022 Grokbase