FAQ
Hi,

I was wondering what scheduling algorithm is used in Hadoop (version
0.20.2 in particular), for a ReduceTask to determine in what order it is
supposed to read the map outputs from the various mappers that have been
run? In particular, suppose we have 10maps called map1, map2,....,
map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
from first?

Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
do both r1 and r2 read from 5 map outputs concurrently?

Thanks,
Virajith

Search Discussions

  • David Rosenstrauch at Jun 29, 2011 at 10:37 pm

    On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
    Hi,

    I was wondering what scheduling algorithm is used in Hadoop (version
    0.20.2 in particular), for a ReduceTask to determine in what order it is
    supposed to read the map outputs from the various mappers that have been
    run? In particular, suppose we have 10maps called map1, map2,....,
    map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
    from first?

    Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
    do both r1 and r2 read from 5 map outputs concurrently?

    Thanks,
    Virajith
    You're missing 2 key steps in here. After the mappers, a sort step gets
    run (to sort the records in key order) and then a partition step (to
    partition the records by key and spread them across the reducers).

    So your question is really a moot one. The records output by a given
    map step get spread across multiple reducers, and not all sent to a
    single reducer.

    DR
  • Virajith Jalaparti at Jun 29, 2011 at 10:46 pm
    Hi,

    I guess I did not frame my question properly. What was actually meant
    was this: After the map phase, the output of each map is partitioned
    based on the key value and written to disk as a single file. Now, the
    ReducerTask starts up and has to read the intermediate values from the
    partitions, that have been created for it by the various map tasks. Now,
    how does the ReduceTask decide which partition to read first?

    Thanks,
    Virajith
    On 6/29/2011 11:37 PM, David Rosenstrauch wrote:
    On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
    Hi,

    I was wondering what scheduling algorithm is used in Hadoop (version
    0.20.2 in particular), for a ReduceTask to determine in what order it is
    supposed to read the map outputs from the various mappers that have been
    run? In particular, suppose we have 10maps called map1, map2,....,
    map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
    from first?

    Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
    do both r1 and r2 read from 5 map outputs concurrently?

    Thanks,
    Virajith
    You're missing 2 key steps in here. After the mappers, a sort step
    gets run (to sort the records in key order) and then a partition step
    (to partition the records by key and spread them across the reducers).

    So your question is really a moot one. The records output by a given
    map step get spread across multiple reducers, and not all sent to a
    single reducer.

    DR

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 29, '11 at 9:29p
activeJun 29, '11 at 10:46p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase