FAQ
We have lots of servers but have limited storage pool. My map jobs are handle
lots of small input files (approx 300Mb Compressed) but the reduce input is
huge ( about 100Gb) requiring lots of temporary and local storage. I would
like to divide my server pool into two kinds - one set with a small disks (
for map jobs) and a few with big storage ( for the combine and reduce jobs).

Is there something I can do that lets me force the reduce job to run on a
specific nodes?

I have done google searching and searching through some forums but not
found.

-best regards,

Raj
--
View this message in context: http://old.nabble.com/Seperate-Server-Sets-for-Map-and-Reduce-tp29216327p29216327.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Alex Kozlov at Jul 20, 2010 at 8:28 pm
    Hi RajVish,

    I am just wondering why the reduce input is huge: would increasing the # of
    reducers make it smaller or is it 'fixed cost'? Having the reducer size >>
    mapper size definitely makes it a very hard problem to schedule on a
    homogeneous cluster, but it also may make it not scalable.

    Regarding your question, you can certainly force the mappers/reducers ratio
    be different on different nodes using *
    mapred.tasktracker.{map,reduce}.tasks.maximum*, but this will have
    implications on the data locality and scalability.

    Also, you may still end up with the same problem since the mappers cache
    their output on a local disk and mapper output == reducer input.

    Alex K
    On Tue, Jul 20, 2010 at 12:11 PM, RajVish wrote:


    We have lots of servers but have limited storage pool. My map jobs are
    handle
    lots of small input files (approx 300Mb Compressed) but the reduce input is
    huge ( about 100Gb) requiring lots of temporary and local storage. I would
    like to divide my server pool into two kinds - one set with a small disks (
    for map jobs) and a few with big storage ( for the combine and reduce
    jobs).

    Is there something I can do that lets me force the reduce job to run on a
    specific nodes?

    I have done google searching and searching through some forums but not
    found.

    -best regards,

    Raj
    --
    View this message in context:
    http://old.nabble.com/Seperate-Server-Sets-for-Map-and-Reduce-tp29216327p29216327.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 20, '10 at 7:12p
activeJul 20, '10 at 8:28p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

RajVish: 1 post Alex Kozlov: 1 post

People

Translate

site design / logo © 2022 Grokbase