FAQ
hi everyone,

i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each of
the jobs depends on the o/p of preceding job. In the reducer of each job I'm
doing very little < just grouping by key from the maps>. I want to give the
output of one MapReduce job to the next job without having to go to the
disk. Does anyone have any ideas on how to do this?

Thanx.

Search Discussions

  • Lukáš Vlček at Apr 8, 2009 at 10:14 pm
    Hi,
    by far I am not an Hadoop expert but I think you can not start Map task
    until the previous Reduce is finished. Saying this it means that you
    probably have to store the Map output to the disk first (because a] it may
    not fit into memory and b] you would risk data loss if the system crashes).
    As for the job chaining you can check JobControl class (
    http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html>

    Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

    Regards,
    Lukas
    On Wed, Apr 8, 2009 at 11:30 PM, asif md wrote:

    hi everyone,

    i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each of
    the jobs depends on the o/p of preceding job. In the reducer of each job
    I'm
    doing very little < just grouping by key from the maps>. I want to give the
    output of one MapReduce job to the next job without having to go to the
    disk. Does anyone have any ideas on how to do this?

    Thanx.


    --
    http://blog.lukas-vlcek.com/
  • Nathan Marz at Apr 8, 2009 at 10:31 pm
    You can also try decreasing the replication factor for the
    intermediate files between jobs. This will make writing those files
    faster.
    On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:

    Hi,
    by far I am not an Hadoop expert but I think you can not start Map
    task
    until the previous Reduce is finished. Saying this it means that you
    probably have to store the Map output to the disk first (because a]
    it may
    not fit into memory and b] you would risk data loss if the system
    crashes).
    As for the job chaining you can check JobControl class (
    http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)
    <http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
    Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

    Regards,
    Lukas
    On Wed, Apr 8, 2009 at 11:30 PM, asif md wrote:

    hi everyone,

    i have to chain multiple map reduce jobs < actually 2 to 4 jobs >,
    each of
    the jobs depends on the o/p of preceding job. In the reducer of
    each job
    I'm
    doing very little < just grouping by key from the maps>. I want to
    give the
    output of one MapReduce job to the next job without having to go to
    the
    disk. Does anyone have any ideas on how to do this?

    Thanx.


    --
    http://blog.lukas-vlcek.com/
  • Jason hadoop at Apr 9, 2009 at 3:25 am
    Chapter 8 of my book covers this in detail, the alpha chapter should be
    available at the apress web site
    Chain mapping rules!
    http://www.apress.com/book/view/1430219424
    On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz wrote:

    You can also try decreasing the replication factor for the intermediate
    files between jobs. This will make writing those files faster.


    On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:

    Hi,
    by far I am not an Hadoop expert but I think you can not start Map task
    until the previous Reduce is finished. Saying this it means that you
    probably have to store the Map output to the disk first (because a] it may
    not fit into memory and b] you would risk data loss if the system
    crashes).
    As for the job chaining you can check JobControl class (

    http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
    )<
    http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
    Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

    Regards,
    Lukas

    On Wed, Apr 8, 2009 at 11:30 PM, asif md wrote:

    hi everyone,
    i have to chain multiple map reduce jobs < actually 2 to 4 jobs >, each
    of
    the jobs depends on the o/p of preceding job. In the reducer of each job
    I'm
    doing very little < just grouping by key from the maps>. I want to give
    the
    output of one MapReduce job to the next job without having to go to the
    disk. Does anyone have any ideas on how to do this?

    Thanx.

    --
    http://blog.lukas-vlcek.com/

    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 8, '09 at 9:32p
activeApr 9, '09 at 3:25a
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase