Is it possible to connect the output of one map reduce job so that it is the
input to another map reduce job.

Basically… then reduce() outputs a key, that will be passed to another map()
function without having to store intermediate data to the filesystem.

Kevin

--

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Search Discussions

  • Arko Provo Mukherjee at Sep 27, 2011 at 7:30 pm
    Hi,

    I am not sure how you can avoid the filesystem, however, I did it as follows:

    // For Job 1
    FileInputFormat.addInputPath(job1, new Path(args[0]));
    FileOutputFormat.setOutputPath(job1, new Path(args[1]));

    // For job 2
    FileInputFormat.addInputPath(job2, new Path(args[1]));
    FileOutputFormat.setOutputPath(job2, new Path(args[2]));

    Assuming
    args[0] --> Input to first mapper
    args[1] --> Output of first reducer / Input to second mapper
    args[2] --> Out of second reducer

    Hope this helps!
    Warm regards
    Arko
    On Tue, Sep 27, 2011 at 2:09 PM, Kevin Burton wrote:
    Is it possible to connect the output of one map reduce job so that it is the
    input to another map reduce job.
    Basically… then reduce() outputs a key, that will be passed to another map()
    function without having to store intermediate data to the filesystem.
    Kevin

    --

    Founder/CEO Spinn3r.com

    Location: San Francisco, CA
    Skype: burtonator

    Skype-in: (415) 871-0687
  • Marcos Luis Ortiz Valmaseda at Sep 27, 2011 at 7:42 pm
    Are you consider for this to Oozie? It´s a workflow engine developed for the
    Yahoo! engineers
    Yahoo/oozie at GitHub
    https://github.com/yahoo/oozie

    Oozie at InfoQ
    http://www.infoq.com/articles/introductionOozie

    Oozie´s examples:
    http://www.infoq.com/articles/oozieexample
    http://yahoo.github.com/oozie/releases/2.3.0/DG_Examples.html

    Oozie at Cloudera
    https://ccp.cloudera.com/display/CDHDOC/Oozie+Installation

    Regards

    2011/9/27 Arko Provo Mukherjee <arkoprovomukherjee@gmail.com>
    Hi,

    I am not sure how you can avoid the filesystem, however, I did it as
    follows:

    // For Job 1
    FileInputFormat.addInputPath(job1, new Path(args[0]));
    FileOutputFormat.setOutputPath(job1, new Path(args[1]));

    // For job 2
    FileInputFormat.addInputPath(job2, new Path(args[1]));
    FileOutputFormat.setOutputPath(job2, new Path(args[2]));

    Assuming
    args[0] --> Input to first mapper
    args[1] --> Output of first reducer / Input to second mapper
    args[2] --> Out of second reducer

    Hope this helps!
    Warm regards
    Arko
    On Tue, Sep 27, 2011 at 2:09 PM, Kevin Burton wrote:
    Is it possible to connect the output of one map reduce job so that it is the
    input to another map reduce job.
    Basically… then reduce() outputs a key, that will be passed to another map()
    function without having to store intermediate data to the filesystem.
    Kevin

    --

    Founder/CEO Spinn3r.com

    Location: San Francisco, CA
    Skype: burtonator

    Skype-in: (415) 871-0687


    --
    Marcos Luis Ortíz Valmaseda
    Linux Infrastructure Engineer
    Linux User # 418229
    http://marcosluis2186.posterous.com
    http://www.linkedin.com/in/marcosluis2186
    Twitter: @marcosluis2186
  • Mike Spreitzer at Sep 27, 2011 at 8:38 pm
    It looks to me like Oozie will not do what was asked. In
    http://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a0_Definitions
    I see:

    3.2.2 Map-Reduce Action
    ...
    The workflow job will wait until the Hadoop map/reduce job completes
    before continuing to the next action in the workflow execution path.

    That implies to me that the output of one job is held in some intermediate
    storage (likely HDFS) for a while before being read by the consuming
    job(s).

    Regards,
    Mike Spreitzer
  • Niels Basjes at Sep 28, 2011 at 7:21 am
    To me it sounds like the asker should checkout tools like storm and s4
    instead of hadoop.

    http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop

    --
    Met vriendelijke groet,
    Niels Basjes
    Op 27 sep. 2011 22:38 schreef "Mike Spreitzer" <mspreitz@us.ibm.com> het
    volgende:
    It looks to me like Oozie will not do what was asked. In
    http://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a0_Definitions
    I see:

    3.2.2 Map-Reduce Action
    ...
    The workflow job will wait until the Hadoop map/reduce job completes
    before continuing to the next action in the workflow execution path.

    That implies to me that the output of one job is held in some intermediate
    storage (likely HDFS) for a while before being read by the consuming
    job(s).

    Regards,
    Mike Spreitzer
  • Arun C Murthy at Sep 27, 2011 at 7:45 pm

    On Sep 27, 2011, at 12:09 PM, Kevin Burton wrote:

    Is it possible to connect the output of one map reduce job so that it is the input to another map reduce job.

    Basically… then reduce() outputs a key, that will be passed to another map() function without having to store intermediate data to the filesystem.
    Currently there is no way to pipeline in such a manner - with hadoop-0.23 it's doable, but will take more effort.

    Arun

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedSep 27, '11 at 7:10p
activeSep 28, '11 at 7:21a
posts6
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase