FAQ
Hi,

As a beginner of Hadoop, I wonder how to send output key-value pairs of the
reducers back to the input of mappers for iterative processing.

What's hadoop streaming? Can I pipe the output stream of reducers back to
the input stream of the mappers to achieve what I want?

Any pointer would be greatly appreciated.
--
View this message in context: http://www.nabble.com/Q%3A-Sending-output-of-reduce-to-mapper-tf4581957.html#a13079722
Sent from the Hadoop Users mailing list archive at Nabble.com.

Search Discussions

  • Arun C Murthy at Oct 7, 2007 at 12:26 pm
    Hi Ken,
    On Sat, Oct 06, 2007 at 08:54:54PM -0700, Ken Pu wrote:

    Hi,

    As a beginner of Hadoop, I wonder how to send output key-value pairs of the
    reducers back to the input of mappers for iterative processing.
    A map-reduce job has only 1 set of maps and 1 set of reduces.

    The way to do what you seek would be to chain jobs together i.e. output of job1 becomes input of job2 and so on. That is fairly easy since the output of the job (i.e. reduces) is on hdfs, usually.

    Clearly the onus on waiting for job-completion is on the user-code i.e. you have to ensure job1 is complete before launching job2 and so on...

    The way to do that would be:
    a) http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf) which submits the job and returns only after it completes (success or failure).
    b) http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf) to just submit the job and poll yourself, look at src/java/org/apache/hadoop/mapred/JobClient.java and in particular the implmentation of *runJob* on how to do taht.
    b) If you don't want to poll use the *job.end.notification.url* property where you can setup a url which will be invoked once the job completes to do async-stuff. (Take a look at src/test/org/apache/hadoop/mapred/NotificationTestCase.java for an e.g. on how to use that).
    What's hadoop streaming? Can I pipe the output stream of reducers back to
    the input stream of the mappers to achieve what I want?
    Hadoop streaming is a utility which allows the user to create and run map/reduce jobs with any executables as the mapper and/or the reducer.
    E.g. one can use std. unix utilities as the mapper/reducer
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

    Hope that helps.

    Arun
    Any pointer would be greatly appreciated.
    --
    View this message in context: http://www.nabble.com/Q%3A-Sending-output-of-reduce-to-mapper-tf4581957.html#a13079722
    Sent from the Hadoop Users mailing list archive at Nabble.com.
  • Ken Pu at Oct 15, 2007 at 5:43 am
    Thanks - it certainly helps!

    Ken



    Arun C Murthy wrote:
    Hi Ken,
    On Sat, Oct 06, 2007 at 08:54:54PM -0700, Ken Pu wrote:

    Hi,

    As a beginner of Hadoop, I wonder how to send output key-value pairs of
    the
    reducers back to the input of mappers for iterative processing.
    A map-reduce job has only 1 set of maps and 1 set of reduces.

    The way to do what you seek would be to chain jobs together i.e. output of
    job1 becomes input of job2 and so on. That is fairly easy since the output
    of the job (i.e. reduces) is on hdfs, usually.

    Clearly the onus on waiting for job-completion is on the user-code i.e.
    you have to ensure job1 is complete before launching job2 and so on...

    The way to do that would be:
    a)
    http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobClient.html#runJob(org.apache.hadoop.mapred.JobConf)
    which submits the job and returns only after it completes (success or
    failure).
    b)
    http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobClient.html#submitJob(org.apache.hadoop.mapred.JobConf)
    to just submit the job and poll yourself, look at
    src/java/org/apache/hadoop/mapred/JobClient.java and in particular the
    implmentation of *runJob* on how to do taht.
    b) If you don't want to poll use the *job.end.notification.url* property
    where you can setup a url which will be invoked once the job completes to
    do async-stuff. (Take a look at
    src/test/org/apache/hadoop/mapred/NotificationTestCase.java for an e.g. on
    how to use that).
    What's hadoop streaming? Can I pipe the output stream of reducers back to
    the input stream of the mappers to achieve what I want?
    Hadoop streaming is a utility which allows the user to create and run
    map/reduce jobs with any executables as the mapper and/or the reducer.
    E.g. one can use std. unix utilities as the mapper/reducer
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

    Hope that helps.

    Arun
    Any pointer would be greatly appreciated.
    --
    View this message in context:
    http://www.nabble.com/Q%3A-Sending-output-of-reduce-to-mapper-tf4581957.html#a13079722
    Sent from the Hadoop Users mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Q%3A-Sending-output-of-reduce-to-mapper-tf4581957.html#a13207007
    Sent from the Hadoop Users mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 7, '07 at 3:55a
activeOct 15, '07 at 5:43a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Ken Pu: 2 posts Arun C Murthy: 1 post

People

Translate

site design / logo © 2022 Grokbase