FAQ
Hello,

I am trying to write a program where I need to write multiple rounds of map
and reduce.

The output of the last round of map-reduce must be fed into the input of the
next round.

Can anyone please guide me to any link / material that can teach me as to
how I can achieve this.

Thanks a lot in advance!

Thanks & regards
Arko

Search Discussions

  • Bibek Paudel at Jun 13, 2011 at 9:53 pm
    Hi,

    On Mon, Jun 13, 2011 at 11:46 PM, Arko Provo Mukherjee
    wrote:
    Hello,

    I am trying to write a program where I need to write multiple rounds of map
    and reduce.

    The output of the last round of map-reduce must be fed into the input of the
    next round.

    Can anyone please guide me to any link / material that can teach me as to
    how I can achieve this.
    The way I do it is:

    create job1
    job1 <-- feed all the configuration parameters (incl input and output
    path) to this job
    run job 1

    create job2
    job2 <-- feed all config params (output of job1 as input, another path
    as output)
    run job2

    ....
    so on.

    I think this is the recommended way of running multiple rounds of MR in hadoop.

    -b
  • Marcos Ortiz at Jun 13, 2011 at 9:55 pm
    Well, you can define a job for each round and then, you can define the
    running workflow based in your implementation and to chain your jobs

    El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
    Hello,

    I am trying to write a program where I need to write multiple rounds
    of map and reduce.

    The output of the last round of map-reduce must be fed into the input
    of the next round.

    Can anyone please guide me to any link / material that can teach me as
    to how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko
    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186
  • GOEKE, MATTHEW (AG/1000) at Jun 13, 2011 at 10:02 pm
    If you know for certain that it needs to be split into multiple work units I would suggest looking into Oozie. Easy to install, light weight, low learning curve... for my purposes it's been very helpful so far. I am also fairly certain you can chain multiple job confs into the same run but I have not actually tried that therefore I can't promise it is easy or possible.

    http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/

    If you are not running CDH3u0 then you can also get the tarball and documentation directly here:
    https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs

    Matt

    -----Original Message-----
    From: Marcos Ortiz
    Sent: Monday, June 13, 2011 4:57 PM
    To: mapreduce-user@hadoop.apache.org
    Cc: Arko Provo Mukherjee
    Subject: Re: Programming Multiple rounds of mapreduce

    Well, you can define a job for each round and then, you can define the
    running workflow based in your implementation and to chain your jobs

    El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
    Hello,

    I am trying to write a program where I need to write multiple rounds
    of map and reduce.

    The output of the last round of map-reduce must be fed into the input
    of the next round.

    Can anyone please guide me to any link / material that can teach me as
    to how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko
    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Alejandro Abdelnur at Jun 13, 2011 at 10:14 pm
    Thanks Matt,

    Arko, if you plan to use Oozie, you can have a simple coordinator job that
    does does, for example (the following schedules a WF every 5 mins that
    consumes the output produced by the previous run, you just have to have the
    initial data)

    Thxs.

    Alejandro

    ----
    <coordinator-app name="coord-1" frequency="${coord:minutes(5)}"
    start="${start}" end="${end}" timezone="UTC"
    xmlns="uri:oozie:coordinator:0.1">
    <controls>
    <concurrency>1</concurrency>
    </controls>

    <datasets>
    <dataset name="data" frequency="${coord:minutes(5)}"
    initial-instance="${start}" timezone="UTC">

    <uri-template>${nameNode}/user/${coord:user()}/examples/${dataRoot}/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}
    </uri-template>
    </dataset>
    </datasets>

    <input-events>
    <data-in name="input" dataset="data">
    <instance>${coord:current(0)}</instance>
    </data-in>
    </input-events>

    <output-events>
    <data-out name="output" dataset="data">
    <instance>${coord:current(1)}</instance>
    </data-out>
    </output-events>

    <action>
    <workflow>

    <app-path>${nameNode}/user/${coord:user()}/examples/apps/subwf-1</app-path>
    <configuration>
    <property>
    <name>jobTracker</name>
    <value>${jobTracker}</value>
    </property>
    <property>
    <name>nameNode</name>
    <value>${nameNode}</value>
    </property>
    <property>
    <name>queueName</name>
    <value>${queueName}</value>
    </property>
    <property>
    <name>examplesRoot</name>
    <value>${examplesRoot}</value>
    </property>
    <property>
    <name>inputDir</name>
    <value>${coord:dataIn('input')}</value>
    </property>
    <property>
    <name>outputDir</name>
    <value>${coord:dataOut('output')}</value>
    </property>
    </configuration>
    </workflow>
    </action>
    </coordinator-app>
    ------
    On Mon, Jun 13, 2011 at 3:01 PM, GOEKE, MATTHEW (AG/1000) wrote:

    If you know for certain that it needs to be split into multiple work units
    I would suggest looking into Oozie. Easy to install, light weight, low
    learning curve... for my purposes it's been very helpful so far. I am also
    fairly certain you can chain multiple job confs into the same run but I have
    not actually tried that therefore I can't promise it is easy or possible.

    http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/

    If you are not running CDH3u0 then you can also get the tarball and
    documentation directly here:
    https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs

    Matt

    -----Original Message-----
    From: Marcos Ortiz
    Sent: Monday, June 13, 2011 4:57 PM
    To: mapreduce-user@hadoop.apache.org
    Cc: Arko Provo Mukherjee
    Subject: Re: Programming Multiple rounds of mapreduce

    Well, you can define a job for each round and then, you can define the
    running workflow based in your implementation and to chain your jobs

    El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
    Hello,

    I am trying to write a program where I need to write multiple rounds
    of map and reduce.

    The output of the last round of map-reduce must be fed into the input
    of the next round.

    Can anyone please guide me to any link / material that can teach me as
    to how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko
    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer (UCI)
    http://marcosluis2186.posterous.com
    http://twitter.com/marcosluis2186


    This e-mail message may contain privileged and/or confidential information,
    and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error,
    please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use
    of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring,
    reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage
    caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Moustafa Gaber at Jun 13, 2011 at 10:30 pm
    I think HaLoop is a framework which can answer your question:
    http://code.google.com/p/haloop/
    On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee wrote:

    Hello,

    I am trying to write a program where I need to write multiple rounds of map
    and reduce.

    The output of the last round of map-reduce must be fed into the input of
    the next round.

    Can anyone please guide me to any link / material that can teach me as to
    how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko


    --
    Best Regards,
    Mostafa Ead
  • Arko Provo Mukherjee at Jun 13, 2011 at 10:39 pm
    Hello,

    Thanks everyone for your responses.

    I am new to Hadoop, so this was a lot of new information for me. I will
    surely go though all of these.

    However, I was actually hoping that someone could point me to some example
    codes where multiple rounds of map-reduce has been used.

    Please let me know if anyone has any such examples as they are the best way
    to learn for me :-)

    Thanks much!
    Cheers
    Arko


    On Mon, Jun 13, 2011 at 5:30 PM, Moustafa Gaber wrote:

    I think HaLoop is a framework which can answer your question:
    http://code.google.com/p/haloop/


    On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee <
    arkoprovomukherjee@gmail.com> wrote:
    Hello,

    I am trying to write a program where I need to write multiple rounds of
    map and reduce.

    The output of the last round of map-reduce must be fed into the input of
    the next round.

    Can anyone please guide me to any link / material that can teach me as to
    how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko


    --
    Best Regards,
    Mostafa Ead
  • Moustafa Gaber at Jun 14, 2011 at 1:13 am
    Actually, HaLoop is a new framework above Hadoop which targets the problem
    of transitive closure algorithms. This type of algorithms contain rounds of
    hadoop jobs, so I think it may contain some useful examples for you.
    On Mon, Jun 13, 2011 at 6:39 PM, Arko Provo Mukherjee wrote:

    Hello,

    Thanks everyone for your responses.

    I am new to Hadoop, so this was a lot of new information for me. I will
    surely go though all of these.

    However, I was actually hoping that someone could point me to some example
    codes where multiple rounds of map-reduce has been used.

    Please let me know if anyone has any such examples as they are the best way
    to learn for me :-)

    Thanks much!
    Cheers
    Arko



    On Mon, Jun 13, 2011 at 5:30 PM, Moustafa Gaber wrote:

    I think HaLoop is a framework which can answer your question:
    http://code.google.com/p/haloop/


    On Mon, Jun 13, 2011 at 5:46 PM, Arko Provo Mukherjee <
    arkoprovomukherjee@gmail.com> wrote:
    Hello,

    I am trying to write a program where I need to write multiple rounds of
    map and reduce.

    The output of the last round of map-reduce must be fed into the input of
    the next round.

    Can anyone please guide me to any link / material that can teach me as to
    how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko


    --
    Best Regards,
    Mostafa Ead

    --
    Best Regards,
    Mostafa Ead
  • Sean Owen at Jun 14, 2011 at 6:29 am
    You could have a look at the MapReduce pipelines in Apache Mahout
    (http://mahout.apache.org). See for instance
    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob. This shows how
    most of Mahout constructs and runs a series of rounds of MapReduce to
    accomplish a task. Each job feeds into one or more of the later
    rounds. It is at least an example of getting in done in straight
    Hadoop -- though workflow systems like Oozie et al are probably the
    kinds of things you want to look at now.

    On Mon, Jun 13, 2011 at 10:46 PM, Arko Provo Mukherjee
    wrote:
    Hello,

    I am trying to write a program where I need to write multiple rounds of map
    and reduce.

    The output of the last round of map-reduce must be fed into the input of the
    next round.

    Can anyone please guide me to any link / material that can teach me as to
    how I can achieve this.

    Thanks a lot in advance!

    Thanks & regards
    Arko

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 13, '11 at 9:47p
activeJun 14, '11 at 6:29a
posts9
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase