FAQ
Hi, I need to implement a specific use case that comes up often in the
machine learning / nlp community. Often we want to run some kind of
optimization process on a data set, but we want to run the optimization at
several different initial parameters. While this is not the usual MR
paradigm of splitting up a large task and then recombining the partial
outputs, I would like to use Hadoop to handle the parallelization.
It mentions on the streaming documentation page (
http://hadoop.apache.org/core/docs/current/streaming.html), that streaming
can be used to create jobs with multiple different parameters - but does not
give any example, so its not clear to me how to give each mapper (or each
reducer), a specific set of parameters. If each mapper/reducer had access
some kind of job index number, i could potentially write a side file which
maps ids->params, but this seems clumsy.
The only solution that I have now, is that my mapper phase will replicate
the data, pairing it with a set of keys that represent different parameters.
Then each reducer will see a key-value pair, by reading the key its can get
its parameters, and the value has the data. Any other solutions?

Thanks!

Ashish

Search Discussions

  • Ashish Venugopal at Aug 13, 2008 at 9:46 pm
    Also, just to clarify a couple of points:
    I am using Hadoop On Demand, which means that to run a job, I first have to
    allocate a cluster, I am using the "hod script" mechanism, where the cluster
    is allocated for running time of my hod script. If my script could schedule
    multiple MR jobs, but then only relinquish control when all jobs are done, I
    could simply schedule one MR per parameter setting.

    Ashish

    On Wed, Aug 13, 2008 at 2:08 PM, Ashish Venugopal wrote:

    Hi, I need to implement a specific use case that comes up often in the
    machine learning / nlp community. Often we want to run some kind of
    optimization process on a data set, but we want to run the optimization at
    several different initial parameters. While this is not the usual MR
    paradigm of splitting up a large task and then recombining the partial
    outputs, I would like to use Hadoop to handle the parallelization.
    It mentions on the streaming documentation page (
    http://hadoop.apache.org/core/docs/current/streaming.html), that streaming
    can be used to create jobs with multiple different parameters - but does not
    give any example, so its not clear to me how to give each mapper (or each
    reducer), a specific set of parameters. If each mapper/reducer had access
    some kind of job index number, i could potentially write a side file which
    maps ids->params, but this seems clumsy.
    The only solution that I have now, is that my mapper phase will replicate
    the data, pairing it with a set of keys that represent different parameters.
    Then each reducer will see a key-value pair, by reading the key its can get
    its parameters, and the value has the data. Any other solutions?

    Thanks!

    Ashish

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 13, '08 at 9:09p
activeAug 13, '08 at 9:46p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Ashish Venugopal: 2 posts

People

Translate

site design / logo © 2022 Grokbase