Hi, I need to implement a specific use case that comes up often in the
machine learning / nlp community. Often we want to run some kind of
optimization process on a data set, but we want to run the optimization at
several different initial parameters. While this is not the usual MR
paradigm of splitting up a large task and then recombining the partial
outputs, I would like to use Hadoop to handle the parallelization.
It mentions on the streaming documentation page (
http://hadoop.apache.org/core/docs/current/streaming.html), that streaming
can be used to create jobs with multiple different parameters - but does not
give any example, so its not clear to me how to give each mapper (or each
reducer), a specific set of parameters. If each mapper/reducer had access
some kind of job index number, i could potentially write a side file which
maps ids->params, but this seems clumsy.
The only solution that I have now, is that my mapper phase will replicate
the data, pairing it with a set of keys that represent different parameters.
Then each reducer will see a key-value pair, by reading the key its can get
its parameters, and the value has the data. Any other solutions?
Thanks!
Ashish