Sorry to ask too many questions, but it will help the user list best offer
you advice, as this is not a typical MR use case.
- Do you foresee the reducer store the data on a local files system to the
- Do you need to use specific input formats for the job, or is it really
just text files?
- Are the input files on the HDFS, or are you (e.g.) reading from HBase, or
some other source?
If your data is on HDFS, and if it is just text files, have you considered
a simple HDFS getMerge on each machine? You could use several tools (e.g.
Fabric) which could trigger a getMerge on each machine.
The problems with MR for this, is that you would be circumventing (if it is
at all possible) the job scheduling which is trying to balance the load
across the cluster.
On Thu, Aug 23, 2012 at 10:47 AM, Hamid Oliaei wrote: