The implied reason is that it puts a bottleneck on I/O - to write one file
(with current HDFS semantics), the bytes for that file all have to pass
through a single host. So, you can have N reduces writing to HDFS in
parallel, or you can have one output file written from one machine. It also
means, in the current implementation, that you must have enuogh room (x2 or
x3 at this point) for that whole output file on a single drive of a single
Unless your output is not being read from Java, it's pretty easy to make
your next process read all of the output files in parallel. I've even done
this when generating MapFiles from jobs... there is code in place to make
this work already. Alternately, you can force there to be a single reducer
in the job settings.
On 7/19/06, Thomas FRIOL wrote:
Each reduce task produces one part file in the DFS. Why the job tracker
does not merge them at the end of the job to produce only one file.
It seems to me that it could be better to process results.
I think there is certainly a reason for the actual behavior but I really
need to get results of my map reduce job in a single file. Maybe someone
can give me a clue to solve my problem.
Thanks for any help.
Développeur Eclipse / Eclipse Developer
Solutions & Technologies
Tél : +33 (0)561 000 653
Portable : +33 (0)609 704 810
Fax : +33 (0)561 005 146
Bryan A. Pendleton
Ph: (877) geek-1-bp