FAQ
Hi all,

Each reduce task produces one part file in the DFS. Why the job tracker
does not merge them at the end of the job to produce only one file.
It seems to me that it could be better to process results.

I think there is certainly a reason for the actual behavior but I really
need to get results of my map reduce job in a single file. Maybe someone
can give me a clue to solve my problem.

Thanks for any help.

Thomas.

--
Thomas FRIOL
Développeur Eclipse / Eclipse Developer
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tél : +33 (0)561 000 653
Portable : +33 (0)609 704 810
Fax : +33 (0)561 005 146
www.anyware-tech.com

Search Discussions

  • Bryan A. Pendleton at Jul 19, 2006 at 5:57 pm
    The implied reason is that it puts a bottleneck on I/O - to write one file
    (with current HDFS semantics), the bytes for that file all have to pass
    through a single host. So, you can have N reduces writing to HDFS in
    parallel, or you can have one output file written from one machine. It also
    means, in the current implementation, that you must have enuogh room (x2 or
    x3 at this point) for that whole output file on a single drive of a single
    machine.

    Unless your output is not being read from Java, it's pretty easy to make
    your next process read all of the output files in parallel. I've even done
    this when generating MapFiles from jobs... there is code in place to make
    this work already. Alternately, you can force there to be a single reducer
    in the job settings.
    On 7/19/06, Thomas FRIOL wrote:

    Hi all,

    Each reduce task produces one part file in the DFS. Why the job tracker
    does not merge them at the end of the job to produce only one file.
    It seems to me that it could be better to process results.

    I think there is certainly a reason for the actual behavior but I really
    need to get results of my map reduce job in a single file. Maybe someone
    can give me a clue to solve my problem.

    Thanks for any help.

    Thomas.

    --
    Thomas FRIOL
    Développeur Eclipse / Eclipse Developer
    Solutions & Technologies
    ANYWARE TECHNOLOGIES
    Tél : +33 (0)561 000 653
    Portable : +33 (0)609 704 810
    Fax : +33 (0)561 005 146
    www.anyware-tech.com

    --
    Bryan A. Pendleton
    Ph: (877) geek-1-bp

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJul 19, '06 at 1:06p
activeJul 19, '06 at 5:57p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase