FAQ
Hi all,

I created some data using the randomwriter utility and compressed the
map task outputs using the options
-D mapred.output.compress=true
-D mapred.map.output.compression.type=BLOCK

I set the bytes per map to be 128 MB but due to compression the final
size of each map tasks output is around 75MB.

I want to use these individual 75MB compressed files as input to
another Map task.
How do I get Hadoop to first decompress the files before computing the
input splits for the map tasks?

Thanks,
Abhishek

Search Discussions

  • Rekha Joshi at Apr 15, 2010 at 9:40 am
    By default, with compressed files you lose the ability to control splits and the file is essentially read as one split to one mapper.

    There had been some discussion in and around this over bzip2, gzip and some fixes are done to allow bzip2 to be splittable.Refer HADOOP-4012

    Also Kevin came with lzo compression and LzoTextInputFormat which overcomes this disadvantage and is faster. Refer http://github.com/kevinweil/hadoop-lzo

    Cheers,
    /R

    On 4/15/10 6:56 AM, "abhishek sharma" wrote:

    Hi all,

    I created some data using the randomwriter utility and compressed the
    map task outputs using the options
    -D mapred.output.compress=true
    -D mapred.map.output.compression.type=BLOCK

    I set the bytes per map to be 128 MB but due to compression the final
    size of each map tasks output is around 75MB.

    I want to use these individual 75MB compressed files as input to
    another Map task.
    How do I get Hadoop to first decompress the files before computing the
    input splits for the map tasks?

    Thanks,
    Abhishek

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 15, '10 at 1:27a
activeApr 15, '10 at 9:40a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Abhishek sharma: 1 post Rekha Joshi: 1 post

People

Translate

site design / logo © 2022 Grokbase