FAQ
GZIP is not splittable. Does that mean a GZIP block compressed sequencefile
can't take advantage of MR parallelism?

How to control the size of block to be compressed in SequenceFile?

--
--Sean

Search Discussions

  • Sean Bigdatafun at Jan 31, 2011 at 5:12 pm

    On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes wrote:

    Hi,

    2011/1/31 Sean Bigdatafun <sean.bigdatafun@gmail.com>:
    GZIP is not splittable.
    Correct, gzip is a stream compression system which effectively means
    you can only start at the beginning of the data with decompressing.
    Does that mean a GZIP block compressed sequencefile can't take advantage
    of MR parallelism?

    AFAIK it should be splittable in the same blocks as the compression was
    done.
    Splittable within the same block?

    Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
    configuration) of a 1GB file for map processing, should the file not GZIP
    compressed --- this is a scenario for an unpressed file.

    But as GZIP is not splittable, if/how can a mapper pick a block? (if it
    can't, then we can't utilize the Mapreduce framework for the parallelism).

    Can you give more answer?







    How to control the size of block to be compressed in SequenceFile?
    Can't help you with that one.

    --
    Met vriendelijke groeten,

    Niels Basjes


    --
    --Sean

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedJan 31, '11 at 8:26a
activeJan 31, '11 at 5:12p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Sean Bigdatafun: 2 posts

People

Translate

site design / logo © 2022 Grokbase