FAQ
GZIP is not splittable. Does that mean a GZIP block compressed sequencefile
can't take advantage of MR parallelism?

How to control the size of block to be compressed in SequenceFile?

--
--Sean

Search Discussions

  • Niels Basjes at Jan 31, 2011 at 8:36 am
    Hi,

    2011/1/31 Sean Bigdatafun <sean.bigdatafun@gmail.com>:
    GZIP is not splittable.
    Correct, gzip is a stream compression system which effectively means
    you can only start at the beginning of the data with decompressing.
    Does that mean a GZIP block compressed sequencefile can't take advantage of MR parallelism?
    AFAIK it should be splittable in the same blocks as the compression was done.
    How to control the size of block to be compressed in SequenceFile?
    Can't help you with that one.

    --
    Met vriendelijke groeten,

    Niels Basjes
  • Sean Bigdatafun at Jan 31, 2011 at 5:12 pm

    On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes wrote:

    Hi,

    2011/1/31 Sean Bigdatafun <sean.bigdatafun@gmail.com>:
    GZIP is not splittable.
    Correct, gzip is a stream compression system which effectively means
    you can only start at the beginning of the data with decompressing.
    Does that mean a GZIP block compressed sequencefile can't take advantage
    of MR parallelism?

    AFAIK it should be splittable in the same blocks as the compression was
    done.
    Splittable within the same block?

    Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
    configuration) of a 1GB file for map processing, should the file not GZIP
    compressed --- this is a scenario for an unpressed file.

    But as GZIP is not splittable, if/how can a mapper pick a block? (if it
    can't, then we can't utilize the Mapreduce framework for the parallelism).

    Can you give more answer?







    How to control the size of block to be compressed in SequenceFile?
    Can't help you with that one.

    --
    Met vriendelijke groeten,

    Niels Basjes


    --
    --Sean
  • Harsh J at Jan 31, 2011 at 5:41 pm
    Hello,

    On Mon, Jan 31, 2011 at 10:41 PM, Sean Bigdatafun
    wrote:
    On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes wrote:

    Hi,

    2011/1/31 Sean Bigdatafun <sean.bigdatafun@gmail.com>:
    GZIP is not splittable.
    Correct, gzip is a stream compression system which effectively means
    you can only start at the beginning of the data with decompressing.
    Does that mean a GZIP block compressed sequencefile can't take advantage
    of MR parallelism?
    AFAIK it should be splittable in the same blocks as the compression was
    done.
    Splittable within the same block?
    Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
    configuration) of a 1GB file for map processing, should the file not GZIP
    compressed --- this is a scenario for an unpressed file.
    But as GZIP is not splittable, if/how can a mapper pick a block? (if it
    can't, then we can't utilize the Mapreduce framework for the parallelism).
    Can you give more answer?
    The base fact is that GZip is not a splittable compression algorithm,
    but SequenceFiles can be written with a set 'block size' for its
    records, and can also be Block-Compressed with a chosen algorithm.
    SequenceFile draws its own 'block' boundaries and thus can let you
    achieve a splittable file with GZip compression applied in its made-up
    splits.

    --
    Harsh J
    www.harshj.com
  • Harsh J at Jan 31, 2011 at 11:02 am

    On Mon, Jan 31, 2011 at 1:56 PM, Sean Bigdatafun wrote:
    How to control the size of block to be compressed in SequenceFile?
    Specified when creating a SequenceFile.Writer object. See the various
    SequenceFile.createWriter()

    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJan 31, '11 at 8:26a
activeJan 31, '11 at 5:41p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase