|| at Jan 31, 2011 at 5:41 pm
On Mon, Jan 31, 2011 at 10:41 PM, Sean Bigdatafun
On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes wrote:
2011/1/31 Sean Bigdatafun <email@example.com>:
GZIP is not splittable.
Correct, gzip is a stream compression system which effectively means
you can only start at the beginning of the data with decompressing.
Does that mean a GZIP block compressed sequencefile can't take advantage
of MR parallelism?
AFAIK it should be splittable in the same blocks as the compression was
Splittable within the same block?
Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
configuration) of a 1GB file for map processing, should the file not GZIP
compressed --- this is a scenario for an unpressed file.
But as GZIP is not splittable, if/how can a mapper pick a block? (if it
can't, then we can't utilize the Mapreduce framework for the parallelism).
Can you give more answer?
The base fact is that GZip is not a splittable compression algorithm,
but SequenceFiles can be written with a set 'block size' for its
records, and can also be Block-Compressed with a chosen algorithm.
SequenceFile draws its own 'block' boundaries and thus can let you
achieve a splittable file with GZip compression applied in its made-up