|| at Apr 15, 2010 at 9:40 am
By default, with compressed files you lose the ability to control splits and the file is essentially read as one split to one mapper.
There had been some discussion in and around this over bzip2, gzip and some fixes are done to allow bzip2 to be splittable.Refer HADOOP-4012
Also Kevin came with lzo compression and LzoTextInputFormat which overcomes this disadvantage and is faster. Refer http://github.com/kevinweil/hadoop-lzo
On 4/15/10 6:56 AM, "abhishek sharma" wrote:
I created some data using the randomwriter utility and compressed the
map task outputs using the options
I set the bytes per map to be 128 MB but due to compression the final
size of each map tasks output is around 75MB.
I want to use these individual 75MB compressed files as input to
another Map task.
How do I get Hadoop to first decompress the files before computing the
input splits for the map tasks?