|| at Mar 3, 2009 at 8:44 am
Thanks for posting this Johan,
I tried unsuccessfully to handle GZip files for the reasons you state
and resorted to uncompressed. I will try the Lzo format and post the
performance difference of compressed vs uncompressed on EC2 which
seems to have very slow disk IO. We have seen really bad import
speeds (like worse than mini macs even with the largest instances) on
postgis and mysql with EC2 so I think this might be very applicable to
the EC2 users.
On Tue, Mar 3, 2009 at 9:32 AM, Johan Oskarsson wrote:
thought I'd pass on this blog post I just wrote about how we compress our
raw log data in Hadoop using Lzo at Last.fm.
The essence of the post is that we're able to make them splittable by
indexing where each compressed chunk starts in the file, similar to the gzip
input format being worked on.
This actually gives us a performance boost in certain jobs that read a lot
of data while saving us disk space at the same time.http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html