FAQ
Hi,

thought I'd pass on this blog post I just wrote about how we compress
our raw log data in Hadoop using Lzo at Last.fm.

The essence of the post is that we're able to make them splittable by
indexing where each compressed chunk starts in the file, similar to the
gzip input format being worked on.
This actually gives us a performance boost in certain jobs that read a
lot of data while saving us disk space at the same time.

http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

/Johan

Search Discussions

  • Miles Osborne at Mar 3, 2009 at 8:38 am
    that's very interesting. for us poor souls using streaming, would we
    be able to use it?

    (right now i'm looking at a 100+ GB gzipped file ...)

    Miles

    2009/3/3 Johan Oskarsson <johan@oskarsson.nu>:
    Hi,

    thought I'd pass on this blog post I just wrote about how we compress our
    raw log data in Hadoop using Lzo at Last.fm.

    The essence of the post is that we're able to make them splittable by
    indexing where each compressed chunk starts in the file, similar to the gzip
    input format being worked on.
    This actually gives us a performance boost in certain jobs that read a lot
    of data while saving us disk space at the same time.

    http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

    /Johan


    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
  • Johan Oskarsson at Mar 3, 2009 at 8:40 am
    We use it with python (dumbo) and streaming, so it should certainly be
    possible. I haven't tried it myself though, so can't give any pointers.

    /Johan

    Miles Osborne wrote:
    that's very interesting. for us poor souls using streaming, would we
    be able to use it?

    (right now i'm looking at a 100+ GB gzipped file ...)

    Miles

    2009/3/3 Johan Oskarsson <johan@oskarsson.nu>:
    Hi,

    thought I'd pass on this blog post I just wrote about how we compress our
    raw log data in Hadoop using Lzo at Last.fm.

    The essence of the post is that we're able to make them splittable by
    indexing where each compressed chunk starts in the file, similar to the gzip
    input format being worked on.
    This actually gives us a performance boost in certain jobs that read a lot
    of data while saving us disk space at the same time.

    http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

    /Johan
  • Tim robertson at Mar 3, 2009 at 8:44 am
    Thanks for posting this Johan,

    I tried unsuccessfully to handle GZip files for the reasons you state
    and resorted to uncompressed. I will try the Lzo format and post the
    performance difference of compressed vs uncompressed on EC2 which
    seems to have very slow disk IO. We have seen really bad import
    speeds (like worse than mini macs even with the largest instances) on
    postgis and mysql with EC2 so I think this might be very applicable to
    the EC2 users.

    Cheers,

    Tim



    On Tue, Mar 3, 2009 at 9:32 AM, Johan Oskarsson wrote:
    Hi,

    thought I'd pass on this blog post I just wrote about how we compress our
    raw log data in Hadoop using Lzo at Last.fm.

    The essence of the post is that we're able to make them splittable by
    indexing where each compressed chunk starts in the file, similar to the gzip
    input format being worked on.
    This actually gives us a performance boost in certain jobs that read a lot
    of data while saving us disk space at the same time.

    http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

    /Johan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 3, '09 at 8:32a
activeMar 3, '09 at 8:44a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase