FAQ
I have a question about file compression in Hadoop. When I set the io.seqfile.compression.type=BLOCK does this also compress actual files I load in the DFS or does this only control the map/reduce file compression? If it doesnt compress the files on the file system, is there any way to compress a file when its loaded? The concern here is that I am just getting started with Pig/Hadoop and have a very small cluster of around 5 nodes. I want to limit IO wait by compressing the actual data. As a test when I compressed our 4GB log file using rar it was only 280mb.

Thanks,
Michael

Search Discussions

  • Arun C Murthy at Nov 13, 2007 at 5:18 pm
    Michael,
    On Tue, Nov 13, 2007 at 08:56:36AM -0800, Michael Harris wrote:
    I have a question about file compression in Hadoop. When I set the io.seqfile.compression.type=BLOCK does this also compress actual files I load in the DFS or does this only control the map/reduce file compression? If it doesnt compress the files on the file system, is there any way to compress a file when its loaded? The concern here is that I am just getting started with Pig/Hadoop and have a very small cluster of around 5 nodes. I want to limit IO wait by compressing the actual data. As a test when I compressed our 4GB log file using rar it was only 280mb.
    If you are loading files into HDFS as a SequenceFile and you set io.seqfile.compression.type=BLOCK (or RECORD) the file will have compressed records. Equivalently you can also use one of the many SequenceFile.createWriter methods (see http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/SequenceFile.html) to specify the compression type, compression codec etc.

    Arun
  • Devaraj Das at Nov 13, 2007 at 5:24 pm
    Yes, io.seqfile.compression controls compression of only the mapred files. A
    way you can compress files on the dfs, independent of mapred, is to use the
    java.util.zip package over the OutputStream that the
    DistributedFileSystem.create returns. For example, you can use
    java.util.zip.GZIPOutputStream. Pass the
    org.apache.hadoop.fs.FSDataOutputStream that
    org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument
    to the GZIPOutputStream constructor.
    -----Original Message-----
    From: Michael Harris
    Sent: Tuesday, November 13, 2007 10:27 PM
    To: hadoop-user@lucene.apache.org
    Subject: File Compression

    I have a question about file compression in Hadoop. When I
    set the io.seqfile.compression.type=BLOCK does this also
    compress actual files I load in the DFS or does this only
    control the map/reduce file compression? If it doesnt
    compress the files on the file system, is there any way to
    compress a file when its loaded? The concern here is that I
    am just getting started with Pig/Hadoop and have a very small
    cluster of around 5 nodes. I want to limit IO wait by
    compressing the actual data. As a test when I compressed our
    4GB log file using rar it was only 280mb.

    Thanks,
    Michael

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 13, '07 at 4:57p
activeNov 13, '07 at 5:24p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase