|
Devaraj Das |
at Nov 13, 2007 at 5:24 pm
|
⇧ |
| |
Yes, io.seqfile.compression controls compression of only the mapred files. A
way you can compress files on the dfs, independent of mapred, is to use the
java.util.zip package over the OutputStream that the
DistributedFileSystem.create returns. For example, you can use
java.util.zip.GZIPOutputStream. Pass the
org.apache.hadoop.fs.FSDataOutputStream that
org.apache.hadoop.dfs.DistributedFileSystem.create() returns as an argument
to the GZIPOutputStream constructor.
-----Original Message-----
From: Michael Harris
Sent: Tuesday, November 13, 2007 10:27 PM
To: hadoop-user@lucene.apache.org
Subject: File Compression
I have a question about file compression in Hadoop. When I
set the io.seqfile.compression.type=BLOCK does this also
compress actual files I load in the DFS or does this only
control the map/reduce file compression? If it doesnt
compress the files on the file system, is there any way to
compress a file when its loaded? The concern here is that I
am just getting started with Pig/Hadoop and have a very small
cluster of around 5 nodes. I want to limit IO wait by
compressing the actual data. As a test when I compressed our
4GB log file using rar it was only 280mb.
Thanks,
Michael