FAQ
Hi,

The size of my Gzipped weblog files is about 35MB. However, upon enabling
block compression, and inserting the logs into another Hive table
(sequencefile), the file size bloats up to about 233MB. I've done similar
processing on a local Hadoop/Hive cluster, and while the compressions is not
as good as gzipping, it still is not this bad. What could be going wrong?

I looked at the header of the resulting file and here's what it says:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

Does Amazon Elastic MapReduce behave differently or am I doing something
wrong?

Saurabh.

Search Discussions

  • Zheng Shao at Feb 1, 2010 at 7:53 am
    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng
    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done similar
    processing on a local Hadoop/Hive cluster, and while the compressions is not
    as good as gzipping, it still is not this bad. What could be going wrong?

    I looked at the header of the resulting file and here's what it says:

    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

    Does Amazon Elastic MapReduce behave differently or am I doing something
    wrong?

    Saurabh.
    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng
  • Saurabh Nanda at Feb 3, 2010 at 8:56 am
    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.
    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng
    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done similar
    processing on a local Hadoop/Hive cluster, and while the compressions is not
    as good as gzipping, it still is not this bad. What could be going wrong?

    I looked at the header of the resulting file and here's what it says:

    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
    Does Amazon Elastic MapReduce behave differently or am I doing something
    wrong?

    Saurabh.
    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Saurabh Nanda at Feb 18, 2010 at 4:26 pm
    Hi Zheng,

    I cross checked. I am setting the following in my Hive script before the
    INSERT command:

    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;

    A 132 MB (gzipped) input file going through a cleanup and getting populated
    in a sequencefile table is growing to 432 MB. What could be going wrong?

    Saurabh.
    On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda wrote:

    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.

    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng

    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done similar
    processing on a local Hadoop/Hive cluster, and while the compressions is not
    as good as gzipping, it still is not this bad. What could be going wrong?
    I looked at the header of the resulting file and here's what it says:

    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
    Does Amazon Elastic MapReduce behave differently or am I doing something
    wrong?

    Saurabh.
    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Zheng Shao at Feb 18, 2010 at 7:08 pm
    Did you also:

    SET mapred.output.compression.codec=org.apache....GZipCode;

    Zheng
    On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda wrote:
    Hi Zheng,

    I cross checked. I am setting the following in my Hive script before the
    INSERT command:

    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;

    A 132 MB (gzipped) input file going through a cleanup and getting populated
    in a sequencefile table is growing to 432 MB. What could be going wrong?

    Saurabh.
    On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda wrote:

    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.
    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng

    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon
    enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done
    similar
    processing on a local Hadoop/Hive cluster, and while the compressions
    is not
    as good as gzipping, it still is not this bad. What could be going
    wrong?

    I looked at the header of the resulting file and here's what it says:


    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

    Does Amazon Elastic MapReduce behave differently or am I doing
    something
    wrong?

    Saurabh.
    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng
  • Saurabh Nanda at Feb 19, 2010 at 1:46 pm
    I'm confused here Zheng. There are two sets of configuration variables.
    Those starting with io.* and those starting with mapred.*. For making sure
    that the final output table is compressed, which ones do I have to set?

    Saurabh.
    On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao wrote:

    Did you also:

    SET mapred.output.compression.codec=org.apache....GZipCode;

    Zheng
    On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda wrote:
    Hi Zheng,

    I cross checked. I am setting the following in my Hive script before the
    INSERT command:

    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;

    A 132 MB (gzipped) input file going through a cleanup and getting populated
    in a sequencefile table is growing to 432 MB. What could be going wrong?

    Saurabh.

    On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.
    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng

    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <saurabhnanda@gmail.com
    wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon
    enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done
    similar
    processing on a local Hadoop/Hive cluster, and while the compressions
    is not
    as good as gzipping, it still is not this bad. What could be going
    wrong?

    I looked at the header of the resulting file and here's what it says:

    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Saurabh Nanda at Feb 19, 2010 at 1:53 pm
    And also hive.exec.compress.*. So that makes it three sets of configuration
    variables:

    mapred.output.compress.*
    io.seqfile.compress.*
    hive.exec.compress.*

    What's the relationship between these configuration parameters and which
    ones should I set to achieve a well compress output table?

    Saurabh.
    On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda wrote:

    I'm confused here Zheng. There are two sets of configuration variables.
    Those starting with io.* and those starting with mapred.*. For making sure
    that the final output table is compressed, which ones do I have to set?

    Saurabh.

    On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao wrote:

    Did you also:

    SET mapred.output.compression.codec=org.apache....GZipCode;

    Zheng

    On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Hi Zheng,

    I cross checked. I am setting the following in my Hive script before the
    INSERT command:

    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;

    A 132 MB (gzipped) input file going through a cleanup and getting populated
    in a sequencefile table is growing to 432 MB. What could be going wrong?

    Saurabh.

    On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.
    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng

    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <
    saurabhnanda@gmail.com>
    wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon
    enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done
    similar
    processing on a local Hadoop/Hive cluster, and while the
    compressions
    is not
    as good as gzipping, it still is not this bad. What could be going
    wrong?

    I looked at the header of the resulting file and here's what it
    says:
    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com
  • Zheng Shao at Feb 19, 2010 at 6:09 pm
    hive.exec.compress.output controls whether or not to compress hive
    output. (This overrides mapred.output.compress in Hive).

    All other compression flags are from hadoop. Please see
    http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.html

    Zheng
    On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda wrote:
    And also hive.exec.compress.*. So that makes it three sets of configuration
    variables:

    mapred.output.compress.*
    io.seqfile.compress.*
    hive.exec.compress.*

    What's the relationship between these configuration parameters and which
    ones should I set to achieve a well compress output table?

    Saurabh.
    On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda wrote:

    I'm confused here Zheng. There are two sets of configuration variables.
    Those starting with io.* and those starting with mapred.*. For making sure
    that the final output table is compressed, which ones do I have to set?

    Saurabh.
    On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao wrote:

    Did you also:

    SET mapred.output.compression.codec=org.apache....GZipCode;

    Zheng

    On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Hi Zheng,

    I cross checked. I am setting the following in my Hive script before
    the
    INSERT command:

    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;

    A 132 MB (gzipped) input file going through a cleanup and getting
    populated
    in a sequencefile table is growing to 432 MB. What could be going
    wrong?

    Saurabh.

    On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <saurabhnanda@gmail.com>
    wrote:
    Thanks, Zheng. Will do some more tests and get back.

    Saurabh.
    On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao wrote:

    I would first check whether it is really the block compression or
    record compression.
    Also maybe the block size is too small but I am not sure that is
    tunable in SequenceFile or not.

    Zheng

    On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda
    <saurabhnanda@gmail.com>
    wrote:
    Hi,

    The size of my Gzipped weblog files is about 35MB. However, upon
    enabling
    block compression, and inserting the logs into another Hive table
    (sequencefile), the file size bloats up to about 233MB. I've done
    similar
    processing on a local Hadoop/Hive cluster, and while the
    compressions
    is not
    as good as gzipping, it still is not this bad. What could be going
    wrong?

    I looked at the header of the resulting file and here's what it
    says:



    SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

    Does Amazon Elastic MapReduce behave differently or am I doing
    something
    wrong?

    Saurabh.
    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    http://nandz.blogspot.com
    http://foodieforlife.blogspot.com


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedFeb 1, '10 at 5:03a
activeFeb 19, '10 at 6:09p
posts8
users2
websitehive.apache.org

2 users in discussion

Saurabh Nanda: 5 posts Zheng Shao: 3 posts

People

Translate

site design / logo © 2022 Grokbase