FAQ
All:

I have some questions regarding hive that I hope you can help me with. I haven't had too much luck with the documentation on these, so any tips would be much appreciated.

I initially posted these on the amazon elastic mapreduce message board, since some are S3 related, but I have gotten no love there.

- Can you create an external table that covers .bz2 files? For example, if I push a bunch of log files to a directory that are .bz2 compressed, can I directly select rows from the external table? If not, what is the best way of loading the .bz2 into a temporary Hive table such that I can do wildcarding?

All of these files are in subdirectories, with the directory names serving as partition names.

- Is there some trick to storing compressed Hive tables that isn't clearly documented? I tried the recipe in the Hive tutorial but didn't have much luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's cloud.

- Has anyone tried to compress the tables via .bz2 instead? Is there an easy way of stream compressing it when using the s3 interfaces?

- What is more efficient: storing the tables in S3 as s3:// or s3n://?

Thanks for your help!

Adam

Search Discussions

  • Carl Steinbach at Jan 25, 2010 at 6:59 pm
    Hi Adam,

    Hive actually relies on the underlying Hadoop implementation for compression
    support, i.e. whether or not Hive can support bz2 compressed files depends
    on whether or not the Hadoop cluster the files are stored in supports the
    bzip2 compression codec. Support for bzip2 was added in Hadoop 0.19, and it
    looks like Amazon's EMR is running a variant of Hadoop 0.18.3, which
    supports gzip but not bzip2.

    There is a discussion of these issues on the Amazon EMR help forum here:
    http://developer.amazonwebservices.com/connect/thread.jspa?messageID=145636

    Thanks.

    Carl
    On Mon, Jan 25, 2010 at 10:38 AM, Adam J. O'Donnell wrote:

    All:

    I have some questions regarding hive that I hope you can help me with. I
    haven't had too much luck with the documentation on these, so any tips would
    be much appreciated.

    I initially posted these on the amazon elastic mapreduce message board,
    since some are S3 related, but I have gotten no love there.

    - Can you create an external table that covers .bz2 files? For example, if
    I push a bunch of log files to a directory that are .bz2 compressed, can I
    directly select rows from the external table? If not, what is the best way
    of loading the .bz2 into a temporary Hive table such that I can do
    wildcarding?

    All of these files are in subdirectories, with the directory names serving
    as partition names.

    - Is there some trick to storing compressed Hive tables that isn't clearly
    documented? I tried the recipe in the Hive tutorial but didn't have much
    luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's
    cloud.

    - Has anyone tried to compress the tables via .bz2 instead? Is there an
    easy way of stream compressing it when using the s3 interfaces?

    - What is more efficient: storing the tables in S3 as s3:// or s3n://?

    Thanks for your help!

    Adam
  • Adam J. O'Donnell at Jan 25, 2010 at 7:41 pm
    Damn thanks for the tip Carl. I forgot the version of hadoop amazon is running is a little old.

    Any ideas on getting table compression on output to work? For example, do I have to specify exec.compress.output on every run, or should i put that into my hive-site.xml? I assume that isn't stored in the metadata store, is it? Also, how efficient is the block storage? Is there a knob I can adjust on that?

    Thanks!
    On Jan 25, 2010, at 10:58 AM, Carl Steinbach wrote:

    Hi Adam,

    Hive actually relies on the underlying Hadoop implementation for compression support, i.e. whether or not Hive can support bz2 compressed files depends on whether or not the Hadoop cluster the files are stored in supports the bzip2 compression codec. Support for bzip2 was added in Hadoop 0.19, and it looks like Amazon's EMR is running a variant of Hadoop 0.18.3, which supports gzip but not bzip2.

    There is a discussion of these issues on the Amazon EMR help forum here:
    http://developer.amazonwebservices.com/connect/thread.jspa?messageID=145636

    Thanks.

    Carl

    On Mon, Jan 25, 2010 at 10:38 AM, Adam J. O'Donnell wrote:
    All:

    I have some questions regarding hive that I hope you can help me with. I haven't had too much luck with the documentation on these, so any tips would be much appreciated.

    I initially posted these on the amazon elastic mapreduce message board, since some are S3 related, but I have gotten no love there.

    - Can you create an external table that covers .bz2 files? For example, if I push a bunch of log files to a directory that are .bz2 compressed, can I directly select rows from the external table? If not, what is the best way of loading the .bz2 into a temporary Hive table such that I can do wildcarding?

    All of these files are in subdirectories, with the directory names serving as partition names.

    - Is there some trick to storing compressed Hive tables that isn't clearly documented? I tried the recipe in the Hive tutorial but didn't have much luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's cloud.

    - Has anyone tried to compress the tables via .bz2 instead? Is there an easy way of stream compressing it when using the s3 interfaces?

    - What is more efficient: storing the tables in S3 as s3:// or s3n://?

    Thanks for your help!

    Adam
    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070
  • Zheng Shao at Jan 26, 2010 at 1:55 am
    We just can put it in hive-site.xml.

    The block storage is pretty efficient since the compression codec is
    native. It should be close to what you get with the command line tools
    like "bzip2" and "gzip".

    Zheng
    On Mon, Jan 25, 2010 at 11:40 AM, Adam J. O'Donnell wrote:
    Damn thanks for the tip Carl.  I forgot the version of hadoop amazon is running is a little old.

    Any ideas on getting table compression on output to work?  For example, do I have to specify exec.compress.output on every run, or should i put that into my hive-site.xml?  I assume that isn't stored in the metadata store, is it?  Also, how efficient is the block storage?  Is there a knob I can adjust on that?

    Thanks!
    On Jan 25, 2010, at 10:58 AM, Carl Steinbach wrote:

    Hi Adam,

    Hive actually relies on the underlying Hadoop implementation for compression support, i.e. whether or not Hive can support bz2 compressed files depends on whether or not the Hadoop cluster the files are stored in supports the bzip2 compression codec. Support for bzip2 was added in Hadoop 0.19, and it looks like Amazon's EMR is running a variant of Hadoop 0.18.3, which supports gzip but not bzip2.

    There is a discussion of these issues on the Amazon EMR help forum here:
    http://developer.amazonwebservices.com/connect/thread.jspa?messageID=145636

    Thanks.

    Carl

    On Mon, Jan 25, 2010 at 10:38 AM, Adam J. O'Donnell wrote:
    All:

    I have some questions regarding hive that I hope you can help me with.  I haven't had too much luck with the documentation on these, so any tips would be much appreciated.

    I initially posted these on the amazon elastic mapreduce message board, since some are S3 related, but I have gotten no love there.

    - Can you create an external table that covers .bz2 files? For example, if I push a bunch of log files to a directory that are .bz2 compressed, can I directly select rows from the external table? If not, what is the best way of loading the .bz2 into a temporary Hive table such that I can do wildcarding?

    All of these files are in subdirectories, with the directory names serving as partition names.

    - Is there some trick to storing compressed Hive tables that isn't clearly documented?  I tried the recipe in the Hive tutorial but didn't have much luck. Anyone here have any success?  This is using Hive 0.4.0 in amazon's cloud.

    - Has anyone tried to compress the tables via .bz2 instead?  Is there an easy way of stream compressing it when using the s3 interfaces?

    - What is more efficient: storing the tables in S3 as s3:// or s3n://?

    Thanks for your help!

    Adam
    --
    Adam J. O'Donnell, Ph.D.
    Immunet Corporation
    Cell: +1 (267) 251-0070


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 25, '10 at 6:39p
activeJan 26, '10 at 1:55a
posts4
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase