FAQ
Hi all,
I've a question regarding compression and indexing.

I would like to compress our Hive data (presently present as
SequenceFile). Also, I have an index on this table and would like to
maintain the index as well (i.e. keep using it).

Question 1:
Sequence file compression can be block or record based. For indexing to
work, do I need to have block based compression? If both block and
record based compression can work with indexing, can someone provide
insight into which to use when?

Question 2:
BZip2 is also a block based compression and is splittable in Hadoop. Do
you see any issues with storing data in BZip2 files and using indexing
on that data?

Question 3 (and perhaps, the most important):
What are the best practices for compression (with or without indexing).
Are folks typically using Sequence File compression as compared to other
compressions (like BZip2)? If using Sequence File compression, are folks
using record based or block based?


Thank you in advance!
Mark

--
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgrover@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

Search Discussions

  • Yongqiang he at Sep 15, 2011 at 9:41 pm
    Question 1:
    Indexing should work for both. But i suggest u use block compression.
    Question 3 (and perhaps, the most important):
    block based compression.

    On Thu, Sep 15, 2011 at 2:16 PM, Mark Grover wrote:
    Hi all,
    I've a question regarding compression and indexing.

    I would like to compress our Hive data (presently present as SequenceFile).
    Also, I have an index on this table and would like to maintain the index as
    well (i.e. keep using it).

    Question 1:
    Sequence file compression can be block or record based. For indexing to
    work, do I need to have block based compression? If both block and record
    based compression can work with indexing, can someone provide insight into
    which to use when?

    Question 2:
    BZip2 is also a block based compression and is splittable in Hadoop. Do you
    see any issues with storing data in BZip2 files and using indexing on that
    data?

    Question 3 (and perhaps, the most important):
    What are the best practices for compression (with or without indexing). Are
    folks typically using Sequence File compression as compared to other
    compressions (like BZip2)? If using Sequence File compression, are folks
    using record based or block based?


    Thank you in advance!
    Mark

    --
    Mark Grover, Business Intelligence Analyst
    OANDA Corporation

    www: oanda.com www: fxtrade.com
    e: mgrover@oanda.com

    "Best Trading Platform" - World Finance's Forex Awards 2009.
    "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
  • Mark Grover at Sep 16, 2011 at 2:07 pm
    Thanks, Yongqiang!

    Could you please confirm my understanding of how to use block compression?

    As of now, I am setting these properties before populating the table
    that should contain compressed data:
    SET io.seqfile.compression.type=BLOCK;
    SET hive.exec.compress.output=true;
    SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

    Question 1:
    Do I need to set io.seqfile.compress.blocksize? If so, to what? It's set
    to 1000000 by default

    Question 2:
    Do I need to set hive.merge.mapfiles? If so, to what? It's set to true
    by default.

    Question 3:
    Any other options I need to set up?

    Thanks again!

    Mark
    P.S: I am using Hive 0.7.1 with Hadoop 0.20
    On 11-09-15 05:41 PM, yongqiang he wrote:
    Question 1:
    Indexing should work for both. But i suggest u use block compression.
    Question 3 (and perhaps, the most important):
    block based compression.


    On Thu, Sep 15, 2011 at 2:16 PM, Mark Groverwrote:
    Hi all,
    I've a question regarding compression and indexing.

    I would like to compress our Hive data (presently present as SequenceFile).
    Also, I have an index on this table and would like to maintain the index as
    well (i.e. keep using it).

    Question 1:
    Sequence file compression can be block or record based. For indexing to
    work, do I need to have block based compression? If both block and record
    based compression can work with indexing, can someone provide insight into
    which to use when?

    Question 2:
    BZip2 is also a block based compression and is splittable in Hadoop. Do you
    see any issues with storing data in BZip2 files and using indexing on that
    data?

    Question 3 (and perhaps, the most important):
    What are the best practices for compression (with or without indexing). Are
    folks typically using Sequence File compression as compared to other
    compressions (like BZip2)? If using Sequence File compression, are folks
    using record based or block based?


    Thank you in advance!
    Mark

    --
    Mark Grover, Business Intelligence Analyst
    OANDA Corporation

    www: oanda.com www: fxtrade.com
    e: mgrover@oanda.com

    "Best Trading Platform" - World Finance's Forex Awards 2009.
    "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
    --
    Mark Grover, Business Intelligence Analyst
    OANDA Corporation

    www: oanda.com www: fxtrade.com
    e: mgrover@oanda.com

    "Best Trading Platform" - World Finance's Forex Awards 2009.
    "The One to Watch" - Treasury Today's Adam Smith Awards 2009.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 15, '11 at 9:17p
activeSep 16, '11 at 2:07p
posts3
users2
websitehive.apache.org

2 users in discussion

Mark Grover: 2 posts Yongqiang he: 1 post

People

Translate

site design / logo © 2021 Grokbase