I've a question regarding compression and indexing.
I would like to compress our Hive data (presently present as
SequenceFile). Also, I have an index on this table and would like to
maintain the index as well (i.e. keep using it).
Sequence file compression can be block or record based. For indexing to
work, do I need to have block based compression? If both block and
record based compression can work with indexing, can someone provide
insight into which to use when?
BZip2 is also a block based compression and is splittable in Hadoop. Do
you see any issues with storing data in BZip2 files and using indexing
on that data?
Question 3 (and perhaps, the most important):
What are the best practices for compression (with or without indexing).
Are folks typically using Sequence File compression as compared to other
compressions (like BZip2)? If using Sequence File compression, are folks
using record based or block based?
Thank you in advance!
Mark Grover, Business Intelligence Analyst
www: oanda.com www: fxtrade.com
"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.