I have some questions regarding hive that I hope you can help me with. I haven't had too much luck with the documentation on these, so any tips would be much appreciated.
I initially posted these on the amazon elastic mapreduce message board, since some are S3 related, but I have gotten no love there.
- Can you create an external table that covers .bz2 files? For example, if I push a bunch of log files to a directory that are .bz2 compressed, can I directly select rows from the external table? If not, what is the best way of loading the .bz2 into a temporary Hive table such that I can do wildcarding?
All of these files are in subdirectories, with the directory names serving as partition names.
- Is there some trick to storing compressed Hive tables that isn't clearly documented? I tried the recipe in the Hive tutorial but didn't have much luck. Anyone here have any success? This is using Hive 0.4.0 in amazon's cloud.
- Has anyone tried to compress the tables via .bz2 instead? Is there an easy way of stream compressing it when using the s3 interfaces?
- What is more efficient: storing the tables in S3 as s3:// or s3n://?
Thanks for your help!