Am Donnerstag, 18. April 2013 01:06:06 UTC+2 schrieb Justin Erickson:
In general, even with just MapReduce, we'd recommend against using GZIP
compressed text files for the following reasons:
* GZIP with a non-splittable file format (i.e. text files) will require
remote reads to process the entire file for files larger than an HDFS block
* GZIP is a very CPU-expensive compression codec optimized for storage
density above performance so it will often be a performance bottleneck
Just to make a comment: I replaced uncompressed text files with gzip'ed
files because gzip -9 compressed better *and* faster than lzop -9 and hive
queries over gzip'ed data are much faster (3 times) than executed over
uncompressed files.
Maybe my setup here is in some way special (compared to the rest of the
group here), but it works very well for me. Maybe the disks in my cluster
are the bottleneck and gzip just helps in this case. I don't know.

LZO is a pain. Not only for the licence stuff and the work to bring the
native codec online. But everyone here seems to ignore that LZO (like gzip)
is *not* splittable. You need the Indexer/.index files for that. Please
correct me if I'm wrong. And when writing LZO files larger than your block
size you have a file that cannot be splitted by default. And with generated
index files you have your data mixed with metadata and this is pain, too!

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 7 | next ›
Discussion Overview
groupimpala-user @
postedApr 17, '13 at 9:49p
activeJan 17, '14 at 9:45a



site design / logo © 2022 Grokbase