I believe there is an architectural mismatch with Hadoop generally, in that gzipped files aren't "splittable" the same way LZO files are, so there's less opportunity for dividing up the work in parallel.

Some background:


I see the idea of splittable gzip has been explored, but I don't know if that ever went anywhere:


On Aug 21, 2013, at 6:19 AM, Jon Bjarnason wrote:

Has there been any movement on this? We are using gzipped text files and are very happy with it. This is a blocker for moving into Impala.

There is a lot of native support for gzipped files in hdfs and it seems odd that Impala doesn't support it.



On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:
Impala supports LZO-compressed and uncompressed text files. GZIP is currently supported with splittable formats such as SequenceFiles, RCFiles, etc.

In general, even with just MapReduce, we'd recommend against using GZIP compressed text files for the following reasons:
* GZIP with a non-splittable file format (i.e. text files) will require remote reads to process the entire file for files larger than an HDFS block
* GZIP is a very CPU-expensive compression codec optimized for storage density above performance so it will often be a performance bottleneck

For better performance, we recommend using a splittable file format with Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you need to use text files for external accessibility, LZO-compressed text is probably your best choice.

That said, we do have GZIP compression for text files as part of our roadmap considerations but I don't have a timeline given it's current level of feedback relative to other higher priority items.


On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen wrote:
Impala 0.7.1 fails to query an external table backed by files ending with a .sql.gz extension. These are gzipped tab-separated value files and I can successfully query them with Hive.

$ impala-shell
Connected to $HOST:21000
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT 2013)

[$HOST:21000] > select * from exampletable limit 10;
Query: select * from exampletable limit 10
ERROR: AnalysisException: Failed to load metadata for table: exampletable
CAUSED BY: TableLoadingException: Failed to load metadata for table: exampletable
CAUSED BY: RuntimeException: Compressed text files are not supported: hdfs://$HOST:8020/path/to/file.sql.gz
[$HOST:21000] >
Apparently there have been issues in this area before (IMPALA-14) - is there some connection? That issue seems to imply support for gzipped files, but apparently that is no longer the case. Regression?

Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

BTW, the "Unable to load history: [Errno 2] No such file or directory" seems to only appear on the first invocation of impala-shell. Probably shouldn't be considered an error at all in that case, since obviously there would be no history on the first invocation.
- Josh

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 7 | next ›
Discussion Overview
groupimpala-user @
postedApr 17, '13 at 9:49p
activeJan 17, '14 at 9:45a



site design / logo © 2022 Grokbase