are very happy with it. This is a blocker for moving into Impala.
There is a lot of native support for gzipped files in hdfs and it seems odd
that Impala doesn't support it.
Thanks,
Jon
On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:
Impala supports LZO-compressed and uncompressed text files. GZIP is
currently supported with splittable formats such as SequenceFiles, RCFiles,
etc.
In general, even with just MapReduce, we'd recommend against using GZIP
compressed text files for the following reasons:
* GZIP with a non-splittable file format (i.e. text files) will require
remote reads to process the entire file for files larger than an HDFS block
* GZIP is a very CPU-expensive compression codec optimized for storage
density above performance so it will often be a performance bottleneck
For better performance, we recommend using a splittable file format with
Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you
need to use text files for external accessibility, LZO-compressed text is
probably your best choice.
That said, we do have GZIP compression for text files as part of our
roadmap considerations but I don't have a timeline given it's current level
of feedback relative to other higher priority items.
Thanks,
Justin
On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen <[email protected]<javascript:>
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].Impala supports LZO-compressed and uncompressed text files. GZIP is
currently supported with splittable formats such as SequenceFiles, RCFiles,
etc.
In general, even with just MapReduce, we'd recommend against using GZIP
compressed text files for the following reasons:
* GZIP with a non-splittable file format (i.e. text files) will require
remote reads to process the entire file for files larger than an HDFS block
* GZIP is a very CPU-expensive compression codec optimized for storage
density above performance so it will often be a performance bottleneck
For better performance, we recommend using a splittable file format with
Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you
need to use text files for external accessibility, LZO-compressed text is
probably your best choice.
That said, we do have GZIP compression for text files as part of our
roadmap considerations but I don't have a timeline given it's current level
of feedback relative to other higher priority items.
Thanks,
Justin
On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen <[email protected]<javascript:>
wrote:
Impala 0.7.1 fails to query an external table backed by files ending with
a .sql.gz extension. These are gzipped tab-separated value files and I can
successfully query them with Hive.
Output:
$ impala-shell
Connected to $HOST:21000
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available
commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT
2013)
[$HOST:21000] > select * from exampletable limit 10;
Query: select * from exampletable limit 10
ERROR: AnalysisException: Failed to load metadata for table: exampletable
CAUSED BY: TableLoadingException: Failed to load metadata for table:
exampletable
CAUSED BY: RuntimeException: Compressed text files are not supported:
hdfs://$HOST:8020/path/to/file.sql.gz
[$HOST:21000] >
Apparently there have been issues in this area before (IMPALA-14<https://issues.cloudera.org/browse/IMPALA-14>)
- is there some connection? That issue seems to imply support for gzipped
files, but apparently that is no longer the case. Regression?
Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5
BTW, the "Unable to load history: [Errno 2] No such file or directory"
seems to only appear on the first invocation of impala-shell. Probably
shouldn't be considered an error at all in that case, since obviously there
would be no history on the first invocation.
- Josh
Impala 0.7.1 fails to query an external table backed by files ending with
a .sql.gz extension. These are gzipped tab-separated value files and I can
successfully query them with Hive.
Output:
$ impala-shell
Connected to $HOST:21000
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available
commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT
2013)
[$HOST:21000] > select * from exampletable limit 10;
Query: select * from exampletable limit 10
ERROR: AnalysisException: Failed to load metadata for table: exampletable
CAUSED BY: TableLoadingException: Failed to load metadata for table:
exampletable
CAUSED BY: RuntimeException: Compressed text files are not supported:
hdfs://$HOST:8020/path/to/file.sql.gz
[$HOST:21000] >
Apparently there have been issues in this area before (IMPALA-14<https://issues.cloudera.org/browse/IMPALA-14>)
- is there some connection? That issue seems to imply support for gzipped
files, but apparently that is no longer the case. Regression?
Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5
BTW, the "Unable to load history: [Errno 2] No such file or directory"
seems to only appear on the first invocation of impala-shell. Probably
shouldn't be considered an error at all in that case, since obviously there
would be no history on the first invocation.
- Josh