FAQ
Impala 0.7.1 fails to query an external table backed by files ending with a
.sql.gz extension. These are gzipped tab-separated value files and I can
successfully query them with Hive.

Output:

$ impala-shell
Connected to $HOST:21000
Unable to load history: [Errno 2] No such file or directory
Welcome to the Impala shell. Press TAB twice to see a list of available
commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT
2013)

[$HOST:21000] > select * from exampletable limit 10;
Query: select * from exampletable limit 10
ERROR: AnalysisException: Failed to load metadata for table: exampletable
CAUSED BY: TableLoadingException: Failed to load metadata for table:
exampletable
CAUSED BY: RuntimeException: Compressed text files are not supported:
hdfs://$HOST:8020/path/to/file.sql.gz
[$HOST:21000] >

Apparently there have been issues in this area before (IMPALA-14<https://issues.cloudera.org/browse/IMPALA-14>)
- is there some connection? That issue seems to imply support for gzipped
files, but apparently that is no longer the case. Regression?

Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

BTW, the "Unable to load history: [Errno 2] No such file or directory"
seems to only appear on the first invocation of impala-shell. Probably
shouldn't be considered an error at all in that case, since obviously there
would be no history on the first invocation.
- Josh

Search Discussions

  • Jon Bjarnason at Aug 21, 2013 at 1:19 pm
    Has there been any movement on this? We are using gzipped text files and
    are very happy with it. This is a blocker for moving into Impala.

    There is a lot of native support for gzipped files in hdfs and it seems odd
    that Impala doesn't support it.

    Thanks,

    Jon
    On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:

    Impala supports LZO-compressed and uncompressed text files. GZIP is
    currently supported with splittable formats such as SequenceFiles, RCFiles,
    etc.

    In general, even with just MapReduce, we'd recommend against using GZIP
    compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will require
    remote reads to process the entire file for files larger than an HDFS block
    * GZIP is a very CPU-expensive compression codec optimized for storage
    density above performance so it will often be a performance bottleneck

    For better performance, we recommend using a splittable file format with
    Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you
    need to use text files for external accessibility, LZO-compressed text is
    probably your best choice.

    That said, we do have GZIP compression for text files as part of our
    roadmap considerations but I don't have a timeline given it's current level
    of feedback relative to other higher priority items.

    Thanks,
    Justin


    On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen <hansen....@gmail.com<javascript:>
    wrote:
    Impala 0.7.1 fails to query an external table backed by files ending with
    a .sql.gz extension. These are gzipped tab-separated value files and I can
    successfully query them with Hive.

    Output:

    $ impala-shell
    Connected to $HOST:21000
    Unable to load history: [Errno 2] No such file or directory
    Welcome to the Impala shell. Press TAB twice to see a list of available
    commands.

    Copyright (c) 2012 Cloudera, Inc. All rights reserved.

    (Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT
    2013)

    [$HOST:21000] > select * from exampletable limit 10;
    Query: select * from exampletable limit 10
    ERROR: AnalysisException: Failed to load metadata for table: exampletable
    CAUSED BY: TableLoadingException: Failed to load metadata for table:
    exampletable
    CAUSED BY: RuntimeException: Compressed text files are not supported:
    hdfs://$HOST:8020/path/to/file.sql.gz
    [$HOST:21000] >

    Apparently there have been issues in this area before (IMPALA-14<https://issues.cloudera.org/browse/IMPALA-14>)
    - is there some connection? That issue seems to imply support for gzipped
    files, but apparently that is no longer the case. Regression?

    Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

    BTW, the "Unable to load history: [Errno 2] No such file or directory"
    seems to only appear on the first invocation of impala-shell. Probably
    shouldn't be considered an error at all in that case, since obviously there
    would be no history on the first invocation.
    - Josh
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • John Russell at Aug 23, 2013 at 10:18 pm
    I believe there is an architectural mismatch with Hadoop generally, in that gzipped files aren't "splittable" the same way LZO files are, so there's less opportunity for dividing up the work in parallel.

    Some background:

    http://stackoverflow.com/questions/11229272/hadoop-mr-better-to-have-compressed-input-files-or-raw-files

    I see the idea of splittable gzip has been explored, but I don't know if that ever went anywhere:

    https://issues.apache.org/jira/browse/MAPREDUCE-491

    John
    On Aug 21, 2013, at 6:19 AM, Jon Bjarnason wrote:

    Has there been any movement on this? We are using gzipped text files and are very happy with it. This is a blocker for moving into Impala.

    There is a lot of native support for gzipped files in hdfs and it seems odd that Impala doesn't support it.

    Thanks,

    Jon

    On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:
    Impala supports LZO-compressed and uncompressed text files. GZIP is currently supported with splittable formats such as SequenceFiles, RCFiles, etc.

    In general, even with just MapReduce, we'd recommend against using GZIP compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will require remote reads to process the entire file for files larger than an HDFS block
    * GZIP is a very CPU-expensive compression codec optimized for storage density above performance so it will often be a performance bottleneck

    For better performance, we recommend using a splittable file format with Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you need to use text files for external accessibility, LZO-compressed text is probably your best choice.

    That said, we do have GZIP compression for text files as part of our roadmap considerations but I don't have a timeline given it's current level of feedback relative to other higher priority items.

    Thanks,
    Justin


    On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen wrote:
    Impala 0.7.1 fails to query an external table backed by files ending with a .sql.gz extension. These are gzipped tab-separated value files and I can successfully query them with Hive.

    Output:
    $ impala-shell
    Connected to $HOST:21000
    Unable to load history: [Errno 2] No such file or directory
    Welcome to the Impala shell. Press TAB twice to see a list of available commands.

    Copyright (c) 2012 Cloudera, Inc. All rights reserved.

    (Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT 2013)

    [$HOST:21000] > select * from exampletable limit 10;
    Query: select * from exampletable limit 10
    ERROR: AnalysisException: Failed to load metadata for table: exampletable
    CAUSED BY: TableLoadingException: Failed to load metadata for table: exampletable
    CAUSED BY: RuntimeException: Compressed text files are not supported: hdfs://$HOST:8020/path/to/file.sql.gz
    [$HOST:21000] >
    Apparently there have been issues in this area before (IMPALA-14) - is there some connection? That issue seems to imply support for gzipped files, but apparently that is no longer the case. Regression?

    Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

    BTW, the "Unable to load history: [Errno 2] No such file or directory" seems to only appear on the first invocation of impala-shell. Probably shouldn't be considered an error at all in that case, since obviously there would be no history on the first invocation.
    - Josh


    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Justin Erickson at Sep 11, 2013 at 7:26 pm
    Despite the limitations it is part of the roadmap but we don't have a
    timeline for this. Note that even when this is supported we'll recommend
    against it for the CPU and network performance issues described earlier.

    On Fri, Aug 23, 2013 at 3:18 PM, John Russell wrote:

    I believe there is an architectural mismatch with Hadoop generally, in
    that gzipped files aren't "splittable" the same way LZO files are, so
    there's less opportunity for dividing up the work in parallel.

    Some background:


    http://stackoverflow.com/questions/11229272/hadoop-mr-better-to-have-compressed-input-files-or-raw-files

    I see the idea of splittable gzip has been explored, but I don't know if
    that ever went anywhere:

    https://issues.apache.org/jira/browse/MAPREDUCE-491

    John

    On Aug 21, 2013, at 6:19 AM, Jon Bjarnason wrote:

    Has there been any movement on this? We are using gzipped text files and
    are very happy with it. This is a blocker for moving into Impala.

    There is a lot of native support for gzipped files in hdfs and it seems
    odd that Impala doesn't support it.

    Thanks,

    Jon
    On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:

    Impala supports LZO-compressed and uncompressed text files. GZIP is
    currently supported with splittable formats such as SequenceFiles, RCFiles,
    etc.

    In general, even with just MapReduce, we'd recommend against using GZIP
    compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will require
    remote reads to process the entire file for files larger than an HDFS block
    * GZIP is a very CPU-expensive compression codec optimized for storage
    density above performance so it will often be a performance bottleneck

    For better performance, we recommend using a splittable file format with
    Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you
    need to use text files for external accessibility, LZO-compressed text is
    probably your best choice.

    That said, we do have GZIP compression for text files as part of our
    roadmap considerations but I don't have a timeline given it's current level
    of feedback relative to other higher priority items.

    Thanks,
    Justin

    On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen wrote:

    Impala 0.7.1 fails to query an external table backed by files ending
    with a .sql.gz extension. These are gzipped tab-separated value files and I
    can successfully query them with Hive.

    Output:

    $ impala-shell
    Connected to $HOST:21000
    Unable to load history: [Errno 2] No such file or directory
    Welcome to the Impala shell. Press TAB twice to see a list of available
    commands.

    Copyright (c) 2012 Cloudera, Inc. All rights reserved.

    (Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43 PDT
    2013)

    [$HOST:21000] > select * from exampletable limit 10;
    Query: select * from exampletable limit 10
    ERROR: AnalysisException: Failed to load metadata for table: exampletable
    CAUSED BY: TableLoadingException: Failed to load metadata for table:
    exampletable
    CAUSED BY: RuntimeException: Compressed text files are not supported:
    hdfs://$HOST:8020/path/to/**file.sql.gz
    [$HOST:21000] >

    Apparently there have been issues in this area before (IMPALA-14<https://issues.cloudera.org/browse/IMPALA-14>)
    - is there some connection? That issue seems to imply support for gzipped
    files, but apparently that is no longer the case. Regression?

    Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

    BTW, the "Unable to load history: [Errno 2] No such file or directory"
    seems to only appear on the first invocation of impala-shell. Probably
    shouldn't be considered an error at all in that case, since obviously there
    would be no history on the first invocation.
    - Josh
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.


    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Marcel Kornacker at Sep 12, 2013 at 3:55 am

    On Wed, Sep 11, 2013 at 8:40 PM, Nishant Patel wrote:
    Any timeline for supporting snappy compressed text files? We wont be able to
    migrate ti impala as all our tables has snappy compressed text files :(
    What's the reason for not using snappy-compressed sequence files? The
    problem with text files is that they're not splittable.
    Regards,
    Nishant

    On Thu, Sep 12, 2013 at 12:56 AM, Justin Erickson wrote:

    Despite the limitations it is part of the roadmap but we don't have a
    timeline for this. Note that even when this is supported we'll recommend
    against it for the CPU and network performance issues described earlier.


    On Fri, Aug 23, 2013 at 3:18 PM, John Russell <jrussell@cloudera.com>
    wrote:
    I believe there is an architectural mismatch with Hadoop generally, in
    that gzipped files aren't "splittable" the same way LZO files are, so
    there's less opportunity for dividing up the work in parallel.

    Some background:


    http://stackoverflow.com/questions/11229272/hadoop-mr-better-to-have-compressed-input-files-or-raw-files

    I see the idea of splittable gzip has been explored, but I don't know if
    that ever went anywhere:

    https://issues.apache.org/jira/browse/MAPREDUCE-491

    John

    On Aug 21, 2013, at 6:19 AM, Jon Bjarnason wrote:

    Has there been any movement on this? We are using gzipped text files and
    are very happy with it. This is a blocker for moving into Impala.

    There is a lot of native support for gzipped files in hdfs and it seems
    odd that Impala doesn't support it.

    Thanks,

    Jon
    On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:

    Impala supports LZO-compressed and uncompressed text files. GZIP is
    currently supported with splittable formats such as SequenceFiles, RCFiles,
    etc.

    In general, even with just MapReduce, we'd recommend against using GZIP
    compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will require
    remote reads to process the entire file for files larger than an HDFS block
    * GZIP is a very CPU-expensive compression codec optimized for storage
    density above performance so it will often be a performance bottleneck

    For better performance, we recommend using a splittable file format with
    Snappy compression such as Snappy-compressed Avro or SequenceFiles. If you
    need to use text files for external accessibility, LZO-compressed text is
    probably your best choice.

    That said, we do have GZIP compression for text files as part of our
    roadmap considerations but I don't have a timeline given it's current level
    of feedback relative to other higher priority items.

    Thanks,
    Justin


    On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen <hansen....@gmail.com>
    wrote:
    Impala 0.7.1 fails to query an external table backed by files ending
    with a .sql.gz extension. These are gzipped tab-separated value files and I
    can successfully query them with Hive.

    Output:

    $ impala-shell
    Connected to $HOST:21000
    Unable to load history: [Errno 2] No such file or directory
    Welcome to the Impala shell. Press TAB twice to see a list of available
    commands.

    Copyright (c) 2012 Cloudera, Inc. All rights reserved.

    (Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43
    PDT 2013)

    [$HOST:21000] > select * from exampletable limit 10;
    Query: select * from exampletable limit 10
    ERROR: AnalysisException: Failed to load metadata for table:
    exampletable
    CAUSED BY: TableLoadingException: Failed to load metadata for table:
    exampletable
    CAUSED BY: RuntimeException: Compressed text files are not supported:
    hdfs://$HOST:8020/path/to/file.sql.gz
    [$HOST:21000] >

    Apparently there have been issues in this area before (IMPALA-14) - is
    there some connection? That issue seems to imply support for gzipped files,
    but apparently that is no longer the case. Regression?

    Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

    BTW, the "Unable to load history: [Errno 2] No such file or directory"
    seems to only appear on the first invocation of impala-shell. Probably
    shouldn't be considered an error at all in that case, since obviously there
    would be no history on the first invocation.
    - Josh
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.


    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.



    --
    Regards,
    Nishant Patel

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Nishant Patel at Sep 12, 2013 at 7:03 am
    I understand that its good idea to use snappy compressed sequence files
    instead of snappy compressed text files but as everything is in production
    not possible to change right now.

    'Text files are not splittable' is not issue for us because the way we have
    partitioned data, after applying compression one file is not crossing block
    size(128 MB) in 95% cases.

    Thanks for your suggestion.

    Regards,
    Nishant

    On Thu, Sep 12, 2013 at 9:25 AM, Marcel Kornacker wrote:

    On Wed, Sep 11, 2013 at 8:40 PM, Nishant Patel
    wrote:
    Any timeline for supporting snappy compressed text files? We wont be able to
    migrate ti impala as all our tables has snappy compressed text files :(
    What's the reason for not using snappy-compressed sequence files? The
    problem with text files is that they're not splittable.
    Regards,
    Nishant


    On Thu, Sep 12, 2013 at 12:56 AM, Justin Erickson <justin@cloudera.com>
    wrote:
    Despite the limitations it is part of the roadmap but we don't have a
    timeline for this. Note that even when this is supported we'll recommend
    against it for the CPU and network performance issues described earlier.


    On Fri, Aug 23, 2013 at 3:18 PM, John Russell <jrussell@cloudera.com>
    wrote:
    I believe there is an architectural mismatch with Hadoop generally, in
    that gzipped files aren't "splittable" the same way LZO files are, so
    there's less opportunity for dividing up the work in parallel.

    Some background:

    http://stackoverflow.com/questions/11229272/hadoop-mr-better-to-have-compressed-input-files-or-raw-files
    I see the idea of splittable gzip has been explored, but I don't know
    if
    that ever went anywhere:

    https://issues.apache.org/jira/browse/MAPREDUCE-491

    John

    On Aug 21, 2013, at 6:19 AM, Jon Bjarnason wrote:

    Has there been any movement on this? We are using gzipped text files
    and
    are very happy with it. This is a blocker for moving into Impala.

    There is a lot of native support for gzipped files in hdfs and it seems
    odd that Impala doesn't support it.

    Thanks,

    Jon
    On Wednesday, April 17, 2013 11:06:06 PM UTC, Justin Erickson wrote:

    Impala supports LZO-compressed and uncompressed text files. GZIP is
    currently supported with splittable formats such as SequenceFiles,
    RCFiles,
    etc.

    In general, even with just MapReduce, we'd recommend against using
    GZIP
    compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will
    require
    remote reads to process the entire file for files larger than an HDFS
    block
    * GZIP is a very CPU-expensive compression codec optimized for storage
    density above performance so it will often be a performance bottleneck

    For better performance, we recommend using a splittable file format
    with
    Snappy compression such as Snappy-compressed Avro or SequenceFiles.
    If you
    need to use text files for external accessibility, LZO-compressed
    text is
    probably your best choice.

    That said, we do have GZIP compression for text files as part of our
    roadmap considerations but I don't have a timeline given it's current
    level
    of feedback relative to other higher priority items.

    Thanks,
    Justin


    On Wed, Apr 17, 2013 at 2:44 PM, Josh Hansen <hansen....@gmail.com>
    wrote:
    Impala 0.7.1 fails to query an external table backed by files ending
    with a .sql.gz extension. These are gzipped tab-separated value
    files and I
    can successfully query them with Hive.

    Output:

    $ impala-shell
    Connected to $HOST:21000
    Unable to load history: [Errno 2] No such file or directory
    Welcome to the Impala shell. Press TAB twice to see a list of
    available
    commands.

    Copyright (c) 2012 Cloudera, Inc. All rights reserved.

    (Build version: Impala v0.7.1 (70cfa54) built on Tue Apr 16 22:10:43
    PDT 2013)

    [$HOST:21000] > select * from exampletable limit 10;
    Query: select * from exampletable limit 10
    ERROR: AnalysisException: Failed to load metadata for table:
    exampletable
    CAUSED BY: TableLoadingException: Failed to load metadata for table:
    exampletable
    CAUSED BY: RuntimeException: Compressed text files are not supported:
    hdfs://$HOST:8020/path/to/file.sql.gz
    [$HOST:21000] >

    Apparently there have been issues in this area before (IMPALA-14) -
    is
    there some connection? That issue seems to imply support for gzipped
    files,
    but apparently that is no longer the case. Regression?

    Cluster is CDH4.2.0 installed using parcels from Cloudera Manager 4.5

    BTW, the "Unable to load history: [Errno 2] No such file or
    directory"
    seems to only appear on the first invocation of impala-shell.
    Probably
    shouldn't be considered an error at all in that case, since
    obviously there
    would be no history on the first invocation.
    - Josh
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.


    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.



    --
    Regards,
    Nishant Patel

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.


    --
    Regards,
    Nishant Patel

    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Sven Teresniak at Jan 17, 2014 at 9:45 am

    Am Donnerstag, 18. April 2013 01:06:06 UTC+2 schrieb Justin Erickson:
    In general, even with just MapReduce, we'd recommend against using GZIP
    compressed text files for the following reasons:
    * GZIP with a non-splittable file format (i.e. text files) will require
    remote reads to process the entire file for files larger than an HDFS block
    * GZIP is a very CPU-expensive compression codec optimized for storage
    density above performance so it will often be a performance bottleneck
    Just to make a comment: I replaced uncompressed text files with gzip'ed
    files because gzip -9 compressed better *and* faster than lzop -9 and hive
    queries over gzip'ed data are much faster (3 times) than executed over
    uncompressed files.
    Maybe my setup here is in some way special (compared to the rest of the
    group here), but it works very well for me. Maybe the disks in my cluster
    are the bottleneck and gzip just helps in this case. I don't know.

    LZO is a pain. Not only for the licence stuff and the work to bring the
    native codec online. But everyone here seems to ignore that LZO (like gzip)
    is *not* splittable. You need the Indexer/.index files for that. Please
    correct me if I'm wrong. And when writing LZO files larger than your block
    size you have a file that cannot be splitted by default. And with generated
    index files you have your data mixed with metadata and this is pain, too!

    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 17, '13 at 9:49p
activeJan 17, '14 at 9:45a
posts7
users7
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase