FAQ
Hi all,

I'm trying to figure out whether Impala's use case is right for me. Here
are details of what I am doing:

- Log data loaded into Hive tables on EC2 EMR
- Analytics queries performed in Hive (takes ~2mins - 20 mins depending on
query and dataset).
- Results stored in S3 and readied for production display


From the brief research I have done, it looks like Impala can speed up
these analytics queries and run them in seconds not minutes. It would be a
great help if you can answer the following questions and confirm my
assumptions:

1) I assume queries executed on Impala are done in shell and cannot be
output to a file on S3 without some custom script.
2) I assume loading still has to be done through EMR and Impala needs to
be restarted with each ETL run.
3) What is the recommended hardware set to query performance? For
example, if I have a 2 node small cluster that runs an analytics query in 2
minutes (SELECT val, COUNT(*) FROM test GROUP BY val), can I use that same
cluster with Impala installed and run that same query and expect it to be
predictably faster? Or do I need 4 XLarge instances minimum as the example
indicated?
(http://blog.cloudera.com/blog/2013/02/from-zero-to-impala-in-minutes/).
4) Does Impala scale in the same manner as Hive as dataset grows?

Search Discussions

  • Nong at Feb 17, 2013 at 7:59 pm
    Your use case sounds like a good fit for Impala. Answers inline.
    On Friday, February 15, 2013 10:50:50 AM UTC-8, Chris Bates wrote:

    Hi all,

    I'm trying to figure out whether Impala's use case is right for me. Here
    are details of what I am doing:

    - Log data loaded into Hive tables on EC2 EMR
    - Analytics queries performed in Hive (takes ~2mins - 20 mins depending on
    query and dataset).
    - Results stored in S3 and readied for production display


    From the brief research I have done, it looks like Impala can speed up
    these analytics queries and run them in seconds not minutes. It would be a
    great help if you can answer the following questions and confirm my
    assumptions:

    1) I assume queries executed on Impala are done in shell and cannot be
    output to a file on S3 without some custom script.
    What was the mechanism for inserting into S3 with Hive? For queries that
    do not contain an 'insert', the results are just returned to the client.
    In the case where the client is the shell, it just prints it out. If the
    query contains an insert, we write the data into HDFS.

    2) I assume loading still has to be done through EMR and Impala needs to
    be restarted with each ETL run.
    If the underlying services (HDFS or hive metastore) have been restarted,
    Impala will need to be as well.

    3) What is the recommended hardware set to query performance? For
    example, if I have a 2 node small cluster that runs an analytics query in 2
    minutes (SELECT val, COUNT(*) FROM test GROUP BY val), can I use that same
    cluster with Impala installed and run that same query and expect it to be
    predictably faster? Or do I need 4 XLarge instances minimum as the example
    indicated? (
    http://blog.cloudera.com/blog/2013/02/from-zero-to-impala-in-minutes/).
    Take a look at our FAQ
    <https://ccp.cloudera.com/display/IMPALA10BETADOC/Impala+Frequently+Asked+Questions#ImpalaFrequentlyAskedQuestions-ImpalaSystemRequirements>for
    some more details. I don't have much experience running on EC2. The
    medium/large instances look comparable to our development machines where we
    know impala runs well. The small instances look like they'd benefit from
    more hardware.

    4) Does Impala scale in the same manner as Hive as dataset grows?
    Yup. There might be one caveat though. If your dataset happens to be
    small enough to fit in the aggregate memory of the cluster, queries over
    the same tables will be served out of the OS buffer cache. If your dataset
    grows to be bigger, you will notice a significant drop in query performance
    as now you are reading off disk. This applies the same to hive, but based
    on our experience, impala is much more likely to be io bound so the
    performance difference is more pronounced.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedFeb 15, '13 at 6:50p
activeFeb 17, '13 at 7:59p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Chris Bates: 1 post Nong: 1 post

People

Translate

site design / logo © 2022 Grokbase