Your use case sounds like a good fit for Impala. Answers inline.
On Friday, February 15, 2013 10:50:50 AM UTC-8, Chris Bates wrote:
I'm trying to figure out whether Impala's use case is right for me. Here
are details of what I am doing:
- Log data loaded into Hive tables on EC2 EMR
- Analytics queries performed in Hive (takes ~2mins - 20 mins depending on
query and dataset).
- Results stored in S3 and readied for production display
From the brief research I have done, it looks like Impala can speed up
these analytics queries and run them in seconds not minutes. It would be a
great help if you can answer the following questions and confirm my
1) I assume queries executed on Impala are done in shell and cannot be
output to a file on S3 without some custom script.
What was the mechanism for inserting into S3 with Hive? For queries that
do not contain an 'insert', the results are just returned to the client.
In the case where the client is the shell, it just prints it out. If the
query contains an insert, we write the data into HDFS.
2) I assume loading still has to be done through EMR and Impala needs to
be restarted with each ETL run.
If the underlying services (HDFS or hive metastore) have been restarted,
Impala will need to be as well.
3) What is the recommended hardware set to query performance? For
example, if I have a 2 node small cluster that runs an analytics query in 2
minutes (SELECT val, COUNT(*) FROM test GROUP BY val), can I use that same
cluster with Impala installed and run that same query and expect it to be
predictably faster? Or do I need 4 XLarge instances minimum as the example
Take a look at our FAQ
some more details. I don't have much experience running on EC2. The
medium/large instances look comparable to our development machines where we
know impala runs well. The small instances look like they'd benefit from
4) Does Impala scale in the same manner as Hive as dataset grows?
Yup. There might be one caveat though. If your dataset happens to be
small enough to fit in the aggregate memory of the cluster, queries over
the same tables will be served out of the OS buffer cache. If your dataset
grows to be bigger, you will notice a significant drop in query performance
as now you are reading off disk. This applies the same to hive, but based
on our experience, impala is much more likely to be io bound so the
performance difference is more pronounced.