I have a cluster with 4 data nodes on which I am running very simple
queries for performance testing. The tables are parquet format. Environment
is Impala 1.2.1 / CDH 4.5 / Centos
- I start with a table with 100 million rows and add 100 million rows at
a time - till I go to 1 billion rows. I do a count(*) on this table after
adding 100 million rows. The time taken for this simple count(*) is linear.
1.13s for 101,713,307 rows
16.77s (!) for 203,426,614 rows
2.23s for 305,139,921 rows
4.72s for 406,853,228 rows
5.30s for 508,566,535 rows
4.59s for 610,279,842 rows
5.90s for 711,993,149 rows
8.94s for 813,706,456 rows
11.33s for 915,419,763 rows
7.09s for 1,017,133,070 rows
- I run a simple count/group rollup query on the table with 100
million rows and another table with same schema but 1 billion rows. The
time taken for 100 mill rows is between 4.49 s - 5.58 seconds and time
for 1 billion rows is around 36.96s - 38.54 seconds, again showing
linear growth in time.
- There seems a wide range of time when I execute the query repeatedly -
bit surprised to see such variation. The timing is taken with running
single query at a time on the cluster - there is no other query running on
the cluster to distort timing tests.
- These table contain timestamp and I am unable to run ANALYSE table
Compute statistics on this from hive due to known issue of hive not
recognizing parquet timestamp column (missing parquet jar file in hive).
Impala 1.2.1. does not supports Analyze table from impala-shell, so I can't
analyse table from impala. I realize this could be a major issue and
certainly discount some of the time increase on missing statistics.
However, I also want to make sure the queries are using all 4 impala-d in a
fairly balanced way. How do I check that ?
- The cluster was configured manually i.e. without cloudera manager.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.