Last couple of emails failed. Here is another attempt.
HBase is running but is not in use. The only thing running on the cluster
are the TPC-H queries. One at a time. I tried to stop HBase anyway for my
next run and I got a message saying to stop Impala and Hue first, so I
didn't do that.
I have Impala daemons running on all 4 nodes in the cluster. The catalog
service is on my third node.
Here are the results of running TPC-H query 17, 3 times. I will wait a few
minutes in between each query to gather statistics. I'm getting the
statistics from the impalad screens and the top command. I'm taking them
from my #1 machine where I run the queries and the #3 machine where the
catalog is. There are impalad daemons on all machines in the cluster.
There is nothing else running on the cluster. Each node had 48GB mem.
Query times were:
1. 6 min, 46 sec
2. 24 min, 21 sec
3. 51 min 42 sec
Values after reboot of cluster
****************************************************************************
Hadoop1 impalad
************************
Host Memory Usage - 2.5GB Physical, 3.3 GB Cached
Resident Memory - 227 MB
Top Command - RES = 226m, %MEM = .5
Hadoop3 impalad
************************
Host Memory Usage - 4.6GB Physical, 3.7GB Cached
Resident Memory - 229 MB
Top Command - RES = 229m, %MEM = .5
Hadoop3 catalogd
************************
Top Command - RES = 326m, %MEM = .7
*******************************************************************************
Values after running query 17 first time
************************************
Query took: 6 min 46 sec
************************************
Hadoop1 impalad
************************
Host Memory Usage - 38.4GB Physical, 4.6GB Cached
Resident Memory - 39.7GB
Top Command - RES = 39GB, %MEM = 84.5
Hadoop3 impalad
************************
Host Memory Usage - 34.5GB Physical, 4.6 Cached
Resident Memory - 29.8 GB
Top Command - RES = 29GB, %MEM = 63.3
Hadoop3 catalogd
************************
Top Command - RES = 309m, %MEM = .6
*******************************************************************************
Values after running query 17 second time
************************************
Query took: 24 min 21 sec
************************************
Hadoop1 impalad
************************
Host Memory Usage - 46.7GB Physical, 0GB Cached, Swapped used 8.4GB
Resident Memory - 41.8GB
Top Command - RES = 41GB, %MEM = 88.9
Physical Memory display shows 46.5 of 47.1 GB used
Hadoop3 impalad
************************
Host Memory Usage - 45GB Physical, 1.8 Cached
Resident Memory - 40GB
Top Command - RES = 39GB, %MEM = 84.9
Physical Memory display shows 45 of 47.1 GB used
Hadoop3 catalogd
************************
Top Command - RES = 314m, %MEM = .7
*******************************************************************************
Values after running query 17 third time
************************************
Query took: 51 min 42 sec
************************************
Hadoop1 impalad
************************
Host Memory Usage - 46.6xGB Physical, 0GB Cached, Swapped used 15.4GB
Resident Memory - 42GB
Top Command - RES = 42GB, %MEM = 89.2
Physical Memory display shows 46.5 of 47.1 GB used
Hadoop3 impalad
************************
Host Memory Usage - 46.8GB Physical, 0GB Cached, Swapped used 27GB
Resident Memory - 40.7GB
Top Command - RES = 40GB, %MEM = 86.4
Physical Memory display shows 46.8 of 47.1 GB used
Hadoop3 catalogd
************************
Top Command - RES = 272m, %MEM = .5
On Sat, Jan 18, 2014 at 3:14 PM, Alan Choi wrote:Hi Jim,
By default, Impala will impose a memory limit of 80% of the system memory.
So, you're having a mem-limit of 37GB (as shown in the memz.txt). Impala
shouldn't use a whole lot more than 37GB.
Two things I want to make sure. First, you probably have some other
process running on the node: such as CM agent, monitor, and Impala catalog
service. How much memory are consumed by these process?
Second, just to make sure, you don't have HBase running, right?
Thanks,
Alan
On Sat, Jan 18, 2014 at 5:46 AM, Jim Williams wrote:Hi Nong,
My cluster is CM managed. I have not touched any parms. I'm using it
straight out of the box. Let me know if I should do something with process
mem limit. And where I can set that. Or any other parms that will help
performance.
Here is what I did:
I brought up the cluster (so it's a fresh start of everything including
the os) and ran TPC-H query 7. Then I captured the memz and the metrics
values.
Note that after I did this, I ran query 7 again and it took twice as long
to run as the first time. I ran it again and it took 3 times as long as
the second time. So the times were (4 min, 8 min, 24 min).
This is a 4 node cluster. Each has 48GB mem. My TPC-H database is 100GB.
Thanks,
Jim
On Fri, Jan 17, 2014 at 12:52 PM, Nong Li wrote:Inline.
On Fri, Jan 17, 2014 at 4:04 AM, wrote:
Hello,
I'm trying to do the TPC-H queries with a small 4 node cluster. I'm
running Impala 1.2.1 and I've run into 2 issues.
1. When I load my Parquet tables with a 10GB TPC-H database the
lineitem table has some corrupt data. Some of the date fields have
unprintable characters and not dates in them. At the 1GB level I don't get
this.
You're running into
https://issues.cloudera.org/browse/IMPALA-692. I
recommend upgrading to 1.2.3.
2. When I run at the 100GB level (text format because of issue above)
there appears to be a memory leak in Impala. After I run a few queries all
the memory on my systems (48GB) is taken up and not released. So, at that
point it is non-stop swapping.
Is this a CM managed cluster? Do you have process mem limits enabled? If
you can get the cluster in this state or close to it, can you
send us the /memz page and /metrics from the debug web page?
Has anyone seen these issues?
Thanks,
Jim
On Monday, July 22, 2013 12:14:00 PM UTC-4, Aron MacDonald wrote:Hi All,
Recently a few of us following this forum have collaborated and
compiled some interesting Impala bench-marking results.
We've used the TPC-H data-set on our small HADOOP clusters to test
query run-times with Text and Parquet file formats.
The following spreadsheet (which is still a work in progress) was
prepared by myself, Lee Jung-Yup, Henrik Behrens and most recently Tim
Hejit (from OnMarc).
https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEVMeTQwZGJSOVQwcFRSRFFFUmcxWWc#gid=6
On row 21 on the summary sheet we have tried to benchmark our
environments using a simple cost performance factor metric. The cost of
our environments is based monthly on hardware and operating costs (e.g
Electricity), assuming the hardware is fully depreciated (straight line
method) over 3 years.
While the initial focus was on simple queries on a single table, some
attempts have also been made to compare performance using the complex
queries documented in TPC-H (http://www.tpc.org/tpch/)
To re-create these tests (on Text tables) in your own environment you
can use the following link
https://github.com/kj-ki/tpc-h-impala/tree/master/tpch_impalaIf you also want to try it with Parquet format tables then the
following link has the files I created/used.
https://docs.google.com/file/d/0Bxydpie8Km_fNUtvREdxYVNJWkE/edit?usp=sharing
Why not try these on your own clusters (especially those running
significantly larger clusters).
We’d be only too happy to include your results.
Kind Regards
Aron
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.