FAQ
Could you provide the query profiles for the queries you're testing? These
will help identify where the bottleneck is. (Query profiles can be accessed
via the impalad web interface at http://your-impalad-server:25000/queries.)

On Fri, Mar 8, 2013 at 6:44 AM, Sékine Coulibaly wrote:

I've setup a bench platform ona physical machine. I'm running CH 4.2.0 and
Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM, RAID 5 6x150GB
SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.

I'm using a table having the following schema :

delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string

I've loaded content into this table through hive as follows :

SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT, cmd
STRING, status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING, content_name STRING, nb_bytes_sent INT, progress INT,
interface STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;

create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code string,dte
string,count_sessions int,cmd string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface string,server_ip string,customer_id string )
STORED AS SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;

The log files contain around 37 millions rows.

Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name LIMIT
5000;
Query time : 6.43sec (grouped 37 millions rows)

--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec

Req#2
Query time : 20 sec

These are basically the kind of request I'll run for my use case.

Although Impala is around 300% faster than Postgresql9.2, I was expecting
performance difference of at least one order of magnitude (in other words,
I was expecting around 1 sec for Req#1, 2 secs for Req#2), and I'm
wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.

I'm not sure I'm getting most out of this very basic Impala setup, and I
wish anyone could provide a comparison basis, and/or help point out
configuration items that may be wrong.

iotop shows very light hard drive activity at the start of the request, so
disk i/o is most likely not the bottleneck. Of course, running these
queries on the logs_text table is much more longer (eg 42 secs for Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35) during the
request processing, so I'm quite puzzled :(


Any help appreciated !

Thank you

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 16 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 8, '13 at 8:59p
activeDec 2, '13 at 2:17p
posts16
users8
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase