Does the Postgresql table have any indexes?
On Fri, Mar 8, 2013 at 6:44 AM, Sékine Coulibaly wrote:
I've setup a bench platform ona physical machine. I'm running CH 4.2.0 and
Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM, RAID 5 6x150GB
SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.
I'm using a table having the following schema :
delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string
I've loaded content into this table through hive as follows :
SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT, cmd
STRING, status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING, content_name STRING, nb_bytes_sent INT, progress INT,
interface STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;
create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code string,dte
string,count_sessions int,cmd string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface string,server_ip string,customer_id string )
STORED AS SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;
The log files contain around 37 millions rows.
Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name LIMIT
5000;
Query time : 6.43sec (grouped 37 millions rows)
--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec
Req#2
Query time : 20 sec
These are basically the kind of request I'll run for my use case.
Although Impala is around 300% faster than Postgresql9.2, I was expecting
performance difference of at least one order of magnitude (in other words,
I was expecting around 1 sec for Req#1, 2 secs for Req#2), and I'm
wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.
I'm not sure I'm getting most out of this very basic Impala setup, and I
wish anyone could provide a comparison basis, and/or help point out
configuration items that may be wrong.
iotop shows very light hard drive activity at the start of the request, so
disk i/o is most likely not the bottleneck. Of course, running these
queries on the logs_text table is much more longer (eg 42 secs for Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35) during the
request processing, so I'm quite puzzled :(
Any help appreciated !
Thank you
I've setup a bench platform ona physical machine. I'm running CH 4.2.0 and
Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM, RAID 5 6x150GB
SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.
I'm using a table having the following schema :
delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string
I've loaded content into this table through hive as follows :
SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT, cmd
STRING, status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING, content_name STRING, nb_bytes_sent INT, progress INT,
interface STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;
create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code string,dte
string,count_sessions int,cmd string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface string,server_ip string,customer_id string )
STORED AS SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;
The log files contain around 37 millions rows.
Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name LIMIT
5000;
Query time : 6.43sec (grouped 37 millions rows)
--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec
Req#2
Query time : 20 sec
These are basically the kind of request I'll run for my use case.
Although Impala is around 300% faster than Postgresql9.2, I was expecting
performance difference of at least one order of magnitude (in other words,
I was expecting around 1 sec for Req#1, 2 secs for Req#2), and I'm
wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.
I'm not sure I'm getting most out of this very basic Impala setup, and I
wish anyone could provide a comparison basis, and/or help point out
configuration items that may be wrong.
iotop shows very light hard drive activity at the start of the request, so
disk i/o is most likely not the bottleneck. Of course, running these
queries on the logs_text table is much more longer (eg 42 secs for Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35) during the
request processing, so I'm quite puzzled :(
Any help appreciated !
Thank you