FAQ
About 1.9 GB snappy compressed, raw is around 9.5 GB.
Yes, the pg table has indexes but not on the columns I used for my impala
queries (in other wordes, not on asset_name.


2013/3/8 Greg Rahn <grahn@cloudera.com>
How many megabytes in size is the 37 million rows file?
Does the Postgresql table have any indexes?
On Fri, Mar 8, 2013 at 6:44 AM, Sékine Coulibaly wrote:

I've setup a bench platform ona physical machine. I'm running CH 4.2.0
and Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM, RAID 5
6x150GB SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.

I'm using a table having the following schema :

delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string

I've loaded content into this table through hive as follows :

SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT, cmd
STRING, status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING, content_name STRING, nb_bytes_sent INT, progress INT,
interface STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;

create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code string,dte
string,count_sessions int,cmd string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface string,server_ip string,customer_id string )
STORED AS SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;

The log files contain around 37 millions rows.

Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name LIMIT
5000;
Query time : 6.43sec (grouped 37 millions rows)

--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec

Req#2
Query time : 20 sec

These are basically the kind of request I'll run for my use case.

Although Impala is around 300% faster than Postgresql9.2, I was expecting
performance difference of at least one order of magnitude (in other words,
I was expecting around 1 sec for Req#1, 2 secs for Req#2), and I'm
wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.

I'm not sure I'm getting most out of this very basic Impala setup, and I
wish anyone could provide a comparison basis, and/or help point out
configuration items that may be wrong.

iotop shows very light hard drive activity at the start of the request,
so disk i/o is most likely not the bottleneck. Of course, running these
queries on the logs_text table is much more longer (eg 42 secs for Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35) during the
request processing, so I'm quite puzzled :(


Any help appreciated !

Thank you

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 16 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 8, '13 at 8:59p
activeDec 2, '13 at 2:17p
posts16
users8
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase