On Wed, Mar 20, 2013 at 3:07 AM, Sékine Coulibaly wrote:
I setup the HDFS data on the RAID array (disregard what I said previously,
these tests were done on a single drive HDFS), are are the results :
A raw 100GB file write shows:
real 11m20
user 46sec
sys 1m12
The HDFS read is 11 minutes long (so around 150MB/s).
The queries execution time are identical (compared to when they were
performed on a non-RAID disk), this make me think that the system is not I/O
bound (I expect more throghput from the RAID array than from a single
drive).
When looking at the profile, I can see that most of the time is used in :
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildTime: 5s547ms
Not sure I can enhance BuildTime from my end. Is there some place for
enhancement here ?
I setup the HDFS data on the RAID array (disregard what I said previously,
these tests were done on a single drive HDFS), are are the results :
A raw 100GB file write shows:
real 11m20
user 46sec
sys 1m12
The HDFS read is 11 minutes long (so around 150MB/s).
The queries execution time are identical (compared to when they were
performed on a non-RAID disk), this make me think that the system is not I/O
bound (I expect more throghput from the RAID array than from a single
drive).
When looking at the profile, I can see that most of the time is used in :
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildTime: 5s547ms
Not sure I can enhance BuildTime from my end. Is there some place for
enhancement here ?
unfortunately there's no way at the moment to make this query run
faster. The reason is that aggregation is single-threaded within a
single plan fragment executing on an impalad.
However, we are planning on adding multi-threaded join and aggregation
execution sometime after the GA release, although we can't promise any
specific dates.
Marcel
Here is the profiling :
Query (id=d29b07c309d34f27:9ea3d18614178baa):
- PlanningTime: 7ms
Query d29b07c309d34f27:9ea3d18614178baa:(260ms 0.00%)
Aggregate Profile:
Coordinator Fragment:(5s903ms 0.00%)
- RowsProduced: 5.00K
CodeGen:
- CodegenTime: 2ms
- CompileTime: 118ms
- LoadTime: 4ms
- ModuleFileSize: 44.61 KB
AGGREGATION_NODE (id=3):(5s906ms 0.59%)
- BuildBuckets: 32.77K
- BuildTime: 10ms
- GetResultsTime: 541K clock cycles
- MemoryUsed: 2.79 MB
- RowsReturned: 5.00K
- RowsReturnedRate: 846.00 /sec
EXCHANGE_NODE (id=2):(5s871ms 99.46%)
- BytesReceived: 1.75 MB
- ConvertRowBatchTime: 1010K clock cycles
- DeserializeRowBatchTimer: 1ms
- MemoryUsed: 0.00
- RowsReturned: 28.37K
- RowsReturnedRate: 4.83 K/sec
Averaged Fragment 1:(5s891ms 0.00%)
completion times: min:5s888ms max:5s888ms mean: 5s888ms stddev:0
execution rates: min:356.10 MB/sec max:356.10 MB/sec mean:356.10
MB/sec stddev:0.00 /sec
split sizes: min: 2.05 GB, max: 2.05 GB, avg: 2.05 GB, stddev: 0.00
- RowsProduced: 28.37K
CodeGen:
- CodegenTime: 4ms
- CompileTime: 122ms
- LoadTime: 5ms
- ModuleFileSize: 44.61 KB
DataStreamSender:
- BytesSent: 1.75 MB
- DataSinkTime: 10ms
- SerializeBatchTime: 3ms
- ThriftTransmitTime: 11ms
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildBuckets: 32.77K
- BuildTime: 5s547ms
- GetResultsTime: 1ms
- MemoryUsed: 2.79 MB
- RowsReturned: 28.37K
- RowsReturnedRate: 4.81 K/sec
HDFS_SCAN_NODE (id=0):(309ms 5.25%)
- BytesRead: 2.05 GB
- DecompressionTime: 11s149ms
- MaterializeTupleTime: 4s297ms
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.05 GB/sec
- RowsReturned: 37.47M
- RowsReturnedRate: 121.12 M/sec
- ScanRangesComplete: 43
- ScannerThreadsReadTime: 996ms
- TotalReadThroughput: 329.91 MB/sec
Fragment 1:
Instance d29b07c309d34f27:9ea3d18614178bac:(5s891ms 0.00%)
Hdfs split stats (:<# splits>/): 0:43/2.20B
- RowsProduced: 28.37K
CodeGen:
- CodegenTime: 4ms
- CompileTime: 122ms
- LoadTime: 5ms
- ModuleFileSize: 44.61 KB
DataStreamSender:
- BytesSent: 1.75 MB
- DataSinkTime: 10ms
- SerializeBatchTime: 3ms
- ThriftTransmitTime: 11ms
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildBuckets: 32.77K
- BuildTime: 5s547ms
- GetResultsTime: 1ms
- MemoryUsed: 2.79 MB
- RowsReturned: 28.37K
- RowsReturnedRate: 4.81 K/sec
HDFS_SCAN_NODE (id=0):(309ms 5.25%)
File Formats: SEQUENCE_FILE/SNAPPY_BLOCKED:43
- BytesRead: 2.05 GB
- DecompressionTime: 11s149ms
- MaterializeTupleTime: 4s297ms
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.05 GB/sec
- RowsReturned: 37.47M
- RowsReturnedRate: 121.12 M/sec
- ScanRangesComplete: 43
- ScannerThreadsReadTime: 996ms
- TotalReadThroughput: 329.91 MB/sec
Thank you !
Sekine
2013/3/13 Greg Rahn <[email protected]>
Query (id=d29b07c309d34f27:9ea3d18614178baa):
- PlanningTime: 7ms
Query d29b07c309d34f27:9ea3d18614178baa:(260ms 0.00%)
Aggregate Profile:
Coordinator Fragment:(5s903ms 0.00%)
- RowsProduced: 5.00K
CodeGen:
- CodegenTime: 2ms
- CompileTime: 118ms
- LoadTime: 4ms
- ModuleFileSize: 44.61 KB
AGGREGATION_NODE (id=3):(5s906ms 0.59%)
- BuildBuckets: 32.77K
- BuildTime: 10ms
- GetResultsTime: 541K clock cycles
- MemoryUsed: 2.79 MB
- RowsReturned: 5.00K
- RowsReturnedRate: 846.00 /sec
EXCHANGE_NODE (id=2):(5s871ms 99.46%)
- BytesReceived: 1.75 MB
- ConvertRowBatchTime: 1010K clock cycles
- DeserializeRowBatchTimer: 1ms
- MemoryUsed: 0.00
- RowsReturned: 28.37K
- RowsReturnedRate: 4.83 K/sec
Averaged Fragment 1:(5s891ms 0.00%)
completion times: min:5s888ms max:5s888ms mean: 5s888ms stddev:0
execution rates: min:356.10 MB/sec max:356.10 MB/sec mean:356.10
MB/sec stddev:0.00 /sec
split sizes: min: 2.05 GB, max: 2.05 GB, avg: 2.05 GB, stddev: 0.00
- RowsProduced: 28.37K
CodeGen:
- CodegenTime: 4ms
- CompileTime: 122ms
- LoadTime: 5ms
- ModuleFileSize: 44.61 KB
DataStreamSender:
- BytesSent: 1.75 MB
- DataSinkTime: 10ms
- SerializeBatchTime: 3ms
- ThriftTransmitTime: 11ms
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildBuckets: 32.77K
- BuildTime: 5s547ms
- GetResultsTime: 1ms
- MemoryUsed: 2.79 MB
- RowsReturned: 28.37K
- RowsReturnedRate: 4.81 K/sec
HDFS_SCAN_NODE (id=0):(309ms 5.25%)
- BytesRead: 2.05 GB
- DecompressionTime: 11s149ms
- MaterializeTupleTime: 4s297ms
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.05 GB/sec
- RowsReturned: 37.47M
- RowsReturnedRate: 121.12 M/sec
- ScanRangesComplete: 43
- ScannerThreadsReadTime: 996ms
- TotalReadThroughput: 329.91 MB/sec
Fragment 1:
Instance d29b07c309d34f27:9ea3d18614178bac:(5s891ms 0.00%)
Hdfs split stats (:<# splits>/): 0:43/2.20B
- RowsProduced: 28.37K
CodeGen:
- CodegenTime: 4ms
- CompileTime: 122ms
- LoadTime: 5ms
- ModuleFileSize: 44.61 KB
DataStreamSender:
- BytesSent: 1.75 MB
- DataSinkTime: 10ms
- SerializeBatchTime: 3ms
- ThriftTransmitTime: 11ms
AGGREGATION_NODE (id=1):(5s894ms 94.81%)
- BuildBuckets: 32.77K
- BuildTime: 5s547ms
- GetResultsTime: 1ms
- MemoryUsed: 2.79 MB
- RowsReturned: 28.37K
- RowsReturnedRate: 4.81 K/sec
HDFS_SCAN_NODE (id=0):(309ms 5.25%)
File Formats: SEQUENCE_FILE/SNAPPY_BLOCKED:43
- BytesRead: 2.05 GB
- DecompressionTime: 11s149ms
- MaterializeTupleTime: 4s297ms
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.05 GB/sec
- RowsReturned: 37.47M
- RowsReturnedRate: 121.12 M/sec
- ScanRangesComplete: 43
- ScannerThreadsReadTime: 996ms
- TotalReadThroughput: 329.91 MB/sec
Thank you !
Sekine
2013/3/13 Greg Rahn <[email protected]>
Another option is just to use direct I/O via the dd option "oflag=direct"
(or iflag=direct for direct path reading)
(or iflag=direct for direct path reading)
On Wed, Mar 13, 2013 at 9:57 AM, Todd Lipcon wrote:
Depending on the amount of RAM on the machine, you probably should
actually use a larger file for this test. To see actual disk throughput I
usually like to use a file at least 2x RAM, preferably 10x RAM if there is
free disk space on the node.
eg use dd if=/dev/zero of=/path/to/file bs=1M count=100000
(creates a 100G file)
The test will take longer but more accurate.
Alternatively, you can drop the buffer cache completely before running
the second test: echo 1 | sudo tee /proc/sys/vm/drop_caches
but the downside of this is that your JVM startup time will be especially
slow as well, given it will have to seek around disk to load shared
libraries, etc.
Doing something in the middle (eg 10G and drop_caches) is probably pretty
accurate.
-Todd
On Wed, Mar 13, 2013 at 9:48 AM, Marcel Kornacker <[email protected]>
wrote:
--
Todd Lipcon
Software Engineer, Cloudera
Depending on the amount of RAM on the machine, you probably should
actually use a larger file for this test. To see actual disk throughput I
usually like to use a file at least 2x RAM, preferably 10x RAM if there is
free disk space on the node.
eg use dd if=/dev/zero of=/path/to/file bs=1M count=100000
(creates a 100G file)
The test will take longer but more accurate.
Alternatively, you can drop the buffer cache completely before running
the second test: echo 1 | sudo tee /proc/sys/vm/drop_caches
but the downside of this is that your JVM startup time will be especially
slow as well, given it will have to seek around disk to load shared
libraries, etc.
Doing something in the middle (eg 10G and drop_caches) is probably pretty
accurate.
-Todd
On Wed, Mar 13, 2013 at 9:48 AM, Marcel Kornacker <[email protected]>
wrote:
Sekine, could you run this
dd if=/dev/zero of=<file path on your raid drive> bs=1024 count=3145728
This will tell you what your raw write throughput on that drive is.
Then copy that file into hdfs:
hadoop fs -put <file path> /tmp
Then measure how long it takes to read it:
time hadoop fs -cat /tmp/<file name> > /dev/null
How does your hdfs read throughput compare with the raw throughput of
that drive?
On Tue, Mar 12, 2013 at 9:35 AM, Sékine Coulibaly <[email protected]>
wrote:
dd if=/dev/zero of=<file path on your raid drive> bs=1024 count=3145728
This will tell you what your raw write throughput on that drive is.
Then copy that file into hdfs:
hadoop fs -put <file path> /tmp
Then measure how long it takes to read it:
time hadoop fs -cat /tmp/<file name> > /dev/null
How does your hdfs read throughput compare with the raw throughput of
that drive?
On Tue, Mar 12, 2013 at 9:35 AM, Sékine Coulibaly <[email protected]>
wrote:
Yes, there are multiples disks, but its a RAID 5, so I guess it's seen
by
the system as one disk.
Since crunching data volume seems the only possible enhancement, given
the
query type, RC format could be a good guess. I expect loading as
little as
200MB and so having a performance boost of around x10.
However my last RC format tests were not very successfull though...
2013/3/12 Nong Li <[email protected]>
by
the system as one disk.
Since crunching data volume seems the only possible enhancement, given
the
query type, RC format could be a good guess. I expect loading as
little as
200MB and so having a performance boost of around x10.
However my last RC format tests were not very successfull though...
2013/3/12 Nong Li <[email protected]>
Based on the runtime profile, Impala is running as fast as the
hardware
allows.
The runtime profile indicates that your DN is setup with 1 disk and
we are
getting ~85MB/sec from that
disk. This sounds reasonable to me. You are reading about 2GB of
data
from it so this ends up taking
~25 seconds. With this setup, Impala is completely bottle-necked on
IO.
Are you expecting more IO throughput? I see you have multiple disks
but
the throughput does not reflect
that. I'd start by investigating the IO setup.
Here are the relevant counters from the runtime profile.
Fragment 1:
Instance 46fa998af31342a5:92e5c5ed0f0c9221:(26s138ms 0.00%)
Hdfs split stats (:<# splits>/): 0:43/2.20B
<-- One disk
HDFS_SCAN_NODE (id=0):(25s880ms 99.01%)
File Formats: SEQUENCE_FILE/SNAPPY_BLOCKED:43
- BytesRead: 2.05 GB
- PerDiskReadThroughput: 82.83 MB/sec
<-- Disk throughput
On Mon, Mar 11, 2013 at 9:21 AM, Sékine Coulibaly
<[email protected]>
wrote:
hardware
allows.
The runtime profile indicates that your DN is setup with 1 disk and
we are
getting ~85MB/sec from that
disk. This sounds reasonable to me. You are reading about 2GB of
data
from it so this ends up taking
~25 seconds. With this setup, Impala is completely bottle-necked on
IO.
Are you expecting more IO throughput? I see you have multiple disks
but
the throughput does not reflect
that. I'd start by investigating the IO setup.
Here are the relevant counters from the runtime profile.
Fragment 1:
Instance 46fa998af31342a5:92e5c5ed0f0c9221:(26s138ms 0.00%)
Hdfs split stats (:<# splits>/): 0:43/2.20B
<-- One disk
HDFS_SCAN_NODE (id=0):(25s880ms 99.01%)
File Formats: SEQUENCE_FILE/SNAPPY_BLOCKED:43
- BytesRead: 2.05 GB
- PerDiskReadThroughput: 82.83 MB/sec
<-- Disk throughput
On Mon, Mar 11, 2013 at 9:21 AM, Sékine Coulibaly
<[email protected]>
wrote:
About 1.9 GB snappy compressed, raw is around 9.5 GB.
Yes, the pg table has indexes but not on the columns I used for my
impala
queries (in other wordes, not on asset_name.
2013/3/8 Greg Rahn <[email protected]>
Yes, the pg table has indexes but not on the columns I used for my
impala
queries (in other wordes, not on asset_name.
2013/3/8 Greg Rahn <[email protected]>
How many megabytes in size is the 37 million rows file?
Does the Postgresql table have any indexes?
On Fri, Mar 8, 2013 at 6:44 AM, Sékine Coulibaly
<[email protected]>
wrote:
Does the Postgresql table have any indexes?
On Fri, Mar 8, 2013 at 6:44 AM, Sékine Coulibaly
<[email protected]>
wrote:
I've setup a bench platform ona physical machine. I'm running CH
4.2.0
and Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM,
RAID 5
6x150GB SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.
I'm using a table having the following schema :
delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string
I've loaded content into this table through hive as follows :
SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT,
cmd STRING,
status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING,
content_name STRING, nb_bytes_sent INT, progress INT, interface
STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED
BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;
create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code
string,dte string,count_sessions int,cmd
string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface
string,server_ip string,customer_id string ) STORED AS
SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;
The log files contain around 37 millions rows.
Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name
LIMIT 5000;
Query time : 6.43sec (grouped 37 millions rows)
--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec
Req#2
Query time : 20 sec
These are basically the kind of request I'll run for my use case.
Although Impala is around 300% faster than Postgresql9.2, I was
expecting performance difference of at least one order of
magnitude (in
other words, I was expecting around 1 sec for Req#1, 2 secs for
Req#2), and
I'm wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.
I'm not sure I'm getting most out of this very basic Impala setup,
and
I wish anyone could provide a comparison basis, and/or help point
out
configuration items that may be wrong.
iotop shows very light hard drive activity at the start of the
request,
so disk i/o is most likely not the bottleneck. Of course, running
these
queries on the logs_text table is much more longer (eg 42 secs for
Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35)
during
the request processing, so I'm quite puzzled :(
Any help appreciated !
Thank you
4.2.0
and Impala 0.6 on an IBM M3 x3650 (8 HT cores, 1,6GHz, 32GB RAM,
RAID 5
6x150GB SAS drives, running RedHat 6.3.
I'm running Impala tests on it and the results are in some way
disappointing.
I'm using a table having the following schema :
delivery_srv string
media_type string
server_name string
error_code string
dte string
count_sessions int
cmd string
status_code string
client_ip string
client_type string
session_id string
asset_name string
count_sent int
indicator int
interface string
server_ip string
customer_id string
I've loaded content into this table through hive as follows :
SET hive.exec.compress.output=true;
SET
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
create table logs_text(delivery_pop STRING, media_library STRING,
videoserver STRING, severity STRING, dte STRING,nb_sessions INT,
cmd STRING,
status_code STRING, client_ip STRING,browser_type STRING,
session_id STRING,
content_name STRING, nb_bytes_sent INT, progress INT, interface
STRING,ip_port STRING, mac STRING ) ROW FORMAT DELIMITED FIELDS
TERMINATED
BY '\t' LINES TERMINATED by '\n';
load data local inpath '/tmp/logs/*.txt' into table logs_text;
create table logs_sf_snappy (delivery_srv string,
media_type string, server_name string,error_code
string,dte string,count_sessions int,cmd
string,status_code
string,client_ip string,client_type string,session_id
string,asset_name string,count_sent int,indicator
int,interface
string,server_ip string,customer_id string ) STORED AS
SEQUENCEFILE;
insert overwrite table logs_sf_snappy select * from logs_text;
The log files contain around 37 millions rows.
Here are some results :
--> Impala :
Req#1
SELECT COUNT(1) FROM logs_sf_snappy;
Query time : 2.83sec (counted 37 millions rows)
Req#2
SELECT COUNT(1),asset_name FROM logs_sf_snappy GROUP BY asset_name
LIMIT 5000;
Query time : 6.43sec (grouped 37 millions rows)
--> Postgresql 9.2 (same hw) :
Req#1
Query time : 8.3 sec
Req#2
Query time : 20 sec
These are basically the kind of request I'll run for my use case.
Although Impala is around 300% faster than Postgresql9.2, I was
expecting performance difference of at least one order of
magnitude (in
other words, I was expecting around 1 sec for Req#1, 2 secs for
Req#2), and
I'm wondering if my setup is wrong.
I've setup short-circuit reads and I'm using pseudo-distributed
configuration.
I'm not sure I'm getting most out of this very basic Impala setup,
and
I wish anyone could provide a comparison basis, and/or help point
out
configuration items that may be wrong.
iotop shows very light hard drive activity at the start of the
request,
so disk i/o is most likely not the bottleneck. Of course, running
these
queries on the logs_text table is much more longer (eg 42 secs for
Req#2).
Regarding cpu, the load average remains quite low (aroud 0.35)
during
the request processing, so I'm quite puzzled :(
Any help appreciated !
Thank you
--
Todd Lipcon
Software Engineer, Cloudera