- what CDH version are you running on?
- what else is running on your machines?
- did you follow the instructions on this page?
https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance
Even though this isn't the root cause of your performance problem, this line
Hdfs split stats (<volume id>:<# splits>/<split lengths>):
1:2/179.31M 2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M
7:1/134.22M 8:1/134.22M
indicates that you have a very imbalanced physical data distribution,
with disk 3 having to do 3 times as much work as most of the others.
My general suggestion would be to move to snappy-compressed sequence
files.
On Wed, Jan 23, 2013 at 5:36 AM, Rajesh Balamohan
wrote:
Hi Experts,
Would the following output from one of the instances help in debugging the
root cause of slow performance?
simple-scheduler.cc:171] SimpleScheduler locality percentage 100% (6 out of
6)
Instance 612825795e0e4afc:aa7c06de5b5c06da:(31s863ms 0.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 1:2/179.31M
2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M 7:1/134.22M 8:1/134.22M
- RowsProduced: 1
CodeGen:
- CodegenTime: 1268K clock cycles
- CompileTime: 36ms
- LoadTime: 3ms
- ModuleFileSize: 40.16 KB
DataStreamSender:
- BytesSent: 16.00 B
- DataSinkTime: 1928K clock cycles
- SerializeBatchTime: 14K clock cycles
- ThriftTransmitTime: 670K clock cycles
AGGREGATION_NODE (id=1):(31s863ms 0.02%)
- BuildBuckets: 1.02K
- BuildTime: 3ms
- GetResultsTime: 11K clock cycles
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
HDFS_SCAN_NODE (id=0):(31s858ms 99.98%)
File Formats: TEXT/NONE:13
- BytesRead: 1.37 GB
- DelimiterParseTime: 901ms
- MaterializeTupleTime: 280K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 4.99 MB/sec
- RowsReturned: 796.95K
- RowsReturnedRate: 25.01 K/sec
- ScanRangesComplete: 13
- ScannerThreadsReadTime: 4m40s
- TotalReadThroughput: 43.71 MB/sec
Instance 612825795e0e4afc:aa7c06de5b5c06db:(26s042ms 0.00%)
On Wed, Jan 23, 2013 at 2:59 PM, Rajesh Balamohan
wrote:
--
~Rajesh.B
--
--Would the following output from one of the instances help in debugging the
root cause of slow performance?
simple-scheduler.cc:171] SimpleScheduler locality percentage 100% (6 out of
6)
Instance 612825795e0e4afc:aa7c06de5b5c06da:(31s863ms 0.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 1:2/179.31M
2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M 7:1/134.22M 8:1/134.22M
- RowsProduced: 1
CodeGen:
- CodegenTime: 1268K clock cycles
- CompileTime: 36ms
- LoadTime: 3ms
- ModuleFileSize: 40.16 KB
DataStreamSender:
- BytesSent: 16.00 B
- DataSinkTime: 1928K clock cycles
- SerializeBatchTime: 14K clock cycles
- ThriftTransmitTime: 670K clock cycles
AGGREGATION_NODE (id=1):(31s863ms 0.02%)
- BuildBuckets: 1.02K
- BuildTime: 3ms
- GetResultsTime: 11K clock cycles
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
HDFS_SCAN_NODE (id=0):(31s858ms 99.98%)
File Formats: TEXT/NONE:13
- BytesRead: 1.37 GB
- DelimiterParseTime: 901ms
- MaterializeTupleTime: 280K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 4.99 MB/sec
- RowsReturned: 796.95K
- RowsReturnedRate: 25.01 K/sec
- ScanRangesComplete: 13
- ScannerThreadsReadTime: 4m40s
- TotalReadThroughput: 43.71 MB/sec
Instance 612825795e0e4afc:aa7c06de5b5c06db:(26s042ms 0.00%)
On Wed, Jan 23, 2013 at 2:59 PM, Rajesh Balamohan
wrote:
Hi All,
I have been trying Impala 4.1 (17-Jan-2013 release) on 7 physical machines
with 10 disks (@ 10K RPM) per machine. I am using CDH4.1 on these machines.
However, when I run the query on impala, I get very low perDiskThroughput.
I am sure the scan rate can be atleast 10 disks x 60 MB/s = 600 MB/s per
machine.
Are there any ways to debug why I see such a low scan rate?
Any pointers would be of great help.
HDFS_SCAN_NODE (id=0):(7s314ms 99.47%)
File Formats: TEXT/NONE:26
- BytesRead: 145.12 MB
- DelimiterParseTime: 98ms
- MaterializeTupleTime: 420K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.52 MB/sec
- RowsReturned: 1.38M
- RowsReturnedRate: 189.21 K/sec
- ScanRangesComplete: 26
- ScannerThreadsReadTime: 57s484ms
- TotalReadThroughput: 19.87 MB/sec
--
~Rajesh.B
I have been trying Impala 4.1 (17-Jan-2013 release) on 7 physical machines
with 10 disks (@ 10K RPM) per machine. I am using CDH4.1 on these machines.
However, when I run the query on impala, I get very low perDiskThroughput.
I am sure the scan rate can be atleast 10 disks x 60 MB/s = 600 MB/s per
machine.
Are there any ways to debug why I see such a low scan rate?
Any pointers would be of great help.
HDFS_SCAN_NODE (id=0):(7s314ms 99.47%)
File Formats: TEXT/NONE:26
- BytesRead: 145.12 MB
- DelimiterParseTime: 98ms
- MaterializeTupleTime: 420K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.52 MB/sec
- RowsReturned: 1.38M
- RowsReturnedRate: 189.21 K/sec
- ScanRangesComplete: 26
- ScannerThreadsReadTime: 57s484ms
- TotalReadThroughput: 19.87 MB/sec
--
~Rajesh.B
--
~Rajesh.B
--