FAQ
Rajesh, a few questions:
- what CDH version are you running on?
- what else is running on your machines?
- did you follow the instructions on this page?
https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance

Even though this isn't the root cause of your performance problem, this line
Hdfs split stats (<volume id>:<# splits>/<split lengths>):
1:2/179.31M 2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M
7:1/134.22M 8:1/134.22M
indicates that you have a very imbalanced physical data distribution,
with disk 3 having to do 3 times as much work as most of the others.
My general suggestion would be to move to snappy-compressed sequence
files.

On Wed, Jan 23, 2013 at 5:36 AM, Rajesh Balamohan
wrote:
Hi Experts,

Would the following output from one of the instances help in debugging the
root cause of slow performance?

simple-scheduler.cc:171] SimpleScheduler locality percentage 100% (6 out of
6)

Instance 612825795e0e4afc:aa7c06de5b5c06da:(31s863ms 0.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 1:2/179.31M
2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M 7:1/134.22M 8:1/134.22M
- RowsProduced: 1
CodeGen:
- CodegenTime: 1268K clock cycles
- CompileTime: 36ms
- LoadTime: 3ms
- ModuleFileSize: 40.16 KB
DataStreamSender:
- BytesSent: 16.00 B
- DataSinkTime: 1928K clock cycles
- SerializeBatchTime: 14K clock cycles
- ThriftTransmitTime: 670K clock cycles
AGGREGATION_NODE (id=1):(31s863ms 0.02%)
- BuildBuckets: 1.02K
- BuildTime: 3ms
- GetResultsTime: 11K clock cycles
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
HDFS_SCAN_NODE (id=0):(31s858ms 99.98%)
File Formats: TEXT/NONE:13
- BytesRead: 1.37 GB
- DelimiterParseTime: 901ms
- MaterializeTupleTime: 280K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 4.99 MB/sec
- RowsReturned: 796.95K
- RowsReturnedRate: 25.01 K/sec
- ScanRangesComplete: 13
- ScannerThreadsReadTime: 4m40s
- TotalReadThroughput: 43.71 MB/sec
Instance 612825795e0e4afc:aa7c06de5b5c06db:(26s042ms 0.00%)



On Wed, Jan 23, 2013 at 2:59 PM, Rajesh Balamohan
wrote:
Hi All,

I have been trying Impala 4.1 (17-Jan-2013 release) on 7 physical machines
with 10 disks (@ 10K RPM) per machine. I am using CDH4.1 on these machines.

However, when I run the query on impala, I get very low perDiskThroughput.

I am sure the scan rate can be atleast 10 disks x 60 MB/s = 600 MB/s per
machine.

Are there any ways to debug why I see such a low scan rate?

Any pointers would be of great help.

HDFS_SCAN_NODE (id=0):(7s314ms 99.47%)
File Formats: TEXT/NONE:26
- BytesRead: 145.12 MB
- DelimiterParseTime: 98ms
- MaterializeTupleTime: 420K clock cycles
- MemoryUsed: 0.00
- PerDiskReadThroughput: 2.52 MB/sec
- RowsReturned: 1.38M
- RowsReturnedRate: 189.21 K/sec
- ScanRangesComplete: 26
- ScannerThreadsReadTime: 57s484ms
- TotalReadThroughput: 19.87 MB/sec


--
~Rajesh.B



--
~Rajesh.B

--
--

Search Discussions

  • Rajesh Balamohan at Jan 23, 2013 at 4:43 pm
    Hi Marcel,

    Thanks a lot for the very quick reply. Plz refer to my answers inline.

    - what CDH version are you running on?

    ===> CDH 4.1.2

    - what else is running on your machines?
    ===> None other than CDH related processes. Every machines runs DataNode,
    TaskTracker, Impalad. I am using hadoop-0.20-mapreduce and not using YARN.
    When running impala queries, I ensure that no mapreduce jobs are scheduled
    (as it would tend to skew the results).

    - did you follow the instructions on this page?
    ====> Yes. Please let me know if I have to paste "vars" output here for
    debugging.

    I would move hive tables to sequenceFile with snappy codec and try it out.
    (DataSet size is > 200 GB)

    I am trying with default threads per disk (default is 1), and "num_disks=0"
    (default) value. Hope this wouldn't be the cause of the degradation.

    Definitely impala is faster than hive by a huge margin. But, looking at the
    "PerDiskThroughput" I got a feeling that the true-horse-power of impalad
    wasn't unlocked in my set of machines. :)

    ~Rajesh.B

    On Wed, Jan 23, 2013 at 8:57 PM, Marcel Kornacker wrote:

    Rajesh, a few questions:
    - what CDH version are you running on?
    - what else is running on your machines?
    - did you follow the instructions on this page?

    https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance

    Even though this isn't the root cause of your performance problem, this
    line
    Hdfs split stats (<volume id>:<# splits>/<split lengths>):
    1:2/179.31M 2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M
    7:1/134.22M 8:1/134.22M
    indicates that you have a very imbalanced physical data distribution,
    with disk 3 having to do 3 times as much work as most of the others.
    My general suggestion would be to move to snappy-compressed sequence
    files.

    On Wed, Jan 23, 2013 at 5:36 AM, Rajesh Balamohan
    wrote:
    Hi Experts,

    Would the following output from one of the instances help in debugging the
    root cause of slow performance?

    simple-scheduler.cc:171] SimpleScheduler locality percentage 100% (6 out of
    6)

    Instance 612825795e0e4afc:aa7c06de5b5c06da:(31s863ms 0.00%)
    Hdfs split stats (<volume id>:<# splits>/<split lengths>):
    1:2/179.31M
    2:3/307.76M 3:4/447.74M 5:1/134.22M 6:1/131.34M 7:1/134.22M 8:1/134.22M
    - RowsProduced: 1
    CodeGen:
    - CodegenTime: 1268K clock cycles
    - CompileTime: 36ms
    - LoadTime: 3ms
    - ModuleFileSize: 40.16 KB
    DataStreamSender:
    - BytesSent: 16.00 B
    - DataSinkTime: 1928K clock cycles
    - SerializeBatchTime: 14K clock cycles
    - ThriftTransmitTime: 670K clock cycles
    AGGREGATION_NODE (id=1):(31s863ms 0.02%)
    - BuildBuckets: 1.02K
    - BuildTime: 3ms
    - GetResultsTime: 11K clock cycles
    - MemoryUsed: 32.01 KB
    - RowsReturned: 1
    - RowsReturnedRate: 0
    HDFS_SCAN_NODE (id=0):(31s858ms 99.98%)
    File Formats: TEXT/NONE:13
    - BytesRead: 1.37 GB
    - DelimiterParseTime: 901ms
    - MaterializeTupleTime: 280K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 4.99 MB/sec
    - RowsReturned: 796.95K
    - RowsReturnedRate: 25.01 K/sec
    - ScanRangesComplete: 13
    - ScannerThreadsReadTime: 4m40s
    - TotalReadThroughput: 43.71 MB/sec
    Instance 612825795e0e4afc:aa7c06de5b5c06db:(26s042ms 0.00%)



    On Wed, Jan 23, 2013 at 2:59 PM, Rajesh Balamohan
    wrote:
    Hi All,

    I have been trying Impala 4.1 (17-Jan-2013 release) on 7 physical
    machines
    with 10 disks (@ 10K RPM) per machine. I am using CDH4.1 on these
    machines.
    However, when I run the query on impala, I get very low
    perDiskThroughput.
    I am sure the scan rate can be atleast 10 disks x 60 MB/s = 600 MB/s per
    machine.

    Are there any ways to debug why I see such a low scan rate?

    Any pointers would be of great help.

    HDFS_SCAN_NODE (id=0):(7s314ms 99.47%)
    File Formats: TEXT/NONE:26
    - BytesRead: 145.12 MB
    - DelimiterParseTime: 98ms
    - MaterializeTupleTime: 420K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 2.52 MB/sec
    - RowsReturned: 1.38M
    - RowsReturnedRate: 189.21 K/sec
    - ScanRangesComplete: 26
    - ScannerThreadsReadTime: 57s484ms
    - TotalReadThroughput: 19.87 MB/sec


    --
    ~Rajesh.B



    --
    ~Rajesh.B

    --
    --


    --
    ~Rajesh.B

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJan 23, '13 at 3:27p
activeJan 23, '13 at 4:43p
posts2
users2
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase