FAQ
The query "select * from scl limit 5" in Impala is sometimes fast (~5 seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I can reproduce the slowness by first flushing the Linux OS cache (echo 1 > /proc/sys/vm/drop_caches) and then running the query. Here's a summary of the query runtimes (in mm:ss):

Impala uncached, 2:12
Impala cached, 0:05
Hive uncached, 0:06
Hive cached, 0:06

Why the slowness?

My setup is Impala version 0.6 on a 10-node EC2 cluster launched by Cloudera Manager, using instance type m2.4xlarge and AMI ami-3c994355 in region us-east-1. The "scl" table contains sequence files using block compression (gzip).

Thanks.

Search Discussions

  • Steven Wong at Mar 20, 2013 at 9:13 pm
    Query profiles below. Let me know if there's additional info I can supply.

    For cached (fast) query:

    Query (id=bc7205ab6c2d4495:98ef69e1dea7d300):
    - PlanningTime: 321ms
    Query bc7205ab6c2d4495:98ef69e1dea7d300:(4s674ms 0.00%)
    Aggregate Profile:
    Coordinator Fragment:(4s229ms 0.00%)
    - RowsProduced: 5
    CodeGen:
    - CodegenTime: 0K clock cycles
    - CompileTime: 50ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    EXCHANGE_NODE (id=1):(4s229ms 100.00%)
    - BytesReceived: 9.60 KB
    - ConvertRowBatchTime: 4K clock cycles
    - DeserializeRowBatchTimer: 23K clock cycles
    - MemoryUsed: 0.00
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    Averaged Fragment 1:(4s302ms 0.00%)
    completion times: min:4s257ms max:4s375ms mean: 4s314ms stddev:39ms
    execution rates: min:61.05 GB/sec max:62.87 GB/sec mean:62.02 GB/sec stddev:602.89 MB/sec
    split sizes: min: 266.80 GB, max: 267.75 GB, avg: 267.55 GB, stddev: 266.16 MB
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 156ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 983.00 B
    - DataSinkTime: 39K clock cycles
    - SerializeBatchTime: 6K clock cycles
    - ThriftTransmitTime: 278K clock cycles
    HDFS_SCAN_NODE (id=0):(4s299ms 99.93%)
    - BytesRead: 11.87 MB
    - DecompressionTime: 333ms
    - MaterializeTupleTime: 545K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 69.07 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 598ms
    - TotalReadThroughput: 1.57 MB/sec
    Fragment 1:
    Instance bc7205ab6c2d4495:98ef69e1dea7d302:(4s302ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:1758/141.54B 1:1804/145.96B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s299ms 99.93%)
    - BytesRead: 3.20 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 5.14 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 622ms
    - TotalReadThroughput: 726.67 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d303:(4s302ms 0.05%)
    Hdfs split stats (:<# splits>/): 0:1804/143.86B 1:1784/143.54B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 158ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s300ms 99.95%)
    - BytesRead: 2.12 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 1.25 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1s697ms
    - TotalReadThroughput: 481.55 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d304:(4s276ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:1802/146.59B 1:1710/140.67B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 157ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s273ms 99.93%)
    - BytesRead: 3.10 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 16.54 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 187ms
    - TotalReadThroughput: 704.22 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d305:(4s312ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:1826/143.05B 1:1739/144.15B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 156ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s309ms 99.93%)
    - BytesRead: 3.09 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 7.19 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 429ms
    - TotalReadThroughput: 698.89 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d306:(4s349ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:2075/140.81B 1:2101/145.67B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 155ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s346ms 99.93%)
    - BytesRead: 3.16 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 17.38 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 181ms
    - TotalReadThroughput: 717.78 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d307:(4s268ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1833/145.10B 1:1774/142.27B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 156ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s264ms 99.92%)
    - BytesRead: 3.11 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 5.18 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 598ms
    - TotalReadThroughput: 706.00 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d308:(4s237ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 156ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 9.60 KB
    - DataSinkTime: 393K clock cycles
    - SerializeBatchTime: 62K clock cycles
    - ThriftTransmitTime: 1ms
    HDFS_SCAN_NODE (id=0):(4s234ms 99.92%)
    - BytesRead: 67.32 MB
    - DecompressionTime: 3s332ms
    - MaterializeTupleTime: 2ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 46.50 MB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    - ScanRangesComplete: 2
    - ScannerThreadsReadTime: 1s447ms
    - TotalReadThroughput: 9.63 MB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d309:(4s266ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:1878/147.69B 1:1782/139.64B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 155ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s262ms 99.93%)
    - BytesRead: 27.28 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 574.00 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 47ms
    - TotalReadThroughput: 726.89 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d30a:(4s345ms 0.07%)
    Hdfs split stats (:<# splits>/): 0:1857/149.40B 1:1748/138.02B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 158ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s342ms 99.93%)
    - BytesRead: 3.18 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 6.60 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 482ms
    - TotalReadThroughput: 724.22 KB/sec
    Instance bc7205ab6c2d4495:98ef69e1dea7d30b:(4s365ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1821/147.52B 1:1711/139.92B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 155ms
    - LoadTime: 4ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(4s362ms 99.92%)
    - BytesRead: 3.19 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 10.88 MB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 293ms
    - TotalReadThroughput: 726.22 KB/sec


    For uncached (slow) query:

    Query (id=905d867225e5471f:bdd72b4f09d29ce5):
    - PlanningTime: 24s617ms
    Query 905d867225e5471f:bdd72b4f09d29ce5:(2m12s 0.00%)
    Aggregate Profile:
    Coordinator Fragment:(2m10s 0.00%)
    - RowsProduced: 5
    CodeGen:
    - CodegenTime: 0K clock cycles
    - CompileTime: 51ms
    - LoadTime: 16ms
    - ModuleFileSize: 44.62 KB
    EXCHANGE_NODE (id=1):(2m10s 100.00%)
    - BytesReceived: 9.73 KB
    - ConvertRowBatchTime: 5K clock cycles
    - DeserializeRowBatchTimer: 14ms
    - MemoryUsed: 0.00
    - RowsReturned: 5
    - RowsReturnedRate: 0
    Averaged Fragment 1:(2m11s 0.00%)
    completion times: min:2m10s max:2m11s mean: 2m11s stddev:369ms
    execution rates: min:2.03 GB/sec max:2.05 GB/sec mean:2.04 GB/sec stddev:6.15 MB/sec
    split sizes: min: 266.80 GB, max: 267.75 GB, avg: 267.55 GB, stddev: 266.16 MB
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 17ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 996.00 B
    - DataSinkTime: 52ms
    - SerializeBatchTime: 27ms
    - ThriftTransmitTime: 28ms
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 7.81 MB
    - DecompressionTime: 276ms
    - MaterializeTupleTime: 2262K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 152.63 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1m
    - TotalReadThroughput: 48.64 KB/sec
    Fragment 1:
    Instance 905d867225e5471f:bdd72b4f09d29ce7:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1758/141.54B 1:1804/145.96B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 5ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.13 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 49.35 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1m4s
    - TotalReadThroughput: 24.37 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29ce8:(2m10s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1804/143.86B 1:1784/143.54B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 156ms
    - LoadTime: 12ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m10s 100.00%)
    - BytesRead: 2.04 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 19.70 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1m46s
    - TotalReadThroughput: 15.91 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29ce9:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1802/146.59B 1:1710/140.67B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 157ms
    - LoadTime: 16ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.01 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 41.62 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1m14s
    - TotalReadThroughput: 23.43 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29cea:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1826/143.05B 1:1739/144.15B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 156ms
    - LoadTime: 15ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.01 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 59.28 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 51s960ms
    - TotalReadThroughput: 23.38 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29ceb:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:2075/140.81B 1:2101/145.67B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 155ms
    - LoadTime: 27ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.11 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 134.09 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 23s767ms
    - TotalReadThroughput: 24.21 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29cec:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1833/145.10B 1:1774/142.27B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 14ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.07 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 54.03 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 58s112ms
    - TotalReadThroughput: 23.80 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29ced:(2m9s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 154ms
    - LoadTime: 15ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 9.73 KB
    - DataSinkTime: 523ms
    - SerializeBatchTime: 270ms
    - ThriftTransmitTime: 288ms
    HDFS_SCAN_NODE (id=0):(2m9s 100.00%)
    - BytesRead: 43.30 MB
    - DecompressionTime: 2s767ms
    - MaterializeTupleTime: 8ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 840.20 KB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 0
    - ScanRangesComplete: 4
    - ScannerThreadsReadTime: 52s768ms
    - TotalReadThroughput: 278.02 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29cee:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1878/147.69B 1:1782/139.64B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 11ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 11.25 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 217.42 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 52s966ms
    - TotalReadThroughput: 25.22 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29cef:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1857/149.40B 1:1748/138.02B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 10ms
    - CompileTime: 154ms
    - LoadTime: 14ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.12 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 58.31 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 54s789ms
    - TotalReadThroughput: 24.10 KB/sec
    Instance 905d867225e5471f:bdd72b4f09d29cf0:(2m11s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1821/147.52B 1:1711/139.92B
    - RowsProduced: 0
    CodeGen:
    - CodegenTime: 11ms
    - CompileTime: 155ms
    - LoadTime: 44ms
    - ModuleFileSize: 44.62 KB
    DataStreamSender:
    - BytesSent: 0.00
    - DataSinkTime: 0K clock cycles
    - SerializeBatchTime: 0K clock cycles
    - ThriftTransmitTime: 0K clock cycles
    HDFS_SCAN_NODE (id=0):(2m11s 100.00%)
    - BytesRead: 3.11 MB
    - DecompressionTime: 0K clock cycles
    - MaterializeTupleTime: 0K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 52.29 KB/sec
    - RowsReturned: 0
    - RowsReturnedRate: 0
    - ScanRangesComplete: 0
    - ScannerThreadsReadTime: 1m
    - TotalReadThroughput: 23.98 KB/sec



    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Tuesday, March 19, 2013 5:49 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    In order to answer "why" we need to quantify "what", as in what is the execution profile in each case. Could you provide the query profiles for both the cached and uncached (fast and slow) impala executions?
    You can obtain these by going to the Impala url for whatever node you are connect to (defaults to port 25000), click on the /queries link, find the execution, click the "Profile" link and attach that text for both cases.


    On Tue, Mar 19, 2013 at 1:21 PM, Steven Wong wrote:
    The query "select * from scl limit 5" in Impala is sometimes fast (~5 seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I can reproduce the slowness by first flushing the Linux OS cache (echo 1 > /proc/sys/vm/drop_caches) and then running the query. Here's a summary of the query runtimes (in mm:ss):

    Impala uncached, 2:12
    Impala cached, 0:05
    Hive uncached, 0:06
    Hive cached, 0:06

    Why the slowness?

    My setup is Impala version 0.6 on a 10-node EC2 cluster launched by Cloudera Manager, using instance type m2.4xlarge and AMI ami-3c994355 in region us-east-1. The "scl" table contains sequence files using block compression (gzip).

    Thanks.
  • Steven Wong at Mar 21, 2013 at 9:45 pm
    Should I look into /var/log/impalad/impalad.INFO on all nodes or only the node I connected to? What particular words or text patterns should I look for in the log?

    Runtimes in mm:ss for "select count(*) from scl where [1 partition containing 339M rows]":

    Impala uncached, 1:55
    Impala cached, 0:32
    Hive uncached, 2:30
    Hive cached, 1:38

    For count(*), Impala is faster than Hive, cached or uncached. In contrast, for select * limit 5, Impala is much slower than Hive if uncached and roughly equal to Hive if cached, as seen from the runtimes in my original email.

    Note that a difference between the 2 queries is the count(*) has a predicate to prune away all except 1 partition, whereas the select limit 5 doesn't have any predicate.


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Wednesday, March 20, 2013 4:58 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    In looking at the query profiles, it is apparent that there is a dramatic difference in the IO rates. Even in the "fast" scenario, it is still rather slow on an absolute scale (access to data in the file system page cache should be sub second). I've put the relevant sections/data below. What this shows is that for the "fast" scenario, impala reads 67.32 MB of data in 1s447ms, then spends 3s332ms decompressing it. In the "slow" case, it reads 43.30 MB of data in 52s768ms (horribly slow), then spends 2s767ms decompressing it.

    Could you validate you dont see any errors or warnings in your /var/log/impalad/impalad.INFO files?

    Out of curiosity, how does a count(*) for this table compare between hive and impala (or any full table scan query)?

    As a comparison point, on my system (10 data nodes) I can flush the page cache and impala is faster than hive for a table backed by delimited text, even if I run hive second (so it should benifit from the first impala run).

    $ impala-shell -V -i localhost -q "select * from foo limit 5;"
    Query finished, fetching results ...
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Returned 5 row(s) in 0.98s

    $ hive -e "select * from foo limit 5;"
    Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
    Hive history file=/tmp/grahn/hive_job_log_grahn_201303201650_663744682.txt
    OK
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Time taken: 5.217 seconds

    Data from the provided profiles:
    -- fast
    Instance bc7205ab6c2d4495:98ef69e1dea7d308:(4s237ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(4s234ms 99.92%)
    - BytesRead: 67.32 MB
    - DecompressionTime: 3s332ms
    - MaterializeTupleTime: 2ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 46.50 MB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    - ScanRangesComplete: 2
    - ScannerThreadsReadTime: 1s447ms
    - TotalReadThroughput: 9.63 MB/sec

    -- slow
    Instance 905d867225e5471f:bdd72b4f09d29ced:(2m9s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(2m9s 100.00%)
    - BytesRead: 43.30 MB
    - DecompressionTime: 2s767ms
    - MaterializeTupleTime: 8ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 840.20 KB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 0
    - ScanRangesComplete: 4
    - ScannerThreadsReadTime: 52s768ms [this is very slow]
    - TotalReadThroughput: 278.02 KB/sec

    -- example of cached data from my system using text delimted files for "select * from foo limit 5;"

    HDFS_SCAN_NODE (id=0):(83ms 90.59%)
    - BytesRead: 3.11 MB
    - DelimiterParseTime: 8ms
    - MaterializeTupleTime: 334K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 1.92 GB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 59.00 /sec
    - ScanRangesComplete: 1
    - ScannerThreadsReadTime: 1ms
    - TotalReadThroughput: 0.00 /sec

    On Wed, Mar 20, 2013 at 2:13 PM, Steven Wong wrote:
    Query profiles below. Let me know if there's additional info I can supply.
    <snip>
  • Marcel Kornacker at Mar 21, 2013 at 10:21 pm

    On Thu, Mar 21, 2013 at 3:10 PM, Greg Rahn wrote:
    For the query in question originally (select * from foo limit 5), hive does
    not issue a map reduce job, it looks to run it in "local mode", so it is
    basically just getting the first block of the first file and returning the
    results using the hive shell jvm to make this call. In contrast, impala is
    actually running in fully distributed mode, which puts it at a disadvantage.
    If impala were to only use the coordinator node you are connected to and
    make the same request from hive, it would be much faster. This is an
    optimization that can be looked into. I've filed IMPALA-165 for this.
    Actually, there's a query option that will make it run single-node.

    Steve, could you try setting NUM_NODES=1 in the impala shell before
    running your query?
    On Thu, Mar 21, 2013 at 2:45 PM, Steven Wong wrote:

    Should I look into /var/log/impalad/impalad.INFO on all nodes or only the
    node I connected to? What particular words or text patterns should I look
    for in the log?

    Runtimes in mm:ss for "select count(*) from scl where [1 partition
    containing 339M rows]":

    Impala uncached, 1:55
    Impala cached, 0:32
    Hive uncached, 2:30
    Hive cached, 1:38


    For count(*), Impala is faster than Hive, cached or uncached. In contrast,
    for select * limit 5, Impala is much slower than Hive if uncached and
    roughly equal to Hive if cached, as seen from the runtimes in my original
    email.

    Note that a difference between the 2 queries is the count(*) has a
    predicate to prune away all except 1 partition, whereas the select limit 5
    doesn't have any predicate.


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Wednesday, March 20, 2013 4:58 PM

    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    In looking at the query profiles, it is apparent that there is a dramatic
    difference in the IO rates. Even in the "fast" scenario, it is still rather
    slow on an absolute scale (access to data in the file system page cache
    should be sub second). I've put the relevant sections/data below. What
    this shows is that for the "fast" scenario, impala reads 67.32 MB of data in
    1s447ms, then spends 3s332ms decompressing it. In the "slow" case, it reads
    43.30 MB of data in 52s768ms (horribly slow), then spends 2s767ms
    decompressing it.

    Could you validate you dont see any errors or warnings in your
    /var/log/impalad/impalad.INFO files?

    Out of curiosity, how does a count(*) for this table compare between hive
    and impala (or any full table scan query)?

    As a comparison point, on my system (10 data nodes) I can flush the page
    cache and impala is faster than hive for a table backed by delimited text,
    even if I run hive second (so it should benifit from the first impala run).

    $ impala-shell -V -i localhost -q "select * from foo limit 5;"
    Query finished, fetching results ...
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Returned 5 row(s) in 0.98s

    $ hive -e "select * from foo limit 5;"
    Logging initialized using configuration in
    file:/etc/hive/conf.dist/hive-log4j.properties
    Hive history file=/tmp/grahn/hive_job_log_grahn_201303201650_663744682.txt
    OK
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Time taken: 5.217 seconds

    Data from the provided profiles:
    -- fast
    Instance bc7205ab6c2d4495:98ef69e1dea7d308:(4s237ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(4s234ms 99.92%)
    - BytesRead: 67.32 MB
    - DecompressionTime: 3s332ms
    - MaterializeTupleTime: 2ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 46.50 MB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    - ScanRangesComplete: 2
    - ScannerThreadsReadTime: 1s447ms
    - TotalReadThroughput: 9.63 MB/sec

    -- slow
    Instance 905d867225e5471f:bdd72b4f09d29ced:(2m9s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(2m9s 100.00%)
    - BytesRead: 43.30 MB
    - DecompressionTime: 2s767ms
    - MaterializeTupleTime: 8ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 840.20 KB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 0
    - ScanRangesComplete: 4
    - ScannerThreadsReadTime: 52s768ms [this is very slow]
    - TotalReadThroughput: 278.02 KB/sec

    -- example of cached data from my system using text delimted files for
    "select * from foo limit 5;"

    HDFS_SCAN_NODE (id=0):(83ms 90.59%)
    - BytesRead: 3.11 MB
    - DelimiterParseTime: 8ms
    - MaterializeTupleTime: 334K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 1.92 GB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 59.00 /sec
    - ScanRangesComplete: 1
    - ScannerThreadsReadTime: 1ms
    - TotalReadThroughput: 0.00 /sec
    On Wed, Mar 20, 2013 at 2:13 PM, Steven Wong wrote:

    Query profiles below. Let me know if there's additional info I can
    supply.
    <snip>
  • Marcel Kornacker at Mar 22, 2013 at 1:15 am
    On 'select * from scl limit 5'?
    On Thu, Mar 21, 2013 at 6:07 PM, Steven Wong wrote:
    With this:

    [ip-XXX-XXX-XXX-XXX:21000] > set NUM_NODES=1
    NUM_NODES set to 1

    Both the uncached and cached queries ran slower than previously: 6:00 and 0:20 (compared to 2:12 and 0:05 previously)


    ________________________________________
    From: Marcel Kornacker [marcel@cloudera.com]
    Sent: Thursday, March 21, 2013 3:21 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5
    On Thu, Mar 21, 2013 at 3:10 PM, Greg Rahn wrote:
    For the query in question originally (select * from foo limit 5), hive does
    not issue a map reduce job, it looks to run it in "local mode", so it is
    basically just getting the first block of the first file and returning the
    results using the hive shell jvm to make this call. In contrast, impala is
    actually running in fully distributed mode, which puts it at a disadvantage.
    If impala were to only use the coordinator node you are connected to and
    make the same request from hive, it would be much faster. This is an
    optimization that can be looked into. I've filed IMPALA-165 for this.
    Actually, there's a query option that will make it run single-node.

    Steve, could you try setting NUM_NODES=1 in the impala shell before
    running your query?
    On Thu, Mar 21, 2013 at 2:45 PM, Steven Wong wrote:

    Should I look into /var/log/impalad/impalad.INFO on all nodes or only the
    node I connected to? What particular words or text patterns should I look
    for in the log?

    Runtimes in mm:ss for "select count(*) from scl where [1 partition
    containing 339M rows]":

    Impala uncached, 1:55
    Impala cached, 0:32
    Hive uncached, 2:30
    Hive cached, 1:38


    For count(*), Impala is faster than Hive, cached or uncached. In contrast,
    for select * limit 5, Impala is much slower than Hive if uncached and
    roughly equal to Hive if cached, as seen from the runtimes in my original
    email.

    Note that a difference between the 2 queries is the count(*) has a
    predicate to prune away all except 1 partition, whereas the select limit 5
    doesn't have any predicate.


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Wednesday, March 20, 2013 4:58 PM

    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    In looking at the query profiles, it is apparent that there is a dramatic
    difference in the IO rates. Even in the "fast" scenario, it is still rather
    slow on an absolute scale (access to data in the file system page cache
    should be sub second). I've put the relevant sections/data below. What
    this shows is that for the "fast" scenario, impala reads 67.32 MB of data in
    1s447ms, then spends 3s332ms decompressing it. In the "slow" case, it reads
    43.30 MB of data in 52s768ms (horribly slow), then spends 2s767ms
    decompressing it.

    Could you validate you dont see any errors or warnings in your
    /var/log/impalad/impalad.INFO files?

    Out of curiosity, how does a count(*) for this table compare between hive
    and impala (or any full table scan query)?

    As a comparison point, on my system (10 data nodes) I can flush the page
    cache and impala is faster than hive for a table backed by delimited text,
    even if I run hive second (so it should benifit from the first impala run).

    $ impala-shell -V -i localhost -q "select * from foo limit 5;"
    Query finished, fetching results ...
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Returned 5 row(s) in 0.98s

    $ hive -e "select * from foo limit 5;"
    Logging initialized using configuration in
    file:/etc/hive/conf.dist/hive-log4j.properties
    Hive history file=/tmp/grahn/hive_job_log_grahn_201303201650_663744682.txt
    OK
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Time taken: 5.217 seconds

    Data from the provided profiles:
    -- fast
    Instance bc7205ab6c2d4495:98ef69e1dea7d308:(4s237ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(4s234ms 99.92%)
    - BytesRead: 67.32 MB
    - DecompressionTime: 3s332ms
    - MaterializeTupleTime: 2ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 46.50 MB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    - ScanRangesComplete: 2
    - ScannerThreadsReadTime: 1s447ms
    - TotalReadThroughput: 9.63 MB/sec

    -- slow
    Instance 905d867225e5471f:bdd72b4f09d29ced:(2m9s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(2m9s 100.00%)
    - BytesRead: 43.30 MB
    - DecompressionTime: 2s767ms
    - MaterializeTupleTime: 8ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 840.20 KB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 0
    - ScanRangesComplete: 4
    - ScannerThreadsReadTime: 52s768ms [this is very slow]
    - TotalReadThroughput: 278.02 KB/sec

    -- example of cached data from my system using text delimted files for
    "select * from foo limit 5;"

    HDFS_SCAN_NODE (id=0):(83ms 90.59%)
    - BytesRead: 3.11 MB
    - DelimiterParseTime: 8ms
    - MaterializeTupleTime: 334K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 1.92 GB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 59.00 /sec
    - ScanRangesComplete: 1
    - ScannerThreadsReadTime: 1ms
    - TotalReadThroughput: 0.00 /sec
    On Wed, Mar 20, 2013 at 2:13 PM, Steven Wong wrote:

    Query profiles below. Let me know if there's additional info I can
    supply.
    <snip>
  • Steven Wong at Mar 22, 2013 at 10:23 pm
    Yes, on "select * from scl limit 5."


    ________________________________________
    From: Marcel Kornacker [marcel@cloudera.com]
    Sent: Thursday, March 21, 2013 6:15 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    On 'select * from scl limit 5'?
    On Thu, Mar 21, 2013 at 6:07 PM, Steven Wong wrote:
    With this:

    [ip-XXX-XXX-XXX-XXX:21000] > set NUM_NODES=1
    NUM_NODES set to 1

    Both the uncached and cached queries ran slower than previously: 6:00 and 0:20 (compared to 2:12 and 0:05 previously)


    ________________________________________
    From: Marcel Kornacker [marcel@cloudera.com]
    Sent: Thursday, March 21, 2013 3:21 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5
    On Thu, Mar 21, 2013 at 3:10 PM, Greg Rahn wrote:
    For the query in question originally (select * from foo limit 5), hive does
    not issue a map reduce job, it looks to run it in "local mode", so it is
    basically just getting the first block of the first file and returning the
    results using the hive shell jvm to make this call. In contrast, impala is
    actually running in fully distributed mode, which puts it at a disadvantage.
    If impala were to only use the coordinator node you are connected to and
    make the same request from hive, it would be much faster. This is an
    optimization that can be looked into. I've filed IMPALA-165 for this.
    Actually, there's a query option that will make it run single-node.

    Steve, could you try setting NUM_NODES=1 in the impala shell before
    running your query?
    On Thu, Mar 21, 2013 at 2:45 PM, Steven Wong wrote:

    Should I look into /var/log/impalad/impalad.INFO on all nodes or only the
    node I connected to? What particular words or text patterns should I look
    for in the log?

    Runtimes in mm:ss for "select count(*) from scl where [1 partition
    containing 339M rows]":

    Impala uncached, 1:55
    Impala cached, 0:32
    Hive uncached, 2:30
    Hive cached, 1:38


    For count(*), Impala is faster than Hive, cached or uncached. In contrast,
    for select * limit 5, Impala is much slower than Hive if uncached and
    roughly equal to Hive if cached, as seen from the runtimes in my original
    email.

    Note that a difference between the 2 queries is the count(*) has a
    predicate to prune away all except 1 partition, whereas the select limit 5
    doesn't have any predicate.


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Wednesday, March 20, 2013 4:58 PM

    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    In looking at the query profiles, it is apparent that there is a dramatic
    difference in the IO rates. Even in the "fast" scenario, it is still rather
    slow on an absolute scale (access to data in the file system page cache
    should be sub second). I've put the relevant sections/data below. What
    this shows is that for the "fast" scenario, impala reads 67.32 MB of data in
    1s447ms, then spends 3s332ms decompressing it. In the "slow" case, it reads
    43.30 MB of data in 52s768ms (horribly slow), then spends 2s767ms
    decompressing it.

    Could you validate you dont see any errors or warnings in your
    /var/log/impalad/impalad.INFO files?

    Out of curiosity, how does a count(*) for this table compare between hive
    and impala (or any full table scan query)?

    As a comparison point, on my system (10 data nodes) I can flush the page
    cache and impala is faster than hive for a table backed by delimited text,
    even if I run hive second (so it should benifit from the first impala run).

    $ impala-shell -V -i localhost -q "select * from foo limit 5;"
    Query finished, fetching results ...
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Returned 5 row(s) in 0.98s

    $ hive -e "select * from foo limit 5;"
    Logging initialized using configuration in
    file:/etc/hive/conf.dist/hive-log4j.properties
    Hive history file=/tmp/grahn/hive_job_log_grahn_201303201650_663744682.txt
    OK
    17938 7931 1 2
    17939 7816 3 3
    17940 7819 2 2
    17941 7969 4 3
    17942 7884 1 2
    Time taken: 5.217 seconds

    Data from the provided profiles:
    -- fast
    Instance bc7205ab6c2d4495:98ef69e1dea7d308:(4s237ms 0.08%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(4s234ms 99.92%)
    - BytesRead: 67.32 MB
    - DecompressionTime: 3s332ms
    - MaterializeTupleTime: 2ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 46.50 MB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 1.00 /sec
    - ScanRangesComplete: 2
    - ScannerThreadsReadTime: 1s447ms
    - TotalReadThroughput: 9.63 MB/sec

    -- slow
    Instance 905d867225e5471f:bdd72b4f09d29ced:(2m9s 0.00%)
    Hdfs split stats (:<# splits>/): 0:1778/143.33B 1:1811/144.03B
    - RowsProduced: 5
    ...
    HDFS_SCAN_NODE (id=0):(2m9s 100.00%)
    - BytesRead: 43.30 MB
    - DecompressionTime: 2s767ms
    - MaterializeTupleTime: 8ms
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 840.20 KB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 0
    - ScanRangesComplete: 4
    - ScannerThreadsReadTime: 52s768ms [this is very slow]
    - TotalReadThroughput: 278.02 KB/sec

    -- example of cached data from my system using text delimted files for
    "select * from foo limit 5;"

    HDFS_SCAN_NODE (id=0):(83ms 90.59%)
    - BytesRead: 3.11 MB
    - DelimiterParseTime: 8ms
    - MaterializeTupleTime: 334K clock cycles
    - MemoryUsed: 0.00
    - PerDiskReadThroughput: 1.92 GB/sec
    - RowsReturned: 5
    - RowsReturnedRate: 59.00 /sec
    - ScanRangesComplete: 1
    - ScannerThreadsReadTime: 1ms
    - TotalReadThroughput: 0.00 /sec
    On Wed, Mar 20, 2013 at 2:13 PM, Steven Wong wrote:

    Query profiles below. Let me know if there's additional info I can
    supply.
    <snip>
  • Greg Rahn at Mar 22, 2013 at 6:08 pm
    When you are running the hive queries, are you exiting the hive shell
    between query executions, or just re-running the query in the same shell?

    On Tue, Mar 19, 2013 at 1:21 PM, Steven Wong wrote:

    The query "select * from scl limit 5" in Impala is sometimes fast (~5
    seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I
    can reproduce the slowness by first flushing the Linux OS cache (echo 1 >
    /proc/sys/vm/drop_caches) and then running the query. Here's a summary of
    the query runtimes (in mm:ss):

    Impala uncached, 2:12
    Impala cached, 0:05
    Hive uncached, 0:06
    Hive cached, 0:06

  • Steven Wong at Mar 22, 2013 at 10:24 pm
    Each query (Impala and Hive) was run in a separate shell instance via:

    impala -q ...
    hive -e ...


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Friday, March 22, 2013 11:08 AM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    When you are running the hive queries, are you exiting the hive shell between query executions, or just re-running the query in the same shell?


    On Tue, Mar 19, 2013 at 1:21 PM, Steven Wong wrote:
    The query "select * from scl limit 5" in Impala is sometimes fast (~5 seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I can reproduce the slowness by first flushing the Linux OS cache (echo 1 > /proc/sys/vm/drop_caches) and then running the query. Here's a summary of the query runtimes (in mm:ss):

    Impala uncached, 2:12
    Impala cached, 0:05
    Hive uncached, 0:06
    Hive cached, 0:06
  • Greg Rahn at Mar 22, 2013 at 11:37 pm
    What kind of volumes is the HDFS data on in AWS? EBS or local?
    There are two dfs.data.dir volumes per DataNode?
    On Fri, Mar 22, 2013 at 3:24 PM, Steven Wong wrote:

    Each query (Impala and Hive) was run in a separate shell instance via:

    impala -q ...
    hive -e ...


    ------------------------------
    *From:* Greg Rahn [grahn@cloudera.com]
    *Sent:* Friday, March 22, 2013 11:08 AM

    *To:* impala-user@cloudera.org
    *Subject:* Re: select * from table limit 5

    When you are running the hive queries, are you exiting the hive shell
    between query executions, or just re-running the query in the same shell?

    On Tue, Mar 19, 2013 at 1:21 PM, Steven Wong wrote:

    The query "select * from scl limit 5" in Impala is sometimes fast (~5
    seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I
    can reproduce the slowness by first flushing the Linux OS cache (echo 1 >
    /proc/sys/vm/drop_caches) and then running the query. Here's a summary of
    the query runtimes (in mm:ss):

    Impala uncached, 2:12
    Impala cached, 0:05
    Hive uncached, 0:06
    Hive cached, 0:06

  • Steven Wong at Mar 23, 2013 at 12:32 am
    HDFS is using 2 local volumes per m2.4xlarge node. It was set up by Cloudera Manager in AWS, so you can reproduce for yourself if you want.


    ________________________________
    From: Greg Rahn [grahn@cloudera.com]
    Sent: Friday, March 22, 2013 4:37 PM
    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    What kind of volumes is the HDFS data on in AWS? EBS or local?
    There are two dfs.data.dir volumes per DataNode?

    On Fri, Mar 22, 2013 at 3:24 PM, Steven Wong wrote:
    Each query (Impala and Hive) was run in a separate shell instance via:

    impala -q ...
    hive -e ...


    ________________________________
    From: Greg Rahn [grahn@cloudera.com ]
    Sent: Friday, March 22, 2013 11:08 AM

    To: impala-user@cloudera.org
    Subject: Re: select * from table limit 5

    When you are running the hive queries, are you exiting the hive shell between query executions, or just re-running the query in the same shell?


    On Tue, Mar 19, 2013 at 1:21 PM, Steven Wong wrote:
    The query "select * from scl limit 5" in Impala is sometimes fast (~5 seconds) but sometimes very slow (~2 minutes), much slower than in Hive. I can reproduce the slowness by first flushing the Linux OS cache (echo 1 > /proc/sys/vm/drop_caches) and then running the query. Here's a summary of the query runtimes (in mm:ss):

    Impala uncached, 2:12
    Impala cached, 0:05
    Hive uncached, 0:06
    Hive cached, 0:06

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 19, '13 at 8:22p
activeMar 23, '13 at 12:32a
posts10
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase