FAQ
Hi,

Yes, Impala does take a long time to load such an extremely wide table.
I've filed JIRA IMPALA-428 to track it. Thanks for reporting it.!

Thanks,
Alan

On Tue, Jun 18, 2013 at 4:02 AM, Neeraj Chaplot wrote:

The query plan is :

Query (id=c41c881772ac72e:9e1630860f637eba):
Summary:
Start Time: 2013-06-18 15:01:56
End Time: 2013-06-18 15:46:13
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 1.0.1 RELEASE (build df844fb967cec8740f08dfb8b21962bc053527ef)
User: root
Default Db: default
Sql Statement: select count(1) from imp_ext_test
Plan:
----------------
PLAN FRAGMENT 0
PARTITION: UNPARTITIONED

3:AGGREGATE
output: SUM(<slot 0>)
group by:
tuple ids: 1
2:EXCHANGE
tuple ids: 1

PLAN FRAGMENT 1
PARTITION: RANDOM

STREAM DATA SINK
EXCHANGE ID: 2
UNPARTITIONED

1:AGGREGATE
output: COUNT(1)
group by:
tuple ids: 1
0:SCAN HDFS
table=default.imp_ext_test #partitions=1 size=695.17MB
tuple ids: 0
----------------
Query Timeline: 44m17s
- Start execution: 2.461ms (2.461ms)
- Planning finished: 44m6s (44m5s)
- Rows available: 44m16s (10s822ms)
- First row fetched: 44m17s (443.984ms)
- Unregister query: 44m17s (2.778ms)
ImpalaServer:
- ClientFetchWaitTimer: 444.987ms
- RowMaterializationTimer: 21.879us
Execution Profile c41c881772ac72e:9e1630860f637eba:(Active: 10s821ms, % non-child: 0.00%)
- FinalizationTimer: 0ns
Coordinator Fragment:(Active: 10s589ms, % non-child: 0.00%)
- AverageThreadTokens: 0.00
- RowsProduced: 1
CodeGen:(Active: 109.22ms, % non-child: 1.03%)
- CodegenTime: 506.933us
- CompileTime: 86.490ms
- LoadTime: 22.531ms
- ModuleFileSize: 74.45 KB
AGGREGATION_NODE (id=3):(Active: 10s589ms, % non-child: 0.05%)
ExecOption: Codegen Enabled
- BuildBuckets: 1.02K (1024)
- BuildTime: 2.957us
- GetResultsTime: 3.760us
- LoadFactor: 0.00
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
EXCHANGE_NODE (id=2):(Active: 10s585ms, % non-child: 99.96%)
- BytesReceived: 16.00 B
- ConvertRowBatchTime: 3.576us
- DataArrivalWaitTime: 10s585ms
- DeserializeRowBatchTimer: 4.917us
- FirstBatchArrivalWaitTime: 0ns
- MemoryUsed: 0.00
- RowsReturned: 1
- RowsReturnedRate: 0
- SendersBlockedTimer: 0ns
- SendersBlockedTotalTimer(*): 0ns
Averaged Fragment 1:(Active: 10s589ms, % non-child: 0.00%)
split sizes: min: 695.17 MB, max: 695.17 MB, avg: 695.17 MB, stddev: 0.00
completion times: min:10s590ms max:10s590ms mean: 10s590ms stddev:0ns
execution rates: min:65.64 MB/sec max:65.64 MB/sec mean:65.64 MB/sec stddev:0.00 /sec
num instances: 1
- AverageThreadTokens: 10.05
- RowsProduced: 1
CodeGen:(Active: 95.569ms, % non-child: 0.90%)
- CodegenTime: 816.588us
- CompileTime: 88.714ms
- LoadTime: 6.853ms
- ModuleFileSize: 74.45 KB
DataStreamSender (dst_id=2):(Active: 258.881us, % non-child: 0.00%)
- BytesSent: 16.00 B
- NetworkThroughput(*): 77.97 KB/sec
- OverallThroughput: 60.36 KB/sec
- SerializeBatchTime: 28.202us
- ThriftTransmitTime(*): 200.390us
- UncompressedRowBatchSize: 16.00 B
AGGREGATION_NODE (id=1):(Active: 10s589ms, % non-child: 0.05%)
- BuildBuckets: 1.02K (1024)
- BuildTime: 97.70us
- GetResultsTime: 4.189us
- LoadFactor: 0.00
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
HDFS_SCAN_NODE (id=0):(Active: 10s584ms, % non-child: 99.95%)
- AverageHdfsReadThreadConcurrency: 0.52
- AverageIoMgrQueueCapacity: 244.57
- AverageIoMgrQueueSize: 0.00
- AverageScannerThreadConcurrency: 0.14
- BytesRead: 695.17 MB
- MemoryUsed: 0.00
- NumDisksAccessed: 1
- PerReadThreadRawHdfsThroughput: 118.43 MB/sec
- RowsRead: 10.00K (10000)
- RowsReturned: 10.00K (10000)
- RowsReturnedRate: 944.00 /sec
- ScanRangesComplete: 16
- ScannerThreadsInvoluntaryContextSwitches: 70
- ScannerThreadsTotalWallClockTime: 1m32s
- DelimiterParseTime: 678.996ms
- MaterializeTupleTime(*): 74.412us
- ScannerThreadsSysTime: 12.994ms
- ScannerThreadsUserTime: 713.884ms
- ScannerThreadsVoluntaryContextSwitches: 813
- TotalRawHdfsReadTime(*): 5s869ms
- TotalReadThroughput: 66.21 MB/sec
Fragment 1:
Instance c41c881772ac72e:9e1630860f637ebc (host=impetus-i0060.impetus.co.in:22000):(Active: 10s589ms, % non-child: 0.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:16/695.17 MB
- AverageThreadTokens: 10.05
- RowsProduced: 1
CodeGen:(Active: 95.569ms, % non-child: 0.90%)
- CodegenTime: 816.588us
- CompileTime: 88.714ms
- LoadTime: 6.853ms
- ModuleFileSize: 74.45 KB
DataStreamSender (dst_id=2):(Active: 258.881us, % non-child: 0.00%)
- BytesSent: 16.00 B
- NetworkThroughput(*): 77.97 KB/sec
- OverallThroughput: 60.36 KB/sec
- SerializeBatchTime: 28.202us
- ThriftTransmitTime(*): 200.390us
- UncompressedRowBatchSize: 16.00 B
AGGREGATION_NODE (id=1):(Active: 10s589ms, % non-child: 0.05%)
ExecOption: Codegen Enabled
- BuildBuckets: 1.02K (1024)
- BuildTime: 97.70us
- GetResultsTime: 4.189us
- LoadFactor: 0.00
- MemoryUsed: 32.01 KB
- RowsReturned: 1
- RowsReturnedRate: 0
HDFS_SCAN_NODE (id=0):(Active: 10s584ms, % non-child: 99.95%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:16/695.17 MB
Hdfs Read Thread Concurrency Bucket: 0:47.62% 1:52.38% 2:0%
File Formats: TEXT/NONE:16
ExecOption: Codegen enabled: 16 out of 16
- AverageHdfsReadThreadConcurrency: 0.52
- AverageIoMgrQueueCapacity: 244.57
- AverageIoMgrQueueSize: 0.00
- AverageScannerThreadConcurrency: 0.14
- BytesRead: 695.17 MB
- MemoryUsed: 0.00
- NumDisksAccessed: 1
- PerReadThreadRawHdfsThroughput: 118.43 MB/sec
- RowsRead: 10.00K (10000)
- RowsReturned: 10.00K (10000)
- RowsReturnedRate: 944.00 /sec
- ScanRangesComplete: 16
- ScannerThreadsInvoluntaryContextSwitches: 70
- ScannerThreadsTotalWallClockTime: 1m32s
- DelimiterParseTime: 678.996ms
- MaterializeTupleTime(*): 74.412us
- ScannerThreadsSysTime: 12.994ms
- ScannerThreadsUserTime: 713.884ms
- ScannerThreadsVoluntaryContextSwitches: 813
- TotalRawHdfsReadTime(*): 5s869ms
- TotalReadThroughput: 66.21 MB/sec



On Tue, Jun 18, 2013 at 4:11 PM, wrote:

Hi All,

I have an impala external table which has about 10k column. When I fire
select count(1) on that table it takes more than 10 mins for the first
time. Next time any select or aggregate returns in sub seconds.

I saw impala server logs and found that "catalog.HdfsTable: load table"
is taking lot of time.

When I refresh the impala cache and fire any query it takes lot of time
for first query then onwards it very fast.

Logs :

13/06/18 15:01:56 INFO service.Frontend: analyze query select count(1)
from imp_ext_test
13/06/18 15:01:57 INFO catalog.HdfsTable: load table imp_ext_test
13/06/18 15:46:02 INFO catalog.HdfsTable: load partition block md for
imp_ext_test
13/06/18 15:46:02 INFO catalog.HdfsTable: loaded partition
PartitionBlockMetadata{#blocks=0, #filenames=0, totalStringLen=0}
13/06/18 15:46:02 INFO catalog.HdfsTable: loaded partition
PartitionBlockMetadata{#blocks=16, #filenames=6, totalStringLen=420}
13/06/18 15:46:02 INFO catalog.HdfsTable: loaded disk ids for table
default.imp_ext_test
13/06/18 15:46:02 INFO catalog.HdfsTable: 1

Could you please provide some help on reducing the load table time. My
table has about 10k columns.

Thanks

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 4 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 18, '13 at 10:41a
activeJun 18, '13 at 11:33p
posts4
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase