That sounds great. Improving file size will decrease the number of disk I/O.
In addition to that, how about giving a configuration option to change the
page size of Parquet?
I think the reason querying the RCfile table was gradually faster than
querying the Parquet table was mainly because the decompression cost of the
Parquet was bigger than the RCfile.
If we can increase the page size of the Parquet(default is 64Kb), we can
expect more performance increase.
BTW, In the query profile the DecompressionTime is always 0 for RCfile
table.
There is no decompress_timer_ variable in hdfs-rcfile-scanner.h.
Is this a bug?
HDFS_SCAN_NODE (id=0):(Active: 38s514ms, % non-child: 100.00%)
Hdfs split stats (<volume id>:<# splits>/<split lengths>):
0:10/528.46 MB 1:11/424.84 MB 2:7/169.25 MB 3:7/168.13 MB 4:12/489.22 MB
5:7/336.64 MB
File Formats: *RC_FILE*/SNAPPY_BLOCKED:54
ExecOption: Codegen enabled: 0 out of 108
- AverageHdfsReadThreadConcurrency: 0.56
- HdfsReadThreadConcurrencyCountPercentage=0: 44.16
- HdfsReadThreadConcurrencyCountPercentage=1: 55.84
- HdfsReadThreadConcurrencyCountPercentage=2: 0.00
- HdfsReadThreadConcurrencyCountPercentage=3: 0.00
- HdfsReadThreadConcurrencyCountPercentage=4: 0.00
- HdfsReadThreadConcurrencyCountPercentage=5: 0.00
- HdfsReadThreadConcurrencyCountPercentage=6: 0.00
- HdfsReadThreadConcurrencyCountPercentage=7: 0.00
- HdfsReadThreadConcurrencyCountPercentage=8: 0.00
- AverageIoMgrQueueCapcity: 256.00
- AverageIoMgrQueueSize: 0.00
- AverageScannerThreadConcurrency: 0.68
- BytesRead: 2.07 GB
- BytesSkipped: 0.00
* - DecompressionTime: 0ns // always zero*
- MemoryUsed: 144.05 MB
- NumDisksAccessed: 8
- PerReadThreadRawHdfsThroughput: 109.15 MB/sec
- RowsRead: 51.48M (51475163)
- RowsReturned: 0
- RowsReturnedRate: 0
On Thursday, May 30, 2013 5:02:56 AM UTC+9, Nong wrote:
Thanks for doing this. We're working on improving the performance of the
Parquet scanner as
well as adding better encodings for strings (which we see a lot of) to
improve file size. We
believe there's quite a bit of low hanging fruit.
Nong
On Wed, May 29, 2013 at 4:53 AM, Jung-Yup Lee <ljy...@gmail.com<javascript:>
Thanks for doing this. We're working on improving the performance of the
Parquet scanner as
well as adding better encodings for strings (which we see a lot of) to
improve file size. We
believe there's quite a bit of low hanging fruit.
Nong
On Wed, May 29, 2013 at 4:53 AM, Jung-Yup Lee <ljy...@gmail.com<javascript:>
wrote:
Hi all,
I'd like to share my simple performance test for comparing 3 different
file types(Text, Parquet, RCFile).
- Environment
* My cluster consists of 8 DNs, and each node is equipped with 24-core
CPU, 64 GB memory and 6 disks.
- Total file size of each file type
TEXT(no compression) PARQUET(snappy) RCFILE(snappy) Total size58.5Gb19.2Gb
16.5Gb Num. of files 888 236Num. of rows 400M 400M400M
- Results1 ( with OS caching )
Query1> select <list of columns> from text_test1 where col1='#'; (with
os caching)
Query2> select <list of columns> from par_test1 where col1='#'; (with
os caching)
Query3> select <list of columns> from rc_test1 where col1='#'; (with os
caching)
* I requested above queries while increasing the number of columns in the
select-list.
* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.
Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 9.833.24 7.312 11.004.23 8.153 11.365.71 9.444 10.877.19 10.415 ~118.8412.43
6 ~119.68 12.437 ~1110.90 13.078 ``13.17 13.539 ``14.09 15.1110 ``14.1715.78
11 15.98 16.7512 16.61 18.3113 17.54 18.0414 *19.76* *18.70*15 20.1219.96
16 19.81 19.4417 23.85 20.1318 25.44 22.1719 20 21 `` 22 ``
23 `` 24 ~1128.98 26.17
- Results2 ( without OS caching )
Query1> select <list of columns> from text_test1 where col1='#';
(without os caching)
Query2> select <list of columns> from par_test1 where col1='#';
(without os caching)
Query3> select <list of columns> from rc_test1 where col1='#'; (without
os caching)
* I requested above queries while increasing the number of columns in the
select-list.
* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.
Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 75.597.81 26.872 74.3210.44 27.713 ``13.64 28.454 ``19.66 27.945 ``
24.38 29.656 26.58 30.797 28.27 32.778 30.37 33.239 *35.27* *35.08*10
42.05 36.4811 12 13 14 15 16 17 18 19
20 21 `` 22 `` 23 `` 24 ~7570.68 62.62
Hope this helps to determine which file type to use in your data
warehouse. =)
Hi all,
I'd like to share my simple performance test for comparing 3 different
file types(Text, Parquet, RCFile).
- Environment
* My cluster consists of 8 DNs, and each node is equipped with 24-core
CPU, 64 GB memory and 6 disks.
- Total file size of each file type
TEXT(no compression) PARQUET(snappy) RCFILE(snappy) Total size58.5Gb19.2Gb
16.5Gb Num. of files 888 236Num. of rows 400M 400M400M
- Results1 ( with OS caching )
Query1> select <list of columns> from text_test1 where col1='#'; (with
os caching)
Query2> select <list of columns> from par_test1 where col1='#'; (with
os caching)
Query3> select <list of columns> from rc_test1 where col1='#'; (with os
caching)
* I requested above queries while increasing the number of columns in the
select-list.
* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.
Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 9.833.24 7.312 11.004.23 8.153 11.365.71 9.444 10.877.19 10.415 ~118.8412.43
6 ~119.68 12.437 ~1110.90 13.078 ``13.17 13.539 ``14.09 15.1110 ``14.1715.78
11 15.98 16.7512 16.61 18.3113 17.54 18.0414 *19.76* *18.70*15 20.1219.96
16 19.81 19.4417 23.85 20.1318 25.44 22.1719 20 21 `` 22 ``
23 `` 24 ~1128.98 26.17
- Results2 ( without OS caching )
Query1> select <list of columns> from text_test1 where col1='#';
(without os caching)
Query2> select <list of columns> from par_test1 where col1='#';
(without os caching)
Query3> select <list of columns> from rc_test1 where col1='#'; (without
os caching)
* I requested above queries while increasing the number of columns in the
select-list.
* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.
Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 75.597.81 26.872 74.3210.44 27.713 ``13.64 28.454 ``19.66 27.945 ``
24.38 29.656 26.58 30.797 28.27 32.778 30.37 33.239 *35.27* *35.08*10
42.05 36.4811 12 13 14 15 16 17 18 19
20 21 `` 22 `` 23 `` 24 ~7570.68 62.62
Hope this helps to determine which file type to use in your data
warehouse. =)