FAQ
Hi Nong,
That sounds great. Improving file size will decrease the number of disk I/O.
In addition to that, how about giving a configuration option to change the
page size of Parquet?
I think the reason querying the RCfile table was gradually faster than
querying the Parquet table was mainly because the decompression cost of the
Parquet was bigger than the RCfile.
If we can increase the page size of the Parquet(default is 64Kb), we can
expect more performance increase.

BTW, In the query profile the DecompressionTime is always 0 for RCfile
table.
There is no decompress_timer_ variable in hdfs-rcfile-scanner.h.
Is this a bug?

         HDFS_SCAN_NODE (id=0):(Active: 38s514ms, % non-child: 100.00%)
           Hdfs split stats (<volume id>:<# splits>/<split lengths>):
0:10/528.46 MB 1:11/424.84 MB 2:7/169.25 MB 3:7/168.13 MB 4:12/489.22 MB
5:7/336.64 MB
           File Formats: *RC_FILE*/SNAPPY_BLOCKED:54
           ExecOption: Codegen enabled: 0 out of 108
            - AverageHdfsReadThreadConcurrency: 0.56
              - HdfsReadThreadConcurrencyCountPercentage=0: 44.16
              - HdfsReadThreadConcurrencyCountPercentage=1: 55.84
              - HdfsReadThreadConcurrencyCountPercentage=2: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=3: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=4: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=5: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=6: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=7: 0.00
              - HdfsReadThreadConcurrencyCountPercentage=8: 0.00
            - AverageIoMgrQueueCapcity: 256.00
            - AverageIoMgrQueueSize: 0.00
            - AverageScannerThreadConcurrency: 0.68
            - BytesRead: 2.07 GB
            - BytesSkipped: 0.00
          * - DecompressionTime: 0ns // always zero*
            - MemoryUsed: 144.05 MB
            - NumDisksAccessed: 8
            - PerReadThreadRawHdfsThroughput: 109.15 MB/sec
            - RowsRead: 51.48M (51475163)
            - RowsReturned: 0
            - RowsReturnedRate: 0

On Thursday, May 30, 2013 5:02:56 AM UTC+9, Nong wrote:

Thanks for doing this. We're working on improving the performance of the
Parquet scanner as
well as adding better encodings for strings (which we see a lot of) to
improve file size. We
believe there's quite a bit of low hanging fruit.

Nong


On Wed, May 29, 2013 at 4:53 AM, Jung-Yup Lee <ljy...@gmail.com<javascript:>
wrote:
Hi all,

I'd like to share my simple performance test for comparing 3 different
file types(Text, Parquet, RCFile).


- Environment

* My cluster consists of 8 DNs, and each node is equipped with 24-core
CPU, 64 GB memory and 6 disks.


- Total file size of each file type


TEXT(no compression) PARQUET(snappy) RCFILE(snappy) Total size58.5Gb19.2Gb
16.5Gb Num. of files 888 236Num. of rows 400M 400M400M


- Results1 ( with OS caching )

Query1> select <list of columns> from text_test1 where col1='#'; (with
os caching)

Query2> select <list of columns> from par_test1 where col1='#'; (with
os caching)

Query3> select <list of columns> from rc_test1 where col1='#'; (with os
caching)

* I requested above queries while increasing the number of columns in the
select-list.

* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.


Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 9.833.24 7.312 11.004.23 8.153 11.365.71 9.444 10.877.19 10.415 ~118.8412.43
6 ~119.68 12.437 ~1110.90 13.078 ``13.17 13.539 ``14.09 15.1110 ``14.1715.78
11 15.98 16.7512 16.61 18.3113 17.54 18.0414 *19.76* *18.70*15 20.1219.96
16 19.81 19.4417 23.85 20.1318 25.44 22.1719 20 21 `` 22 ``
23 `` 24 ~1128.98 26.17



- Results2 ( without OS caching )

Query1> select <list of columns> from text_test1 where col1='#';
(without os caching)

Query2> select <list of columns> from par_test1 where col1='#';
(without os caching)

Query3> select <list of columns> from rc_test1 where col1='#'; (without
os caching)

* I requested above queries while increasing the number of columns in the
select-list.

* Above queries returned nothing. But, lazy I/O and lazy decompression
did not occur.


Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
1 75.597.81 26.872 74.3210.44 27.713 ``13.64 28.454 ``19.66 27.945 ``
24.38 29.656 26.58 30.797 28.27 32.778 30.37 33.239 *35.27* *35.08*10
42.05 36.4811 12 13 14 15 16 17 18 19
20 21 `` 22 `` 23 `` 24 ~7570.68 62.62
Hope this helps to determine which file type to use in your data
warehouse. =)

Search Discussions

  • Julien Le Dem at Jun 6, 2013 at 2:37 pm
    The following applies when writing from Map/Reduce (not from Impala):
    In M/R one file is created per task.
    If your job is just converting you probably have a Map-only job. Which
    means you're going to have one output file per split in the input.
    You can adjust the split size by setting a bigger value to
    mapred.min.split.size

    On Thu, Jun 6, 2013 at 3:41 AM, Tim Heijt wrote:

    Maybe a stupid question but how do you increase the file size of the
    parquet files? We have a nice block size of 1GB however the file size is
    ~350MB max. How to increase this size, we don't know which parameter to
    change.

    Op donderdag 30 mei 2013 23:53:23 UTC+2 schreef Jung-Yup Lee het volgende:
    Hi Nong,
    That sounds great. Improving file size will decrease the number of disk
    I/O.
    In addition to that, how about giving a configuration option to change
    the page size of Parquet?
    I think the reason querying the RCfile table was gradually faster than
    querying the Parquet table was mainly because the decompression cost of the
    Parquet was bigger than the RCfile.
    If we can increase the page size of the Parquet(default is 64Kb), we can
    expect more performance increase.

    BTW, In the query profile the DecompressionTime is always 0 for RCfile
    table.
    There is no decompress_timer_ variable in hdfs-rcfile-scanner.h.
    Is this a bug?

    HDFS_SCAN_NODE (id=0):(Active: 38s514ms, % non-child: 100.00%)
    Hdfs split stats (<volume id>:<# splits>/<split lengths>):
    0:10/528.46 MB 1:11/424.84 MB 2:7/169.25 MB 3:7/168.13 MB 4:12/489.22 MB
    5:7/336.64 MB
    File Formats: *RC_FILE*/SNAPPY_BLOCKED:54
    ExecOption: Codegen enabled: 0 out of 108
    - AverageHdfsReadThreadConcurren**cy: 0.56
    - HdfsReadThreadConcurrencyCount**Percentage=0: 44.16
    - HdfsReadThreadConcurrencyCount**Percentage=1: 55.84
    - HdfsReadThreadConcurrencyCount**Percentage=2: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=3: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=4: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=5: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=6: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=7: 0.00
    - HdfsReadThreadConcurrencyCount**Percentage=8: 0.00
    - AverageIoMgrQueueCapcity: 256.00
    - AverageIoMgrQueueSize: 0.00
    - AverageScannerThreadConcurrenc**y: 0.68
    - BytesRead: 2.07 GB
    - BytesSkipped: 0.00
    * - DecompressionTime: 0ns // always zero*
    - MemoryUsed: 144.05 MB
    - NumDisksAccessed: 8
    - PerReadThreadRawHdfsThroughput**: 109.15 MB/sec
    - RowsRead: 51.48M (51475163)
    - RowsReturned: 0
    - RowsReturnedRate: 0

    On Thursday, May 30, 2013 5:02:56 AM UTC+9, Nong wrote:

    Thanks for doing this. We're working on improving the performance of
    the Parquet scanner as
    well as adding better encodings for strings (which we see a lot of) to
    improve file size. We
    believe there's quite a bit of low hanging fruit.

    Nong

    On Wed, May 29, 2013 at 4:53 AM, Jung-Yup Lee wrote:

    Hi all,

    I'd like to share my simple performance test for comparing 3 different
    file types(Text, Parquet, RCFile).


    - Environment

    * My cluster consists of 8 DNs, and each node is equipped with 24-core
    CPU, 64 GB memory and 6 disks.


    - Total file size of each file type


    TEXT(no compression) PARQUET(snappy) RCFILE(snappy) Total size58.5Gb19.2Gb
    16.5Gb Num. of files 888 236Num. of rows 400M 400M400M


    - Results1 ( with OS caching )

    Query1> select <list of columns> from text_test1 where col1='#';
    (with os caching)

    Query2> select <list of columns> from par_test1 where col1='#'; (with
    os caching)

    Query3> select <list of columns> from rc_test1 where col1='#'; (with
    os caching)

    * I requested above queries while increasing the number of columns in
    the select-list.

    * Above queries returned nothing. But, lazy I/O and lazy decompression
    did not occur.


    Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
    1 9.833.24 7.312 11.004.23 8.153 11.365.71 9.444 10.877.19 10.415 ~11
    8.84 12.436 ~119.68 12.437 ~1110.90 13.078 ``13.17 13.539 ``14.09 15.11
    10 ``14.17 15.7811 15.98 16.7512 16.61 18.3113 17.54 18.0414 *19.76
    * *18.70*15 20.12 19.9616 19.81 19.4417 23.85 20.1318 25.44 22.1719
    20 21 `` 22 `` 23 `` 24 ~1128.98 26.17



    - Results2 ( without OS caching )

    Query1> select <list of columns> from text_test1 where col1='#';
    (without os caching)

    Query2> select <list of columns> from par_test1 where col1='#';
    (without os caching)

    Query3> select <list of columns> from rc_test1 where col1='#';
    (without os caching)

    * I requested above queries while increasing the number of columns in
    the select-list.

    * Above queries returned nothing. But, lazy I/O and lazy decompression
    did not occur.


    Num. of select list Query latency(s)TEXT(no compr.) PARQUET(snappy)RCFILE(snappy)
    1 75.597.81 26.872 74.3210.44 27.713 ``13.64 28.454 ``19.66 27.945 ``
    24.38 29.656 26.58 30.797 28.27 32.778 30.37 33.239 *35.27* *35.08*
    10 42.05 36.4811 12 13 14 15 16 17 18
    19 20 21 `` 22 `` 23 `` 24 ~7570.68 62.62
    Hope this helps to determine which file type to use in your data
    warehouse. =)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 30, '13 at 9:53p
activeJun 6, '13 at 2:37p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Julien Le Dem: 1 post Jung-Yup Lee: 1 post

People

Translate

site design / logo © 2022 Grokbase