FAQ
We have a Parquet table for a day of activity (close to 5 billion records)
which is partitioned by minute and there are several minutes where the
total data size is over 1 GB, but when populating the table it split the
files and I can't find any files that are larger than 260 MB. Since the
HDFS block size is 1GB, this is not ideal, since the optimal Parquet file
size should also be 1GB.

Is there any setting for Impala to make sure that the files don't get split
into smaller sizes to improve I/O performance during queries?

Thank you,
Daniel

Search Discussions

  • Nong Li at Jun 21, 2013 at 5:28 pm
    This is an issue in our implementation causing Impala to generate files
    less than the target size. This is fixed for our upcoming release.

    Thanks
    Nong

    On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe wrote:

    We have a Parquet table for a day of activity (close to 5 billion records)
    which is partitioned by minute and there are several minutes where the
    total data size is over 1 GB, but when populating the table it split the
    files and I can't find any files that are larger than 260 MB. Since the
    HDFS block size is 1GB, this is not ideal, since the optimal Parquet file
    size should also be 1GB.

    Is there any setting for Impala to make sure that the files don't get
    split into smaller sizes to improve I/O performance during queries?

    Thank you,
    Daniel
  • Daniel Uribe at Jun 21, 2013 at 9:06 pm
    Thanks for the information Nong, I'm glad to know that this is something
    that will be fixed in the next release. Any estimate on the timeframe on
    when it might be available?

    Best regards,
    Daniel
    On Friday, June 21, 2013 1:28:07 PM UTC-4, Nong wrote:

    This is an issue in our implementation causing Impala to generate files
    less than the target size. This is fixed for our upcoming release.

    Thanks
    Nong


    On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe <daniel....@gmail.com<javascript:>
    wrote:
    We have a Parquet table for a day of activity (close to 5 billion
    records) which is partitioned by minute and there are several minutes where
    the total data size is over 1 GB, but when populating the table it split
    the files and I can't find any files that are larger than 260 MB. Since the
    HDFS block size is 1GB, this is not ideal, since the optimal Parquet file
    size should also be 1GB.

    Is there any setting for Impala to make sure that the files don't get
    split into smaller sizes to improve I/O performance during queries?

    Thank you,
    Daniel
  • Skye Wanderman-Milne at Jun 21, 2013 at 9:41 pm
    Impala 1.1 will be released within the next month.

    On Fri, Jun 21, 2013 at 2:05 PM, Daniel Uribe wrote:

    Thanks for the information Nong, I'm glad to know that this is something
    that will be fixed in the next release. Any estimate on the timeframe on
    when it might be available?

    Best regards,
    Daniel

    On Friday, June 21, 2013 1:28:07 PM UTC-4, Nong wrote:

    This is an issue in our implementation causing Impala to generate files
    less than the target size. This is fixed for our upcoming release.

    Thanks
    Nong

    On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe wrote:

    We have a Parquet table for a day of activity (close to 5 billion
    records) which is partitioned by minute and there are several minutes where
    the total data size is over 1 GB, but when populating the table it split
    the files and I can't find any files that are larger than 260 MB. Since the
    HDFS block size is 1GB, this is not ideal, since the optimal Parquet file
    size should also be 1GB.

    Is there any setting for Impala to make sure that the files don't get
    split into smaller sizes to improve I/O performance during queries?

    Thank you,
    Daniel
  • Marcel Kornacker at Jul 10, 2013 at 4:38 pm

    On Tue, Jul 9, 2013 at 9:27 PM, Andrei wrote:
    Hello,

    We have a similar problem. Our usecase is to store market data, for about
    700B records (3 years approx 600M-1B daily records). One of the frequent
    queries are of type

    "select ... from quotes where symbol = $P_SYMBOL and timestamp between
    $P_START and $P_END ... "

    So first problem there are far too many files created for 7TB of data (due
    to aforementioned 260MB limit). Do you think adding MAXSIZE in CREATE
    [internal] TABLE statement is a good idea ? What if I want storage filesize
    to be larger than HDFS block ?

    Regarding partitioning, if one splits the table by symbol/date it results
    in 6M files (8000 symbols x 750 days) which might be too much to handle for
    the name node. Will impala (query optimizer) be smart enough if I had to
    partition table by say [symbol, year(timestamp) || weekofyear(timestamp) ]
    (ie use function in partition definition) ?
    What exactly do you mean by "use function in partition definition"?
    The select list of the Insert statement that populates the table?

    If you have a table that is partitioned by something like (symbol
    string, year int, weekofyear int), you will need predicates on those
    partition columns in your query in order to take advantage of
    partition pruning. In other words, simply having "timestamp between
    <start> and <end>" won't be sufficient, you'll need to add "year
    between <> and <> ...".
    Thanks in advance,
    Andrei.





    On Friday, June 21, 2013 5:40:59 PM UTC-4, Skye Wanderman-Milne wrote:

    Impala 1.1 will be released within the next month.


    On Fri, Jun 21, 2013 at 2:05 PM, Daniel Uribe <daniel....@gmail.com>
    wrote:
    Thanks for the information Nong, I'm glad to know that this is something
    that will be fixed in the next release. Any estimate on the timeframe on
    when it might be available?

    Best regards,
    Daniel

    On Friday, June 21, 2013 1:28:07 PM UTC-4, Nong wrote:

    This is an issue in our implementation causing Impala to generate files
    less than the target size. This is fixed for our upcoming release.

    Thanks
    Nong


    On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe <daniel....@gmail.com>
    wrote:
    We have a Parquet table for a day of activity (close to 5 billion
    records) which is partitioned by minute and there are several minutes where
    the total data size is over 1 GB, but when populating the table it split the
    files and I can't find any files that are larger than 260 MB. Since the HDFS
    block size is 1GB, this is not ideal, since the optimal Parquet file size
    should also be 1GB.

    Is there any setting for Impala to make sure that the files don't get
    split into smaller sizes to improve I/O performance during queries?

    Thank you,
    Daniel
  • Andrei at Jul 11, 2013 at 3:15 am
    Thanks, Marcel, that answers my question. I was thinking more if impala
    could optimize in the following fashion:

    ... timestamp between '2013-01-01' and '2013-02-01'
    into
    *... year(timestamp) = 2013 and weekofyear(timestamp) between 1 and 5* and timestamp
    between '2013-01-01' and '2013-02-01'

    ie automatically detect (and insert) partition information.

    Regarding first question (too many parquet files) do you think ORC format<http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html>is better suited for such a usecase ? I like the idea of having stripes
    (partitions) inside one file as well as lightweight indexes to seek/skip
    the rows.

    On Wednesday, July 10, 2013 12:38:22 PM UTC-4, Marcel Kornacker wrote:

    On Tue, Jul 9, 2013 at 9:27 PM, Andrei <ase...@gmail.com <javascript:>>
    wrote:
    Hello,

    We have a similar problem. Our usecase is to store market data, for about
    700B records (3 years approx 600M-1B daily records). One of the frequent
    queries are of type

    "select ... from quotes where symbol = $P_SYMBOL and timestamp between
    $P_START and $P_END ... "

    So first problem there are far too many files created for 7TB of data (due
    to aforementioned 260MB limit). Do you think adding MAXSIZE in CREATE
    [internal] TABLE statement is a good idea ? What if I want storage filesize
    to be larger than HDFS block ?

    Regarding partitioning, if one splits the table by symbol/date it results
    in 6M files (8000 symbols x 750 days) which might be too much to handle for
    the name node. Will impala (query optimizer) be smart enough if I had to
    partition table by say [symbol, year(timestamp) ||
    weekofyear(timestamp) ]
    (ie use function in partition definition) ?
    What exactly do you mean by "use function in partition definition"?
    The select list of the Insert statement that populates the table?

    If you have a table that is partitioned by something like (symbol
    string, year int, weekofyear int), you will need predicates on those
    partition columns in your query in order to take advantage of
    partition pruning. In other words, simply having "timestamp between
    <start> and <end>" won't be sufficient, you'll need to add "year
    between <> and <> ...".
    Thanks in advance,
    Andrei.





    On Friday, June 21, 2013 5:40:59 PM UTC-4, Skye Wanderman-Milne wrote:

    Impala 1.1 will be released within the next month.


    On Fri, Jun 21, 2013 at 2:05 PM, Daniel Uribe <daniel....@gmail.com>
    wrote:
    Thanks for the information Nong, I'm glad to know that this is
    something
    that will be fixed in the next release. Any estimate on the timeframe
    on
    when it might be available?

    Best regards,
    Daniel

    On Friday, June 21, 2013 1:28:07 PM UTC-4, Nong wrote:

    This is an issue in our implementation causing Impala to generate
    files
    less than the target size. This is fixed for our upcoming release.

    Thanks
    Nong


    On Thu, Jun 20, 2013 at 9:17 AM, Daniel Uribe <daniel....@gmail.com>
    wrote:
    We have a Parquet table for a day of activity (close to 5 billion
    records) which is partitioned by minute and there are several
    minutes where
    the total data size is over 1 GB, but when populating the table it
    split the
    files and I can't find any files that are larger than 260 MB. Since
    the HDFS
    block size is 1GB, this is not ideal, since the optimal Parquet file
    size
    should also be 1GB.

    Is there any setting for Impala to make sure that the files don't
    get
    split into smaller sizes to improve I/O performance during queries?

    Thank you,
    Daniel

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 20, '13 at 4:17p
activeJul 11, '13 at 3:15a
posts6
users5
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase