FAQ
If you're using Impala to write the Parquet files, you can set the
PARQUET_FILE_SIZE query option before running your insert statements.

On Mon, Apr 7, 2014 at 1:09 AM, György Balogh wrote:

Hi,

We have a cluster with 3 nodes, 4 disks in each node. When we load data to
a parquet table it creates 3 chunks parallel (I assume one per node).

This strategy is ok if we have hundreds of GB of data. But for tables with
total compressed size around 2 GB only 3 chunks will be created so query
performance will be far from maximum because only 3 disk will read the data
and rest will be idle. Is it possible to change this policy? For example
creating as many chunks as the total number of disks in the cluster seems
to be a better strategy. I know small parquet files are not optimal but
still better in this situation since all the disks and cores can work
parallel.

Thank you,
Gyorgy

To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 9, '14 at 5:48p
activeApr 9, '14 at 5:48p
posts1
users1
websitecloudera.com
irc#hadoop

1 user in discussion

Skye Wanderman-Milne: 1 post

People

Translate

site design / logo © 2022 Grokbase