On Fri, May 3, 2013 at 1:11 PM, Jung-Yup Lee wrote:
Hi Nong,

Storing the repeatition level in parquet DataPage for nested data will be
included in the upcoming release?
I am really looking for this feature.
The upcoming 1.1 release of Impala will not support complex/nested structures.

On Saturday, May 4, 2013 2:04:19 AM UTC+9, Nong wrote:

Hi Jonathan,

Unfortunately there is no way to do this currently. Parquet is designed
to write large, non-splittable blocks to
maximize IO throughput for larger datasets (maximize sequential reads vs
disk seeks). The size of the blocks
is configurable in the format specification but not currently exposed to
the user. We plan to expose more of
these configuration options in the upcoming release.


On Thu, May 2, 2013 at 4:10 PM, Jonathan Larson <jonathan....@gmail.com>
So I'm using Impala and loving it, but when I pulled a 700 M row table
(about 10 GB) into a parquet table - it only made 16 backing files in HDFS.
The source table had hundreds of files and I have a couple hundred CPUs to
throw at things. Now, my performance is many times worse when querying
against the parquet table (because it parallelizes only out to 16 workers)
as compared to the old non-parquet format which allows hundreds of workers
working in parallel. In Hive, I used to get around this problem by setting:

set mapred.max.split.input.size

and also by bumping up the


I tried setting a couple of the Imapala settings, but I wasn't a good
guesser apparently. Does anyone know how can I convince Impala to use more
than 16 workers similarly so I can see the benefits of using Parquet?

Any thoughts appreciated :) Sorry if I missed something obvious!


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupimpala-user @
postedMay 3, '13 at 8:12p
activeMay 3, '13 at 8:42p

2 users in discussion

Jung-Yup Lee: 1 post Marcel Kornacker: 1 post



site design / logo © 2022 Grokbase