As I know, in Impala a Parquet file is consist of a one hdfs block, and
this block consist of a number of column chunks.
Because more than one scanner threads can't read a block at the same time,
I think the query latency will be in proportion to the number of columns in
select list.

On Saturday, May 25, 2013 1:18:02 AM UTC+9, gerrard...@gmail.com wrote:


When I run a select query on data in parquet with ~50 million rows and 10
columns I get much worse performance as I select more columns in the row.
Suppose the following query returns 3 rows:

select a from table where a = 12345;

This query returns in 2 seconds. Then if I query:

select a, b from table where a = 12345;

the query returns in 4 seconds and so on. Is this expected behaviour as
parquet is a columnar store? Is there a way to optimise this?

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 1 | next ›
Discussion Overview
groupimpala-user @
postedMay 24, '13 at 7:44p
activeMay 24, '13 at 7:44p

1 user in discussion

Jung-Yup Lee: 1 post



site design / logo © 2022 Grokbase