Nope, I don't need to read all the columns. Unfortunately, in my case, I
suspect the columns I need to read for aggregation will dominate the
parquet file because they have high entropy. My filtered columns (low
cardinality strings fields and time stamps) should compress down to almost
nothing using dictionary encoding.

Awesome regarding the index stuff. I noticed you listed it in your 2.0
milestone, so no surprises there. Looking forward to it.

On Wednesday, October 9, 2013 10:52:19 AM UTC-7, Marcel Kornacker wrote:
On Wed, Oct 9, 2013 at 10:37 AM, <ke...@pulse.io <javascript:>> wrote:
I'm trying to design some interactive queries using impala. My tests show
impala has the desired latency when I keep the number of rows scanned in the
single digit millions (which is really impressive, by the way). To keep the
data that size, I need to choose partition keys that will keep the reads
smallish (probably in the 10 - 200 MB range). However, the impala docs
really encourage 1GB partitions to improve compression and sequential read
efficiency, and there's also the well documented issue with hdfs and large
numbers of files.
Do you always need to read all of the columns of that table?
Reading through the parquet docs, it seems like their column statistics and
indexes could allow users of impala to have their cake and eat it too, in
the sense that you can have large sequential files, but using the indices,
you could avoid reading most of the file and only grab the pages you need.
Is this what the road map means when it mentions parquet indices (i.e.
"Parquet enhancements – continued performance gains including index pages")
To make this a bit more concrete, here's an example. Say I have a time
series of data. I only want to access about 30 minutes of that data, which
equates to roughly 1 million rows or 100 MB (too big for an hbase
scan, but a small snack for impala). Right now, my choices seem to be 1)
Create 30 minute partitions with 100MB or smaller files, 2) Create larger 5
hour partitions and just eat the extra latency from scanning the unneeded
data. On the other hand, if there was a simple min-max stat on each page in
the time stamp column of the parquet file, you could easily skip all the
unnecessary pages. Then I could store larger partitions, but still get low
latency queries.

Am I understanding the point of the indexes correctly?
You are, and Parquet 2.0 defines the layout of those index pages, and
also has the option of sorted files (ie, you could really scan
contiguous sections of the columns you care about, which in your case
would require sorting on the timestamp column).

However: the upcoming Impala 1.2 release will support Parquet 2.0 but
will not write or read those index pages; we'll be adding that
functionality in a future release (1.2.1 or 2.0).


To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user...@cloudera.org <javascript:>.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupimpala-user @
postedOct 9, '13 at 5:52p
activeOct 9, '13 at 6:00p

2 users in discussion

Keith: 1 post Marcel Kornacker: 1 post



site design / logo © 2022 Grokbase