FAQ

On Wed, Oct 9, 2013 at 10:37 AM, wrote:
I'm trying to design some interactive queries using impala. My tests show
impala has the desired latency when I keep the number of rows scanned in the
single digit millions (which is really impressive, by the way). To keep the
data that size, I need to choose partition keys that will keep the reads
smallish (probably in the 10 - 200 MB range). However, the impala docs
really encourage 1GB partitions to improve compression and sequential read
efficiency, and there's also the well documented issue with hdfs and large
numbers of files.
Do you always need to read all of the columns of that table?
Reading through the parquet docs, it seems like their column statistics and
indexes could allow users of impala to have their cake and eat it too, in
the sense that you can have large sequential files, but using the indices,
you could avoid reading most of the file and only grab the pages you need.

Is this what the road map means when it mentions parquet indices (i.e.
"Parquet enhancements – continued performance gains including index pages")

To make this a bit more concrete, here's an example. Say I have a time
series of data. I only want to access about 30 minutes of that data, which
equates to roughly 1 million rows or 100 MB (too big for an hbase sequential
scan, but a small snack for impala). Right now, my choices seem to be 1)
Create 30 minute partitions with 100MB or smaller files, 2) Create larger 5
hour partitions and just eat the extra latency from scanning the unneeded
data. On the other hand, if there was a simple min-max stat on each page in
the time stamp column of the parquet file, you could easily skip all the
unnecessary pages. Then I could store larger partitions, but still get low
latency queries.

Am I understanding the point of the indexes correctly?
You are, and Parquet 2.0 defines the layout of those index pages, and
also has the option of sorted files (ie, you could really scan
contiguous sections of the columns you care about, which in your case
would require sorting on the timestamp column).

However: the upcoming Impala 1.2 release will support Parquet 2.0 but
will not write or read those index pages; we'll be adding that
functionality in a future release (1.2.1 or 2.0).

Marcel

To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Keith at Oct 9, 2013 at 6:00 pm
    Nope, I don't need to read all the columns. Unfortunately, in my case, I
    suspect the columns I need to read for aggregation will dominate the
    parquet file because they have high entropy. My filtered columns (low
    cardinality strings fields and time stamps) should compress down to almost
    nothing using dictionary encoding.

    Awesome regarding the index stuff. I noticed you listed it in your 2.0
    milestone, so no surprises there. Looking forward to it.

    Keith
    On Wednesday, October 9, 2013 10:52:19 AM UTC-7, Marcel Kornacker wrote:
    On Wed, Oct 9, 2013 at 10:37 AM, <ke...@pulse.io <javascript:>> wrote:
    I'm trying to design some interactive queries using impala. My tests show
    impala has the desired latency when I keep the number of rows scanned in the
    single digit millions (which is really impressive, by the way). To keep the
    data that size, I need to choose partition keys that will keep the reads
    smallish (probably in the 10 - 200 MB range). However, the impala docs
    really encourage 1GB partitions to improve compression and sequential read
    efficiency, and there's also the well documented issue with hdfs and large
    numbers of files.
    Do you always need to read all of the columns of that table?
    Reading through the parquet docs, it seems like their column statistics and
    indexes could allow users of impala to have their cake and eat it too, in
    the sense that you can have large sequential files, but using the indices,
    you could avoid reading most of the file and only grab the pages you need.
    Is this what the road map means when it mentions parquet indices (i.e.
    "Parquet enhancements – continued performance gains including index pages")
    To make this a bit more concrete, here's an example. Say I have a time
    series of data. I only want to access about 30 minutes of that data, which
    equates to roughly 1 million rows or 100 MB (too big for an hbase
    sequential
    scan, but a small snack for impala). Right now, my choices seem to be 1)
    Create 30 minute partitions with 100MB or smaller files, 2) Create larger 5
    hour partitions and just eat the extra latency from scanning the unneeded
    data. On the other hand, if there was a simple min-max stat on each page in
    the time stamp column of the parquet file, you could easily skip all the
    unnecessary pages. Then I could store larger partitions, but still get low
    latency queries.

    Am I understanding the point of the indexes correctly?
    You are, and Parquet 2.0 defines the layout of those index pages, and
    also has the option of sorted files (ie, you could really scan
    contiguous sections of the columns you care about, which in your case
    would require sorting on the timestamp column).

    However: the upcoming Impala 1.2 release will support Parquet 2.0 but
    will not write or read those index pages; we'll be adding that
    functionality in a future release (1.2.1 or 2.0).

    Marcel

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedOct 9, '13 at 5:52p
activeOct 9, '13 at 6:00p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Keith: 1 post Marcel Kornacker: 1 post

People

Translate

site design / logo © 2021 Grokbase