FAQ
Hi Keith,

You are correct that Impala does not perform delayed materialization (i.e.,
it materializes then filters).

For text and sequence files, Impala will parse and store the columns needed
to evaluate the predicates first (this still requires reading the entire
file to locate all the rows and columns, but saves cycles on rows that do
not pass the filter). For all other file formats, Impala materializes all
selected columns before filtering.

Skye
On Wed, Mar 26, 2014 at 12:09 PM, Keith Simmons wrote:

I'm pretty sure this was answered at the last impala meetup, but I'm going
to ask again just to be sure.

If I'm running a query where several columns are selected, but only a few
are involved in the predicates, will impala filter then materialize rows,
or materialize then filter? For example, if I'm selecting twenty columns,
but only filtering on two columns, and only 10 rows out of millions pass my
filter, do I still have to pay the penalty to materialize the millions of
rows? I suspect the answer is yes, as reducing my select columns seems to
drastically alter my query profiles, but I wanted to confirm before I chase
the wrong optimization strategy.

I believe the parquet folks have done predicate pushdown for their avro
reader here: https://github.com/Parquet/parquet-mr/pull/68. That being
said, I'm basing it entirely on the name of the pull request. I haven't
actually looked at the code.

Keith

To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 27, '14 at 9:56p
activeMar 27, '14 at 9:56p
posts1
users1
websitecloudera.com
irc#hadoop

1 user in discussion

Skye Wanderman-Milne: 1 post

People

Translate

site design / logo © 2022 Grokbase