consideration implementing index in Parquet.
https://groups.google.com/forum/#!searchin/parquet-dev/index/parquet-dev/Izazbxw5uiA/unU_XeTi27sJ
After an index become available in Parquet, we will be able to take
advantage of it.
Hope this helps.
On Saturday, May 4, 2013 3:34:11 AM UTC+9, Jonathan Larson wrote:
Are there plans to eventually allow for secondary indices (e.g. akin to
a Hive style index) in Impala? Just working with some enormous datasets
and Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI
interaction (Impala pulls it down to ~20 seconds). If we were able to
define and physically materialize an index on top of a column or two too -
wouldn't the performance be even faster for those cases (I'm sure it would
drop sub second)? I know Impala isn't *intended* to do the RTQ stuff as
much as Ad-Hoc (which is our predominant use case), but it seems pretty
close to being able to even provide at least some level of RTQ too. As a
test right now, I'm doing "group by" selections into new tables in Hive and
then querying those reduced derived tables in Impala, which aren't as fast
as an index due to the scan, but still provides a massive speedup because
the tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.
With an index capability, I could avoid this step - and the traversal
logic would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)
Thanks!
-Jonathan
Are there plans to eventually allow for secondary indices (e.g. akin to
a Hive style index) in Impala? Just working with some enormous datasets
and Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI
interaction (Impala pulls it down to ~20 seconds). If we were able to
define and physically materialize an index on top of a column or two too -
wouldn't the performance be even faster for those cases (I'm sure it would
drop sub second)? I know Impala isn't *intended* to do the RTQ stuff as
much as Ad-Hoc (which is our predominant use case), but it seems pretty
close to being able to even provide at least some level of RTQ too. As a
test right now, I'm doing "group by" selections into new tables in Hive and
then querying those reduced derived tables in Impala, which aren't as fast
as an index due to the scan, but still provides a massive speedup because
the tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.
With an index capability, I could avoid this step - and the traversal
logic would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)
Thanks!
-Jonathan