FAQ
Jonathan, thank you for your question.

I agree completely that a lot of workloads include some form of
single-row lookups or range scans over a small number of rows, and for
those particular queries indices make a lot of sense. However, index
support is best implemented in the underlying storage manager and not
in the query engine that runs on top of it. Impala does support Hbase
as a storage manager, in addition to Hdfs, and Hbase gives you an
ordered key space/a primary index plus row lookups and range scans.
Have you looked into Hbase as an option?

  Hdfs, being basically a file system, is unfortunately not in an ideal
position to implement secondary indices. I am aware that Hive supports
some form of secondary indices, but I don't have any first-hand
experience how useful those actually are and how widely they are being
used.

Marcel

On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
wrote:
Are there plans to eventually allow for secondary indices (e.g. akin to a
Hive style index) in Impala? Just working with some enormous datasets and
Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI interaction
(Impala pulls it down to ~20 seconds). If we were able to define and
physically materialize an index on top of a column or two too - wouldn't the
performance be even faster for those cases (I'm sure it would drop sub
second)? I know Impala isn't *intended* to do the RTQ stuff as much as
Ad-Hoc (which is our predominant use case), but it seems pretty close to
being able to even provide at least some level of RTQ too. As a test right
now, I'm doing "group by" selections into new tables in Hive and then
querying those reduced derived tables in Impala, which aren't as fast as an
index due to the scan, but still provides a massive speedup because the
tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.

With an index capability, I could avoid this step - and the traversal logic
would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)

Thanks!
-Jonathan

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 3 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 3, '13 at 8:51p
activeMay 9, '13 at 9:03p
posts3
users2
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase