FAQ
Hi Marcel,
   Thanks for your thoughtful reply and to answer your question, yes - we
are in fact using HBase for now (or even Cassandra). The problem therein
lies that we must write a custom rowkey structure every time we need to
index something... and then be aware of that structure when scanning HBase
for quick retrieval. I guess to further qualify my question - I was really
hoping Parquet would implement indices that Impala would be able to take
advantage of (which Jung-Yup seems to have mentioned may happen). Using a
mechanism like that along with the SQL DDL used to create indices... and a
query planner to automatically write the code to access the indicess is far
more powerful than having my devs continually adjust an ever changing HBase
API. While I can have my team write a more generic handler, it just seems
very elegant and attractive to have in native HiveQL... Having it at this
layer would provide an abstraction and standardization that would save a
lot of development time.
   Also, you mentioned the Hive secondary indices... they are nice and
provide a small performance boost, but it is still basically a moot issue
for interactive retrieval because it requires M/R to spin up. Still, they
do provide some performance boost for jobs that use them. Impala doesn't
have the M/R spinup time, so secondary indices within it would seem FAR
more valuable than they are in Hive - in terms of a responsive query... and
the implementation over something like Parquet (once it supports indices)
would seem pretty straightforward.

Thanks!
-Jonathan
On Friday, May 3, 2013 1:51:24 PM UTC-7, Marcel Kornacker wrote:

Jonathan, thank you for your question.

I agree completely that a lot of workloads include some form of
single-row lookups or range scans over a small number of rows, and for
those particular queries indices make a lot of sense. However, index
support is best implemented in the underlying storage manager and not
in the query engine that runs on top of it. Impala does support Hbase
as a storage manager, in addition to Hdfs, and Hbase gives you an
ordered key space/a primary index plus row lookups and range scans.
Have you looked into Hbase as an option?

Hdfs, being basically a file system, is unfortunately not in an ideal
position to implement secondary indices. I am aware that Hive supports
some form of secondary indices, but I don't have any first-hand
experience how useful those actually are and how widely they are being
used.

Marcel

On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
<jonathan....@gmail.com <javascript:>> wrote:
Are there plans to eventually allow for secondary indices (e.g. akin to a
Hive style index) in Impala? Just working with some enormous datasets and
Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI
interaction
(Impala pulls it down to ~20 seconds). If we were able to define and
physically materialize an index on top of a column or two too - wouldn't the
performance be even faster for those cases (I'm sure it would drop sub
second)? I know Impala isn't *intended* to do the RTQ stuff as much as
Ad-Hoc (which is our predominant use case), but it seems pretty close to
being able to even provide at least some level of RTQ too. As a test right
now, I'm doing "group by" selections into new tables in Hive and then
querying those reduced derived tables in Impala, which aren't as fast as an
index due to the scan, but still provides a massive speedup because the
tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.

With an index capability, I could avoid this step - and the traversal logic
would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)

Thanks!
-Jonathan

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 3 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 3, '13 at 8:51p
activeMay 9, '13 at 9:03p
posts3
users2
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase