On Mon, May 6, 2013 at 12:24 PM, Jonathan Larson wrote:
Hi Marcel,
Thanks for your thoughtful reply and to answer your question, yes - we are
in fact using HBase for now (or even Cassandra). The problem therein lies
that we must write a custom rowkey structure every time we need to index
something... and then be aware of that structure when scanning HBase for
quick retrieval. I guess to further qualify my question - I was really
hoping Parquet would implement indices that Impala would be able to take
advantage of (which Jung-Yup seems to have mentioned may happen). Using a
mechanism like that along with the SQL DDL used to create indices... and a
query planner to automatically write the code to access the indicess is far
more powerful than having my devs continually adjust an ever changing HBase
API. While I can have my team write a more generic handler, it just seems
very elegant and attractive to have in native HiveQL... Having it at this
layer would provide an abstraction and standardization that would save a lot
of development time.
Also, you mentioned the Hive secondary indices... they are nice and
provide a small performance boost, but it is still basically a moot issue
for interactive retrieval because it requires M/R to spin up. Still, they
do provide some performance boost for jobs that use them. Impala doesn't
have the M/R spinup time, so secondary indices within it would seem FAR more
valuable than they are in Hive - in terms of a responsive query... and the
implementation over something like Parquet (once it supports indices) would
seem pretty straightforward.
Hi Marcel,
Thanks for your thoughtful reply and to answer your question, yes - we are
in fact using HBase for now (or even Cassandra). The problem therein lies
that we must write a custom rowkey structure every time we need to index
something... and then be aware of that structure when scanning HBase for
quick retrieval. I guess to further qualify my question - I was really
hoping Parquet would implement indices that Impala would be able to take
advantage of (which Jung-Yup seems to have mentioned may happen). Using a
mechanism like that along with the SQL DDL used to create indices... and a
query planner to automatically write the code to access the indicess is far
more powerful than having my devs continually adjust an ever changing HBase
API. While I can have my team write a more generic handler, it just seems
very elegant and attractive to have in native HiveQL... Having it at this
layer would provide an abstraction and standardization that would save a lot
of development time.
Also, you mentioned the Hive secondary indices... they are nice and
provide a small performance boost, but it is still basically a moot issue
for interactive retrieval because it requires M/R to spin up. Still, they
do provide some performance boost for jobs that use them. Impala doesn't
have the M/R spinup time, so secondary indices within it would seem FAR more
valuable than they are in Hive - in terms of a responsive query... and the
implementation over something like Parquet (once it supports indices) would
seem pretty straightforward.
they would need to be maintained explicitly, ie, they would be
out-of-sync with the base data after new data gets added, because new
data might show up simply by copying data into hdfs, and the user
would be forced to recreate the index manually. This applies
independently of the file format, so Parquet's (not-yet-implemented)
random-access functionality wouldn't help.
Regarding hbase: does your hesitation regarding hbase stem from the
fact that you can't easily store composite row keys (you need to
convert them into a single string, and map it into a string column in
the Impala/Hive table) and you don't like writing against the hbase
api? We are planning on improving the integration of Impala and hbase,
and this would include the ability to store composite row keys (so
that you could map them into, say, two integer columns in your Impala
table) and support for INSERT/UPDATE/DELETE, even for single rows (ie,
INSERT INTO <hbasetable> VALUES (...)).
Marcel
Thanks!
-Jonathan
-Jonathan
On Friday, May 3, 2013 1:51:24 PM UTC-7, Marcel Kornacker wrote:
Jonathan, thank you for your question.
I agree completely that a lot of workloads include some form of
single-row lookups or range scans over a small number of rows, and for
those particular queries indices make a lot of sense. However, index
support is best implemented in the underlying storage manager and not
in the query engine that runs on top of it. Impala does support Hbase
as a storage manager, in addition to Hdfs, and Hbase gives you an
ordered key space/a primary index plus row lookups and range scans.
Have you looked into Hbase as an option?
Hdfs, being basically a file system, is unfortunately not in an ideal
position to implement secondary indices. I am aware that Hive supports
some form of secondary indices, but I don't have any first-hand
experience how useful those actually are and how widely they are being
used.
Marcel
On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
wrote:
Jonathan, thank you for your question.
I agree completely that a lot of workloads include some form of
single-row lookups or range scans over a small number of rows, and for
those particular queries indices make a lot of sense. However, index
support is best implemented in the underlying storage manager and not
in the query engine that runs on top of it. Impala does support Hbase
as a storage manager, in addition to Hdfs, and Hbase gives you an
ordered key space/a primary index plus row lookups and range scans.
Have you looked into Hbase as an option?
Hdfs, being basically a file system, is unfortunately not in an ideal
position to implement secondary indices. I am aware that Hive supports
some form of secondary indices, but I don't have any first-hand
experience how useful those actually are and how widely they are being
used.
Marcel
On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
wrote:
Are there plans to eventually allow for secondary indices (e.g. akin to
a
Hive style index) in Impala? Just working with some enormous datasets
and
Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI
interaction
(Impala pulls it down to ~20 seconds). If we were able to define and
physically materialize an index on top of a column or two too - wouldn't
the
performance be even faster for those cases (I'm sure it would drop sub
second)? I know Impala isn't *intended* to do the RTQ stuff as much as
Ad-Hoc (which is our predominant use case), but it seems pretty close to
being able to even provide at least some level of RTQ too. As a test
right
now, I'm doing "group by" selections into new tables in Hive and then
querying those reduced derived tables in Impala, which aren't as fast as
an
index due to the scan, but still provides a massive speedup because the
tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.
With an index capability, I could avoid this step - and the traversal
logic
would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)
Thanks!
-Jonathan
a
Hive style index) in Impala? Just working with some enormous datasets
and
Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI
interaction
(Impala pulls it down to ~20 seconds). If we were able to define and
physically materialize an index on top of a column or two too - wouldn't
the
performance be even faster for those cases (I'm sure it would drop sub
second)? I know Impala isn't *intended* to do the RTQ stuff as much as
Ad-Hoc (which is our predominant use case), but it seems pretty close to
being able to even provide at least some level of RTQ too. As a test
right
now, I'm doing "group by" selections into new tables in Hive and then
querying those reduced derived tables in Impala, which aren't as fast as
an
index due to the scan, but still provides a massive speedup because the
tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.
With an index capability, I could avoid this step - and the traversal
logic
would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)
Thanks!
-Jonathan