FAQ
Jonathan, thank you for your question.

I agree completely that a lot of workloads include some form of
single-row lookups or range scans over a small number of rows, and for
those particular queries indices make a lot of sense. However, index
support is best implemented in the underlying storage manager and not
in the query engine that runs on top of it. Impala does support Hbase
as a storage manager, in addition to Hdfs, and Hbase gives you an
ordered key space/a primary index plus row lookups and range scans.
Have you looked into Hbase as an option?

  Hdfs, being basically a file system, is unfortunately not in an ideal
position to implement secondary indices. I am aware that Hive supports
some form of secondary indices, but I don't have any first-hand
experience how useful those actually are and how widely they are being
used.

Marcel

On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
wrote:
Are there plans to eventually allow for secondary indices (e.g. akin to a
Hive style index) in Impala? Just working with some enormous datasets and
Parquet / Impala provides incomparable speed... but we're still often
looking to pull things into columnar stores for retrieval for UI interaction
(Impala pulls it down to ~20 seconds). If we were able to define and
physically materialize an index on top of a column or two too - wouldn't the
performance be even faster for those cases (I'm sure it would drop sub
second)? I know Impala isn't *intended* to do the RTQ stuff as much as
Ad-Hoc (which is our predominant use case), but it seems pretty close to
being able to even provide at least some level of RTQ too. As a test right
now, I'm doing "group by" selections into new tables in Hive and then
querying those reduced derived tables in Impala, which aren't as fast as an
index due to the scan, but still provides a massive speedup because the
tables are only in the millions of rows instead of billions. Even just
doing this pulls the queries to UI-ready levels.

With an index capability, I could avoid this step - and the traversal logic
would be even far faster. Maybe I'm pushing too far down this lane
though... thoughts? Just wanted to hear what people thought :)

Thanks!
-Jonathan

Search Discussions

  • Jonathan Larson at May 6, 2013 at 4:24 pm
    Hi Marcel,
       Thanks for your thoughtful reply and to answer your question, yes - we
    are in fact using HBase for now (or even Cassandra). The problem therein
    lies that we must write a custom rowkey structure every time we need to
    index something... and then be aware of that structure when scanning HBase
    for quick retrieval. I guess to further qualify my question - I was really
    hoping Parquet would implement indices that Impala would be able to take
    advantage of (which Jung-Yup seems to have mentioned may happen). Using a
    mechanism like that along with the SQL DDL used to create indices... and a
    query planner to automatically write the code to access the indicess is far
    more powerful than having my devs continually adjust an ever changing HBase
    API. While I can have my team write a more generic handler, it just seems
    very elegant and attractive to have in native HiveQL... Having it at this
    layer would provide an abstraction and standardization that would save a
    lot of development time.
       Also, you mentioned the Hive secondary indices... they are nice and
    provide a small performance boost, but it is still basically a moot issue
    for interactive retrieval because it requires M/R to spin up. Still, they
    do provide some performance boost for jobs that use them. Impala doesn't
    have the M/R spinup time, so secondary indices within it would seem FAR
    more valuable than they are in Hive - in terms of a responsive query... and
    the implementation over something like Parquet (once it supports indices)
    would seem pretty straightforward.

    Thanks!
    -Jonathan
    On Friday, May 3, 2013 1:51:24 PM UTC-7, Marcel Kornacker wrote:

    Jonathan, thank you for your question.

    I agree completely that a lot of workloads include some form of
    single-row lookups or range scans over a small number of rows, and for
    those particular queries indices make a lot of sense. However, index
    support is best implemented in the underlying storage manager and not
    in the query engine that runs on top of it. Impala does support Hbase
    as a storage manager, in addition to Hdfs, and Hbase gives you an
    ordered key space/a primary index plus row lookups and range scans.
    Have you looked into Hbase as an option?

    Hdfs, being basically a file system, is unfortunately not in an ideal
    position to implement secondary indices. I am aware that Hive supports
    some form of secondary indices, but I don't have any first-hand
    experience how useful those actually are and how widely they are being
    used.

    Marcel

    On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
    <jonathan....@gmail.com <javascript:>> wrote:
    Are there plans to eventually allow for secondary indices (e.g. akin to a
    Hive style index) in Impala? Just working with some enormous datasets and
    Parquet / Impala provides incomparable speed... but we're still often
    looking to pull things into columnar stores for retrieval for UI
    interaction
    (Impala pulls it down to ~20 seconds). If we were able to define and
    physically materialize an index on top of a column or two too - wouldn't the
    performance be even faster for those cases (I'm sure it would drop sub
    second)? I know Impala isn't *intended* to do the RTQ stuff as much as
    Ad-Hoc (which is our predominant use case), but it seems pretty close to
    being able to even provide at least some level of RTQ too. As a test right
    now, I'm doing "group by" selections into new tables in Hive and then
    querying those reduced derived tables in Impala, which aren't as fast as an
    index due to the scan, but still provides a massive speedup because the
    tables are only in the millions of rows instead of billions. Even just
    doing this pulls the queries to UI-ready levels.

    With an index capability, I could avoid this step - and the traversal logic
    would be even far faster. Maybe I'm pushing too far down this lane
    though... thoughts? Just wanted to hear what people thought :)

    Thanks!
    -Jonathan
  • Marcel Kornacker at May 9, 2013 at 9:03 pm

    On Mon, May 6, 2013 at 12:24 PM, Jonathan Larson wrote:
    Hi Marcel,
    Thanks for your thoughtful reply and to answer your question, yes - we are
    in fact using HBase for now (or even Cassandra). The problem therein lies
    that we must write a custom rowkey structure every time we need to index
    something... and then be aware of that structure when scanning HBase for
    quick retrieval. I guess to further qualify my question - I was really
    hoping Parquet would implement indices that Impala would be able to take
    advantage of (which Jung-Yup seems to have mentioned may happen). Using a
    mechanism like that along with the SQL DDL used to create indices... and a
    query planner to automatically write the code to access the indicess is far
    more powerful than having my devs continually adjust an ever changing HBase
    API. While I can have my team write a more generic handler, it just seems
    very elegant and attractive to have in native HiveQL... Having it at this
    layer would provide an abstraction and standardization that would save a lot
    of development time.
    Also, you mentioned the Hive secondary indices... they are nice and
    provide a small performance boost, but it is still basically a moot issue
    for interactive retrieval because it requires M/R to spin up. Still, they
    do provide some performance boost for jobs that use them. Impala doesn't
    have the M/R spinup time, so secondary indices within it would seem FAR more
    valuable than they are in Hive - in terms of a responsive query... and the
    implementation over something like Parquet (once it supports indices) would
    seem pretty straightforward.
    Jonathan, the problem with secondary hdfs-resident indices is that
    they would need to be maintained explicitly, ie, they would be
    out-of-sync with the base data after new data gets added, because new
    data might show up simply by copying data into hdfs, and the user
    would be forced to recreate the index manually. This applies
    independently of the file format, so Parquet's (not-yet-implemented)
    random-access functionality wouldn't help.

    Regarding hbase: does your hesitation regarding hbase stem from the
    fact that you can't easily store composite row keys (you need to
    convert them into a single string, and map it into a string column in
    the Impala/Hive table) and you don't like writing against the hbase
    api? We are planning on improving the integration of Impala and hbase,
    and this would include the ability to store composite row keys (so
    that you could map them into, say, two integer columns in your Impala
    table) and support for INSERT/UPDATE/DELETE, even for single rows (ie,
    INSERT INTO <hbasetable> VALUES (...)).

    Marcel
    Thanks!
    -Jonathan
    On Friday, May 3, 2013 1:51:24 PM UTC-7, Marcel Kornacker wrote:

    Jonathan, thank you for your question.

    I agree completely that a lot of workloads include some form of
    single-row lookups or range scans over a small number of rows, and for
    those particular queries indices make a lot of sense. However, index
    support is best implemented in the underlying storage manager and not
    in the query engine that runs on top of it. Impala does support Hbase
    as a storage manager, in addition to Hdfs, and Hbase gives you an
    ordered key space/a primary index plus row lookups and range scans.
    Have you looked into Hbase as an option?

    Hdfs, being basically a file system, is unfortunately not in an ideal
    position to implement secondary indices. I am aware that Hive supports
    some form of secondary indices, but I don't have any first-hand
    experience how useful those actually are and how widely they are being
    used.

    Marcel

    On Fri, May 3, 2013 at 11:34 AM, Jonathan Larson
    wrote:
    Are there plans to eventually allow for secondary indices (e.g. akin to
    a
    Hive style index) in Impala? Just working with some enormous datasets
    and
    Parquet / Impala provides incomparable speed... but we're still often
    looking to pull things into columnar stores for retrieval for UI
    interaction
    (Impala pulls it down to ~20 seconds). If we were able to define and
    physically materialize an index on top of a column or two too - wouldn't
    the
    performance be even faster for those cases (I'm sure it would drop sub
    second)? I know Impala isn't *intended* to do the RTQ stuff as much as
    Ad-Hoc (which is our predominant use case), but it seems pretty close to
    being able to even provide at least some level of RTQ too. As a test
    right
    now, I'm doing "group by" selections into new tables in Hive and then
    querying those reduced derived tables in Impala, which aren't as fast as
    an
    index due to the scan, but still provides a massive speedup because the
    tables are only in the millions of rows instead of billions. Even just
    doing this pulls the queries to UI-ready levels.

    With an index capability, I could avoid this step - and the traversal
    logic
    would be even far faster. Maybe I'm pushing too far down this lane
    though... thoughts? Just wanted to hear what people thought :)

    Thanks!
    -Jonathan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 3, '13 at 8:51p
activeMay 9, '13 at 9:03p
posts3
users2
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase