FAQ
Hi,

We would like to create an impala table which can supply updated data to.
Currently we are using inserts with a version timestamp to allow us to get
the latest row version as well as run "as of" a specific time.

The solution we have is using a join to "max(version_timestamp), pkey ...
group by pkey"
which has the obvious performance issue of a join.

Q) What other options can you recommend?



select version_timestamp as rev, m.pkey, field1, field2

from my_table m

inner join (

     select max(version_timestamp) as rev,

     pkey

     from my_table

     group by pkey

) v on v.rev = m.version_timestamp;

Search Discussions

  • Marcel Kornacker at Jun 18, 2013 at 3:52 am

    On Mon, Jun 17, 2013 at 7:54 AM, Paul Birnie wrote:
    Hi,

    We would like to create an impala table which can supply updated data to.
    Currently we are using inserts with a version timestamp to allow us to get
    the latest row version as well as run "as of" a specific time.
    This might be more adequately stored in hbase, which already supports
    versioning and timestamps. The Impala hbase scanner will only retrieve
    the latest version of a cell, which is exactly what you want.

    Marcel
    The solution we have is using a join to "max(version_timestamp), pkey ...
    group by pkey"
    which has the obvious performance issue of a join.

    Q) What other options can you recommend?



    select version_timestamp as rev, m.pkey, field1, field2

    from my_table m

    inner join (

    select max(version_timestamp) as rev,

    pkey

    from my_table

    group by pkey

    ) v on v.rev = m.version_timestamp;



  • Paul Birnie at Jul 22, 2013 at 10:18 am
    Hi Marcel,

    I created an hbase table and wrote 1 million rows (1 million unique keys)
    to it.

    The problem is that running an hbase aggregate query against the database
    is very slow - as I posted here.

    "slow queries from impala against hbase table<https://groups.google.com/a/cloudera.org/forum/#!mydiscussions/impala-user/BV5AiK8vlfQ>
    "
    (
    https://groups.google.com/a/cloudera.org/forum/#!topic/impala-user/BV5AiK8vlfQ
    )

    It feels like there is a throughput barrier between impala and hbase

    kind regards
    Paul Birnie

    On Tuesday, 18 June 2013 04:52:44 UTC+1, Marcel Kornacker wrote:
    On Mon, Jun 17, 2013 at 7:54 AM, Paul Birnie wrote:
    Hi,

    We would like to create an impala table which can supply updated data to.
    Currently we are using inserts with a version timestamp to allow us to get
    the latest row version as well as run "as of" a specific time.
    This might be more adequately stored in hbase, which already supports
    versioning and timestamps. The Impala hbase scanner will only retrieve
    the latest version of a cell, which is exactly what you want.

    Marcel
    The solution we have is using a join to "max(version_timestamp), pkey ...
    group by pkey"
    which has the obvious performance issue of a join.

    Q) What other options can you recommend?



    select version_timestamp as rev, m.pkey, field1, field2

    from my_table m

    inner join (

    select max(version_timestamp) as rev,

    pkey

    from my_table

    group by pkey

    ) v on v.rev = m.version_timestamp;



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJun 17, '13 at 2:54p
activeJul 22, '13 at 10:18a
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Paul Birnie: 2 posts Marcel Kornacker: 1 post

People

Translate

site design / logo © 2021 Grokbase