FAQ
This sounds like reasonable behavior.
You are basically comparing best case (parquet optimized for scans) to
worst case (hbase optimized for key access) with a query that does zero
filtering so there is nothing that hbase can push down for a filter nor
does it use a key.


On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie wrote:

Hi,

I am using Cloudera Quick start vm.

I created 3 tables

-- a external table

CREATE EXTERNAL TABLE day1

(

key STRING,

dimension STRING,

book STRING,

value INT

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/opsmisdb/day1';


-- a parquet table

create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


-- and from hive shell, I create a hive table

-- which has an underlying hbase database

-- and after a "refresh;" is accessible in impala-shell

--

CREATE TABLE hive_day1(key string, dimension string, book string, value
int)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf1:dimension,cf1:book,cf1:value")

TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



-- The problem is that running any impala query against the hive_day1
table (which is actually stored in hbase) is very slow

-- on a small (1 million row) dataset

--

select dimension, sum(value) from hive_day1 group by dimension

takes 15 seconds

on a parquet_day1 table it takes 1 second.


--

Q) Is this expected behavior or some config setting that I have wrong. .

Q) Is there some setting hbase that I can change to make access faster
(bloom filters/in memory caching)

****

Search Discussions

  • Paul Birnie at Jul 22, 2013 at 10:49 am
    Thanks Greg,

    I was thinking, if there was some way to enable in-memory caching of the 1
    million rows in hbase.
    Then I was thinking the effect of columnar vs non-columnar would
    negligible.
    (base on the assumption that columnar formats main benefit is eliminating
    seek time of a disk head on each "row" read, which would not occur if all
    data was in-memory).

    Q) Is there a setting to tell hbase to cache all data in memory?
    Q) If this setting worked, could the performance be same / faster that
    parquet - or is there another performance barrier between reading from
    hbase vs streaming from parquet (hdfs file on disk) eg. hbase is accessed
    via an api that does not support streaming of rows our of memory and an api
    call is required per row?


    kind regards
    Paul

    On Saturday, 20 July 2013 19:16:57 UTC+1, Greg Rahn wrote:

    This sounds like reasonable behavior.
    You are basically comparing best case (parquet optimized for scans) to
    worst case (hbase optimized for key access) with a query that does zero
    filtering so there is nothing that hbase can push down for a filter nor
    does it use a key.



    On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie <pbi...@gmail.com<javascript:>
    wrote:
    Hi,

    I am using Cloudera Quick start vm.

    I created 3 tables

    -- a external table

    CREATE EXTERNAL TABLE day1

    (

    key STRING,

    dimension STRING,

    book STRING,

    value INT

    )

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/cloudera/opsmisdb/day1';


    -- a parquet table

    create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


    -- and from hive shell, I create a hive table

    -- which has an underlying hbase database

    -- and after a "refresh;" is accessible in impala-shell

    --

    CREATE TABLE hive_day1(key string, dimension string, book string, value
    int)

    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

    WITH SERDEPROPERTIES ("hbase.columns.mapping" =
    ":key,cf1:dimension,cf1:book,cf1:value")

    TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



    -- The problem is that running any impala query against the hive_day1
    table (which is actually stored in hbase) is very slow

    -- on a small (1 million row) dataset

    --

    select dimension, sum(value) from hive_day1 group by dimension

    takes 15 seconds

    on a parquet_day1 table it takes 1 second.


    --

    Q) Is this expected behavior or some config setting that I have wrong. .

    Q) Is there some setting hbase that I can change to make access faster
    (bloom filters/in memory caching)

    ****
  • Paul Birnie at Jul 22, 2013 at 5:13 pm
    I wrote a small program to test scan performance of hbase on that same vm
    (Cloudera CDH 4.3 quickstart vm)

    Q) is there some setting I can change to make scan performance closer to
    hdfs speeds (better in memory caching)


    interestingly I get a "full" scan of the 1million items completed in 8
    seconds if I setCaching to 10000 (
    http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)
    )

    13/07/21 07:14:37 WARN conf.Configuration: hadoop.native.lib is deprecated.
    Instead, use io.native.lib.available
    scan details
    dump:{"timeRange":[0,9223372036854775807],"batch":-1,"startRow":"trade0","stopRow":"trade999999","totalColumns":1,"cacheBlocks":true,"families":{"cf1":["dimension"]},"maxVersions":1,"caching":10000}
    iteration: 0 rowsRead:999999 table scan time :7(seconds)
    scan details
    dump:{"timeRange":[0,9223372036854775807],"batch":-1,"startRow":"trade0","stopRow":"trade999999","totalColumns":1,"cacheBlocks":true,"families":{"cf1":["dimension"]},"maxVersions":1,"caching":10000}
    iteration: 1 rowsRead:999999 table scan time :5(seconds)
    scan details
    dump:{"timeRange":[0,9223372036854775807],"batch":-1,"startRow":"trade0","stopRow":"trade999999","totalColumns":1,"cacheBlocks":true,"families":{"cf1":["dimension"]},"maxVersions":1,"caching":10000}
    iteration: 2 rowsRead:999999 table scan time :5(seconds)


    -- code start ---
    package com.test.hbase;


    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.*;
    import org.apache.hadoop.hbase.client.*;
    import org.apache.hadoop.hbase.util.*;

    // http://hbase.apache.org/book/perf.reading.html

    public class TestHBase {

         public static void main(String[] args) throws Exception {

             int cacheSize = Integer.valueOf(args[0]);

             Configuration conf = HBaseConfiguration.create();

             HBaseAdmin admin = new HBaseAdmin(conf);

             try {
                 HTable table = new HTable(conf, "hbase_day1");

                 for (int i = 0; i < 10; i++) {

                     byte[] family = Bytes.toBytes("cf1");
                     byte[] qual = Bytes.toBytes("dimension");


    //http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html
                     Scan scan = new Scan();


                     /*byte[] startRow = Bytes.toBytes("trade0");
                     byte[] stopRow = Bytes.toBytes("trade999999");
                     scan.setStartRow(startRow);
                     scan.setStopRow(stopRow);
                     */

                     scan.setCaching(cacheSize);

                     scan.addColumn(family, qual);

                     ResultScanner rs = table.getScanner(scan);

                     long scanStartTime = System.currentTimeMillis();

                     System.out.println( "scan details dump:" + scan);

                     int rowsRead = 0;

                     for (Result r = rs.next(); r != null; r = rs.next()) {

                         rowsRead ++;

                         byte[] valueObj = r.getValue(family, qual);
                         String value = new String(valueObj);
                         //System.out.println(value);


                     }

                     long scanEndTime = System.currentTimeMillis();
                     long scanDurationSeconds = (scanEndTime - scanStartTime) /
    1000;

                     System.out.println( "iteration: " + i + " rowsRead:" +
    rowsRead + " table scan time :" + scanDurationSeconds + "(seconds)");

                     rs.close();

                 }

                 table.close();
             } finally {
                 admin.close();
             }
         }
    }

    -- code end ---
    I also tried changing the settings of the hbase table (with no luck):

    hbase(main):007:0> disable 'hbase_day1'
    0 row(s) in 2.0480 seconds
    hbase(main):008:0> alter 'hbase_day1',{NAME => 'cf1', IN_MEMORY => 'true'}
    Updating all regions with the new schema...
    2/2 regions updated.
    Done.
    0 row(s) in 1.2980 seconds
    hbase(main):009:0> alter 'hbase_day1',{NAME => 'cf1', BLOOMFILTER => 'ROW'}
    Updating all regions with the new schema...
    2/2 regions updated.
    Done.
    0 row(s) in 1.2790 seconds
    hbase(main):011:0> alter 'hbase_day1',{NAME => 'cf1', VERSIONS => '1'}
    Updating all regions with the new schema...
    2/2 regions updated.
    Done.
    0 row(s) in 1.2160 seconds

    hbase(main):012:0> describe 'hbase_day1'
    DESCRIPTION
    ENABLED
      {NAME => 'hbase_day1', FAMILIES => [{NAME => 'cf1', DATA_BLOCK_ENCODING =>
    'NONE', false
       BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1',
    COMPRESSION => '
      NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => '
    false', BL
      OCKSIZE => '65536', IN_MEMORY => 'true', ENCODE_ON_DISK => 'true',
    BLOCKCACHE => '
      true'}]}
    1 row(s) in 0.1410 seconds

    output of the hbase admin port gives:

    hbase_day1,,1374390717054.0ef786fa72a5753b13cfbd2ffce6aa7b.trade467242numberOfStores=1,
    numberOfStorefiles=2, storefileUncompressedSizeMB=61, storefileSizeMB=61,
    compressionRatio=1.0000, memstoreSizeMB=0, storefileIndexSizeMB=0,
    readRequestsCount=816108, writeRequestsCount=0, rootIndexSizeKB=75,
    totalStaticIndexSizeKB=43, totalStaticBloomSizeKB=640,
    totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN
    hbase_day1,trade467242,1374390717054.0ff5f0e08adc58fe61fc902fa30b9db3.
    trade467242numberOfStores=1, numberOfStorefiles=2,
    storefileUncompressedSizeMB=81, storefileSizeMB=81,
    compressionRatio=1.0000, memstoreSizeMB=0, storefileIndexSizeMB=0,
    readRequestsCount=669893, writeRequestsCount=0, rootIndexSizeKB=100,
    totalStaticIndexSizeKB=56,totalStaticBloomSizeKB=768, totalCompactingKVs=0,
    currentCompactedKVs=0, compactionProgressPct=NaN


    On Monday, 22 July 2013 11:49:53 UTC+1, Paul Birnie wrote:

    Thanks Greg,

    I was thinking, if there was some way to enable in-memory caching of the 1
    million rows in hbase.
    Then I was thinking the effect of columnar vs non-columnar would
    negligible.
    (base on the assumption that columnar formats main benefit is eliminating
    seek time of a disk head on each "row" read, which would not occur if all
    data was in-memory).

    Q) Is there a setting to tell hbase to cache all data in memory?
    Q) If this setting worked, could the performance be same / faster that
    parquet - or is there another performance barrier between reading from
    hbase vs streaming from parquet (hdfs file on disk) eg. hbase is accessed
    via an api that does not support streaming of rows our of memory and an api
    call is required per row?


    kind regards
    Paul

    On Saturday, 20 July 2013 19:16:57 UTC+1, Greg Rahn wrote:

    This sounds like reasonable behavior.
    You are basically comparing best case (parquet optimized for scans) to
    worst case (hbase optimized for key access) with a query that does zero
    filtering so there is nothing that hbase can push down for a filter nor
    does it use a key.


    On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie wrote:

    Hi,

    I am using Cloudera Quick start vm.

    I created 3 tables

    -- a external table

    CREATE EXTERNAL TABLE day1

    (

    key STRING,

    dimension STRING,

    book STRING,

    value INT

    )

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/cloudera/opsmisdb/day1';


    -- a parquet table

    create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


    -- and from hive shell, I create a hive table

    -- which has an underlying hbase database

    -- and after a "refresh;" is accessible in impala-shell

    --

    CREATE TABLE hive_day1(key string, dimension string, book string, value
    int)

    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

    WITH SERDEPROPERTIES ("hbase.columns.mapping" =
    ":key,cf1:dimension,cf1:book,cf1:value")

    TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



    -- The problem is that running any impala query against the hive_day1
    table (which is actually stored in hbase) is very slow

    -- on a small (1 million row) dataset

    --

    select dimension, sum(value) from hive_day1 group by dimension

    takes 15 seconds

    on a parquet_day1 table it takes 1 second.


    --

    Q) Is this expected behavior or some config setting that I have wrong. .

    Q) Is there some setting hbase that I can change to make access faster
    (bloom filters/in memory caching)

    ****
  • Greg Rahn at Jul 22, 2013 at 5:15 pm
    The data can certainly be resident in the HBase Block Cache given it's set
    large enough, but that simply saves an IO penalty if one exists.
    The Impala HDFS scanner should out perform the HBase scanner every time
    when there is no filtration on the query. HBase is a better fit if you can
    use predicates on keys or other columns that will eliminate ranges of data
    to even been read.

    If we step back a bit, is there a higher level objective you are interested
    in? It would help to understand your use case and the context around it.



    On Mon, Jul 22, 2013 at 3:49 AM, Paul Birnie wrote:

    Thanks Greg,

    I was thinking, if there was some way to enable in-memory caching of the 1
    million rows in hbase.
    Then I was thinking the effect of columnar vs non-columnar would
    negligible.
    (base on the assumption that columnar formats main benefit is eliminating
    seek time of a disk head on each "row" read, which would not occur if all
    data was in-memory).

    Q) Is there a setting to tell hbase to cache all data in memory?
    Q) If this setting worked, could the performance be same / faster that
    parquet - or is there another performance barrier between reading from
    hbase vs streaming from parquet (hdfs file on disk) eg. hbase is accessed
    via an api that does not support streaming of rows our of memory and an api
    call is required per row?


    kind regards
    Paul

    On Saturday, 20 July 2013 19:16:57 UTC+1, Greg Rahn wrote:

    This sounds like reasonable behavior.
    You are basically comparing best case (parquet optimized for scans) to
    worst case (hbase optimized for key access) with a query that does zero
    filtering so there is nothing that hbase can push down for a filter nor
    does it use a key.


    On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie wrote:

    Hi,

    I am using Cloudera Quick start vm.

    I created 3 tables

    -- a external table

    CREATE EXTERNAL TABLE day1

    (

    key STRING,

    dimension STRING,

    book STRING,

    value INT

    )

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/cloudera/opsmisdb/day1'**;


    -- a parquet table

    create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


    -- and from hive shell, I create a hive table

    -- which has an underlying hbase database

    -- and after a "refresh;" is accessible in impala-shell

    --

    CREATE TABLE hive_day1(key string, dimension string, book string, value
    int)

    STORED BY 'org.apache.hadoop.hive.hbase.**HBaseStorageHandler'

    WITH SERDEPROPERTIES ("hbase.columns.mapping" =
    ":key,cf1:dimension,cf1:book,**cf1:value")

    TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



    -- The problem is that running any impala query against the hive_day1
    table (which is actually stored in hbase) is very slow

    -- on a small (1 million row) dataset

    --

    select dimension, sum(value) from hive_day1 group by dimension

    takes 15 seconds

    on a parquet_day1 table it takes 1 second.


    --

    Q) Is this expected behavior or some config setting that I have wrong. .

    Q) Is there some setting hbase that I can change to make access faster
    (bloom filters/in memory caching)

    ****
  • Greg Rahn at Jul 23, 2013 at 6:21 pm
    If I understand the use case correctly, the data is not updated in place,
    but a new record is added when an attribute changes for that given key.
    I think you will find that nether Impala nor HBase solves this perfectly --
    it's about trade offs.

    HBase does versioning, which is why Marcel mentioned it, however, that was
    the only context in that thread. There was no data access discussion.
      However, HBase is a better solution if you are doing access by key or have
    selective query predicates where reading data can be skipped. Its forte is
    not full access scans.

    Impala is a great tool for scans and analytics, but the current bits lack
    support for window functions, which is probably the most elegant way in SQL
    to solve this. But even then, there are drawbacks to this data design as
    it requires a full view of all the data in order to find the most recent
    event.

    Depending on how the data will be consumed, you may find cases where one
    method/tool outperforms the other and vice versa -- I don't think there is
    a single best way. You may find that both tools solve only a portion of
    the requirements.

    --
    -- example with a SQL window function to find the most recent versioned
    event
    --
    create table events (event_id int, event_status int, event_ts timestamp);

    insert into events (event_id, event_status, event_ts) values (1000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 4,
    now());

    insert into events (event_id, event_status, event_ts) values (2000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 4,
    now());

    insert into events (event_id, event_status, event_ts) values (3000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 4,
    now());


    select * from
    (
       select
         event_id,
         event_status,
         event_ts,
         row_number() over (partition by event_id order by event_ts desc) as
    row_number
       from events
    ) x
    where row_number = 1;

      event_id | event_status | event_ts | row_number
    ----------+--------------+----------------------------+------------
          1000 | 4 | 2013-07-23 10:35:31.274211 | 1
          2000 | 4 | 2013-07-23 10:35:31.276625 | 1
          3000 | 4 | 2013-07-23 10:35:31.278503 | 1





    On Mon, Jul 22, 2013 at 3:53 PM, Paul Birnie wrote:

    Hi Greg,

    I am trying to solve what I call "in-flight" problem - the issue is that
    some rows in a system represent a trade / user that is changing state
    during the in-flight window after that they become static. The in-flight
    window could be a day (or years).

    as I posted here one option to get the latest row for a specific row
    version is to join back to self - but this seems like a bad design when
    most of the time you always want the latest version.

    Marcel suggested I try Hbase - which works very well for inserts, its just
    slow to run aggregate queries against the contents


    https://groups.google.com/a/cloudera.org/forum/#!topic/impala-user/x0cvcUQ0dY4

    I have noticed this posting: http://hbase.apache.org/book/perf.hdfs.html
    "

    Performance Comparisons of HBase vs. HDFS

    A fairly common question on the dist-list is why HBase isn't as performant
    as HDFS files in a batch context (e.g., as a MapReduce source or sink). The
    short answer is that HBase is doing a lot more than HDFS (e.g., reading the
    KeyValues, returning the most current row or specified timestamps, etc.),
    and as such HBase is 4-5 times slower than HDFS in this processing context.
    There is room for improvement and this gap will, over time, be reduced, but
    HDFS will always be faster in this use-case."

    and this jira

    https://issues.cloudera.org/browse/IMPALA-381

    Q) Any ideas for some alternative design options to solve the "in-flight
    problem"?

    Q) Given that my aggregate query is currently taking 15 seconds against
    hbase, perhaps there is some setting I can change to make it only 5 times
    slower than parquet?

    On Monday, 22 July 2013 18:15:03 UTC+1, Greg Rahn wrote:

    The data can certainly be resident in the HBase Block Cache given it's
    set large enough, but that simply saves an IO penalty if one exists.
    The Impala HDFS scanner should out perform the HBase scanner every time
    when there is no filtration on the query. HBase is a better fit if you can
    use predicates on keys or other columns that will eliminate ranges of data
    to even been read.

    If we step back a bit, is there a higher level objective you are
    interested in? It would help to understand your use case and the context
    around it.



    On Mon, Jul 22, 2013 at 3:49 AM, Paul Birnie wrote:

    Thanks Greg,

    I was thinking, if there was some way to enable in-memory caching of the
    1 million rows in hbase.
    Then I was thinking the effect of columnar vs non-columnar would
    negligible.
    (base on the assumption that columnar formats main benefit is
    eliminating seek time of a disk head on each "row" read, which would not
    occur if all data was in-memory).

    Q) Is there a setting to tell hbase to cache all data in memory?
    Q) If this setting worked, could the performance be same / faster that
    parquet - or is there another performance barrier between reading from
    hbase vs streaming from parquet (hdfs file on disk) eg. hbase is accessed
    via an api that does not support streaming of rows our of memory and an api
    call is required per row?


    kind regards
    Paul

    On Saturday, 20 July 2013 19:16:57 UTC+1, Greg Rahn wrote:

    This sounds like reasonable behavior.
    You are basically comparing best case (parquet optimized for scans) to
    worst case (hbase optimized for key access) with a query that does zero
    filtering so there is nothing that hbase can push down for a filter nor
    does it use a key.


    On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie wrote:

    Hi,

    I am using Cloudera Quick start vm.

    I created 3 tables

    -- a external table

    CREATE EXTERNAL TABLE day1

    (

    key STRING,

    dimension STRING,

    book STRING,

    value INT

    )

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/cloudera/opsmisdb/day1'****;


    -- a parquet table

    create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


    -- and from hive shell, I create a hive table

    -- which has an underlying hbase database

    -- and after a "refresh;" is accessible in impala-shell

    --

    CREATE TABLE hive_day1(key string, dimension string, book string,
    value int)

    STORED BY 'org.apache.hadoop.hive.hbase.****HBaseStorageHandler'

    WITH SERDEPROPERTIES ("hbase.columns.mapping" =
    ":key,cf1:dimension,cf1:book,**c**f1:value")

    TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



    -- The problem is that running any impala query against the hive_day1
    table (which is actually stored in hbase) is very slow

    -- on a small (1 million row) dataset

    --

    select dimension, sum(value) from hive_day1 group by dimension

    takes 15 seconds

    on a parquet_day1 table it takes 1 second.


    --

    Q) Is this expected behavior or some config setting that I have
    wrong. .

    Q) Is there some setting hbase that I can change to make access
    faster (bloom filters/in memory caching)

    ****
  • Greg Rahn at Jul 23, 2013 at 7:32 pm
    I should clarify when I mentioned HBase and Impala, I'm referring
    specifically to Impala using HBase as a storage manager for the former and
    Impala using HDFS as the storage manager for the latter.

    On Tue, Jul 23, 2013 at 11:21 AM, Greg Rahn wrote:

    If I understand the use case correctly, the data is not updated in place,
    but a new record is added when an attribute changes for that given key.
    I think you will find that nether Impala nor HBase solves this perfectly
    -- it's about trade offs.

    HBase does versioning, which is why Marcel mentioned it, however, that was
    the only context in that thread. There was no data access discussion.
    However, HBase is a better solution if you are doing access by key or have
    selective query predicates where reading data can be skipped. Its forte is
    not full access scans.

    Impala is a great tool for scans and analytics, but the current bits lack
    support for window functions, which is probably the most elegant way in SQL
    to solve this. But even then, there are drawbacks to this data design as
    it requires a full view of all the data in order to find the most recent
    event.

    Depending on how the data will be consumed, you may find cases where one
    method/tool outperforms the other and vice versa -- I don't think there is
    a single best way. You may find that both tools solve only a portion of
    the requirements.

    --
    -- example with a SQL window function to find the most recent versioned
    event
    --
    create table events (event_id int, event_status int, event_ts timestamp);

    insert into events (event_id, event_status, event_ts) values (1000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (1000, 4,
    now());

    insert into events (event_id, event_status, event_ts) values (2000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (2000, 4,
    now());

    insert into events (event_id, event_status, event_ts) values (3000, 1,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 2,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 3,
    now());
    insert into events (event_id, event_status, event_ts) values (3000, 4,
    now());


    select * from
    (
    select
    event_id,
    event_status,
    event_ts,
    row_number() over (partition by event_id order by event_ts desc) as
    row_number
    from events
    ) x
    where row_number = 1;

    event_id | event_status | event_ts | row_number
    ----------+--------------+----------------------------+------------
    1000 | 4 | 2013-07-23 10:35:31.274211 | 1
    2000 | 4 | 2013-07-23 10:35:31.276625 | 1
    3000 | 4 | 2013-07-23 10:35:31.278503 | 1





    On Mon, Jul 22, 2013 at 3:53 PM, Paul Birnie wrote:

    Hi Greg,

    I am trying to solve what I call "in-flight" problem - the issue is that
    some rows in a system represent a trade / user that is changing state
    during the in-flight window after that they become static. The in-flight
    window could be a day (or years).

    as I posted here one option to get the latest row for a specific row
    version is to join back to self - but this seems like a bad design when
    most of the time you always want the latest version.

    Marcel suggested I try Hbase - which works very well for inserts, its
    just slow to run aggregate queries against the contents


    https://groups.google.com/a/cloudera.org/forum/#!topic/impala-user/x0cvcUQ0dY4

    I have noticed this posting: http://hbase.apache.org/book/perf.hdfs.html
    "

    Performance Comparisons of HBase vs. HDFS

    A fairly common question on the dist-list is why HBase isn't as
    performant as HDFS files in a batch context (e.g., as a MapReduce source or
    sink). The short answer is that HBase is doing a lot more than HDFS (e.g.,
    reading the KeyValues, returning the most current row or specified
    timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
    processing context. There is room for improvement and this gap will, over
    time, be reduced, but HDFS will always be faster in this use-case."

    and this jira

    https://issues.cloudera.org/browse/IMPALA-381

    Q) Any ideas for some alternative design options to solve the "in-flight
    problem"?

    Q) Given that my aggregate query is currently taking 15 seconds against
    hbase, perhaps there is some setting I can change to make it only 5 times
    slower than parquet?

    On Monday, 22 July 2013 18:15:03 UTC+1, Greg Rahn wrote:

    The data can certainly be resident in the HBase Block Cache given it's
    set large enough, but that simply saves an IO penalty if one exists.
    The Impala HDFS scanner should out perform the HBase scanner every time
    when there is no filtration on the query. HBase is a better fit if you can
    use predicates on keys or other columns that will eliminate ranges of data
    to even been read.

    If we step back a bit, is there a higher level objective you are
    interested in? It would help to understand your use case and the context
    around it.



    On Mon, Jul 22, 2013 at 3:49 AM, Paul Birnie wrote:

    Thanks Greg,

    I was thinking, if there was some way to enable in-memory caching of
    the 1 million rows in hbase.
    Then I was thinking the effect of columnar vs non-columnar would
    negligible.
    (base on the assumption that columnar formats main benefit is
    eliminating seek time of a disk head on each "row" read, which would not
    occur if all data was in-memory).

    Q) Is there a setting to tell hbase to cache all data in memory?
    Q) If this setting worked, could the performance be same / faster that
    parquet - or is there another performance barrier between reading from
    hbase vs streaming from parquet (hdfs file on disk) eg. hbase is accessed
    via an api that does not support streaming of rows our of memory and an api
    call is required per row?


    kind regards
    Paul

    On Saturday, 20 July 2013 19:16:57 UTC+1, Greg Rahn wrote:

    This sounds like reasonable behavior.
    You are basically comparing best case (parquet optimized for scans) to
    worst case (hbase optimized for key access) with a query that does zero
    filtering so there is nothing that hbase can push down for a filter nor
    does it use a key.


    On Sat, Jul 20, 2013 at 12:12 AM, Paul Birnie wrote:

    Hi,

    I am using Cloudera Quick start vm.

    I created 3 tables

    -- a external table

    CREATE EXTERNAL TABLE day1

    (

    key STRING,

    dimension STRING,

    book STRING,

    value INT

    )

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/cloudera/opsmisdb/day1'****;


    -- a parquet table

    create table parquet_day1 LIKE day1 STORED AS PARQUETFILE;


    -- and from hive shell, I create a hive table

    -- which has an underlying hbase database

    -- and after a "refresh;" is accessible in impala-shell

    --

    CREATE TABLE hive_day1(key string, dimension string, book string,
    value int)

    STORED BY 'org.apache.hadoop.hive.hbase.****HBaseStorageHandler'

    WITH SERDEPROPERTIES ("hbase.columns.mapping" =
    ":key,cf1:dimension,cf1:book,**c**f1:value")

    TBLPROPERTIES ("hbase.table.name" = "hbase_day1");



    -- The problem is that running any impala query against the hive_day1
    table (which is actually stored in hbase) is very slow

    -- on a small (1 million row) dataset

    --

    select dimension, sum(value) from hive_day1 group by dimension

    takes 15 seconds

    on a parquet_day1 table it takes 1 second.


    --

    Q) Is this expected behavior or some config setting that I have
    wrong. .

    Q) Is there some setting hbase that I can change to make access
    faster (bloom filters/in memory caching)

    ****

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJul 20, '13 at 6:16p
activeJul 23, '13 at 7:32p
posts6
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Greg Rahn: 4 posts Paul Birnie: 2 posts

People

Translate

site design / logo © 2022 Grokbase