Grokbase Groups HBase user July 2011
FAQ
Hello,



we have a three part row-key taking into account that the first part is
important for distribution/partitioning when the system grows. Depending
on the access pattern, we might be in a situation to use e.g. RegEx
filters on rowkeys. I wonder if there is some kind of an execution plan
(as known in RDBMS) when running a HBase query to better understand how
HBase processes the query and what execution path it takes to generate
the result set.



Or is this a complete different thinking?



Thanks,

Thomas

Search Discussions

  • Ted Dunning at Jul 27, 2011 at 6:22 am
    Hbase doesn't really have a lot of flexibility in how it does a query.

    A get is a get.

    A scan is a scan.

    A filtered scan is a scan.

    There might be a few diagnostics that would tell you how many records were
    rejected in a scan or which coprocessors executed or which column families
    were used, but there isn't a lot of magic optimization going on. This is
    very different from SQL where your query doesn't specify what actually
    happens very directly.
    On Tue, Jul 26, 2011 at 11:10 PM, Steinmaurer Thomas wrote:

    Hello,



    we have a three part row-key taking into account that the first part is
    important for distribution/partitioning when the system grows. Depending
    on the access pattern, we might be in a situation to use e.g. RegEx
    filters on rowkeys. I wonder if there is some kind of an execution plan
    (as known in RDBMS) when running a HBase query to better understand how
    HBase processes the query and what execution path it takes to generate
    the result set.



    Or is this a complete different thinking?



    Thanks,

    Thomas


  • Andrew Purtell at Jul 27, 2011 at 6:25 am
    Or is this a complete different thinking?
    Yes.

    There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
    Depending on the access pattern, we might be in a situation to use e.g. RegEx filters on rowkeys. I wonder if there is some kind of an execution plan when running a HBase query to better understand
    Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment.

    What people do is measure the application metrics of interest and try different approaches to optimize them.

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
    To: user@hbase.apache.org
    Sent: Tuesday, July 26, 2011 11:10 PM
    Subject: Something like Execution Plan as in the RDBMS world?

    Hello,



    we have a three part row-key taking into account that the first part is
    important for distribution/partitioning when the system grows. Depending
    on the access pattern, we might be in a situation to use e.g. RegEx
    filters on rowkeys. I wonder if there is some kind of an execution plan
    (as known in RDBMS) when running a HBase query to better understand how
    HBase processes the query and what execution path it takes to generate
    the result set.



    Or is this a complete different thinking?



    Thanks,

    Thomas




  • Steinmaurer Thomas at Aug 4, 2011 at 10:58 am
    Hi Andy and Ted!

    Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.

    I have the following row key, as already discussed in other threads.

    vehicle_id: up to 16 characters
    device_id: up to 16 characters
    timestamp: YYYYMMDDhhmmss

    Pretty much one row every 5 minutes for a particular vehicle and device.

    Now I want to get the rows for an entire day for a particular vehicle and device.

    The following range scan implementation:

    Scan scan = new Scan();

    String startKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808000000";
    String endKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808235959";
    scan.setStartRow(Bytes.toBytes(startKey));
    scan.setStopRow(Bytes.toBytes(endKey));
    scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));

    Takes < 1 sec.

    Whereas the following regex based row filter implementation:

    List<Filter> filters = new ArrayList<Filter>();
    RowFilter rf = new RowFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
    );
    filters.add(rf);

    QualifierFilter qf = new QualifierFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator("temperature1_value")
    );
    filters.add(qf);

    FilterList filterList1 = new FilterList(filters);
    scan.setFilter(filterList1);


    Takes around 6 sec on a very small table.


    We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.


    Thanks!

    Thomas


    -----Original Message-----
    From: Andrew Purtell
    Sent: Mittwoch, 27. Juli 2011 08:25
    To: user@hbase.apache.org
    Subject: Re: Something like Execution Plan as in the RDBMS world?
    Or is this a complete different thinking?
    Yes.

    There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
    Depending on the access pattern, we might be in a situation to use
    e.g. RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan when running a HBase query to better understand
    Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment.

    What people do is measure the application metrics of interest and try different approaches to optimize them.

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
    To: user@hbase.apache.org
    Sent: Tuesday, July 26, 2011 11:10 PM
    Subject: Something like Execution Plan as in the RDBMS world?

    Hello,



    we have a three part row-key taking into account that the first part is
    important for distribution/partitioning when the system grows.
    Depending on the access pattern, we might be in a situation to use e.g.
    RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan (as known in RDBMS) when running a HBase query to better
    understand how HBase processes the query and what execution path it
    takes to generate the result set.



    Or is this a complete different thinking?



    Thanks,

    Thomas




  • Jean-Daniel Cryans at Aug 4, 2011 at 4:58 pm
    Your first scan is set to start with an offset in the table and has a
    stop row, whereas the other one is a full table scan with filtering
    (if my understanding is correct). Compare the second scan with a
    straight up full table scan and you should see where the slowdown
    comes from (should be from the fact that it has to read everything).

    Hope that helps,

    J-D

    On Thu, Aug 4, 2011 at 3:57 AM, Steinmaurer Thomas
    wrote:
    Hi Andy and Ted!

    Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.

    I have the following row key, as already discussed in other threads.

    vehicle_id: up to 16 characters
    device_id: up to 16 characters
    timestamp: YYYYMMDDhhmmss

    Pretty much one row every 5 minutes for a particular vehicle and device.

    Now I want to get the rows for an entire day for a particular vehicle and device.

    The following range scan implementation:

    Scan scan = new Scan();

    String startKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808000000";
    String endKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808235959";
    scan.setStartRow(Bytes.toBytes(startKey));
    scan.setStopRow(Bytes.toBytes(endKey));
    scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));

    Takes < 1 sec.

    Whereas the following regex based row filter implementation:

    List<Filter> filters = new ArrayList<Filter>();
    RowFilter rf = new RowFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
    );
    filters.add(rf);

    QualifierFilter qf = new QualifierFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator("temperature1_value")
    );
    filters.add(qf);

    FilterList filterList1 = new FilterList(filters);
    scan.setFilter(filterList1);


    Takes around 6 sec on a very small table.


    We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.


    Thanks!

    Thomas


    -----Original Message-----
    From: Andrew Purtell
    Sent: Mittwoch, 27. Juli 2011 08:25
    To: user@hbase.apache.org
    Subject: Re: Something like Execution Plan as in the RDBMS world?
    Or is this a complete different thinking?
    Yes.

    There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
    Depending on the access pattern, we might be in a situation to use
    e.g. RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan when running a HBase query to better understand
    Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment.

    What people do is measure the application metrics of interest and try different approaches to optimize them.

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
    To: user@hbase.apache.org
    Sent: Tuesday, July 26, 2011 11:10 PM
    Subject: Something like Execution Plan as in the RDBMS world?

    Hello,



    we have a three part row-key taking into account that the first part is
    important for distribution/partitioning when the system grows.
    Depending on the access pattern, we might be in a situation to use e.g.
    RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan (as known in RDBMS) when running a HBase query to better
    understand how HBase processes the query and what execution path it
    takes to generate the result set.



    Or is this a complete different thinking?



    Thanks,

    Thomas




  • Michael Segel at Aug 4, 2011 at 10:05 pm
    Tomas,

    If I understand you correctly you have a row key of A,B,C and you wan to fetch only the rows on A and C
    You can do a start row of A
    And then do the end row of A1

    So that you get the first row for the give vehicle_id, and then stop when the vehicle_id changes.

    You would then have to do a server side filter on values for C to get the timestamp for a given day.
    (You could do this with a client side filter, but that means pushing all the data over the wire.)
    [Note having said that, you could just do a client side filter since you only have 115K rows and you're going to get a subset of that returned by the range key.]

    The idea of doing something like the following:
    SELECT *
    FROM TABLE
    WHERE A=x
    AND DAY(C) = y [or some variation]
    {A and C are part of a composite index}

    doesn't work in HBase.

    If your key was ACB, meaning that Vehicle_id, timestamp, device_id was the composite key, then you could do a start/stop range scan using A and C.

    Sorry if I'm missing something since I jumped in the middle of a discussion.

    -Mike

    Subject: RE: Something like Execution Plan as in the RDBMS world?
    Date: Thu, 4 Aug 2011 12:57:12 +0200
    From: Thomas.Steinmaurer@scch.at
    To: user@hbase.apache.org; apurtell@apache.org

    Hi Andy and Ted!

    Thanks for your reply. Basically, I'm currently trying a range scan and a regex row filter on a very small table (~ 115K rows), just to get used to. Hadoop/HBase ... is running in the available Cloudera VM.

    I have the following row key, as already discussed in other threads.

    vehicle_id: up to 16 characters
    device_id: up to 16 characters
    timestamp: YYYYMMDDhhmmss

    Pretty much one row every 5 minutes for a particular vehicle and device.

    Now I want to get the rows for an entire day for a particular vehicle and device.

    The following range scan implementation:

    Scan scan = new Scan();

    String startKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808000000";
    String endKey =
    String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace(' ', '0') // Vehicle ID
    + "-"
    + String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "1").replace(' ', '0') // Device ID
    + "-"
    + "20110808235959";
    scan.setStartRow(Bytes.toBytes(startKey));
    scan.setStopRow(Bytes.toBytes(endKey));
    scan.addColumn(Bytes.toBytes("data_details"), Bytes.toBytes("temperature1_value"));

    Takes < 1 sec.

    Whereas the following regex based row filter implementation:

    List<Filter> filters = new ArrayList<Filter>();
    RowFilter rf = new RowFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
    );
    filters.add(rf);

    QualifierFilter qf = new QualifierFilter(
    CompareFilter.CompareOp.EQUAL
    , new RegexStringComparator("temperature1_value")
    );
    filters.add(qf);

    FilterList filterList1 = new FilterList(filters);
    scan.setFilter(filterList1);


    Takes around 6 sec on a very small table.


    We aren't sure if we need the regex row filter capabilities at all or if range scans are sufficient for our access pattern. But a better understanding on how to optimize regex stuff would be helpful.


    Thanks!

    Thomas


    -----Original Message-----
    From: Andrew Purtell
    Sent: Mittwoch, 27. Juli 2011 08:25
    To: user@hbase.apache.org
    Subject: Re: Something like Execution Plan as in the RDBMS world?
    Or is this a complete different thinking?
    Yes.

    There isn't an "execution plan" when using HBase, as that term is commonly understood from RDBMS systems. The commands you issue against HBase using the client API are executed in order as you issue them.
    Depending on the access pattern, we might be in a situation to use
    e.g. RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan when running a HBase query to better understand
    Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics like number of store files read, how many keys examined, etc. is an interesting idea perhaps along the lines of what you ask, but HBase does not have support for that level of query performance introspection at the moment.

    What people do is measure the application metrics of interest and try different approaches to optimize them.

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
    To: user@hbase.apache.org
    Sent: Tuesday, July 26, 2011 11:10 PM
    Subject: Something like Execution Plan as in the RDBMS world?

    Hello,



    we have a three part row-key taking into account that the first part is
    important for distribution/partitioning when the system grows.
    Depending on the access pattern, we might be in a situation to use e.g.
    RegEx filters on rowkeys. I wonder if there is some kind of an
    execution plan (as known in RDBMS) when running a HBase query to better
    understand how HBase processes the query and what execution path it
    takes to generate the result set.



    Or is this a complete different thinking?



    Thanks,

    Thomas




Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJul 27, '11 at 6:11a
activeAug 4, '11 at 10:05p
posts6
users5
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase