Grokbase Groups HBase user May 2011
FAQ
Hi,

We have a table split across multiple regions(approx 50-60 regions for 64 MB
split size) with rowid schema as
[ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
activities for an item for a customer.We have lots of data for lots of item
for a custoer in this table.

When we try to lookup activities for an item for the last 30 days from this
table , we are using a Scan with RowFilter and RegexComparator.The scan
takes a lot of time ( almost 15-20 secs) to get us the activities for an
item.

We are hooked up to HBase tables directly from a web application,so this
response time of around 20 secs is unacceptable.We also noticed that
whenever we do any scan kind of operation it is never in acceptable ranges
for a web application.

Are we doing something wrong ? If Hbase scans are so slow then it would be
real hard to hook it up directly with any web application.

Could somebody please suggest how to improve this or some other
options(design,architectural) to remedy this kind of issues dealing with lot
of data.

Note: We have tried with setCaching,SingleColumnValueFilter to no
significant effect.

---------------------------
Thanks & Regards
Himanish

Search Discussions

  • Connolly Juhani at May 12, 2011 at 6:12 am
    By naming rows from the timestamp the rowids are going to all be sequential
    when inserting. So all new inserts will be going into the same region. When
    checking the last 30 days you will also be reading from the same region
    where all the writing is happening, i.e the one that is already busy writing
    the edit log for all those entries. You might want to consider an
    alternative method of naming your rows that would result in more distributed
    reading/writing.
    However since you are naming rows by timestamps, you should be able to
    restrict the scan by a start and end date. You are doing this, right? If
    you're not, you are scanning every row in the table when you only need the
    rows from end-start.

    Someone may need to correct me, but based on my memory of the implementation
    scans are entirely sequential, so region a gets scanned, then b, then c. You
    could speed this up by scanning multiple regions in parallel processes and
    merging the results.
    On 12 May 2011 14:36, Himanish Kushary wrote:

    Hi,

    We have a table split across multiple regions(approx 50-60 regions for 64
    MB
    split size) with rowid schema as
    [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
    activities for an item for a customer.We have lots of data for lots of item
    for a custoer in this table.

    When we try to lookup activities for an item for the last 30 days from this
    table , we are using a Scan with RowFilter and RegexComparator.The scan
    takes a lot of time ( almost 15-20 secs) to get us the activities for an
    item.

    We are hooked up to HBase tables directly from a web application,so this
    response time of around 20 secs is unacceptable.We also noticed that
    whenever we do any scan kind of operation it is never in acceptable ranges
    for a web application.

    Are we doing something wrong ? If Hbase scans are so slow then it would be
    real hard to hook it up directly with any web application.

    Could somebody please suggest how to improve this or some other
    options(design,architectural) to remedy this kind of issues dealing with
    lot
    of data.

    Note: We have tried with setCaching,SingleColumnValueFilter to no
    significant effect.

    ---------------------------
    Thanks & Regards
    Himanish
  • Ryan Rawson at May 12, 2011 at 6:21 am
    Scans are in serial.

    To use DB parlance, consider a Scan + filter the moral equivalent of a
    "SELECT * FROM <> WHERE col='val'" with no index, and a full table
    scan is engaged.

    The typical ways to help solve performance issues are such:
    - arrange your data using the primary key so you can scan the smallest
    portion of the table possible.
    - use another table as an index. Unfortunately HBase doesn't help you here.

    -ryan
    On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani wrote:
    By naming rows from the timestamp the rowids are going to all be sequential
    when inserting. So all new inserts will be going into the same region. When
    checking the last 30 days you will also be reading from the same region
    where all the writing is happening, i.e the one that is already busy writing
    the edit log for all those entries. You might want to consider an
    alternative method of naming your rows that would result in more distributed
    reading/writing.
    However since you are naming rows by timestamps, you should be able to
    restrict the scan by a start and end date. You are doing this, right? If
    you're not, you are scanning every row in the table when you only need the
    rows from end-start.

    Someone may need to correct me, but based on my memory of the implementation
    scans are entirely sequential, so region a gets scanned, then b, then c. You
    could speed this up by scanning multiple regions in parallel processes and
    merging the results.
    On 12 May 2011 14:36, Himanish Kushary wrote:

    Hi,

    We have a table split across multiple regions(approx 50-60 regions for 64
    MB
    split size) with rowid schema as
    [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
    activities for an item for a customer.We have lots of data for lots of item
    for a custoer in this table.

    When we try to lookup activities for an item for the last 30 days from this
    table , we are using a Scan with RowFilter and RegexComparator.The scan
    takes a lot of time ( almost 15-20 secs) to get us the activities for an
    item.

    We are hooked up to HBase tables directly from a web application,so this
    response time of around 20 secs is unacceptable.We also noticed that
    whenever we do any scan kind of operation it is never in acceptable ranges
    for a web application.

    Are we doing something wrong ? If Hbase scans are so slow then it would be
    real hard to hook it up directly with any web application.

    Could somebody please suggest how to improve this or some other
    options(design,architectural) to remedy this kind of issues dealing with
    lot
    of data.

    Note: We have tried with setCaching,SingleColumnValueFilter to no
    significant effect.

    ---------------------------
    Thanks & Regards
    Himanish
  • Himanish Kushary at May 12, 2011 at 8:32 pm
    Thanks for your help. We are implementing our own secondary index table to
    get rid of the scan and replace those calls with Get.

    One common trend that we are following , to ensure the frontend web
    application is performant as per our expectation, is to always try and use
    Gets' from the UI instead of Scans'.

    Thanks
    Himanish
    On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson wrote:

    Scans are in serial.

    To use DB parlance, consider a Scan + filter the moral equivalent of a
    "SELECT * FROM <> WHERE col='val'" with no index, and a full table
    scan is engaged.

    The typical ways to help solve performance issues are such:
    - arrange your data using the primary key so you can scan the smallest
    portion of the table possible.
    - use another table as an index. Unfortunately HBase doesn't help you here.

    -ryan
    On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani wrote:
    By naming rows from the timestamp the rowids are going to all be
    sequential
    when inserting. So all new inserts will be going into the same region. When
    checking the last 30 days you will also be reading from the same region
    where all the writing is happening, i.e the one that is already busy writing
    the edit log for all those entries. You might want to consider an
    alternative method of naming your rows that would result in more
    distributed
    reading/writing.
    However since you are naming rows by timestamps, you should be able to
    restrict the scan by a start and end date. You are doing this, right? If
    you're not, you are scanning every row in the table when you only need the
    rows from end-start.

    Someone may need to correct me, but based on my memory of the
    implementation
    scans are entirely sequential, so region a gets scanned, then b, then c. You
    could speed this up by scanning multiple regions in parallel processes and
    merging the results.
    On 12 May 2011 14:36, Himanish Kushary wrote:

    Hi,

    We have a table split across multiple regions(approx 50-60 regions for
    64
    MB
    split size) with rowid schema as
    [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
    activities for an item for a customer.We have lots of data for lots of
    item
    for a custoer in this table.

    When we try to lookup activities for an item for the last 30 days from
    this
    table , we are using a Scan with RowFilter and RegexComparator.The scan
    takes a lot of time ( almost 15-20 secs) to get us the activities for an
    item.

    We are hooked up to HBase tables directly from a web application,so this
    response time of around 20 secs is unacceptable.We also noticed that
    whenever we do any scan kind of operation it is never in acceptable
    ranges
    for a web application.

    Are we doing something wrong ? If Hbase scans are so slow then it would
    be
    real hard to hook it up directly with any web application.

    Could somebody please suggest how to improve this or some other
    options(design,architectural) to remedy this kind of issues dealing with
    lot
    of data.

    Note: We have tried with setCaching,SingleColumnValueFilter to no
    significant effect.

    ---------------------------
    Thanks & Regards
    Himanish


    --
    Thanks & Regards
    Himanish
  • Ryan Rawson at May 12, 2011 at 8:53 pm
    Don't forget that a Get is just a 1 row scan, they share the same code
    path internally. The only difference of course is that a get just
    returns that one row and therefore is fairly fast (unless your row is
    huge, think hundreds of MBs).

    -ryan
    On Thu, May 12, 2011 at 1:31 PM, Himanish Kushary wrote:
    Thanks for your help. We are implementing our own secondary index table to
    get rid of the scan and replace those calls with Get.

    One common trend that we are following , to ensure the frontend web
    application is performant as per our expectation, is to always try and use
    Gets' from the UI instead of Scans'.

    Thanks
    Himanish
    On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson wrote:

    Scans are in serial.

    To use DB parlance, consider a Scan + filter the moral equivalent of a
    "SELECT * FROM <> WHERE col='val'" with no index, and a full table
    scan is engaged.

    The typical ways to help solve performance issues are such:
    - arrange your data using the primary key so you can scan the smallest
    portion of the table possible.
    - use another table as an index. Unfortunately HBase doesn't help you here.

    -ryan

    On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <juhani@ninja.co.jp>
    wrote:
    By naming rows from the timestamp the rowids are going to all be
    sequential
    when inserting. So all new inserts will be going into the same region. When
    checking the last 30 days you will also be reading from the same region
    where all the writing is happening, i.e the one that is already busy writing
    the edit log for all those entries. You might want to consider an
    alternative method of naming your rows that would result in more
    distributed
    reading/writing.
    However since you are naming rows by timestamps, you should be able to
    restrict the scan by a start and end date. You are doing this, right? If
    you're not, you are scanning every row in the table when you only need the
    rows from end-start.

    Someone may need to correct me, but based on my memory of the
    implementation
    scans are entirely sequential, so region a gets scanned, then b, then c. You
    could speed this up by scanning multiple regions in parallel processes and
    merging the results.
    On 12 May 2011 14:36, Himanish Kushary wrote:

    Hi,

    We have a table split across multiple regions(approx 50-60 regions for
    64
    MB
    split size) with rowid schema as
    [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
    activities for an item for a customer.We have lots of data for lots of
    item
    for a custoer in this table.

    When we try to lookup activities for an item for the last 30 days from
    this
    table , we are using a Scan with RowFilter and RegexComparator.The scan
    takes a lot of time ( almost 15-20 secs) to get us the activities for an
    item.

    We are hooked up to HBase tables directly from a web application,so this
    response time of around 20 secs is unacceptable.We also noticed that
    whenever we do any scan kind of operation it is never in acceptable
    ranges
    for a web application.

    Are we doing something wrong ? If Hbase scans are so slow then it would
    be
    real hard to hook it up directly with any web application.

    Could somebody please suggest how to improve this or some other
    options(design,architectural) to remedy this kind of issues dealing with
    lot
    of data.

    Note: We have tried with setCaching,SingleColumnValueFilter to no
    significant effect.

    ---------------------------
    Thanks & Regards
    Himanish


    --
    Thanks & Regards
    Himanish

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 12, '11 at 5:37a
activeMay 12, '11 at 8:53p
posts5
users3
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase