FAQ
As we load more and more data into HBase we are seeing the "millions of
columns" to be a challenge for us. We have some very wide rows and we are
taking 12-15 seconds to read those rows. Since HBase does not sort columns
and thereby can not support a scan of columns we are really seeing some
serious limitations to how we can model data in hbase. We always need to
read the entire row thus taking a 15 sec hit.

Is/has there been any talk about building in some support for sorted columns
and the ability to read/scan across columns? Millions of columns are
challenging if you can only read a single column/list of columns or the
entire thing. How does bigtable support this? It seems that hbase is limited
as a column based data store unless it can support this. Our columns are
truly dynamic so we do not even necessarily know what they are to request
them by name in a list. We want to be able to read/scan them just like for
rows.

We would love the ability to support the following read method (through
Thrift). We can of course do this on our own from the entire row but it
requires reading the 2 million col row into memory first.

getRowWithColumnRange(tableName, row, startColumn, stopColumn)

The above would be even better if it could be set up like a scanner where we
could stop at any point. Basically instead of scanning rows we would scan
columns for a given row. This would be the best way to support an offset,
limit pattern.

colScanID = colScannerOpenWithStop(tableName, row, startColumn, stopColumn)
colScannerGetList (colSanID,1000)

Of course once these changes occurred people would be pushing the size of
rows even more. We have seen somewhere around 20+ million columns cause OOM
errors. One row per region should be the theoretical limit to the row size,
but there is more work needed I am sure to ensure that this is true.

Thanks.

Search Discussions

  • Stack at Aug 10, 2011 at 6:34 pm

    On Wed, Aug 10, 2011 at 2:39 AM, Wayne wrote:
    As we load more and more data into HBase we are seeing the "millions of
    columns" to be a challenge for us. We have some very wide rows and we are
    taking 12-15 seconds to read those rows.
    How many columns when its taking this long Wayne?

    Since HBase does not sort columns
    They are sorted.
    and thereby can not support a scan of columns
    How do you mean? You only want a subset of the columns? Can you add
    a filter or add some subset of the columns to the Scan specification?

    You can also read a piece of the row only if that is all you are
    interested in (though you are on other side of thrift, right, and this
    facility may not be exposed -- I have not checked)
    Is/has there been any talk about building in some support for sorted columns
    and the ability to read/scan across columns? Millions of columns are
    challenging if you can only read a single column/list of columns or the
    entire thing.
    When you say read/scan across columns, can you say more what you'd
    like? You'd like to read N columns at a time?
    How does bigtable support this? It seems that hbase is limited
    as a column based data store unless it can support this. Our columns are
    truly dynamic so we do not even necessarily know what they are to request
    them by name in a list. We want to be able to read/scan them just like for
    rows.
    In java you'd do
    http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)

    We would love the ability to support the following read method (through
    Thrift). We can of course do this on our own from the entire row but it
    requires reading the 2 million col row into memory first.
    How big are the cells? How big is the 2M row? You don't know the
    name but do they fit a pattern that you could filter on? (Though
    again, filters are not exposed in thrift though that looks like its
    getting fixed)
    getRowWithColumnRange(tableName, row, startColumn, stopColumn)

    The above would be even better if it could be set up like a scanner where we
    could stop at any point. Basically instead of scanning rows we would scan
    columns for a given row. This would be the best way to support an offset,
    limit pattern.

    colScanID = colScannerOpenWithStop(tableName, row, startColumn, stopColumn)
    colScannerGetList (colSanID,1000)

    Of course once these changes occurred people would be pushing the size of
    rows even more. We have seen somewhere around 20+ million columns cause OOM
    errors. One row per region should be the theoretical limit to the row size,
    but there is more work needed I am sure to ensure that this is true.
    The above look useful. Stick them into an issue Wayne.

    St.Ack
    P.S. I'm still working (slowly) on the recover tool you asked for in
    your last mail.
  • Wayne at Aug 10, 2011 at 7:08 pm
    I think you are right in that Thrift is all we see and it is very limited.
    Comments in-line.
    On Wed, Aug 10, 2011 at 8:33 PM, Stack wrote:
    On Wed, Aug 10, 2011 at 2:39 AM, Wayne wrote:
    As we load more and more data into HBase we are seeing the "millions of
    columns" to be a challenge for us. We have some very wide rows and we are
    taking 12-15 seconds to read those rows.
    How many columns when its taking this long Wayne?
    ~2 million columns take 15 seconds.
    Since HBase does not sort columns
    They are sorted.

    They are not sorted (that we see). Columns come back in the order they were
    saved to the row or some other logic I am not sure, but column are not
    sorted like rows. We have always expected/wanted columns to be sorted on
    retrieval but they are not. Googling this it seems consistent with comments
    out there.

    and thereby can not support a scan of columns
    How do you mean? You only want a subset of the columns? Can you add
    a filter or add some subset of the columns to the Scan specification?
    Yes we only want a subset of columns. Thrift has no filters...that I know
    of? We can ask for a specific column or a list of columns, but since we do
    not know up front what the columns even are it does not help us.
    You can also read a piece of the row only if that is all you are
    interested in (though you are on other side of thrift, right, and this
    facility may not be exposed -- I have not checked)
    Not exposed...that I know of in Thrift...this would be great to be able to
    do. We would love to "chunk" the row back and thereby start getting data
    faster. Waiting 15 sec for anything is a real problem for us.
    Is/has there been any talk about building in some support for sorted columns
    and the ability to read/scan across columns? Millions of columns are
    challenging if you can only read a single column/list of columns or the
    entire thing.
    When you say read/scan across columns, can you say more what you'd
    like? You'd like to read N columns at a time?
    We would like most of all to read N columns. It is the Offset, Limit
    problem. Give me the first 100, give me the first 100 starting at 100 etc.
    That is what we are trying to support. We can easily get something like this
    to work by scanning the columns and stopping once we get enough or even
    chunking back parts of the row and stopping once we have enough. Right now
    we read 2 million values in 15 seconds and return 100 of them to the end
    user. We prefer to read 100 from hbase and return 100 to the user in 50 ms.
    How does bigtable support this? It seems that hbase is limited
    as a column based data store unless it can support this. Our columns are
    truly dynamic so we do not even necessarily know what they are to request
    them by name in a list. We want to be able to read/scan them just like for
    rows.
    In java you'd do

    http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch(int)

    We would love the ability to support the following read method (through
    Thrift). We can of course do this on our own from the entire row but it
    requires reading the 2 million col row into memory first.
    How big are the cells? How big is the 2M row? You don't know the
    name but do they fit a pattern that you could filter on? (Though
    again, filters are not exposed in thrift though that looks like its
    getting fixed)
    Not sure about filters, will have to look into them more closely as they are
    not exposed in Thrift yet. The cells are small. Simple doubles or
    varchar(50) type stuff. Our row keys and columns keys are actually much
    bigger than the values. Not sure about the row size, but it is pretty big.
    The thing is we don't want to read the whole thing back but instead would
    prefer to reads parts of the row in a way we can iterate through pages to
    support again the offset, limit pattern.
    getRowWithColumnRange(tableName, row, startColumn, stopColumn)

    The above would be even better if it could be set up like a scanner where we
    could stop at any point. Basically instead of scanning rows we would scan
    columns for a given row. This would be the best way to support an offset,
    limit pattern.

    colScanID = colScannerOpenWithStop(tableName, row, startColumn,
    stopColumn)
    colScannerGetList (colSanID,1000)

    Of course once these changes occurred people would be pushing the size of
    rows even more. We have seen somewhere around 20+ million columns cause OOM
    errors. One row per region should be the theoretical limit to the row size,
    but there is more work needed I am sure to ensure that this is true.
    The above look useful. Stick them into an issue Wayne. Ok
    St.Ack
    P.S. I'm still working (slowly) on the recover tool you asked for in
    your last mail.
    Thanks. I am hopeful something will be there before we need it!!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedAug 10, '11 at 9:40a
activeAug 10, '11 at 7:08p
posts3
users2
websitehbase.apache.org

2 users in discussion

Wayne: 2 posts Stack: 1 post

People

Translate

site design / logo © 2022 Grokbase