Grokbase Groups HBase user June 2016
FAQ
Hi:

I'm currently using HBase 1.1.2 and am in the process of determining how
best to proceed with the column layout for an upcoming expansion of our
data pipeline.

Background:

Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
Table B: billions of rows (more than Table A), 1.8 TB (with snappy
compression), rowkey is sha1


These tables represent data obtained via a combination batch/streaming
process. We want to expand our data pipeline to run an assortment of
analyses on these tables (both batch and streaming) and be able to store
the results in each table as appropriate. Table A is a set of unique
entries with some example data, whereas Table B is correlated to Table A
(via Table A's sha1), but is not de-duplicated (that is to say, it contains
contextual data).

For the expansion of the data pipeline, we want to store the data either in
Table A if context is not needed, and Table B if context is needed. Since
we have a theoretically unlimited number of different analyses that we may
want to perform and store the results for (that is to say, I need to assume
there will be a substantial number of data sets that need to be stored in
these tables, which will grow over time and could each themselves
potentially be somewhat wide in terms of columns).

Originally, I had considered storing these in column families, where each
analysis is grouped together in a different column family. However, I have
read in the HBase book documentation that HBase does not perform well with
many column families (a few default, ~10 max), so I have discarded this
option.

The next two options both involve using wide tables with many columns in a
separate column family (e.g. "d"), where all the various analysis would be
grouped into the same family in a potentially wide amount of columns in
total. Each of these analyses needs to maintain their own versions so we
can correlate the data from each one. The variants which come to mind to
accomplish that, and on which I would appreciate some feedback on are:

    1. Use HBase's native versioning to store the version of the analysis
    2. Encode a version in the column name itself

I know the HBase native versions use the server's timestamp by default, but
can take any long value. So we could assign a particular time value to be a
version of a particular analysis. However, the doc also warned that there
could be negative ramifications of this because HBase uses the versions
internally for things like TTL for deletes/maintenance. Do people use
versions in this way? Are the TTL issues of great concern? (We likely won't
be deleting things often from the tables, but can't guarantee that we won't
ever do so).

Encoding a version in the column name itself would make the column names
bigger, and I know it's encouraged for column names to be as small as
possible.

Adjacent to the native-version-or-not question, there's the general column
naming. I was originally thinking maybe having a prefix followed by the
column name, optionally with the version in the middle depending on whether
1 or 2 is chosen above. This would allow prefix filters to be used during
gets/scans to gather all columns for a given analysis type, etc. but it
would perhaps result in larger column names across billions of rows.

e.g. *analysisfoo_4_column1*

In practice, is this done and can it perform well? Or is it better to pick
a fixed width and use some number in its place, that's then translated via,
say, another table?

e.g. *100000_1000_100000* (or something to that effect -- fixed width
numbers that are stand-in ids for potentially longer descriptions).

Thanks,
- Ken

Search Discussions

  • Ken Hampson at Jun 11, 2016 at 3:16 am
    I realize that was probably a bit of a wall of text... =)

    So, TL;DR: I'm wondering:
    1) If people have used and had good experiences with caller-specified
    version timestamps (esp. given the caveats in the HBase book doc re: issues
    with deletions and TTLs).

    2) About suggestions for optimal column naming for potentially large
    numbers of different column groupings for very wide tables.

    Thanks,
    - Ken
    On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson wrote:

    Hi:

    I'm currently using HBase 1.1.2 and am in the process of determining how
    best to proceed with the column layout for an upcoming expansion of our
    data pipeline.

    Background:

    Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
    Table B: billions of rows (more than Table A), 1.8 TB (with snappy
    compression), rowkey is sha1


    These tables represent data obtained via a combination batch/streaming
    process. We want to expand our data pipeline to run an assortment of
    analyses on these tables (both batch and streaming) and be able to store
    the results in each table as appropriate. Table A is a set of unique
    entries with some example data, whereas Table B is correlated to Table A
    (via Table A's sha1), but is not de-duplicated (that is to say, it contains
    contextual data).

    For the expansion of the data pipeline, we want to store the data either
    in Table A if context is not needed, and Table B if context is needed.
    Since we have a theoretically unlimited number of different analyses that
    we may want to perform and store the results for (that is to say, I need to
    assume there will be a substantial number of data sets that need to be
    stored in these tables, which will grow over time and could each themselves
    potentially be somewhat wide in terms of columns).

    Originally, I had considered storing these in column families, where each
    analysis is grouped together in a different column family. However, I have
    read in the HBase book documentation that HBase does not perform well with
    many column families (a few default, ~10 max), so I have discarded this
    option.

    The next two options both involve using wide tables with many columns in a
    separate column family (e.g. "d"), where all the various analysis would be
    grouped into the same family in a potentially wide amount of columns in
    total. Each of these analyses needs to maintain their own versions so we
    can correlate the data from each one. The variants which come to mind to
    accomplish that, and on which I would appreciate some feedback on are:

    1. Use HBase's native versioning to store the version of the analysis
    2. Encode a version in the column name itself

    I know the HBase native versions use the server's timestamp by default,
    but can take any long value. So we could assign a particular time value to
    be a version of a particular analysis. However, the doc also warned that
    there could be negative ramifications of this because HBase uses the
    versions internally for things like TTL for deletes/maintenance. Do people
    use versions in this way? Are the TTL issues of great concern? (We likely
    won't be deleting things often from the tables, but can't guarantee that we
    won't ever do so).

    Encoding a version in the column name itself would make the column names
    bigger, and I know it's encouraged for column names to be as small as
    possible.

    Adjacent to the native-version-or-not question, there's the general column
    naming. I was originally thinking maybe having a prefix followed by the
    column name, optionally with the version in the middle depending on whether
    1 or 2 is chosen above. This would allow prefix filters to be used during
    gets/scans to gather all columns for a given analysis type, etc. but it
    would perhaps result in larger column names across billions of rows.

    e.g. *analysisfoo_4_column1*

    In practice, is this done and can it perform well? Or is it better to pick
    a fixed width and use some number in its place, that's then translated via,
    say, another table?

    e.g. *100000_1000_100000* (or something to that effect -- fixed width
    numbers that are stand-in ids for potentially longer descriptions).

    Thanks,
    - Ken
  • Anil gupta at Jun 11, 2016 at 6:47 pm
    My 2 cents:

    #1. HBase version timestamp is purely used for storing & purging historical
    data on basis of TTL. If you try to build an app toying around timestamps
    you might run into issues. So, you might need to be very careful with that.

    #2. Usually HBase suggests that column name to be around 5-6 chars because
    HBase store data as KV. But, its hard to keep on doing that in **real world
    apps**. When you use block encoding/compression, the performance penalty of
    wide columns is reduced. For example, Apache Phoenix uses Fast_Diff
    encoding by default due to non-short column name.
    Here is another blogpost that discuss perf of encoding/compression:
    http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html
    I have been using user friendly column names(more readable rather than
    short abbreviation) and i still get decent performance in my
    apps.(Obviously, YMMV. My apps are performing within our SLA.)
    In prod, I have a table that has 1100+ columns, column names are not short.
    Hence, i would recommend you to go ahead with your non-short column naming.
    You might need to try out different encoding/compression to see what
    provides you best performance.

    HTH,
    Anil Gupta
    On Fri, Jun 10, 2016 at 8:16 PM, Ken Hampson wrote:

    I realize that was probably a bit of a wall of text... =)

    So, TL;DR: I'm wondering:
    1) If people have used and had good experiences with caller-specified
    version timestamps (esp. given the caveats in the HBase book doc re: issues
    with deletions and TTLs).

    2) About suggestions for optimal column naming for potentially large
    numbers of different column groupings for very wide tables.

    Thanks,
    - Ken
    On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson wrote:

    Hi:

    I'm currently using HBase 1.1.2 and am in the process of determining how
    best to proceed with the column layout for an upcoming expansion of our
    data pipeline.

    Background:

    Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
    Table B: billions of rows (more than Table A), 1.8 TB (with snappy
    compression), rowkey is sha1


    These tables represent data obtained via a combination batch/streaming
    process. We want to expand our data pipeline to run an assortment of
    analyses on these tables (both batch and streaming) and be able to store
    the results in each table as appropriate. Table A is a set of unique
    entries with some example data, whereas Table B is correlated to Table A
    (via Table A's sha1), but is not de-duplicated (that is to say, it contains
    contextual data).

    For the expansion of the data pipeline, we want to store the data either
    in Table A if context is not needed, and Table B if context is needed.
    Since we have a theoretically unlimited number of different analyses that
    we may want to perform and store the results for (that is to say, I need to
    assume there will be a substantial number of data sets that need to be
    stored in these tables, which will grow over time and could each
    themselves
    potentially be somewhat wide in terms of columns).

    Originally, I had considered storing these in column families, where each
    analysis is grouped together in a different column family. However, I have
    read in the HBase book documentation that HBase does not perform well with
    many column families (a few default, ~10 max), so I have discarded this
    option.

    The next two options both involve using wide tables with many columns in a
    separate column family (e.g. "d"), where all the various analysis would be
    grouped into the same family in a potentially wide amount of columns in
    total. Each of these analyses needs to maintain their own versions so we
    can correlate the data from each one. The variants which come to mind to
    accomplish that, and on which I would appreciate some feedback on are:

    1. Use HBase's native versioning to store the version of the analysis
    2. Encode a version in the column name itself

    I know the HBase native versions use the server's timestamp by default,
    but can take any long value. So we could assign a particular time value to
    be a version of a particular analysis. However, the doc also warned that
    there could be negative ramifications of this because HBase uses the
    versions internally for things like TTL for deletes/maintenance. Do people
    use versions in this way? Are the TTL issues of great concern? (We likely
    won't be deleting things often from the tables, but can't guarantee that we
    won't ever do so).

    Encoding a version in the column name itself would make the column names
    bigger, and I know it's encouraged for column names to be as small as
    possible.

    Adjacent to the native-version-or-not question, there's the general column
    naming. I was originally thinking maybe having a prefix followed by the
    column name, optionally with the version in the middle depending on whether
    1 or 2 is chosen above. This would allow prefix filters to be used during
    gets/scans to gather all columns for a given analysis type, etc. but it
    would perhaps result in larger column names across billions of rows.

    e.g. *analysisfoo_4_column1*

    In practice, is this done and can it perform well? Or is it better to pick
    a fixed width and use some number in its place, that's then translated via,
    say, another table?

    e.g. *100000_1000_100000* (or something to that effect -- fixed width
    numbers that are stand-in ids for potentially longer descriptions).

    Thanks,
    - Ken


    --
    Thanks & Regards,
    Anil Gupta
  • Ken Hampson at Jun 13, 2016 at 2:34 am
    Hi, Anil:

    Thanks for the feedback! I'll proceed with the non-short column-naming.
    It's good to have some feedback from real-world, production cases.

    Thanks again,
    - Ken
    On Sat, Jun 11, 2016 at 2:47 PM anil gupta wrote:

    My 2 cents:

    #1. HBase version timestamp is purely used for storing & purging historical
    data on basis of TTL. If you try to build an app toying around timestamps
    you might run into issues. So, you might need to be very careful with that.

    #2. Usually HBase suggests that column name to be around 5-6 chars because
    HBase store data as KV. But, its hard to keep on doing that in **real world
    apps**. When you use block encoding/compression, the performance penalty of
    wide columns is reduced. For example, Apache Phoenix uses Fast_Diff
    encoding by default due to non-short column name.
    Here is another blogpost that discuss perf of encoding/compression:

    http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html
    I have been using user friendly column names(more readable rather than
    short abbreviation) and i still get decent performance in my
    apps.(Obviously, YMMV. My apps are performing within our SLA.)
    In prod, I have a table that has 1100+ columns, column names are not short.
    Hence, i would recommend you to go ahead with your non-short column naming.
    You might need to try out different encoding/compression to see what
    provides you best performance.

    HTH,
    Anil Gupta
    On Fri, Jun 10, 2016 at 8:16 PM, Ken Hampson wrote:

    I realize that was probably a bit of a wall of text... =)

    So, TL;DR: I'm wondering:
    1) If people have used and had good experiences with caller-specified
    version timestamps (esp. given the caveats in the HBase book doc re: issues
    with deletions and TTLs).

    2) About suggestions for optimal column naming for potentially large
    numbers of different column groupings for very wide tables.

    Thanks,
    - Ken
    On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson wrote:

    Hi:

    I'm currently using HBase 1.1.2 and am in the process of determining
    how
    best to proceed with the column layout for an upcoming expansion of our
    data pipeline.

    Background:

    Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1
    Table B: billions of rows (more than Table A), 1.8 TB (with snappy
    compression), rowkey is sha1


    These tables represent data obtained via a combination batch/streaming
    process. We want to expand our data pipeline to run an assortment of
    analyses on these tables (both batch and streaming) and be able to
    store
    the results in each table as appropriate. Table A is a set of unique
    entries with some example data, whereas Table B is correlated to Table
    A
    (via Table A's sha1), but is not de-duplicated (that is to say, it contains
    contextual data).

    For the expansion of the data pipeline, we want to store the data
    either
    in Table A if context is not needed, and Table B if context is needed.
    Since we have a theoretically unlimited number of different analyses
    that
    we may want to perform and store the results for (that is to say, I
    need
    to
    assume there will be a substantial number of data sets that need to be
    stored in these tables, which will grow over time and could each
    themselves
    potentially be somewhat wide in terms of columns).

    Originally, I had considered storing these in column families, where
    each
    analysis is grouped together in a different column family. However, I have
    read in the HBase book documentation that HBase does not perform well with
    many column families (a few default, ~10 max), so I have discarded this
    option.

    The next two options both involve using wide tables with many columns
    in
    a
    separate column family (e.g. "d"), where all the various analysis would be
    grouped into the same family in a potentially wide amount of columns in
    total. Each of these analyses needs to maintain their own versions so
    we
    can correlate the data from each one. The variants which come to mind
    to
    accomplish that, and on which I would appreciate some feedback on are:

    1. Use HBase's native versioning to store the version of the
    analysis
    2. Encode a version in the column name itself

    I know the HBase native versions use the server's timestamp by default,
    but can take any long value. So we could assign a particular time value to
    be a version of a particular analysis. However, the doc also warned
    that
    there could be negative ramifications of this because HBase uses the
    versions internally for things like TTL for deletes/maintenance. Do people
    use versions in this way? Are the TTL issues of great concern? (We
    likely
    won't be deleting things often from the tables, but can't guarantee
    that
    we
    won't ever do so).

    Encoding a version in the column name itself would make the column
    names
    bigger, and I know it's encouraged for column names to be as small as
    possible.

    Adjacent to the native-version-or-not question, there's the general column
    naming. I was originally thinking maybe having a prefix followed by the
    column name, optionally with the version in the middle depending on whether
    1 or 2 is chosen above. This would allow prefix filters to be used
    during
    gets/scans to gather all columns for a given analysis type, etc. but it
    would perhaps result in larger column names across billions of rows.

    e.g. *analysisfoo_4_column1*

    In practice, is this done and can it perform well? Or is it better to pick
    a fixed width and use some number in its place, that's then translated via,
    say, another table?

    e.g. *100000_1000_100000* (or something to that effect -- fixed width
    numbers that are stand-in ids for potentially longer descriptions).

    Thanks,
    - Ken


    --
    Thanks & Regards,
    Anil Gupta

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJun 8, '16 at 2:53a
activeJun 13, '16 at 2:34a
posts4
users2
websitehbase.apache.org

2 users in discussion

Ken Hampson: 3 posts Anil gupta: 1 post

People

Translate

site design / logo © 2018 Grokbase