FAQ
I'm about to design my HBase table, and I wonder :
1) what is the max number of column family that is still consider
reasonable? 100 , 1K, 1M ... more ?
2) what is a reasonable number of columns per column family ? 100, 1K , 1M
.. more ?

please advice
Yonatan

Search Discussions

  • Stack at Dec 20, 2008 at 11:19 pm

    yonatan maman wrote:
    I'm about to design my HBase table, and I wonder :
    1) what is the max number of column family that is still consider
    reasonable? 100 , 1K, 1M ... more ?
    Keep it small I'd say for now until we do more work in server
    parallellizing querying of different column families. I'd suggest low tens.
    2) what is a reasonable number of columns per column family ? 100, 1K , 1M
    .. more ?
    Low hundreds till we fix the bug that has us slow when lots of columns.

    St.Ack
  • Ryan LeCompte at Dec 20, 2008 at 11:24 pm
    Ah darn, I was just trying to experiment changing my schema to support
    1000's of columns.... however once I did that I started running out of
    memory again. :-(

    On Sat, Dec 20, 2008 at 6:19 PM, stack wrote:
    yonatan maman wrote:
    I'm about to design my HBase table, and I wonder :
    1) what is the max number of column family that is still consider
    reasonable? 100 , 1K, 1M ... more ?
    Keep it small I'd say for now until we do more work in server parallellizing
    querying of different column families. I'd suggest low tens.
    2) what is a reasonable number of columns per column family ? 100, 1K ,
    1M
    .. more ?
    Low hundreds till we fix the bug that has us slow when lots of columns.

    St.Ack
  • Yonatan maman at Dec 20, 2008 at 11:33 pm
    I want to have to use HBAse to implement
    (Entity-attribute-value_model<http://en.wikipedia.org/wiki/Entity-attribute-value_model>),
    in RDBM it looks like :

    col1: entityID

    col2: attributeName

    col3: value


    Will it be reasoniable to have one Hbase table like this:

    entityID as row key
    attr as column family

    so suppose I have 2 entities:
    e1: has 2 attibutes: a1 with value v1 and a2 with value v2
    e2: has 2 attibutes: a1 wuth value v11 and a3 with value v33


    e1--->attr:a1=v1, attr:a2=v2
    e2--->attr:a1=v11, attr:a3=v33

    I guess the number of different sohuld be low hundred (as suggested by
    stack). will this 'bug' taken care now ? what will be the limit wfter it
    will be fixed ?

    -- Yonatan





    On Sun, Dec 21, 2008 at 1:23 AM, Ryan LeCompte wrote:

    Ah darn, I was just trying to experiment changing my schema to support
    1000's of columns.... however once I did that I started running out of
    memory again. :-(

    On Sat, Dec 20, 2008 at 6:19 PM, stack wrote:
    yonatan maman wrote:
    I'm about to design my HBase table, and I wonder :
    1) what is the max number of column family that is still consider
    reasonable? 100 , 1K, 1M ... more ?
    Keep it small I'd say for now until we do more work in server
    parallellizing
    querying of different column families. I'd suggest low tens.
    2) what is a reasonable number of columns per column family ? 100, 1K ,
    1M
    .. more ?
    Low hundreds till we fix the bug that has us slow when lots of columns.

    St.Ack
  • Thibaut_ at Dec 21, 2008 at 2:08 pm
    Hi,

    just as a temporary fix, you could also use something like google protocol
    buffers or facebook's thrift for the data modelling and only save the binary
    output in hbase.

    You will however loose the ability to filter on columns or only fetch the
    columns you are interested in, and must always fetch all of the data related
    to an entity.

    Thibaut


    yonatan maman wrote:
    I want to have to use HBAse to implement
    (Entity-attribute-value_model<http://en.wikipedia.org/wiki/Entity-attribute-value_model>),
    in RDBM it looks like :

    col1: entityID

    col2: attributeName

    col3: value


    Will it be reasoniable to have one Hbase table like this:

    entityID as row key
    attr as column family

    so suppose I have 2 entities:
    e1: has 2 attibutes: a1 with value v1 and a2 with value v2
    e2: has 2 attibutes: a1 wuth value v11 and a3 with value v33


    e1--->attr:a1=v1, attr:a2=v2
    e2--->attr:a1=v11, attr:a3=v33

    I guess the number of different sohuld be low hundred (as suggested by
    stack). will this 'bug' taken care now ? what will be the limit wfter it
    will be fixed ?

    -- Yonatan





    On Sun, Dec 21, 2008 at 1:23 AM, Ryan LeCompte wrote:

    Ah darn, I was just trying to experiment changing my schema to support
    1000's of columns.... however once I did that I started running out of
    memory again. :-(

    On Sat, Dec 20, 2008 at 6:19 PM, stack wrote:
    yonatan maman wrote:
    I'm about to design my HBase table, and I wonder :
    1) what is the max number of column family that is still consider
    reasonable? 100 , 1K, 1M ... more ?
    Keep it small I'd say for now until we do more work in server
    parallellizing
    querying of different column families. I'd suggest low tens.
    2) what is a reasonable number of columns per column family ? 100, 1K
    ,
    1M
    .. more ?
    Low hundreds till we fix the bug that has us slow when lots of columns.

    St.Ack
    --
    View this message in context: http://www.nabble.com/what-is-considered-as-best---worst-practice--tp21109754p21115358.html
    Sent from the HBase User mailing list archive at Nabble.com.
  • Andrew Purtell at Dec 21, 2008 at 4:12 pm
    I use JSON for exactly this. A simple row/column/timestamp
    key leads to a compound structure encoding all of the object
    attributes, or maybe arrays of objects, etc. At the scale
    where HBase is an effective solution you need to
    denormalize ("insert time join") for query efficiency anyhow,
    and I can serve the results out as is. Most of the work then
    is done in the mapreduce tasks that produce and store the
    JSON encodings in batch. I also build several views of the
    data into multiple tables -- materialized views basically.
    At Hadoop/HBase scale, disk space is cheap, seek time is not.

    Because of this query processing time is low enough that I
    can serve them right out of HBase without needing an
    intermediate caching layer such as memcached or Tokyo
    Cabinet (jgray's favorite).
    From: Thibaut
    Subject: Re: what is considered as best / worst practice?
    To: hbase-user@hadoop.apache.org
    Date: Sunday, December 21, 2008, 6:07 AM
    Hi,

    just as a temporary fix, you could also use something like
    google protocol buffers or facebook's thrift for the data
    modelling and only save the binary output in hbase.

    You will however loose the ability to filter on columns or
    only fetch the columns you are interested in, and must
    always fetch all of the data related to an entity.

    Thibaut
  • Dru Jensen at Dec 22, 2008 at 6:10 pm
    JSON+
    Question: Is it an acceptable design to use the timestamp as a data
    element?

    I am currently adding the date to the column name and setting the
    number of versions in the table to 1.

    Current: htable.put('table','family:date', 'JSON');

    What I would like to do is use the timestamp as a data element to
    store the date of the entry and set the number of versions to infinite.

    Proposed: htable.put ('table', 'family:', 'JSON', 'date');

    Is this a good approach? Are there any gotcha's? Is there a way to
    get all of the versions for a row/column in a single call? I need to
    graph the results over time.
    On Dec 21, 2008, at 8:11 AM, Andrew Purtell wrote:

    I use JSON for exactly this. A simple row/column/timestamp
    key leads to a compound structure encoding all of the object
    attributes, or maybe arrays of objects, etc. At the scale
    where HBase is an effective solution you need to
    denormalize ("insert time join") for query efficiency anyhow,
    and I can serve the results out as is. Most of the work then
    is done in the mapreduce tasks that produce and store the
    JSON encodings in batch. I also build several views of the
    data into multiple tables -- materialized views basically.
    At Hadoop/HBase scale, disk space is cheap, seek time is not.

    Because of this query processing time is low enough that I
    can serve them right out of HBase without needing an
    intermediate caching layer such as memcached or Tokyo
    Cabinet (jgray's favorite).

  • Stack at Dec 23, 2008 at 5:40 pm

    Dru Jensen wrote:
    JSON+
    Question: Is it an acceptable design to use the timestamp as a data
    element?

    I am currently adding the date to the column name and setting the
    number of versions in the table to 1.

    Current: htable.put('table','family:date', 'JSON');

    What I would like to do is use the timestamp as a data element to
    store the date of the entry and set the number of versions to infinite.

    Proposed: htable.put ('table', 'family:', 'JSON', 'date');

    Is this a good approach? Are there any gotcha's? Is there a way to
    get all of the versions for a row/column in a single call? I need to
    graph the results over time.
    You can do that but why not just use the cell's timestamp/version?

    You can get multiple versions using HTable get methods but not using
    Scanners currently. With Scanners, only one version -- either at or
    just before stipulated timestamp -- is returned.

    St.Ack
  • Yonatan maman at Dec 21, 2008 at 4:21 pm
    Can you elaborate about this approach ?
    specially how can I ask queries like :
    give me all entities that has attribute a1 with value v1 (and attribute a2
    with value v2)...

    -- Yonatan

    On Sun, Dec 21, 2008 at 4:07 PM, Thibaut_ wrote:


    Hi,

    just as a temporary fix, you could also use something like google protocol
    buffers or facebook's thrift for the data modelling and only save the
    binary
    output in hbase.

    You will however loose the ability to filter on columns or only fetch the
    columns you are interested in, and must always fetch all of the data
    related
    to an entity.

    Thibaut


    yonatan maman wrote:
    I want to have to use HBAse to implement
    (Entity-attribute-value_model<
    http://en.wikipedia.org/wiki/Entity-attribute-value_model>),
    in RDBM it looks like :

    col1: entityID

    col2: attributeName

    col3: value


    Will it be reasoniable to have one Hbase table like this:

    entityID as row key
    attr as column family

    so suppose I have 2 entities:
    e1: has 2 attibutes: a1 with value v1 and a2 with value v2
    e2: has 2 attibutes: a1 wuth value v11 and a3 with value v33


    e1--->attr:a1=v1, attr:a2=v2
    e2--->attr:a1=v11, attr:a3=v33

    I guess the number of different sohuld be low hundred (as suggested by
    stack). will this 'bug' taken care now ? what will be the limit wfter it
    will be fixed ?

    -- Yonatan





    On Sun, Dec 21, 2008 at 1:23 AM, Ryan LeCompte wrote:

    Ah darn, I was just trying to experiment changing my schema to support
    1000's of columns.... however once I did that I started running out of
    memory again. :-(

    On Sat, Dec 20, 2008 at 6:19 PM, stack wrote:
    yonatan maman wrote:
    I'm about to design my HBase table, and I wonder :
    1) what is the max number of column family that is still consider
    reasonable? 100 , 1K, 1M ... more ?
    Keep it small I'd say for now until we do more work in server
    parallellizing
    querying of different column families. I'd suggest low tens.
    2) what is a reasonable number of columns per column family ? 100,
    1K
    ,
    1M
    .. more ?
    Low hundreds till we fix the bug that has us slow when lots of
    columns.
    St.Ack
    --
    View this message in context:
    http://www.nabble.com/what-is-considered-as-best---worst-practice--tp21109754p21115358.html
    Sent from the HBase User mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedDec 20, '08 at 10:58p
activeDec 23, '08 at 5:40p
posts9
users6
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase