FAQ
Hi All,

Does any one have comments about how Hbase will perform in a 4 node cluster
compared to an equivalent MySQL configuration?

Thanks,

Rafael

Search Discussions

  • Michael Bieniosek at Oct 11, 2007 at 11:20 pm
    MySQL and hbase are optimized for different operations. What are you trying to do?

    -Michael

    On 10/11/07 3:35 PM, "Rafael Turk" wrote:

    Hi All,

    Does any one have comments about how Hbase will perform in a 4 node cluster
    compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Jim Kellerman at Oct 12, 2007 at 2:35 am
    12345678901234567890123456789012345678901234567890123456789012345

    Performance always depends on the work load. However, having said
    that, you should read Michael Stonebraker's paper "The End of an
    Architectural Era (It's Time for a Complete Rewrite)" which was
    presented at the Very Large Database Conference. You can find a
    PDF copy of the paper here:
    http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf

    In this paper he presents compelling evidence that column oriented
    databases (HBase is a column oriented database) can outperform
    traditional RDBMS systems (MySql) by an order of magnitude or more
    for almost every kind of work load. Here's a brief summary of why
    this is so:

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.

    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.

    - compression: works better on a column oriented database because
    the data is similar, and stored together, which is not the case
    in a row oriented database.

    - scans: suppose you have a 600GB database with 200 columns of
    equal length (the TPC-H OLTP benchmark has 212 columns) but
    while you are scanning the table you only want to return 5
    of the columns. Each column takes up 3GB of the 600GB. A row
    oriented database will have to read the entire 600GB to extract
    the 20GB of data desired. Think about how long it takes to read
    600GB vs 20GB. Furthermore, in a column oriented database, each
    column can be read in parallel, and the inner loop only executes
    once per column rather than once per row as in the row oriented
    database.

    - bulk loads: column oriented databases have to construct their
    indexes as the load progresses, so even of the load goes from
    low value to high, btrees must be split and reorganized. For
    column oriented databases, this is not true.

    - adding capacity: in a row oriented database, you generally have
    to dump the database, create a new partitioning scheme and then
    load the dumped data into a new database. With HBase, storage
    is only limited by the DFS. Need more storage? Add another data
    node.

    We have done almost no tuning for HBase, but I'd be willing to bet
    that it would handily beat MySql in a drag race.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com

    -----Original Message-----
    From: Rafael Turk
    Sent: Thursday, October 11, 2007 3:36 PM
    To: hadoop-user@lucene.apache.org
    Subject: HBase performance

    Hi All,

    Does any one have comments about how Hbase will perform in a
    4 node cluster compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Jeff Hammerbacher at Oct 12, 2007 at 4:20 pm
    hmm, i'm going to have to disagree strongly with jim here on several points:

    1) the paper you reference has nothing to do with column-store performance:
    it's all about a new, in-memory oltp system being worked on in stonebraker's
    lab/vertica. it's mainly about removing disk access via replication (rather
    than maintaining a redo log) and being smart about partitioning your data to
    maximize "one-site" transactions.
    2) column store technology has been around for a while; sybase iq would rule
    the world if column-oriented data stores were a one-size-fits-all solution
    to every database problem.
    3) you totally ignore the impact of having an in-memory "write-optimized
    store" to amortize the cost of writes to the on-disk "read-optimized store"
    (memtable and sstable in bigtable parlance--dunno what they're called in
    hbase). otherwise, write and bulk load performance for a column-oriented
    data store is generally atrocious.
    4) your section on "adding capacity" has NOTHING at all to do with
    organizing your data on disk in a column-oriented fashion; it's a property
    of any reasonably well-designed horizontally partitioned data store.

    there's a ton of hot air around this space in general, so refraining from
    making claims like "column oriented databases ... can outperform traditional
    RDBMS systems ... by an order of magnitude or more for almost every kind of
    work load" will prevent my head from exploding.
    thanks,
    jeff
    On 10/11/07, Jim Kellerman wrote:

    12345678901234567890123456789012345678901234567890123456789012345

    Performance always depends on the work load. However, having said
    that, you should read Michael Stonebraker's paper "The End of an
    Architectural Era (It's Time for a Complete Rewrite)" which was
    presented at the Very Large Database Conference. You can find a
    PDF copy of the paper here:
    http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf

    In this paper he presents compelling evidence that column oriented
    databases (HBase is a column oriented database) can outperform
    traditional RDBMS systems (MySql) by an order of magnitude or more
    for almost every kind of work load. Here's a brief summary of why
    this is so:

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.

    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.

    - compression: works better on a column oriented database because
    the data is similar, and stored together, which is not the case
    in a row oriented database.

    - scans: suppose you have a 600GB database with 200 columns of
    equal length (the TPC-H OLTP benchmark has 212 columns) but
    while you are scanning the table you only want to return 5
    of the columns. Each column takes up 3GB of the 600GB. A row
    oriented database will have to read the entire 600GB to extract
    the 20GB of data desired. Think about how long it takes to read
    600GB vs 20GB. Furthermore, in a column oriented database, each
    column can be read in parallel, and the inner loop only executes
    once per column rather than once per row as in the row oriented
    database.

    - bulk loads: column oriented databases have to construct their
    indexes as the load progresses, so even of the load goes from
    low value to high, btrees must be split and reorganized. For
    column oriented databases, this is not true.

    - adding capacity: in a row oriented database, you generally have
    to dump the database, create a new partitioning scheme and then
    load the dumped data into a new database. With HBase, storage
    is only limited by the DFS. Need more storage? Add another data
    node.

    We have done almost no tuning for HBase, but I'd be willing to bet
    that it would handily beat MySql in a drag race.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com

    -----Original Message-----
    From: Rafael Turk
    Sent: Thursday, October 11, 2007 3:36 PM
    To: hadoop-user@lucene.apache.org
    Subject: HBase performance

    Hi All,

    Does any one have comments about how Hbase will perform in a
    4 node cluster compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Jim Kellerman at Oct 12, 2007 at 4:44 pm
    FYI: I just heard Stonebraker talk at the High Performance Transaction Systems Workshop this week. His presentation focused on column oriented databases and not just in memory databases.

    His talk was quite controversial with the traditional database folks, but he did make some valid points.

    I had no intention of making your head explode, but rather to get people to at least rethink the conventional wisdom surrounding row oriented databases. After all Stonebraker wrote databases that most modern ones are built from. He should know something about the topic.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com

    -----Original Message-----
    From: Jeff Hammerbacher
    Sent: Friday, October 12, 2007 9:20 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: HBase performance

    hmm, i'm going to have to disagree strongly with jim here on
    several points:

    1) the paper you reference has nothing to do with
    column-store performance:
    it's all about a new, in-memory oltp system being worked on
    in stonebraker's lab/vertica. it's mainly about removing
    disk access via replication (rather than maintaining a redo
    log) and being smart about partitioning your data to maximize
    "one-site" transactions.
    2) column store technology has been around for a while;
    sybase iq would rule the world if column-oriented data stores
    were a one-size-fits-all solution to every database problem.
    3) you totally ignore the impact of having an in-memory
    "write-optimized store" to amortize the cost of writes to the
    on-disk "read-optimized store"
    (memtable and sstable in bigtable parlance--dunno what
    they're called in hbase). otherwise, write and bulk load
    performance for a column-oriented data store is generally atrocious.
    4) your section on "adding capacity" has NOTHING at all to do
    with organizing your data on disk in a column-oriented
    fashion; it's a property of any reasonably well-designed
    horizontally partitioned data store.

    there's a ton of hot air around this space in general, so
    refraining from making claims like "column oriented databases
    ... can outperform traditional RDBMS systems ... by an order
    of magnitude or more for almost every kind of work load" will
    prevent my head from exploding.
    thanks,
    jeff
    On 10/11/07, Jim Kellerman wrote:

    12345678901234567890123456789012345678901234567890123456789012345

    Performance always depends on the work load. However, having said
    that, you should read Michael Stonebraker's paper "The End of an
    Architectural Era (It's Time for a Complete Rewrite)" which was
    presented at the Very Large Database Conference. You can find a PDF
    copy of the paper here:
    http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
    In this paper he presents compelling evidence that column oriented
    databases (HBase is a column oriented database) can outperform
    traditional RDBMS systems (MySql) by an order of magnitude
    or more for
    almost every kind of work load. Here's a brief summary of
    why this is
    so:

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.

    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.

    - compression: works better on a column oriented database because
    the data is similar, and stored together, which is not the case
    in a row oriented database.

    - scans: suppose you have a 600GB database with 200 columns of
    equal length (the TPC-H OLTP benchmark has 212 columns) but
    while you are scanning the table you only want to return 5
    of the columns. Each column takes up 3GB of the 600GB. A row
    oriented database will have to read the entire 600GB to extract
    the 20GB of data desired. Think about how long it takes to read
    600GB vs 20GB. Furthermore, in a column oriented database, each
    column can be read in parallel, and the inner loop only executes
    once per column rather than once per row as in the row oriented
    database.

    - bulk loads: column oriented databases have to construct their
    indexes as the load progresses, so even of the load goes from
    low value to high, btrees must be split and reorganized. For
    column oriented databases, this is not true.

    - adding capacity: in a row oriented database, you generally have
    to dump the database, create a new partitioning scheme and then
    load the dumped data into a new database. With HBase, storage
    is only limited by the DFS. Need more storage? Add another data
    node.

    We have done almost no tuning for HBase, but I'd be willing to bet
    that it would handily beat MySql in a drag race.

    ---
    Jim Kellerman, Senior Engineer; Powerset jim@powerset.com

    -----Original Message-----
    From: Rafael Turk
    Sent: Thursday, October 11, 2007 3:36 PM
    To: hadoop-user@lucene.apache.org
    Subject: HBase performance

    Hi All,

    Does any one have comments about how Hbase will perform in a
    4 node cluster compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Jonathan Hendler at Oct 12, 2007 at 5:55 pm
    One of the valid points Stonebraker makes, I think, has to do with
    compression (and null values). For example - does HBase also offer
    tools, or a strategy for compression? Maybe it's comparing apples to
    [whatever].

    Since Vertica is also a distributed database, I think it may be
    interesting to the newbies like myself on the list. To keep the
    conversation topical - while it's true there's a major campaign of PR
    around Vertica, I'd be interested in hearing more about how HBase
    compares with other "column stores" or hybrids. There's a lot of
    discussion in Semantic Web communities about these systems, since row
    databases don't "scale well" for arbitrary reading of apparently
    randomized, unstructured directed graphs. I'm NOT speaking from VAST
    experience in this, but enough to know that there might be some fire in
    the hot air. To experienced DBAs it can seem like a collection of "cheap
    tricks" - but a collection of cheap tricks is as revolutionary as things
    might get until we all have quantum computers running on Mr. Fusions.

    Really, Hadoop, HDFS, Hbase, etc has such a range of potential uses
    that I'm looking for the broad view of "to Hadoop or not Hadoop".





    Jim Kellerman wrote:
    FYI: I just heard Stonebraker talk at the High Performance Transaction Systems Workshop this week. His presentation focused on column oriented databases and not just in memory databases.

    His talk was quite controversial with the traditional database folks, but he did make some valid points.

    I had no intention of making your head explode, but rather to get people to at least rethink the conventional wisdom surrounding row oriented databases. After all Stonebraker wrote databases that most modern ones are built from. He should know something about the topic.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com


    -----Original Message-----
    From: Jeff Hammerbacher
    Sent: Friday, October 12, 2007 9:20 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: HBase performance

    hmm, i'm going to have to disagree strongly with jim here on
    several points:

    1) the paper you reference has nothing to do with
    column-store performance:
    it's all about a new, in-memory oltp system being worked on
    in stonebraker's lab/vertica. it's mainly about removing
    disk access via replication (rather than maintaining a redo
    log) and being smart about partitioning your data to maximize
    "one-site" transactions.
    2) column store technology has been around for a while;
    sybase iq would rule the world if column-oriented data stores
    were a one-size-fits-all solution to every database problem.
    3) you totally ignore the impact of having an in-memory
    "write-optimized store" to amortize the cost of writes to the
    on-disk "read-optimized store"
    (memtable and sstable in bigtable parlance--dunno what
    they're called in hbase). otherwise, write and bulk load
    performance for a column-oriented data store is generally atrocious.
    4) your section on "adding capacity" has NOTHING at all to do
    with organizing your data on disk in a column-oriented
    fashion; it's a property of any reasonably well-designed
    horizontally partitioned data store.

    there's a ton of hot air around this space in general, so
    refraining from making claims like "column oriented databases
    ... can outperform traditional RDBMS systems ... by an order
    of magnitude or more for almost every kind of work load" will
    prevent my head from exploding.
    thanks,
    jeff
    On 10/11/07, Jim Kellerman wrote:

    12345678901234567890123456789012345678901234567890123456789012345

    Performance always depends on the work load. However, having said
    that, you should read Michael Stonebraker's paper "The End of an
    Architectural Era (It's Time for a Complete Rewrite)" which was
    presented at the Very Large Database Conference. You can find a PDF
    copy of the paper here:

    http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
    In this paper he presents compelling evidence that column oriented
    databases (HBase is a column oriented database) can outperform
    traditional RDBMS systems (MySql) by an order of magnitude
    or more for
    almost every kind of work load. Here's a brief summary of
    why this is
    so:

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.

    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.

    - compression: works better on a column oriented database because
    the data is similar, and stored together, which is not the case
    in a row oriented database.

    - scans: suppose you have a 600GB database with 200 columns of
    equal length (the TPC-H OLTP benchmark has 212 columns) but
    while you are scanning the table you only want to return 5
    of the columns. Each column takes up 3GB of the 600GB. A row
    oriented database will have to read the entire 600GB to extract
    the 20GB of data desired. Think about how long it takes to read
    600GB vs 20GB. Furthermore, in a column oriented database, each
    column can be read in parallel, and the inner loop only executes
    once per column rather than once per row as in the row oriented
    database.

    - bulk loads: column oriented databases have to construct their
    indexes as the load progresses, so even of the load goes from
    low value to high, btrees must be split and reorganized. For
    column oriented databases, this is not true.

    - adding capacity: in a row oriented database, you generally have
    to dump the database, create a new partitioning scheme and then
    load the dumped data into a new database. With HBase, storage
    is only limited by the DFS. Need more storage? Add another data
    node.

    We have done almost no tuning for HBase, but I'd be willing to bet
    that it would handily beat MySql in a drag race.

    ---
    Jim Kellerman, Senior Engineer; Powerset jim@powerset.com


    -----Original Message-----
    From: Rafael Turk
    Sent: Thursday, October 11, 2007 3:36 PM
    To: hadoop-user@lucene.apache.org
    Subject: HBase performance

    Hi All,

    Does any one have comments about how Hbase will perform in a
    4 node cluster compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Doug Cutting at Oct 12, 2007 at 6:30 pm

    Jonathan Hendler wrote:
    Since Vertica is also a distributed database, I think it may be
    interesting to the newbies like myself on the list. To keep the
    conversation topical - while it's true there's a major campaign of PR
    around Vertica, I'd be interested in hearing more about how HBase
    compares with other "column stores" or hybrids.
    Vertica is presumably based on C-Store. C-Store seems not optimized for
    immediate query of recently updated data, but rather for delayed queries
    over mostly read-only data warehouses. HBase (modeled after BigTable)
    is instead optimized for real-time access to read-write data. So I
    think it depends a bit on what your application needs.

    E.g., from the C-Store paper: "we expect read-only queries to be run in
    historical mode. In this mode, the query selects a timestamp, T, less
    than the one of the most recently committed transactions [...]"

    Doug
  • Jim Kellerman at Oct 12, 2007 at 9:47 pm
    Stonebraker has a new column oriented store called H-Store. It is also talked about in the paper.

    And now I'll shut up. I didn't intend to create such a firestorm.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com

    -----Original Message-----
    From: Doug Cutting
    Sent: Friday, October 12, 2007 11:29 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: HBase performance

    Jonathan Hendler wrote:
    Since Vertica is also a distributed database, I think it may be
    interesting to the newbies like myself on the list. To keep the
    conversation topical - while it's true there's a major
    campaign of PR
    around Vertica, I'd be interested in hearing more about how HBase
    compares with other "column stores" or hybrids.
    Vertica is presumably based on C-Store. C-Store seems not
    optimized for immediate query of recently updated data, but
    rather for delayed queries over mostly read-only data
    warehouses. HBase (modeled after BigTable) is instead
    optimized for real-time access to read-write data. So I
    think it depends a bit on what your application needs.

    E.g., from the C-Store paper: "we expect read-only queries to
    be run in historical mode. In this mode, the query selects a
    timestamp, T, less than the one of the most recently
    committed transactions [...]"

    Doug
  • Toby DiPasquale at Oct 12, 2007 at 10:11 pm

    On 10/12/07, Jim Kellerman wrote:
    Stonebraker has a new column oriented store called H-Store. It is also talked about in the paper.
    H-Store is not column oriented. He only borrows certain techniques
    from his work on C-Store.
    From the paper: section 4.1, second paragraph, first sentence: "At
    each site in the grid, rows of tables are placed contiguously in main
    memory, with conventional B-tree indexing."

    --
    Toby DiPasquale
  • Joydeep Sen Sarma at Oct 12, 2007 at 6:48 pm
    As Doug pointed out - Vertica is for warehouse processing, HBase for
    real-time online processing.

    Compression of on-disk data helps in former case case since the queries
    scan large amounts of data and disk/bus/memory serial bandwidth
    bottlenecks are common. It's akin to map-reduce. Also data sizes are
    orders of magnitude larger than for real-time data stores.

    For real-time data stores - accesses are random and bottlenecks are more
    likely to be random disk ops. In this case compression (on-disk) has
    little benefits (seek time dominates data transmission time). On the
    other hand - compression of in-memory data often helps (since better use
    of cache reduces need for disk ios). (Also randomly accessing compressed
    on-disk data is generally expensive).

    -----Original Message-----
    From: Jonathan Hendler
    Sent: Friday, October 12, 2007 10:53 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: HBase performance

    One of the valid points Stonebraker makes, I think, has to do with
    compression (and null values). For example - does HBase also offer
    tools, or a strategy for compression? Maybe it's comparing apples to
    [whatever].

    Since Vertica is also a distributed database, I think it may be
    interesting to the newbies like myself on the list. To keep the
    conversation topical - while it's true there's a major campaign of PR
    around Vertica, I'd be interested in hearing more about how HBase
    compares with other "column stores" or hybrids. There's a lot of
    discussion in Semantic Web communities about these systems, since row
    databases don't "scale well" for arbitrary reading of apparently
    randomized, unstructured directed graphs. I'm NOT speaking from VAST
    experience in this, but enough to know that there might be some fire in
    the hot air. To experienced DBAs it can seem like a collection of "cheap
    tricks" - but a collection of cheap tricks is as revolutionary as things
    might get until we all have quantum computers running on Mr. Fusions.

    Really, Hadoop, HDFS, Hbase, etc has such a range of potential uses
    that I'm looking for the broad view of "to Hadoop or not Hadoop".





    Jim Kellerman wrote:
    FYI: I just heard Stonebraker talk at the High Performance Transaction
    Systems Workshop this week. His presentation focused on column oriented
    databases and not just in memory databases.
    His talk was quite controversial with the traditional database folks,
    but he did make some valid points.
    I had no intention of making your head explode, but rather to get
    people to at least rethink the conventional wisdom surrounding row
    oriented databases. After all Stonebraker wrote databases that most
    modern ones are built from. He should know something about the topic.
    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com


    -----Original Message-----
    From: Jeff Hammerbacher
    Sent: Friday, October 12, 2007 9:20 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: HBase performance

    hmm, i'm going to have to disagree strongly with jim here on
    several points:

    1) the paper you reference has nothing to do with
    column-store performance:
    it's all about a new, in-memory oltp system being worked on
    in stonebraker's lab/vertica. it's mainly about removing
    disk access via replication (rather than maintaining a redo
    log) and being smart about partitioning your data to maximize
    "one-site" transactions.
    2) column store technology has been around for a while;
    sybase iq would rule the world if column-oriented data stores
    were a one-size-fits-all solution to every database problem.
    3) you totally ignore the impact of having an in-memory
    "write-optimized store" to amortize the cost of writes to the
    on-disk "read-optimized store"
    (memtable and sstable in bigtable parlance--dunno what
    they're called in hbase). otherwise, write and bulk load
    performance for a column-oriented data store is generally atrocious.
    4) your section on "adding capacity" has NOTHING at all to do
    with organizing your data on disk in a column-oriented
    fashion; it's a property of any reasonably well-designed
    horizontally partitioned data store.

    there's a ton of hot air around this space in general, so
    refraining from making claims like "column oriented databases
    ... can outperform traditional RDBMS systems ... by an order
    of magnitude or more for almost every kind of work load" will
    prevent my head from exploding.
    thanks,
    jeff
    On 10/11/07, Jim Kellerman wrote:

    12345678901234567890123456789012345678901234567890123456789012345

    Performance always depends on the work load. However, having said
    that, you should read Michael Stonebraker's paper "The End of an
    Architectural Era (It's Time for a Complete Rewrite)" which was
    presented at the Very Large Database Conference. You can find a PDF
    copy of the paper here:

    http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
    In this paper he presents compelling evidence that column oriented
    databases (HBase is a column oriented database) can outperform
    traditional RDBMS systems (MySql) by an order of magnitude
    or more for
    almost every kind of work load. Here's a brief summary of
    why this is
    so:

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.

    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.

    - compression: works better on a column oriented database because
    the data is similar, and stored together, which is not the case
    in a row oriented database.

    - scans: suppose you have a 600GB database with 200 columns of
    equal length (the TPC-H OLTP benchmark has 212 columns) but
    while you are scanning the table you only want to return 5
    of the columns. Each column takes up 3GB of the 600GB. A row
    oriented database will have to read the entire 600GB to extract
    the 20GB of data desired. Think about how long it takes to read
    600GB vs 20GB. Furthermore, in a column oriented database, each
    column can be read in parallel, and the inner loop only executes
    once per column rather than once per row as in the row oriented
    database.

    - bulk loads: column oriented databases have to construct their
    indexes as the load progresses, so even of the load goes from
    low value to high, btrees must be split and reorganized. For
    column oriented databases, this is not true.

    - adding capacity: in a row oriented database, you generally have
    to dump the database, create a new partitioning scheme and then
    load the dumped data into a new database. With HBase, storage
    is only limited by the DFS. Need more storage? Add another data
    node.

    We have done almost no tuning for HBase, but I'd be willing to bet
    that it would handily beat MySql in a drag race.

    ---
    Jim Kellerman, Senior Engineer; Powerset jim@powerset.com


    -----Original Message-----
    From: Rafael Turk
    Sent: Thursday, October 11, 2007 3:36 PM
    To: hadoop-user@lucene.apache.org
    Subject: HBase performance

    Hi All,

    Does any one have comments about how Hbase will perform in a
    4 node cluster compared to an equivalent MySQL configuration?

    Thanks,

    Rafael
  • Peter W. at Oct 12, 2007 at 10:24 pm
    Hi,

    I've had some limited experience with Oracle, SQL Server,
    Informix and at least one commercial in-memory database.

    More recently, I use mysql memory tables for fun speeding
    up bulk read-write operations such as:

    set max_heap_table_size=250*1024*1024;
    create table mem_proptbl (field_one varchar(32),value_one varchar(100),
    index using hash(value_one)) engine=memory;

    downside is i/o time and churning when later writing to disk.

    Column-oriented approaches like SPARQL remind me of
    XQuery, good for specific uses but with limited adoption.

    HBase looks to be a component for distributed, RAM and
    log based byte-arrays that should be able to be COMPRESSED
    by simply bzip2ing the logs...

    It's a much needed scalability tool complementary to RDBMS
    and it's columns don't affect how I store the data offline.

    Thanks to it's contributors for Rocking the House.

    Later,

    Peter W.

    Jonathan Hendler wrote:
    One of the valid points ... has to do with
    compression (and null values). For example - does HBase also offer
    tools, or a strategy for compression?
  • Jim Kellerman at Oct 12, 2007 at 10:54 pm
    One more comment and then I'll really shut up, I promise. On re-reading the paper, you are all absolutely correct about C-Store, H-Store and Vertica.

    What is not in the paper and part of what he presented this week was applying column oriented stores to the TPC-H benchmark.

    The TPC-H OLTP telco benchmark has a schema of 212 columns, contains ~600GB data and each transaction accesses only 6 or 7 of the columns. In a full table scan, a row oriented store must read all 600GB of data. It has no choice. A column oriented store need only read the 6-7 columns which is approximately 20GB. I don't think anyone will argue that you can read 20GB a whole lot faster than 600GB.

    Jeff Hammerbacher wrote:
    4) your section on "adding capacity" has NOTHING at all to do
    with organizing your data on disk in a column-oriented fashion;
    it's a property of any reasonably well-designed horizontally
    partitioned data store.
    Hmm, well column oriented-ness of BigTable and HBase do a pretty nice job of horizontal partitioning.

    Jonathan Hendler wrote:
    One of the valid points ... has to do with compression (and null
    values). For example - does HBase also offer tools, or a
    strategy for compression?
    Yes, see hbase.HColumnDescriptor.java compression is controlled on a per column family basis.

    ---
    Jim Kellerman, Senior Engineer; Powerset
    jim@powerset.com
  • Jason Watkins at Oct 13, 2007 at 5:07 am

    - writes: a row oriented database writes the whole row regardless
    of whether or not values are supplied for every field or not.
    Space is reserved for null fields, so the number of bytes
    written is the same for every row. In a column oriented
    database, only the columns for which values are supplied are
    written. Nulls are free. Also row oriented databases must write
    a row descriptor so that when the row is read, the column values
    can be found.
    While I believe this is true for the basic N-Ary Storage Model as
    published in the literature, I believe most practical products have
    some mechanism of null compression within a page. Perhaps someone with
    more experience could confirm if this is the case?
    - reads: Unless every column is being returned on a read, a column
    oriented database is faster because it only reads the columns
    requested. The row oriented database must read the entire row,
    figure out where the requested columns are and only return that
    portion of the data read.
    Partly. This is ignoring that the column oriented store has to do
    tuple reconstruction which also has overhead. As published in the
    literature, a hybrid of rows across pages but with attributes
    organized as columns within each page is better than a pure column
    store in almost all workloads (reference PAX storage manager in the
    literature).

    All that said, I found his paper extremely interesting, particularly
    the willingness to forgo disk altogether.

    Jason

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 11, '07 at 10:36p
activeOct 13, '07 at 5:07a
posts13
users10
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase