FAQ
Guys,

I am trying to understand Vertica and how it applies to the Hadoop
world. Is this basically a way to store large amounts of data and run
SQL-like queries on it that also result in map/red uce jobs Hadoop/
Hive? Or am I trying to compare apples and oranges? If not, are
Vertica queries faster in getting results than Hive (5 minutes versus
seconds) ?

Thanks,
Ryan

Search Discussions

  • Edward Capriolo at Oct 17, 2009 at 3:00 pm

    On Sat, Oct 17, 2009 at 1:11 AM, Ryan LeCompte wrote:
    Guys,

    I am trying to understand Vertica and how it applies to the Hadoop world. Is
    this basically a way to store large amounts of data and run SQL-like queries
    on it that also result in map/red uce jobs Hadoop/Hive? Or am I trying to
    compare apples and oranges? If not, are Vertica queries faster in getting
    results than Hive (5 minutes versus seconds) ?

    Thanks,
    Ryan
    There was a presentation on hadoop+vertica at hadoop world nyc.

    http://www.cloudera.com/hadoop-world-nyc

    http://www.cloudera.com/sites/all/themes/cloudera/static/hw09/3%20%20-%204-00%20Omer%20Trajman,%20Vertica,%20Hadoop%20-%20Vertica%20v2.ppt

    One major difference is a column based datastore while by default hive
    is row based, but there many differences.

    Edward
  • Arijit Mukherjee at Oct 19, 2009 at 4:48 am
    I've been in touch with Vertica since the past year. The main concept
    behind Vertica is column-orientation, which in turn allows high degree
    of compression and faster query processing (mainly retrieving large
    data sets) as it selects only the column required instead of fetching
    the entire row and applying project on it. It's Mike Stonebraker's
    brainchild and builds on the original C-Store.

    Performance is very impressive on not-so-high-end hardware - loading
    is very fast, as are the queries. But, as of now, it does not support
    map-reduce. This is supported in Greenplum (another DW datastore) -
    but my experience with Greenplum was not so good - performance-wise -
    and it needs quite high-end machines.

    Arijit

    2009/10/17 Edward Capriolo <edlinuxguru@gmail.com>:
    On Sat, Oct 17, 2009 at 1:11 AM, Ryan LeCompte wrote:

    Guys,

    I am trying to understand Vertica and how it applies to the Hadoop world. Is
    this basically a way to store large amounts of data and run SQL-like queries
    on it that also result in map/red uce jobs Hadoop/Hive? Or am I trying to
    compare apples and oranges? If not, are Vertica queries faster in getting
    results than Hive (5 minutes versus seconds) ?

    Thanks,
    Ryan
    There was a presentation on hadoop+vertica at hadoop world nyc.

    http://www.cloudera.com/hadoop-world-nyc

    http://www.cloudera.com/sites/all/themes/cloudera/static/hw09/3%20%20-%204-00%20Omer%20Trajman,%20Vertica,%20Hadoop%20-%20Vertica%20v2.ppt

    One major difference is a column based datastore while by default hive
    is row based, but there many differences.

    Edward


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."
  • Sanjay Sharma at Oct 19, 2009 at 5:32 am
    I do not work for Vertica, but they were my neighbors in Hadoop World NYC where they did talk about a Hadoop MapReduce connector.

    http://www.vertica.com/company/news/vertica-analytic-database-broadens-reach-with-flexstore-
    "With version 3.5, Vertica also introduces native support for MapReduce via connectivity to the standard Hadoop framework. The Vertica 3.5 interface to Hadoop gives MapReduce developers the ability to perform highly scalable in-database analytics by making it easy to store and retrieve data from the equally scalable Vertica Analytic Database. This unique combination of MPP analytic database and MPP compute framework gives enterprises the flexibility to process large sets of structured and unstructured data and make them available to business users at Web speeds."

    BTW, GreenPlum, AsterData and Vertica all claim some kind of interfaces with Hadoop now.


    Also, IMHO, their focus is on certainly working ALONG with Hadoop MapReduce. The symbiosis can be seen as Hadoop MapReduce doing the unstructured data handling part with the structured part being handled by these Column based databases over a rich SQL interface.

    It would be interesting to know HBase, Hive/PIG figure in their plans


    -Sanjay

    -----Original Message-----
    From: Arijit Mukherjee
    Sent: Monday, October 19, 2009 10:18 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: Hive vs. Vertica

    I've been in touch with Vertica since the past year. The main concept
    behind Vertica is column-orientation, which in turn allows high degree
    of compression and faster query processing (mainly retrieving large
    data sets) as it selects only the column required instead of fetching
    the entire row and applying project on it. It's Mike Stonebraker's
    brainchild and builds on the original C-Store.

    Performance is very impressive on not-so-high-end hardware - loading
    is very fast, as are the queries. But, as of now, it does not support
    map-reduce. This is supported in Greenplum (another DW datastore) -
    but my experience with Greenplum was not so good - performance-wise -
    and it needs quite high-end machines.

    Arijit

    2009/10/17 Edward Capriolo <edlinuxguru@gmail.com>:
    On Sat, Oct 17, 2009 at 1:11 AM, Ryan LeCompte wrote:

    Guys,

    I am trying to understand Vertica and how it applies to the Hadoop world. Is
    this basically a way to store large amounts of data and run SQL-like queries
    on it that also result in map/red uce jobs Hadoop/Hive? Or am I trying to
    compare apples and oranges? If not, are Vertica queries faster in getting
    results than Hive (5 minutes versus seconds) ?

    Thanks,
    Ryan
    There was a presentation on hadoop+vertica at hadoop world nyc.

    http://www.cloudera.com/hadoop-world-nyc

    http://www.cloudera.com/sites/all/themes/cloudera/static/hw09/3%20%20-%204-00%20Omer%20Trajman,%20Vertica,%20Hadoop%20-%20Vertica%20v2.ppt

    One major difference is a column based datastore while by default hive
    is row based, but there many differences.

    Edward


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."

    Follow us on Twitter- https://twitter.com/impetuscalling.

    *Impetus Celebrates Green Diwali.

    NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
  • Jeff Hammerbacher at Oct 19, 2009 at 5:36 am
    FWIW, there was a lot of experimentation with columnar storage on the Hive
    lists recently: https://issues.apache.org/jira/browse/HIVE-352. Zebra, a Pig
    subproject, is also focused on columnar storage in HDFS for Pig queries.
    On Sun, Oct 18, 2009 at 10:31 PM, Sanjay Sharma wrote:

    I do not work for Vertica, but they were my neighbors in Hadoop World NYC
    where they did talk about a Hadoop MapReduce connector.


    http://www.vertica.com/company/news/vertica-analytic-database-broadens-reach-with-flexstore-
    "With version 3.5, Vertica also introduces native support for MapReduce via
    connectivity to the standard Hadoop framework. The Vertica 3.5 interface to
    Hadoop gives MapReduce developers the ability to perform highly scalable
    in-database analytics by making it easy to store and retrieve data from the
    equally scalable Vertica Analytic Database. This unique combination of MPP
    analytic database and MPP compute framework gives enterprises the
    flexibility to process large sets of structured and unstructured data and
    make them available to business users at Web speeds."

    BTW, GreenPlum, AsterData and Vertica all claim some kind of interfaces
    with Hadoop now.


    Also, IMHO, their focus is on certainly working ALONG with Hadoop
    MapReduce. The symbiosis can be seen as Hadoop MapReduce doing the
    unstructured data handling part with the structured part being handled by
    these Column based databases over a rich SQL interface.

    It would be interesting to know HBase, Hive/PIG figure in their plans


    -Sanjay

    -----Original Message-----
    From: Arijit Mukherjee
    Sent: Monday, October 19, 2009 10:18 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: Hive vs. Vertica

    I've been in touch with Vertica since the past year. The main concept
    behind Vertica is column-orientation, which in turn allows high degree
    of compression and faster query processing (mainly retrieving large
    data sets) as it selects only the column required instead of fetching
    the entire row and applying project on it. It's Mike Stonebraker's
    brainchild and builds on the original C-Store.

    Performance is very impressive on not-so-high-end hardware - loading
    is very fast, as are the queries. But, as of now, it does not support
    map-reduce. This is supported in Greenplum (another DW datastore) -
    but my experience with Greenplum was not so good - performance-wise -
    and it needs quite high-end machines.

    Arijit

    2009/10/17 Edward Capriolo <edlinuxguru@gmail.com>:
    On Sat, Oct 17, 2009 at 1:11 AM, Ryan LeCompte wrote:

    Guys,

    I am trying to understand Vertica and how it applies to the Hadoop
    world. Is
    this basically a way to store large amounts of data and run SQL-like
    queries
    on it that also result in map/red uce jobs Hadoop/Hive? Or am I trying
    to
    compare apples and oranges? If not, are Vertica queries faster in
    getting
    results than Hive (5 minutes versus seconds) ?

    Thanks,
    Ryan
    There was a presentation on hadoop+vertica at hadoop world nyc.

    http://www.cloudera.com/hadoop-world-nyc

    http://www.cloudera.com/sites/all/themes/cloudera/static/hw09/3%20%20-%204-00%20Omer%20Trajman,%20Vertica,%20Hadoop%20-%20Vertica%20v2.ppt
    One major difference is a column based datastore while by default hive
    is row based, but there many differences.

    Edward


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."

    Follow us on Twitter- https://twitter.com/impetuscalling.

    *Impetus Celebrates Green Diwali.

    NOTE: This message may contain information that is confidential,
    proprietary, privileged or otherwise protected by law. The message is
    intended solely for the named addressee. If received in error, please
    destroy and notify the sender. Any use of this email is prohibited when
    received in error. Impetus does not represent, warrant and/or guarantee,
    that the integrity of this communication has been maintained nor that the
    communication is free of errors, virus, interception or interference.
  • Amogh Vasekar at Oct 19, 2009 at 12:50 pm
    Yahoo! Had an Everest MPP framework based on columnar storage, don't know how popular it was, but required pretty high end machines. Zebra I guess partially aims at getting that into Hadoop using t-file implementation, and its source is available in contrib.

    Amogh


    On 10/19/09 10:18 AM, "Arijit Mukherjee" wrote:

    I've been in touch with Vertica since the past year. The main concept
    behind Vertica is column-orientation, which in turn allows high degree
    of compression and faster query processing (mainly retrieving large
    data sets) as it selects only the column required instead of fetching
    the entire row and applying project on it. It's Mike Stonebraker's
    brainchild and builds on the original C-Store.

    Performance is very impressive on not-so-high-end hardware - loading
    is very fast, as are the queries. But, as of now, it does not support
    map-reduce. This is supported in Greenplum (another DW datastore) -
    but my experience with Greenplum was not so good - performance-wise -
    and it needs quite high-end machines.

    Arijit

    2009/10/17 Edward Capriolo <edlinuxguru@gmail.com>:
    On Sat, Oct 17, 2009 at 1:11 AM, Ryan LeCompte wrote:

    Guys,

    I am trying to understand Vertica and how it applies to the Hadoop world. Is
    this basically a way to store large amounts of data and run SQL-like queries
    on it that also result in map/red uce jobs Hadoop/Hive? Or am I trying to
    compare apples and oranges? If not, are Vertica queries faster in getting
    results than Hive (5 minutes versus seconds) ?

    Thanks,
    Ryan
    There was a presentation on hadoop+vertica at hadoop world nyc.

    http://www.cloudera.com/hadoop-world-nyc

    http://www.cloudera.com/sites/all/themes/cloudera/static/hw09/3%20%20-%204-00%20Omer%20Trajman,%20Vertica,%20Hadoop%20-%20Vertica%20v2.ppt

    One major difference is a column based datastore while by default hive
    is row based, but there many differences.

    Edward


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."
  • Steve Lihn at Oct 22, 2009 at 7:03 pm
    Can someone explain why column based database will compress the data,
    especially on time series data (I've heard sometimes 80% smaller than
    on a traditional RDBMS).
    Intuitively, isn't columnar db similar to a RDBMS indexed on every
    column? Is there a simple illustration why the storage is better in a
    columnar db?

    Thanks, Steve
  • Jeff Hammerbacher at Oct 22, 2009 at 7:54 pm
    Hey Steve,

    I'd recommend checking out some of the blog posts on the Vertica blog. Right
    now, the entries on http://databasecolumn.vertica.com/page/3/ answer this
    question fairly well.

    Regards,
    Jeff
    On Thu, Oct 22, 2009 at 12:03 PM, Steve Lihn wrote:

    Can someone explain why column based database will compress the data,
    especially on time series data (I've heard sometimes 80% smaller than
    on a traditional RDBMS).
    Intuitively, isn't columnar db similar to a RDBMS indexed on every
    column? Is there a simple illustration why the storage is better in a
    columnar db?

    Thanks, Steve
  • Steve Lihn at Oct 22, 2009 at 10:24 pm
    Jeff,
    If I read it correctly, the savings on storage in Vertica is via
    compressing the data files, not by some magic way of storing the raw
    data.
    If that is true, wouldn't compression put a CPU overhead, just like
    the serde overhead in Hive?

    Steve
    On Thu, Oct 22, 2009 at 3:54 PM, Jeff Hammerbacher wrote:
    Hey Steve,

    I'd recommend checking out some of the blog posts on the Vertica blog. Right
    now, the entries on http://databasecolumn.vertica.com/page/3/ answer this
    question fairly well.

    Regards,
    Jeff
    On Thu, Oct 22, 2009 at 12:03 PM, Steve Lihn wrote:

    Can someone explain why column based database will compress the data,
    especially on time series data (I've heard sometimes 80% smaller than
    on a traditional RDBMS).
    Intuitively, isn't columnar db similar to a RDBMS indexed on every
    column? Is there a simple illustration why the storage is better in a
    columnar db?

    Thanks, Steve
  • Arijit Mukherjee at Oct 23, 2009 at 4:56 am
    I'll try to explain a bit. I've worked with Vertica for some time in
    the last year or so...

    Because Vertica is column-oriented, the data type becomes uniform -
    each column is either INT or STRING etc. In row-stores, each row can
    be a mix of various data types. Because of this uniformity in each
    column, compression works faster and better in Vertica. And they have
    very high quality compression algorithms. Compression and
    decompression does put some overhead on the CPU, but, CPUs have become
    faster compared to I/O. Thus, whatever additional cost is added due to
    CPU overhead, Vertica makes it up and more because of less I/O.

    Data access is faster because while doing a select query, Vertica
    only scans the required columns (people normally use select with a
    given number of column names) - thus it's only select, instead of a
    combination of select + project in case of row-stores (where even for
    a selected set of columns, the entire row has to be accessed).

    Hope this helps.

    The incorporation of Map/Reduce in Vertica 3.5 is news to me. I
    mentioned this once during a telecon with Vertica. I think it's a good
    step - especially when Greenplum/AsterData - all have come up with
    similar M/R frameworks.

    Regards
    Arijit

    2009/10/23 Steve Lihn <stevelihn@gmail.com>:
    Jeff,
    If I read it correctly, the savings on storage in Vertica is via
    compressing the data files, not by some magic way of storing the raw
    data.
    If that is true, wouldn't compression put a CPU overhead, just like
    the serde overhead in Hive?

    Steve
    On Thu, Oct 22, 2009 at 3:54 PM, Jeff Hammerbacher wrote:
    Hey Steve,

    I'd recommend checking out some of the blog posts on the Vertica blog. Right
    now, the entries on http://databasecolumn.vertica.com/page/3/ answer this
    question fairly well.

    Regards,
    Jeff
    On Thu, Oct 22, 2009 at 12:03 PM, Steve Lihn wrote:

    Can someone explain why column based database will compress the data,
    especially on time series data (I've heard sometimes 80% smaller than
    on a traditional RDBMS).
    Intuitively, isn't columnar db similar to a RDBMS indexed on every
    column? Is there a simple illustration why the storage is better in a
    columnar db?

    Thanks, Steve


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."
  • Arijit Mukherjee at Oct 23, 2009 at 5:30 am
    Oh - by the way - data is compressed when loaded into Vertica.
    Normally 5:1 ratio. But Vertica will charge you based on the raw data
    volume:-)

    Arijit

    2009/10/23 Arijit Mukherjee <arijit72@gmail.com>:
    I'll try to explain a bit. I've worked with Vertica for some time in
    the last year or so...

    Because Vertica is column-oriented, the data type becomes uniform -
    each column is either INT or STRING etc. In row-stores, each row can
    be a mix of various data types. Because of this uniformity in each
    column, compression works faster and better in Vertica. And they have
    very high quality compression algorithms. Compression and
    decompression does put some overhead on the CPU, but, CPUs have become
    faster compared to I/O. Thus, whatever additional cost is added due to
    CPU overhead, Vertica makes it up and more because of less I/O.

    Data access is faster  because while doing a select query, Vertica
    only scans the required columns (people normally use select with a
    given number of column names) - thus it's only select, instead of a
    combination of select + project in case of row-stores (where even for
    a selected set of columns, the entire row has to be accessed).

    Hope this helps.

    The incorporation of Map/Reduce in Vertica 3.5 is news to me. I
    mentioned this once during a telecon with Vertica. I think it's a good
    step - especially when Greenplum/AsterData - all have come up with
    similar M/R frameworks.

    Regards
    Arijit

    2009/10/23 Steve Lihn <stevelihn@gmail.com>:
    Jeff,
    If I read it correctly, the savings on storage in Vertica is via
    compressing the data files, not by some magic way of storing the raw
    data.
    If that is true, wouldn't compression put a CPU overhead, just like
    the serde overhead in Hive?

    Steve
    On Thu, Oct 22, 2009 at 3:54 PM, Jeff Hammerbacher wrote:
    Hey Steve,

    I'd recommend checking out some of the blog posts on the Vertica blog. Right
    now, the entries on http://databasecolumn.vertica.com/page/3/ answer this
    question fairly well.

    Regards,
    Jeff
    On Thu, Oct 22, 2009 at 12:03 PM, Steve Lihn wrote:

    Can someone explain why column based database will compress the data,
    especially on time series data (I've heard sometimes 80% smaller than
    on a traditional RDBMS).
    Intuitively, isn't columnar db similar to a RDBMS indexed on every
    column? Is there a simple illustration why the storage is better in a
    columnar db?

    Thanks, Steve


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."


    --
    "And when the night is cloudy,
    There is still a light that shines on me,
    Shine on until tomorrow, let it be."

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 17, '09 at 5:11a
activeOct 23, '09 at 5:30a
posts11
users7
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase