Grokbase Groups Hive user March 2011
FAQ
Hi,

I loaded a data set which has 1 million rows into both Hive and HBase
tables. For the HBase table, I created a corresponding Hive table so that
the data in HBase can be queried from Hive QL. Both tables have a key column
and a value column

For the same query (select value, count(*) from table group by value), the
Hive only query runs much faster (~ 30 seconds) as compared to Hive over
HBase (~ 150 seconds).

Is this expected?

Regards,
Biju

Search Discussions

  • John Sichi at Mar 8, 2011 at 6:06 am
    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column

    For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds).

    Is this expected?

    Regards,
    Biju
  • Biju Kaimal at Mar 8, 2011 at 6:14 am
    Hi,

    Could you please explain the reason for the behavior?

    Regards,
    Biju
    On Tue, Mar 8, 2011 at 11:35 AM, John Sichi wrote:

    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase
    tables. For the HBase table, I created a corresponding Hive table so that
    the data in HBase can be queried from Hive QL. Both tables have a key column
    and a value column
    For the same query (select value, count(*) from table group by value),
    the Hive only query runs much faster (~ 30 seconds) as compared to Hive over
    HBase (~ 150 seconds).
    Is this expected?

    Regards,
    Biju
  • John Sichi at Mar 8, 2011 at 6:18 am
    For native tables, Hive reads rows directly from HDFS.

    For HBase tables, it has to go through the HBase region servers, which reconstruct rows from column families (combining cache + HDFS).

    HBase makes it possible to keep your table up to date in real time, but you have to pay an overhead cost at query time.

    On the other hand, with native Hive tables, there's latency in loading new batches of data.

    JVS
    On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote:

    Hi,

    Could you please explain the reason for the behavior?

    Regards,
    Biju

    On Tue, Mar 8, 2011 at 11:35 AM, John Sichi wrote:
    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column

    For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds).

    Is this expected?

    Regards,
    Biju
  • Otis Gospodnetic at Mar 9, 2011 at 5:52 am
    Hi,

    John, are there plans or specific JIRA issues related to this particular
    performance hit that you or somebody else is working on and that those of us
    interested in performance improvements when Hive points to external tables in
    HBase should watch?

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: John Sichi <jsichi@fb.com>
    To: "<user@hive.apache.org>" <user@hive.apache.org>
    Sent: Tue, March 8, 2011 1:17:51 AM
    Subject: Re: Performance between Hive queries vs. Hive over HBase queries

    For native tables, Hive reads rows directly from HDFS.

    For HBase tables, it has to go through the HBase region servers, which
    reconstruct rows from column families (combining cache + HDFS).

    HBase makes it possible to keep your table up to date in real time, but you
    have to pay an overhead cost at query time.

    On the other hand, with native Hive tables, there's latency in loading new
    batches of data.

    JVS
    On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote:

    Hi,

    Could you please explain the reason for the behavior?

    Regards,
    Biju

    On Tue, Mar 8, 2011 at 11:35 AM, John Sichi wrote:
    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase
    tables. For the HBase table, I created a corresponding Hive table so that the
    data in HBase can be queried from Hive QL. Both tables have a key column and a
    value column
    For the same query (select value, count(*) from table group by value), the
    Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase
    (~ 150 seconds).
    Is this expected?

    Regards,
    Biju
  • John Sichi at Mar 9, 2011 at 9:23 pm
    There's one here specifically for the Hive portion, but really a full-stack system profile is needed for deciding where to attack it:

    https://issues.apache.org/jira/browse/HIVE-1231

    I don't know of anyone currently working in this area.

    JVS
    On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote:

    Hi,

    John, are there plans or specific JIRA issues related to this particular
    performance hit that you or somebody else is working on and that those of us
    interested in performance improvements when Hive points to external tables in
    HBase should watch?

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: John Sichi <jsichi@fb.com>
    To: "<user@hive.apache.org>" <user@hive.apache.org>
    Sent: Tue, March 8, 2011 1:17:51 AM
    Subject: Re: Performance between Hive queries vs. Hive over HBase queries

    For native tables, Hive reads rows directly from HDFS.

    For HBase tables, it has to go through the HBase region servers, which
    reconstruct rows from column families (combining cache + HDFS).

    HBase makes it possible to keep your table up to date in real time, but you
    have to pay an overhead cost at query time.

    On the other hand, with native Hive tables, there's latency in loading new
    batches of data.

    JVS
    On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote:

    Hi,

    Could you please explain the reason for the behavior?

    Regards,
    Biju

    On Tue, Mar 8, 2011 at 11:35 AM, John Sichi wrote:
    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase
    tables. For the HBase table, I created a corresponding Hive table so that the
    data in HBase can be queried from Hive QL. Both tables have a key column and a
    value column
    For the same query (select value, count(*) from table group by value), the
    Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase
    (~ 150 seconds).
    Is this expected?

    Regards,
    Biju
  • Otis Gospodnetic at Mar 9, 2011 at 9:24 pm
    Hi,

    Biju's example shows a factor of 5 decrease in performance when Hive points to
    HBase tables.

    Does anyone know how much this factor varies? Is if often closer to 1 or is is
    more often close to 10?
    Just trying to get a better feel for this...

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: John Sichi <jsichi@fb.com>
    To: "<user@hive.apache.org>" <user@hive.apache.org>
    Sent: Tue, March 8, 2011 1:05:34 AM
    Subject: Re: Performance between Hive queries vs. Hive over HBase queries

    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase
    tables. For the HBase table, I created a corresponding Hive table so that the
    data in HBase can be queried from Hive QL. Both tables have a key column and a
    value column
    For the same query (select value, count(*) from table group by value), the
    Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase
    (~ 150 seconds).
    Is this expected?

    Regards,
    Biju
  • John Sichi at Mar 9, 2011 at 9:31 pm
    Factor of 5 closely matches the results I got when I was testing.

    JVS
    On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote:

    Hi,

    Biju's example shows a factor of 5 decrease in performance when Hive points to
    HBase tables.

    Does anyone know how much this factor varies? Is if often closer to 1 or is is
    more often close to 10?
    Just trying to get a better feel for this...

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: John Sichi <jsichi@fb.com>
    To: "<user@hive.apache.org>" <user@hive.apache.org>
    Sent: Tue, March 8, 2011 1:05:34 AM
    Subject: Re: Performance between Hive queries vs. Hive over HBase queries

    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

    Hi,

    I loaded a data set which has 1 million rows into both Hive and HBase
    tables. For the HBase table, I created a corresponding Hive table so that the
    data in HBase can be queried from Hive QL. Both tables have a key column and a
    value column
    For the same query (select value, count(*) from table group by value), the
    Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase
    (~ 150 seconds).
    Is this expected?

    Regards,
    Biju
  • Edward Capriolo at Mar 9, 2011 at 9:51 pm

    On Wed, Mar 9, 2011 at 4:31 PM, John Sichi wrote:
    Factor of 5 closely matches the results I got when I was testing.

    JVS
    On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote:

    Hi,

    Biju's example shows a factor of 5 decrease in performance when Hive points to
    HBase tables.

    Does anyone know how much this factor varies?  Is if often closer to 1 or is is
    more often close to 10?
    Just trying to get a better feel for this...

    Thanks,
    Otis
    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/


    ----- Original Message ----
    From: John Sichi <jsichi@fb.com>
    To: "<user@hive.apache.org>" <user@hive.apache.org>
    Sent: Tue, March 8, 2011 1:05:34 AM
    Subject: Re: Performance between Hive queries vs. Hive over HBase queries

    Yes.

    JVS
    On Mar 7, 2011, at 9:59 PM, Biju Kaimal  wrote:

    Hi,

    I loaded a data set which has 1 million  rows into both Hive and HBase
    tables. For the HBase table, I created a  corresponding Hive table so that the
    data in HBase can be queried from Hive QL.  Both tables have a key column and a
    value column
    For the same  query (select value, count(*) from table group by value), the
    Hive only query  runs much faster (~ 30 seconds) as compared to Hive over HBase
    (~ 150  seconds).
    Is this expected?

    Regards,
    Biju
    There is going to be overhead. Data has to move
    HDFS->RegionServer->TaskTracker. Another factor would be how many
    column families are being spanned in your table search.
  • Vaibhav Aggarwal at Mar 8, 2011 at 6:30 am
    If you are querying for particular key you should see better performance
    though. We have filter push-down for equals on hbase key column.
    On Mar 7, 2011 10:18 PM, "John Sichi" wrote:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 8, '11 at 6:00a
activeMar 9, '11 at 9:51p
posts10
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase