FAQ
I have an hbase table (say table_1) of about 1.8 million rows and 12
columns. Each cell value is not more than 30 characters long.
Size of table in hbase 12.1 gb of data

My CDH cluster is made up of 3 aws instances m1.large. Each of the instance
has 8 GB of ram.

when I run the simple query
select count(*) from table_1;
Impala never returns any results, impalad on one of the instance shows 100%
cpu utilization. After sometime the the impala shell throws up an error
  saying IO failure. This is a consistent behavior.

The same query works on tables in hdfs. Also it works on tables of smaller
size in hbase itself.

When I try the same query in hive on the large hbase table it returns the
results after a long time.

I am using impala version 0.7.

what could be the problem ?

Search Discussions

  • Alan at May 16, 2013 at 6:05 pm
    Hi Abhishek,

    As for the wrong result, you probably have seen my reply on the other
    thread. You're likely hitting IMPALA-356, which we'll fix shortly.

    The query performance is definitely a lot slower that expected. Do you have
    the query profile to share with us?

    Thanks,
    Alan
    On Wednesday, May 15, 2013 9:11:37 AM UTC-7, abhishek desai wrote:


    I have upgraded to Impala 1.0. My queries do not hang anymore but
    aggregation queries take a long time to execute when run on hbase tables.
    I have a table with 18 million records I performed two simple queries on it

    1. select personid, sum(duration) from hbase_vieweractivity group by
    personid
    time taken to execute 5m6.286s

    2. select count(*) from hbase_vieweractivity
    time taken to execute 3m57.515s

    Same queries when executed on the similar table from hdfs execute within a
    few seconds.

    Another issue faced is that the number of records returned by count(*) is
    less than the actual number of records in hbase. The returned result is 17
    million whereas I have 18 million records in hbase.

    What could be the problem with the queries running on hbase ?

    On Tuesday, May 7, 2013 2:28:47 PM UTC+5:30, abhishek desai wrote:

    I have an hbase table (say table_1) of about 1.8 million rows and 12
    columns. Each cell value is not more than 30 characters long.
    Size of table in hbase 12.1 gb of data

    My CDH cluster is made up of 3 aws instances m1.large. Each of the
    instance has 8 GB of ram.

    when I run the simple query
    select count(*) from table_1;
    Impala never returns any results, impalad on one of the instance shows
    100% cpu utilization. After sometime the the impala shell throws up an
    error saying IO failure. This is a consistent behavior.

    The same query works on tables in hdfs. Also it works on tables of
    smaller size in hbase itself.

    When I try the same query in hive on the large hbase table it returns the
    results after a long time.

    I am using impala version 0.7.

    what could be the problem ?


  • Abhishek desai at May 17, 2013 at 8:12 am
    Hi Alan,

    I am guessing the query profile is the one generated under /var/log/impalad
    I have attached logs for 3 query runs.

    Please let me know if it indicates any configuration problems.

    Thanks
  • Abhishek desai at May 17, 2013 at 8:13 am
    The errors and warnings file under /var/log/impalad did not show any
    problems.
    On Tuesday, May 7, 2013 2:28:47 PM UTC+5:30, abhishek desai wrote:

    I have an hbase table (say table_1) of about 1.8 million rows and 12
    columns. Each cell value is not more than 30 characters long.
    Size of table in hbase 12.1 gb of data

    My CDH cluster is made up of 3 aws instances m1.large. Each of the
    instance has 8 GB of ram.

    when I run the simple query
    select count(*) from table_1;
    Impala never returns any results, impalad on one of the instance shows
    100% cpu utilization. After sometime the the impala shell throws up an
    error saying IO failure. This is a consistent behavior.

    The same query works on tables in hdfs. Also it works on tables of smaller
    size in hbase itself.

    When I try the same query in hive on the large hbase table it returns the
    results after a long time.

    I am using impala version 0.7.

    what could be the problem ?


  • Alan Choi at May 20, 2013 at 7:46 pm
    Hi Abhishek,

    Your hbase scan rate is ~20MB/sec. This is roughly the same as what I
    measure when I did my own casual benchmark locally.

    HBase is really good for pin point look up and we're aware that its
    scanning is quite a bit slower than that you would get from HDFS.

    Thanks,
    Alan

    On Fri, May 17, 2013 at 1:13 AM, abhishek desai wrote:

    The errors and warnings file under /var/log/impalad did not show any
    problems.

    On Tuesday, May 7, 2013 2:28:47 PM UTC+5:30, abhishek desai wrote:

    I have an hbase table (say table_1) of about 1.8 million rows and 12
    columns. Each cell value is not more than 30 characters long.
    Size of table in hbase 12.1 gb of data

    My CDH cluster is made up of 3 aws instances m1.large. Each of the
    instance has 8 GB of ram.

    when I run the simple query
    select count(*) from table_1;
    Impala never returns any results, impalad on one of the instance shows
    100% cpu utilization. After sometime the the impala shell throws up an
    error saying IO failure. This is a consistent behavior.

    The same query works on tables in hdfs. Also it works on tables of
    smaller size in hbase itself.

    When I try the same query in hive on the large hbase table it returns the
    results after a long time.

    I am using impala version 0.7.

    what could be the problem ?


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 7, '13 at 8:58a
activeMay 20, '13 at 7:46p
posts5
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Abhishek desai: 3 posts Alan Choi: 2 posts

People

Translate

site design / logo © 2022 Grokbase