FAQ
Hi All


I have mapped a hbase table to hive.I have 3 machines in ec2 installed
using cloudera Manager.
i have total of 2 million records.

Query : select count(*) from table
Result: 2 Million using Hive
            1.5 Million approx using Impala.


I have flushed and compacted (major) the table in hbase.
And then refreshed it.
Still same mismatch of the results.

Finally i have restarted the cluster using the cloudera manager, still the
same.

Can someone point some pointers to solve this issue??


Thanks.
Senthil

Search Discussions

  • Abhishek desai at May 16, 2013 at 9:13 am
    I am facing the same issue, I have 18825718 (18 million) rows in hbase,
    impala count(*) returns 17648100 (17 million). The query takes
    about 3m57.515s to execute. How long does it take on your system for
    2million records ?
  • Senthil Kumar at May 16, 2013 at 9:42 am
    Abhishek

    it took 24 secs. we are testing with sample data.


    Others,
    when will Impala JIRA-300 is going to be solved??
    Even refresh <tablename> does not solve my problem.

    I need to close this evaluation for selection of a hadoop distribution


    Thanks
    Senthil




    On Thursday, May 16, 2013 1:14:11 PM UTC+5:30, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still the
    same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




  • Senthil Kumar at May 16, 2013 at 10:49 am
    An Update in this thread.

    After following some JIRAs of impala, i moved the last region to another
    regionserver .
    Have made last two regions reside in separate regionservers.

    It worked Fine.


    Will it not be a bug for my production??
    Eagerly waiting for the patch.

    Thanks
    Senthil





    On Thursday, May 16, 2013 3:12:12 PM UTC+5:30, Senthil Kumar wrote:

    Abhishek

    it took 24 secs. we are testing with sample data.


    Others,
    when will Impala JIRA-300 is going to be solved??
    Even refresh <tablename> does not solve my problem.

    I need to close this evaluation for selection of a hadoop distribution


    Thanks
    Senthil




    On Thursday, May 16, 2013 1:14:11 PM UTC+5:30, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still
    the same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




  • Alan at May 16, 2013 at 5:53 pm
    Hi Senthil,

    I've filed IMPALA-356 to track the bug. We'll fix it shortly.

    On Thursday, May 16, 2013 3:49:08 AM UTC-7, Senthil Kumar wrote:

    An Update in this thread.

    After following some JIRAs of impala, i moved the last region to another
    regionserver .
    Have made last two regions reside in separate regionservers.

    It worked Fine.


    Will it not be a bug for my production??
    Eagerly waiting for the patch.

    Thanks
    Senthil





    On Thursday, May 16, 2013 3:12:12 PM UTC+5:30, Senthil Kumar wrote:

    Abhishek

    it took 24 secs. we are testing with sample data.


    Others,
    when will Impala JIRA-300 is going to be solved??
    Even refresh <tablename> does not solve my problem.

    I need to close this evaluation for selection of a hadoop distribution


    Thanks
    Senthil




    On Thursday, May 16, 2013 1:14:11 PM UTC+5:30, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still
    the same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




  • Paul Birnie at Jun 23, 2013 at 11:30 am
    I see IMPALA-356 is marked as resolved in jira for impala 1.0.1

    1million rows reported in hbase
    300k rows reported in impala

    However i am seeing the same issue as reported below on the cloudera impala
    quick start vm - i downloaded

    [cloudera@localhost ~]$ rpm -qa | grep impala
    impala-1.0-1.p0.824.el6.x86_64
    impala-server-1.0-1.p0.824.el6.x86_64
    impala-shell-1.0-1.p0.824.el6.x86_64
    hue-impala-2.2.0+189-1.cdh4.2.0.p0.8.el6.x86_64
    impala-state-store-1.0-1.p0.824.el6.x86_64

    I am using Flume->Hive Sink to write the data

    [localhost.localdomain:21000] > select count(key) from hive_live_metric4;
    Query: select count(key) from hive_live_metric4
    Query finished, fetching results ...
    +------------+
    count(key) |
    +------------+
    313260 |
    +------------+
    Returned 1 row(s) in 3.74s

    hive> select count(key) from hive_live_metric4;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
       set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
       set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
       set mapred.reduce.tasks=<number>
    Starting Job = job_201306231039_0002, Tracking URL =
    http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201306231039_0002
    Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201306231039_0002
    Hadoop job information for Stage-1: number of mappers: 2; number of
    reducers: 1
    2013-06-23 12:18:41,089 Stage-1 map = 0%, reduce = 0%
    ...
    ...
    2013-06-23 12:22:14,067 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
    95.9 sec
    MapReduce Total cumulative CPU time: 1 minutes 35 seconds 900 msec
    Ended Job = job_201306231039_0002
    MapReduce Jobs Launched:
    Job 0: Map: 2 Reduce: 1 Cumulative CPU: 95.9 sec HDFS Read: 576 HDFS
    Write: 8 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 35 seconds 900 msec
    OK
    1000000
    Time taken: 221.353 seconds

    btw: I am using a "string" row key

    hive> describe hive_live_metric4;
    OK
    key string from deserializer
    dimension string from deserializer
    value int from deserializer
    Time taken: 0.386 seconds


    hbase(main):002:0> describe "hbase_live_metric4"
    DESCRIPTION
              ENABLED
      {NAME => 'hbase_live_metric4', FAMILIES => [{NAME => 'cf1',
    DATA_BLOCK_ENCODING => true
      'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3',
    COMPRESSI
      ON => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS
    => 'fals
      e', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true',
    BLOCKCACH
      E => 'true'}]}
    1 row(s) in 0.0710 seconds


    hbase> scan "hbase_live_metric4"

      trade100060 column=cf1:dimension,
    timestamp=1371999751269, value=EQUITY
      trade100060 column=cf1:value,
    timestamp=1371999751269, value=1376


    On Thursday, 16 May 2013 08:44:11 UTC+1, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still the
    same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




  • Paul Birnie at Jun 23, 2013 at 11:35 am
    I see IMPALA-356 is marked as resolved in jira for impala 1.0.1

    However i am seeing the same issue as reported below on the cloudera impala
    quick start vm - i downloaded

    1million rows reported in hbase
    300k rows reported in impala

    [cloudera@localhost ~]$ rpm -qa | grep impala
    impala-1.0-1.p0.824.el6.x86_64
    impala-server-1.0-1.p0.824.el6.x86_64
    impala-shell-1.0-1.p0.824.el6.x86_64
    hue-impala-2.2.0+189-1.cdh4.2.0.p0.8.el6.x86_64
    impala-state-store-1.0-1.p0.824.el6.x86_64

    I am using Flume->Hive Sink to write the data

    [localhost.localdomain:21000] > select count(key) from hive_live_metric4;
    Query: select count(key) from hive_live_metric4
    Query finished, fetching results ...
    +------------+
    count(key) |
    +------------+
    313260 |
    +------------+
    Returned 1 row(s) in 3.74s

    hive> select count(key) from hive_live_metric4;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
       set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
       set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
       set mapred.reduce.tasks=<number>
    Starting Job = job_201306231039_0002, Tracking URL =
    http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201306231039_0002
    Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201306231039_0002
    Hadoop job information for Stage-1: number of mappers: 2; number of
    reducers: 1
    2013-06-23 12:18:41,089 Stage-1 map = 0%, reduce = 0%
    ...
    ...
    2013-06-23 12:22:14,067 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
    95.9 sec
    MapReduce Total cumulative CPU time: 1 minutes 35 seconds 900 msec
    Ended Job = job_201306231039_0002
    MapReduce Jobs Launched:
    Job 0: Map: 2 Reduce: 1 Cumulative CPU: 95.9 sec HDFS Read: 576 HDFS
    Write: 8 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 35 seconds 900 msec
    OK
    1000000
    Time taken: 221.353 seconds

    btw: I am using a "string" row key

    hive> describe hive_live_metric4;
    OK
    key string from deserializer
    dimension string from deserializer
    value int from deserializer
    Time taken: 0.386 seconds


    hbase(main):002:0> describe "hbase_live_metric4"
    DESCRIPTION
              ENABLED
      {NAME => 'hbase_live_metric4', FAMILIES => [{NAME => 'cf1',
    DATA_BLOCK_ENCODING => true
      'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3',
    COMPRESSI
      ON => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS
    => 'fals
      e', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true',
    BLOCKCACH
      E => 'true'}]}
    1 row(s) in 0.0710 seconds


    hbase> scan "hbase_live_metric4"

      trade100060 column=cf1:dimension,
    timestamp=1371999751269, value=EQUITY
      trade100060 column=cf1:value,
    timestamp=1371999751269, value=1376


    On Thursday, 16 May 2013 08:44:11 UTC+1, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still the
    same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




  • Alan Choi at Jun 27, 2013 at 10:39 pm
    Hi Paul,

    Seems like Impala might be missing some regions. To confirm that, can you
    send us two things:

    1. list of regions (with start/stop key). You can find it from the HBase
    master web UI. The link should be http://
    <hbase_master_address>:60010/table.jsp?name=<hbase_table_name>

    2. The coordinator log with GLOG_v=2. You can do it by "export GLOG_v=2"
    and then restart impala and re-run the query.

    Thanks,
    Alan

    On Sun, Jun 23, 2013 at 4:31 AM, Paul Birnie wrote:

    I see IMPALA-356 is marked as resolved in jira for impala 1.0.1

    However i am seeing the same issue as reported below on the cloudera
    impala quick start vm - i downloaded

    1million rows reported in hbase
    300k rows reported in impala

    [cloudera@localhost ~]$ rpm -qa | grep impala
    impala-1.0-1.p0.824.el6.x86_64
    impala-server-1.0-1.p0.824.**el6.x86_64
    impala-shell-1.0-1.p0.824.el6.**x86_64
    hue-impala-2.2.0+189-1.cdh4.2.**0.p0.8.el6.x86_64
    impala-state-store-1.0-1.p0.**824.el6.x86_64

    I am using Flume->Hive Sink to write the data

    [localhost.localdomain:21000] > select count(key) from hive_live_metric4;
    Query: select count(key) from hive_live_metric4
    Query finished, fetching results ...
    +------------+
    count(key) |
    +------------+
    313260 |
    +------------+
    Returned 1 row(s) in 3.74s

    hive> select count(key) from hive_live_metric4;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.**reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<**number>
    In order to set a constant number of reducers:
    set mapred.reduce.tasks=<number>
    Starting Job = job_201306231039_0002, Tracking URL =
    http://localhost.localdomain:**50030/jobdetails.jsp?jobid=**
    job_201306231039_0002<http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201306231039_0002>
    Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201306231039_0002
    Hadoop job information for Stage-1: number of mappers: 2; number of
    reducers: 1
    2013-06-23 12:18:41,089 Stage-1 map = 0%, reduce = 0%
    ...
    ...
    2013-06-23 12:22:14,067 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
    95.9 sec
    MapReduce Total cumulative CPU time: 1 minutes 35 seconds 900 msec
    Ended Job = job_201306231039_0002
    MapReduce Jobs Launched:
    Job 0: Map: 2 Reduce: 1 Cumulative CPU: 95.9 sec HDFS Read: 576 HDFS
    Write: 8 SUCCESS
    Total MapReduce CPU Time Spent: 1 minutes 35 seconds 900 msec
    OK
    1000000
    Time taken: 221.353 seconds

    btw: I am using a "string" row key

    hive> describe hive_live_metric4;
    OK
    key string from deserializer
    dimension string from deserializer
    value int from deserializer
    Time taken: 0.386 seconds


    hbase(main):002:0> describe "hbase_live_metric4"
    DESCRIPTION
    ENABLED
    {NAME => 'hbase_live_metric4', FAMILIES => [{NAME => 'cf1',
    DATA_BLOCK_ENCODING => true
    'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3',
    COMPRESSI
    ON => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
    KEEP_DELETED_CELLS => 'fals
    e', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true',
    BLOCKCACH
    E => 'true'}]}
    1 row(s) in 0.0710 seconds


    hbase> scan "hbase_live_metric4"

    trade100060 column=cf1:dimension,
    timestamp=1371999751269, value=EQUITY
    trade100060 column=cf1:value,
    timestamp=1371999751269, value=1376


    On Thursday, 16 May 2013 08:44:11 UTC+1, Senthil Kumar wrote:

    Hi All


    I have mapped a hbase table to hive.I have 3 machines in ec2 installed
    using cloudera Manager.
    i have total of 2 million records.

    Query : select count(*) from table
    Result: 2 Million using Hive
    1.5 Million approx using Impala.


    I have flushed and compacted (major) the table in hbase.
    And then refreshed it.
    Still same mismatch of the results.

    Finally i have restarted the cluster using the cloudera manager, still
    the same.

    Can someone point some pointers to solve this issue??


    Thanks.
    Senthil




Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMay 16, '13 at 7:44a
activeJun 27, '13 at 10:39p
posts8
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase