FAQ
Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687 partitions
that was created from hive. Afterwards, I did a refresh metadata and
compared the select count(1) results and noticed that the result differed
(impala results was significantly smaller than hive). I did further
investigation and determined that impala was not considering some of my
later partitions.

The hive show partition results came back as expected. I tried using the
show table stats command in impala, but I'm getting an error:
[ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
Query: show TABLE STATS rcfile_3p
ERROR: IllegalArgumentException: Comparison method violates its general
contract!

Thanks for your help.

Best,
Sammy

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Alan Choi at Jan 14, 2014 at 3:32 am
    Hi Sammy,

    If you run "explain select count(*) from your_tbl", the plan will tell you
    the number of partitions being scanned. Is that number correct?

    If it is correct, that probably means that some of the data files can't be
    read correctly by Impala.

    If it's not correct, then maybe you can try running "invalidate metadata"
    (that's different from refresh; see this
    link<http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html?scroll=invalidate_metadata_unique_1>for
    more details)?

    Thanks,
    Alan

    On Fri, Jan 10, 2014 at 7:39 PM, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687 partitions
    that was created from hive. Afterwards, I did a refresh metadata and
    compared the select count(1) results and noticed that the result differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of my
    later partitions.

    The hive show partition results came back as expected. I tried using the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its general
    contract!

    Thanks for your help.

    Best,
    Sammy

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Jan 14, 2014 at 4:34 am
    Hi Sammy,

    Good catch. Thanks for reporting this issue. I've filed
    IMPALA-749<https://issues.cloudera.org/browse/IMPALA-749> to
    track it.

    Thanks,
    Alan

    On Mon, Jan 13, 2014 at 7:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks for the reply. The explain shows (32,767 partitions which
    is less than the expected 38,687 partitions):
    0:SCAN HDFS

    table=default.raw_3p #partitions=32767/32767 size=130.69GB |
    I ran "invalidate metadata", but the explain still came back with 32,767..

    Are there any logs that I can provide? Is there a way to to dump out
    what catalogd views of the partitions like I can see the partitions in
    hive's metastore:

    mysql> select count(1) from PARTITIONS where TBL_ID=6;
    +----------+
    count(1) |
    +----------+
    38687 |
    +----------+
    1 row in set (0.00 sec)

    Thanks,
    Sammy





    On Mon, Jan 13, 2014 at 7:32 PM, Alan Choi wrote:
    Hi Sammy,

    If you run "explain select count(*) from your_tbl", the plan will tell you
    the number of partitions being scanned. Is that number correct?

    If it is correct, that probably means that some of the data files can't be
    read correctly by Impala.

    If it's not correct, then maybe you can try running "invalidate metadata"
    (that's different from refresh; see this link for more details)?

    Thanks,
    Alan

    On Fri, Jan 10, 2014 at 7:39 PM, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687 partitions
    that was created from hive. Afterwards, I did a refresh metadata and
    compared the select count(1) results and noticed that the result
    differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of my
    later partitions.

    The hive show partition results came back as expected. I tried using
    the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its general
    contract!

    Thanks for your help.

    Best,
    Sammy

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Sammy Yu at Jan 15, 2014 at 12:52 am
    Hi Alan,
        Thanks so much for looking into the issue and determining the root
    cause. I can see that the IMPALA-749 is assigned for 1.2.4.
    Regarding the catalogd resource issue, is there anything that I can
    provide in terms of log to confirm this is the issue I'm seeing? I
    hate to ask this but regarding the roadmap I know it was mentioned in
    another email that the next major release 1.3 will be available end of
    Q1/early Q2, does this mean we will likely see a 1.2.4 release before
    then and will it fix both of these issues?

    Best,
    Sammy

    On Mon, Jan 13, 2014 at 8:34 PM, Alan Choi wrote:
    Hi Sammy,

    Good catch. Thanks for reporting this issue. I've filed IMPALA-749 to track
    it.

    Thanks,
    Alan

    On Mon, Jan 13, 2014 at 7:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks for the reply. The explain shows (32,767 partitions which
    is less than the expected 38,687 partitions):
    0:SCAN HDFS

    table=default.raw_3p #partitions=32767/32767 size=130.69GB |
    I ran "invalidate metadata", but the explain still came back with 32,767..

    Are there any logs that I can provide? Is there a way to to dump out
    what catalogd views of the partitions like I can see the partitions in
    hive's metastore:

    mysql> select count(1) from PARTITIONS where TBL_ID=6;
    +----------+
    count(1) |
    +----------+
    38687 |
    +----------+
    1 row in set (0.00 sec)

    Thanks,
    Sammy





    On Mon, Jan 13, 2014 at 7:32 PM, Alan Choi wrote:
    Hi Sammy,

    If you run "explain select count(*) from your_tbl", the plan will tell
    you
    the number of partitions being scanned. Is that number correct?

    If it is correct, that probably means that some of the data files can't
    be
    read correctly by Impala.

    If it's not correct, then maybe you can try running "invalidate
    metadata"
    (that's different from refresh; see this link for more details)?

    Thanks,
    Alan

    On Fri, Jan 10, 2014 at 7:39 PM, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687
    partitions
    that was created from hive. Afterwards, I did a refresh metadata and
    compared the select count(1) results and noticed that the result
    differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of my
    later partitions.

    The hive show partition results came back as expected. I tried using
    the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its general
    contract!

    Thanks for your help.

    Best,
    Sammy

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Jan 15, 2014 at 3:37 am
    Hi Sammy,

    For the catalogd resource issue, can you do a "jstack <catalogd pid>" when
    you see the cpu is running high?

    We're working on 1.2.4 and will be released shortly.

    Thanks,
    Alan

    On Tue, Jan 14, 2014 at 4:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks so much for looking into the issue and determining the root
    cause. I can see that the IMPALA-749 is assigned for 1.2.4.
    Regarding the catalogd resource issue, is there anything that I can
    provide in terms of log to confirm this is the issue I'm seeing? I
    hate to ask this but regarding the roadmap I know it was mentioned in
    another email that the next major release 1.3 will be available end of
    Q1/early Q2, does this mean we will likely see a 1.2.4 release before
    then and will it fix both of these issues?

    Best,
    Sammy

    On Mon, Jan 13, 2014 at 8:34 PM, Alan Choi wrote:
    Hi Sammy,

    Good catch. Thanks for reporting this issue. I've filed IMPALA-749 to track
    it.

    Thanks,
    Alan

    On Mon, Jan 13, 2014 at 7:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks for the reply. The explain shows (32,767 partitions which
    is less than the expected 38,687 partitions):
    0:SCAN HDFS

    table=default.raw_3p #partitions=32767/32767 size=130.69GB |
    I ran "invalidate metadata", but the explain still came back with
    32,767..
    Are there any logs that I can provide? Is there a way to to dump out
    what catalogd views of the partitions like I can see the partitions in
    hive's metastore:

    mysql> select count(1) from PARTITIONS where TBL_ID=6;
    +----------+
    count(1) |
    +----------+
    38687 |
    +----------+
    1 row in set (0.00 sec)

    Thanks,
    Sammy





    On Mon, Jan 13, 2014 at 7:32 PM, Alan Choi wrote:
    Hi Sammy,

    If you run "explain select count(*) from your_tbl", the plan will tell
    you
    the number of partitions being scanned. Is that number correct?

    If it is correct, that probably means that some of the data files
    can't
    be
    read correctly by Impala.

    If it's not correct, then maybe you can try running "invalidate
    metadata"
    (that's different from refresh; see this link for more details)?

    Thanks,
    Alan

    On Fri, Jan 10, 2014 at 7:39 PM, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687
    partitions
    that was created from hive. Afterwards, I did a refresh metadata
    and
    compared the select count(1) results and noticed that the result
    differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of
    my
    later partitions.

    The hive show partition results came back as expected. I tried using
    the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its
    general
    contract!

    Thanks for your help.

    Best,
    Sammy

    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Sammy Yu at Jan 15, 2014 at 6:15 pm
    Hi Alan,
        Thanks so much for the response. I'm looking forward to 1.2.4.
    Regarding the high CPU usage issue, I think it is caused when catalogd
    first starts and it begins to enumerate the list of partition (for
    those tables with a lot of partitions). It takes 40 minutes on my
    cluster before it settles back down. I have attached a couple of
    jstack output. I can see a bunch of queries going to the MySQL
    metastore DB for each partition. I'm not sure if this is the
    bottleneck, but is it possible to just do a bulkified single query for
    all the partitions of this table?

    Thanks,
    Sammy


    On Tue, Jan 14, 2014 at 7:37 PM, Alan Choi wrote:
    Hi Sammy,

    For the catalogd resource issue, can you do a "jstack <catalogd pid>" when
    you see the cpu is running high?

    We're working on 1.2.4 and will be released shortly.

    Thanks,
    Alan

    On Tue, Jan 14, 2014 at 4:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks so much for looking into the issue and determining the root
    cause. I can see that the IMPALA-749 is assigned for 1.2.4.
    Regarding the catalogd resource issue, is there anything that I can
    provide in terms of log to confirm this is the issue I'm seeing? I
    hate to ask this but regarding the roadmap I know it was mentioned in
    another email that the next major release 1.3 will be available end of
    Q1/early Q2, does this mean we will likely see a 1.2.4 release before
    then and will it fix both of these issues?

    Best,
    Sammy

    On Mon, Jan 13, 2014 at 8:34 PM, Alan Choi wrote:
    Hi Sammy,

    Good catch. Thanks for reporting this issue. I've filed IMPALA-749 to
    track
    it.

    Thanks,
    Alan

    On Mon, Jan 13, 2014 at 7:52 PM, Sammy Yu wrote:

    Hi Alan,
    Thanks for the reply. The explain shows (32,767 partitions which
    is less than the expected 38,687 partitions):
    0:SCAN HDFS

    table=default.raw_3p #partitions=32767/32767 size=130.69GB |
    I ran "invalidate metadata", but the explain still came back with
    32,767..

    Are there any logs that I can provide? Is there a way to to dump out
    what catalogd views of the partitions like I can see the partitions in
    hive's metastore:

    mysql> select count(1) from PARTITIONS where TBL_ID=6;
    +----------+
    count(1) |
    +----------+
    38687 |
    +----------+
    1 row in set (0.00 sec)

    Thanks,
    Sammy





    On Mon, Jan 13, 2014 at 7:32 PM, Alan Choi wrote:
    Hi Sammy,

    If you run "explain select count(*) from your_tbl", the plan will
    tell
    you
    the number of partitions being scanned. Is that number correct?

    If it is correct, that probably means that some of the data files
    can't
    be
    read correctly by Impala.

    If it's not correct, then maybe you can try running "invalidate
    metadata"
    (that's different from refresh; see this link for more details)?

    Thanks,
    Alan

    On Fri, Jan 10, 2014 at 7:39 PM, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687
    partitions
    that was created from hive. Afterwards, I did a refresh metadata
    and
    compared the select count(1) results and noticed that the result
    differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of
    my
    later partitions.

    The hive show partition results came back as expected. I tried
    using
    the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its
    general
    contract!

    Thanks for your help.

    Best,
    Sammy

    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Jan 14, 2014 at 4:29 am
    Hi Sammy,

    The huge CPU usage is more likely to be caused by a large number of files
    or blocks. We're aware of the performance problem when the catalog/metadata
    is huge. We're working on it. Stay tuned!

    Thanks,
    Alan


    On Mon, Jan 13, 2014 at 5:29 PM, Sammy Yu wrote:

    Hi,
    I also see that catalogd is not behaving very nicely after this table
    was found. It's eating up a lot of CPU, this is the output from top:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND


    4894 impala 20 0 3008m 1.2g 23m S 83.1 16.8 17:37.46 catalogd



    Is there a limit to the maximum number of partitions a table can handle in
    impala? If there is, could one create a tables per partition? Would
    having 30,000+ tables causing problems in impala?

    Thanks,
    Sammy
    On Friday, January 10, 2014 7:39:07 PM UTC-8, Sammy Yu wrote:

    Hi,
    I'm running impala 1.2.3 on with a rcfile table with 38687 partitions
    that was created from hive. Afterwards, I did a refresh metadata and
    compared the select count(1) results and noticed that the result differed
    (impala results was significantly smaller than hive). I did further
    investigation and determined that impala was not considering some of my
    later partitions.

    The hive show partition results came back as expected. I tried using the
    show table stats command in impala, but I'm getting an error:
    [ip-10-124-195-6.ec2.internal:21000] > SHOW TABLE STATS rcfile_3p;
    Query: show TABLE STATS rcfile_3p
    ERROR: IllegalArgumentException: Comparison method violates its general
    contract!

    Thanks for your help.

    Best,
    Sammy
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJan 11, '14 at 3:39a
activeJan 15, '14 at 6:15p
posts7
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Alan Choi: 4 posts Sammy Yu: 3 posts

People

Translate

site design / logo © 2022 Grokbase