FAQ
Hi,

Currently i am trying to join 2 tables. but see unexpected results from
join. Below are the results.
========================================================
[localhost:21000] > select count(*) from demographics;
6132409
[localhost:21000] > select count(*) from useractivity;
33282073
[localhost:21000] > select count(*) from demographics join useractivity on
(demographics.seqid = useractivity.seqid);
74322301
[localhost:21000] >
=========================================================

Is the result correct? since i was making a inner join i expected count(*)
in the join to be max of first 2 queries(i.e 33282073) , but instead
getting this high number(74322301). Is my understanding correct?

Unfortunately there is some problem running hive query on the setup and
hence couldn't cross verify with Hive.

any pointers to solve this would be helpful.

Thanks and Regards
Suresh

--

Search Discussions

  • Suresh K at Jan 31, 2013 at 11:45 am
    Hi Marcel,

    Thanks, that was indeed the issue. I was expecting seqid to be unique
    across table, but it was unique only within the day partition(the table was
    partitioned as year/month/day).


    Thanks and Regards,
    Suresh
    On Thursday, 24 January 2013 21:06:22 UTC+5:30, Marcel Kornacker wrote:

    Hi Suresh,

    if you have duplicates in your {demographics,useractivity}.seqid, you
    would end up with a join result that is larger than either one of the
    tables itself. (If there were only one seqid value in all rows in both
    tables, your result would have |demographics| x |useractivity| rows.)

    To verify, run "select count(distinct seqid) from demographics" and
    the same for the useractivity table. Keep in mind that aggregation is
    done in an in-memory hash table, for which you need enough memory
    available.

    Marcel
    On Thu, Jan 24, 2013 at 4:50 AM, Suresh K wrote:
    Hi,

    Currently i am trying to join 2 tables. but see unexpected results from
    join. Below are the results.
    ========================================================
    [localhost:21000] > select count(*) from demographics;
    6132409
    [localhost:21000] > select count(*) from useractivity;
    33282073
    [localhost:21000] > select count(*) from demographics join useractivity on
    (demographics.seqid = useractivity.seqid);
    74322301
    [localhost:21000] >
    =========================================================

    Is the result correct? since i was making a inner join i expected count(*)
    in the join to be max of first 2 queries(i.e 33282073) , but instead getting
    this high number(74322301). Is my understanding correct?

    Unfortunately there is some problem running hive query on the setup and
    hence couldn't cross verify with Hive.

    any pointers to solve this would be helpful.

    Thanks and Regards
    Suresh

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJan 24, '13 at 12:50p
activeJan 31, '13 at 11:45a
posts2
users1
websitecloudera.com
irc#hadoop

1 user in discussion

Suresh K: 2 posts

People

Translate

site design / logo © 2022 Grokbase