Thanks, that was indeed the issue. I was expecting seqid to be unique
across table, but it was unique only within the day partition(the table was
partitioned as year/month/day).
Thanks and Regards,
Suresh
On Thursday, 24 January 2013 21:06:22 UTC+5:30, Marcel Kornacker wrote:
Hi Suresh,
if you have duplicates in your {demographics,useractivity}.seqid, you
would end up with a join result that is larger than either one of the
tables itself. (If there were only one seqid value in all rows in both
tables, your result would have |demographics| x |useractivity| rows.)
To verify, run "select count(distinct seqid) from demographics" and
the same for the useractivity table. Keep in mind that aggregation is
done in an in-memory hash table, for which you need enough memory
available.
Marcel
Hi Suresh,
if you have duplicates in your {demographics,useractivity}.seqid, you
would end up with a join result that is larger than either one of the
tables itself. (If there were only one seqid value in all rows in both
tables, your result would have |demographics| x |useractivity| rows.)
To verify, run "select count(distinct seqid) from demographics" and
the same for the useractivity table. Keep in mind that aggregation is
done in an in-memory hash table, for which you need enough memory
available.
Marcel
On Thu, Jan 24, 2013 at 4:50 AM, Suresh K wrote:
Hi,
Currently i am trying to join 2 tables. but see unexpected results from
join. Below are the results.
========================================================
[localhost:21000] > select count(*) from demographics;
6132409
[localhost:21000] > select count(*) from useractivity;
33282073
[localhost:21000] > select count(*) from demographics join useractivity on
(demographics.seqid = useractivity.seqid);
74322301
[localhost:21000] >
=========================================================
Is the result correct? since i was making a inner join i expected count(*)
in the join to be max of first 2 queries(i.e 33282073) , but instead getting
this high number(74322301). Is my understanding correct?
Unfortunately there is some problem running hive query on the setup and
hence couldn't cross verify with Hive.
any pointers to solve this would be helpful.
Thanks and Regards
Suresh
--
Hi,
Currently i am trying to join 2 tables. but see unexpected results from
join. Below are the results.
========================================================
[localhost:21000] > select count(*) from demographics;
6132409
[localhost:21000] > select count(*) from useractivity;
33282073
[localhost:21000] > select count(*) from demographics join useractivity on
(demographics.seqid = useractivity.seqid);
74322301
[localhost:21000] >
=========================================================
Is the result correct? since i was making a inner join i expected count(*)
in the join to be max of first 2 queries(i.e 33282073) , but instead getting
this high number(74322301). Is my understanding correct?
Unfortunately there is some problem running hive query on the setup and
hence couldn't cross verify with Hive.
any pointers to solve this would be helpful.
Thanks and Regards
Suresh
--