CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 weeks ago.
Joins on multiple columns have null key values matching.
IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
--- both the above are the same
dump IN;
(,1,2,3)
(,-5,5,5)
( ,100,200,300)
( ,0,200,300)
(a,4,5,6)
(a,7,8,9)
(b,10,11,12)
(b,11,11,12)
IN_NULLS = FILTER IN BY ind is NULL;
dump IN_NULLS;
(,1,2,3)
(,-5,5,5)
J1 = JOIN IN by (ind), IN2 by (ind);
dump J1;
( ,0,200,300, ,0,200,300)
(a,4,5,6,a,4,5,6)
(a,4,5,6,a,7,8,9)
(a,7,8,9,a,4,5,6)
(a,7,8,9,a,7,8,9)
( ,100,200,300, ,100,200,300)
(b,10,11,12,b,10,11,12)
(b,10,11,12,b,11,11,12)
(b,11,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)
The above is the expected result of the self-join on the first column.
J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
dump J2;
( ,0,200,300, ,0,200,300)
( ,100,200,300, ,100,200,300)
(a,4,5,6,a,4,5,6)
(a,7,8,9,a,7,8,9)
(b,10,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)
(,-5,5,5,,-5,5,5)
(,1,2,3,,1,2,3)
The above is incorrect, since it matched the rows that have NULL for the ind field.
There is a work-around, by explicitly filtering for null on the join columns before the join, but the above still looks incorrect to me.
I suspect it is fixed in 0.6 or later, but I have not been able to find a JIRA ticket or message on this list about this.