So, bizarrely, I am either not understanding how pig does joins or there is
a bug... it has been quite frustrating to troubleshoot.
The issue is this: after doing a join to get set5, I do a foreach generate
to make set6. Depending on the order in the join statement, one value gets
erased by another. Here is the specific part I am talking about:
set1 = JOIN Z2 by demo,small_table by demo;
set2 = foreach set1 generate Z2::uid as uid,Z2::c2 as c2,Z2::ss2k as
ss2k,Z2::time_id as time_id ,Z2::countryCode as countryCode,Z2::segment as
segment,small_table::value as alsodemo;
set3 = filter set2 BY segment == 1;
set4 = filter set2 BY segment == 2;
set4_a = foreach set4 generate uid, c2, ss2k, time_id, countryCode, alsodemo
set5 = join set4_a by (uid,c2,ss2k,time_id,countryCode) full, set3 by
set6 = foreach set5 generate ((set3::uid IS NULL) ? set4_a::uid : set3::uid)
((set3::c2 IS NULL) ? set4_a::c2 : set3::c2) as c2,
((set3::ss2k IS NULL) ? set4_a::ss2k : set3::ss2k) as ss2k,
((set3::time_id IS NULL) ? set4_a::time_id : set3::time_id) as
((set3::countryCode IS NULL) ? set4_a::countryCode :
set3::countryCode) as countryCode,
gender, alsodemo as min_age;
If set5 joins set4_a on the left and set3 on the right, then while set5 will
output properly, set6 will not. the "gender" column will erase the alsodemo
If set5 joins set3 on the left and set4_a on the right, then set5 still
works, but set6 will not: in this case, alsodemo will be fine, but it will
erase gender. Basically, the last two columns are always the same. Making
the references set4_a::gender and set3::alsodemo don't change anything (in
fact, originally that's how this was done, but in troubleshooting we have
been changing things to try and fix it).
This is a super bizarre example of the script behaving in a pretty awful
fashion. Not sure what is causing it, would love to know if anyone has any
ideas, and if this is a bug?
I can post the full script, but a lot of it isn't really germane to this.