Grokbase Groups Pig user January 2011
FAQ
So, bizarrely, I am either not understanding how pig does joins or there is
a bug... it has been quite frustrating to troubleshoot.

The issue is this: after doing a join to get set5, I do a foreach generate
to make set6. Depending on the order in the join statement, one value gets
erased by another. Here is the specific part I am talking about:

set1 = JOIN Z2 by demo,small_table by demo;
set2 = foreach set1 generate Z2::uid as uid,Z2::c2 as c2,Z2::ss2k as
ss2k,Z2::time_id as time_id ,Z2::countryCode as countryCode,Z2::segment as
segment,small_table::value as alsodemo;
set3 = filter set2 BY segment == 1;
set4 = filter set2 BY segment == 2;
set4_a = foreach set4 generate uid, c2, ss2k, time_id, countryCode, alsodemo
as gender;
set5 = join set4_a by (uid,c2,ss2k,time_id,countryCode) full, set3 by
(uid,c2,ss2k,time_id,countryCode);
set6 = foreach set5 generate ((set3::uid IS NULL) ? set4_a::uid : set3::uid)
as uid,
((set3::c2 IS NULL) ? set4_a::c2 : set3::c2) as c2,
((set3::ss2k IS NULL) ? set4_a::ss2k : set3::ss2k) as ss2k,
((set3::time_id IS NULL) ? set4_a::time_id : set3::time_id) as
time_id,
((set3::countryCode IS NULL) ? set4_a::countryCode :
set3::countryCode) as countryCode,
gender, alsodemo as min_age;

If set5 joins set4_a on the left and set3 on the right, then while set5 will
output properly, set6 will not. the "gender" column will erase the alsodemo
column.

If set5 joins set3 on the left and set4_a on the right, then set5 still
works, but set6 will not: in this case, alsodemo will be fine, but it will
erase gender. Basically, the last two columns are always the same. Making
the references set4_a::gender and set3::alsodemo don't change anything (in
fact, originally that's how this was done, but in troubleshooting we have
been changing things to try and fix it).

This is a super bizarre example of the script behaving in a pretty awful
fashion. Not sure what is causing it, would love to know if anyone has any
ideas, and if this is a bug?

I can post the full script, but a lot of it isn't really germane to this.

Search Discussions

  • Xiaomeng Wan at Jan 26, 2011 at 4:30 pm
    Isn't it the same problem mentioned in

    http://mail-archives.apache.org/mod_mbox/pig-user/201101.mbox/%3C90570B63-0991-4127-8E3E-69E2B6E4B942@few.vu.nl%3E

    if so, there is already a fix.

    Shawn
    On Wed, Jan 26, 2011 at 7:36 AM, Jonathan Coveney wrote:
    So, bizarrely, I am either not understanding how pig does joins or there is
    a bug... it has been quite frustrating to troubleshoot.

    The issue is this: after doing a join to get set5, I do a foreach generate
    to make set6. Depending on the order in the join statement, one value gets
    erased by another. Here is the specific part I am talking about:

    set1 = JOIN Z2 by demo,small_table by demo;
    set2 = foreach set1 generate Z2::uid as uid,Z2::c2 as c2,Z2::ss2k as
    ss2k,Z2::time_id as time_id ,Z2::countryCode as countryCode,Z2::segment as
    segment,small_table::value as alsodemo;
    set3 = filter set2 BY segment == 1;
    set4 = filter set2 BY segment == 2;
    set4_a = foreach set4 generate uid, c2, ss2k, time_id, countryCode, alsodemo
    as gender;
    set5 = join set4_a by (uid,c2,ss2k,time_id,countryCode) full, set3 by
    (uid,c2,ss2k,time_id,countryCode);
    set6 = foreach set5 generate ((set3::uid IS NULL) ? set4_a::uid : set3::uid)
    as uid,
    ((set3::c2 IS NULL) ? set4_a::c2 : set3::c2) as c2,
    ((set3::ss2k IS NULL) ? set4_a::ss2k : set3::ss2k) as ss2k,
    ((set3::time_id IS NULL) ? set4_a::time_id : set3::time_id) as
    time_id,
    ((set3::countryCode IS NULL) ? set4_a::countryCode :
    set3::countryCode) as countryCode,
    gender, alsodemo as min_age;

    If set5 joins set4_a on the left and set3 on the right, then while set5 will
    output properly, set6 will not. the "gender" column will erase the alsodemo
    column.

    If set5 joins set3 on the left and set4_a on the right, then set5 still
    works, but set6 will not: in this case, alsodemo will be fine, but it will
    erase gender. Basically, the last two columns are always the same. Making
    the references set4_a::gender and set3::alsodemo don't change anything (in
    fact, originally that's how this was done, but in troubleshooting we have
    been changing things to try and fix it).

    This is a super bizarre example of the script behaving in a pretty awful
    fashion. Not sure what is causing it, would love to know if anyone has any
    ideas, and if this is a bug?

    I can post the full script, but a lot of it isn't really germane to this.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 26, '11 at 2:46p
activeJan 26, '11 at 4:30p
posts2
users2
websitepig.apache.org

2 users in discussion

Jonathan Coveney: 1 post Xiaomeng Wan: 1 post

People

Translate

site design / logo © 2021 Grokbase