Grokbase Groups Pig user April 2010
FAQ
CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 weeks ago.

Joins on multiple columns have null key values matching.

IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
--- both the above are the same

dump IN;
(,1,2,3)
(,-5,5,5)
( ,100,200,300)
( ,0,200,300)
(a,4,5,6)
(a,7,8,9)
(b,10,11,12)
(b,11,11,12)

IN_NULLS = FILTER IN BY ind is NULL;
dump IN_NULLS;
(,1,2,3)
(,-5,5,5)

J1 = JOIN IN by (ind), IN2 by (ind);
dump J1;
( ,0,200,300, ,0,200,300)
(a,4,5,6,a,4,5,6)
(a,4,5,6,a,7,8,9)
(a,7,8,9,a,4,5,6)
(a,7,8,9,a,7,8,9)
( ,100,200,300, ,100,200,300)
(b,10,11,12,b,10,11,12)
(b,10,11,12,b,11,11,12)
(b,11,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)

The above is the expected result of the self-join on the first column.

J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
dump J2;
( ,0,200,300, ,0,200,300)
( ,100,200,300, ,100,200,300)
(a,4,5,6,a,4,5,6)
(a,7,8,9,a,7,8,9)
(b,10,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)
(,-5,5,5,,-5,5,5)
(,1,2,3,,1,2,3)


The above is incorrect, since it matched the rows that have NULL for the ind field.

There is a work-around, by explicitly filtering for null on the join columns before the join, but the above still looks incorrect to me.
I suspect it is fixed in 0.6 or later, but I have not been able to find a JIRA ticket or message on this list about this.

Search Discussions

  • Hc busy at Apr 16, 2010 at 4:38 pm
    What scott noticed is present when the multiple column join key is used in a
    distributed setting. The trap is that when you unit test the behavior/PIG
    script and it does the join right in a local environment and then you get
    F'ed after u deploy to production in distributed enviro.

    On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey wrote:

    CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2 weeks
    ago.

    Joins on multiple columns have null key values matching.

    IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
    f1:int, f2:int);
    IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
    f1:int, f2:int);
    --- both the above are the same

    dump IN;
    (,1,2,3)
    (,-5,5,5)
    ( ,100,200,300)
    ( ,0,200,300)
    (a,4,5,6)
    (a,7,8,9)
    (b,10,11,12)
    (b,11,11,12)

    IN_NULLS = FILTER IN BY ind is NULL;
    dump IN_NULLS;
    (,1,2,3)
    (,-5,5,5)

    J1 = JOIN IN by (ind), IN2 by (ind);
    dump J1;
    ( ,0,200,300, ,0,200,300)
    (a,4,5,6,a,4,5,6)
    (a,4,5,6,a,7,8,9)
    (a,7,8,9,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    ( ,100,200,300, ,100,200,300)
    (b,10,11,12,b,10,11,12)
    (b,10,11,12,b,11,11,12)
    (b,11,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)

    The above is the expected result of the self-join on the first column.

    J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
    dump J2;
    ( ,0,200,300, ,0,200,300)
    ( ,100,200,300, ,100,200,300)
    (a,4,5,6,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    (b,10,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)
    (,-5,5,5,,-5,5,5)
    (,1,2,3,,1,2,3)


    The above is incorrect, since it matched the rows that have NULL for the
    ind field.

    There is a work-around, by explicitly filtering for null on the join
    columns before the join, but the above still looks incorrect to me.
    I suspect it is fixed in 0.6 or later, but I have not been able to find a
    JIRA ticket or message on this list about this.


  • Alan Gates at Apr 16, 2010 at 5:23 pm

    On Apr 16, 2010, at 9:37 AM, hc busy wrote:

    What scott noticed is present when the multiple column join key is
    used in a
    distributed setting. The trap is that when you unit test the
    behavior/PIG
    script and it does the join right in a local environment and then
    you get
    F'ed after u deploy to production in distributed enviro.
    In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll
    avoid that will fix these issues with development and deployment
    differences.

    Alan.

    On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey
    wrote:
    CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of
    2 weeks
    ago.

    Joins on multiple columns have null key values matching.

    IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray,
    ts:int,
    f1:int, f2:int);
    IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray,
    ts:int,
    f1:int, f2:int);
    --- both the above are the same

    dump IN;
    (,1,2,3)
    (,-5,5,5)
    ( ,100,200,300)
    ( ,0,200,300)
    (a,4,5,6)
    (a,7,8,9)
    (b,10,11,12)
    (b,11,11,12)

    IN_NULLS = FILTER IN BY ind is NULL;
    dump IN_NULLS;
    (,1,2,3)
    (,-5,5,5)

    J1 = JOIN IN by (ind), IN2 by (ind);
    dump J1;
    ( ,0,200,300, ,0,200,300)
    (a,4,5,6,a,4,5,6)
    (a,4,5,6,a,7,8,9)
    (a,7,8,9,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    ( ,100,200,300, ,100,200,300)
    (b,10,11,12,b,10,11,12)
    (b,10,11,12,b,11,11,12)
    (b,11,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)

    The above is the expected result of the self-join on the first
    column.

    J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
    dump J2;
    ( ,0,200,300, ,0,200,300)
    ( ,100,200,300, ,100,200,300)
    (a,4,5,6,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    (b,10,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)
    (,-5,5,5,,-5,5,5)
    (,1,2,3,,1,2,3)


    The above is incorrect, since it matched the rows that have NULL
    for the
    ind field.

    There is a work-around, by explicitly filtering for null on the join
    columns before the join, but the above still looks incorrect to me.
    I suspect it is fixed in 0.6 or later, but I have not been able to
    find a
    JIRA ticket or message on this list about this.


  • Hc busy at Apr 16, 2010 at 6:29 pm
    Cool! can't wait until CDH has >0.7...

    Kinda surprised that nobody encountered this problem before... Can I file a
    ticket?
    On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates wrote:


    On Apr 16, 2010, at 9:37 AM, hc busy wrote:

    What scott noticed is present when the multiple column join key is used in
    a
    distributed setting. The trap is that when you unit test the behavior/PIG
    script and it does the join right in a local environment and then you get
    F'ed after u deploy to production in distributed enviro.
    In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid
    that will fix these issues with development and deployment differences.

    Alan.


    On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <scott@richrelevance.com
    wrote:
    CDH2 Pig 0.5+. Mapred mode, with CDH2 0.20.1+ Both latest as of 2
    weeks
    ago.

    Joins on multiple columns have null key values matching.

    IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
    f1:int, f2:int);
    IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
    f1:int, f2:int);
    --- both the above are the same

    dump IN;
    (,1,2,3)
    (,-5,5,5)
    ( ,100,200,300)
    ( ,0,200,300)
    (a,4,5,6)
    (a,7,8,9)
    (b,10,11,12)
    (b,11,11,12)

    IN_NULLS = FILTER IN BY ind is NULL;
    dump IN_NULLS;
    (,1,2,3)
    (,-5,5,5)

    J1 = JOIN IN by (ind), IN2 by (ind);
    dump J1;
    ( ,0,200,300, ,0,200,300)
    (a,4,5,6,a,4,5,6)
    (a,4,5,6,a,7,8,9)
    (a,7,8,9,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    ( ,100,200,300, ,100,200,300)
    (b,10,11,12,b,10,11,12)
    (b,10,11,12,b,11,11,12)
    (b,11,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)

    The above is the expected result of the self-join on the first column.

    J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
    dump J2;
    ( ,0,200,300, ,0,200,300)
    ( ,100,200,300, ,100,200,300)
    (a,4,5,6,a,4,5,6)
    (a,7,8,9,a,7,8,9)
    (b,10,11,12,b,10,11,12)
    (b,11,11,12,b,11,11,12)
    (,-5,5,5,,-5,5,5)
    (,1,2,3,,1,2,3)


    The above is incorrect, since it matched the rows that have NULL for the
    ind field.

    There is a work-around, by explicitly filtering for null on the join
    columns before the join, but the above still looks incorrect to me.
    I suspect it is fixed in 0.6 or later, but I have not been able to find a
    JIRA ticket or message on this list about this.



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 15, '10 at 11:25p
activeApr 16, '10 at 6:29p
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase