Grokbase Groups Pig user July 2011
FAQ
Trying to join two sets and generate a set from the join and I am getting a


$ hadoop fs -cat DIM/\*
2011,01,31
2011,02,28
2011,03,31
2011,04,30
2011,05,31
2011,06,30
2011,07,31
2011,08,31
2011,09,30
2011,10,31
2011,11,30
2011,12,31


$ hadoop fs -cat ACCT/\*
2011,7,26,key1,23.25,2470.0
2011,7,26,key2,10.416666666666668,232274.08333333334
2011,7,26,key3,82.83333333333333,541377.25
2011,7,26,key4,78.5,492823.33333333326
2011,7,26,key5,110.83333333333334,729811.9166666667
2011,7,26,key6,102.16666666666666,675941.25
2011,7,26,key7,118.91666666666666,770896.75


grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
days:int);
grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
day: int, account:chararray, metric1:double, metric2:double);
grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
'replicated';
grunt> dump AjD;
...
(2011,7,26,key1,23.25,2470.0,2011,7,31)
(2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
(2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
(2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
(2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
(2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
(2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
grunt> describe AjD;
AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
int,DIM::month: int,DIM::days: int}

grunt> FINAL = FOREACH AjD
GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
grunt> dump FINAL;
...
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
for alias FINAL. Backend error : Scalar has more than one row in the output.
1st : (2011,7,26,key1,23.25,2470.0), 2nd
:(2011,7,26,key2,10.416666666666668,232274.08333333334)

However if I store it and reload it to shed the "join" schema:

grunt> STORE AjD INTO 'AjD' using PigStorage(',');
grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
day:int, account:chararray, metric1:double, metric2:double, year2:int,
month2:int, days:int);

grunt> FINAL = FOREACH AjD2
GENERATE year, month, account, (metric2 /days);
grunt> dump FINAL;
...
(2011,7,key1,79.6774193548387)
(2011,7,key2,7492.712365591398)
(2011,7,key3,17463.782258064515)
(2011,7,key4,15897.526881720427)
(2011,7,key5,23542.319892473122)
(2011,7,key6,21804.5564516129)
(2011,7,key7,24867.637096774193)

What am I missing to make this work without storing and reloading?

Thanks,
Rob

Search Discussions

  • Norbert Burger at Jul 29, 2011 at 5:35 pm
    Isn't the query parser getting confused because you've chosen the same
    column name mappings for the LHS and RHS of your join?

    What happens if you change this:

    AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';

    to:

    AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated';

    Norbert
    On Fri, Jul 29, 2011 at 12:47 PM, rob parker wrote:

    Trying to join two sets and generate a set from the join and I am getting a


    $ hadoop fs -cat DIM/\*
    2011,01,31
    2011,02,28
    2011,03,31
    2011,04,30
    2011,05,31
    2011,06,30
    2011,07,31
    2011,08,31
    2011,09,30
    2011,10,31
    2011,11,30
    2011,12,31


    $ hadoop fs -cat ACCT/\*
    2011,7,26,key1,23.25,2470.0
    2011,7,26,key2,10.416666666666668,232274.08333333334
    2011,7,26,key3,82.83333333333333,541377.25
    2011,7,26,key4,78.5,492823.33333333326
    2011,7,26,key5,110.83333333333334,729811.9166666667
    2011,7,26,key6,102.16666666666666,675941.25
    2011,7,26,key7,118.91666666666666,770896.75


    grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
    days:int);
    grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
    day: int, account:chararray, metric1:double, metric2:double);
    grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
    'replicated';
    grunt> dump AjD;
    ...
    (2011,7,26,key1,23.25,2470.0,2011,7,31)
    (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
    (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
    (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
    (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
    (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
    (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
    grunt> describe AjD;
    AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
    chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
    int,DIM::month: int,DIM::days: int}

    grunt> FINAL = FOREACH AjD
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
    grunt> dump FINAL;
    ...
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
    iterator
    for alias FINAL. Backend error : Scalar has more than one row in the
    output.
    1st : (2011,7,26,key1,23.25,2470.0), 2nd
    :(2011,7,26,key2,10.416666666666668,232274.08333333334)

    However if I store it and reload it to shed the "join" schema:

    grunt> STORE AjD INTO 'AjD' using PigStorage(',');
    grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
    day:int, account:chararray, metric1:double, metric2:double, year2:int,
    month2:int, days:int);

    grunt> FINAL = FOREACH AjD2
    GENERATE year, month, account, (metric2 /days);
    grunt> dump FINAL;
    ...
    (2011,7,key1,79.6774193548387)
    (2011,7,key2,7492.712365591398)
    (2011,7,key3,17463.782258064515)
    (2011,7,key4,15897.526881720427)
    (2011,7,key5,23542.319892473122)
    (2011,7,key6,21804.5564516129)
    (2011,7,key7,24867.637096774193)

    What am I missing to make this work without storing and reloading?

    Thanks,
    Rob
  • Raghu Angadi at Jul 29, 2011 at 6:01 pm
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
    should be GENERATE ACCT::year, ACCT::month ... etc.

    this is a common mistake to use '.' instead of '::'.. I wish the error
    message is more user friendly..
    PIG supports 'scalars' and assumes your ACCT would be a single row table at
    runtime when access fields like ACCT.year.

    On Fri, Jul 29, 2011 at 9:47 AM, rob parker wrote:

    Trying to join two sets and generate a set from the join and I am getting a


    $ hadoop fs -cat DIM/\*
    2011,01,31
    2011,02,28
    2011,03,31
    2011,04,30
    2011,05,31
    2011,06,30
    2011,07,31
    2011,08,31
    2011,09,30
    2011,10,31
    2011,11,30
    2011,12,31


    $ hadoop fs -cat ACCT/\*
    2011,7,26,key1,23.25,2470.0
    2011,7,26,key2,10.416666666666668,232274.08333333334
    2011,7,26,key3,82.83333333333333,541377.25
    2011,7,26,key4,78.5,492823.33333333326
    2011,7,26,key5,110.83333333333334,729811.9166666667
    2011,7,26,key6,102.16666666666666,675941.25
    2011,7,26,key7,118.91666666666666,770896.75


    grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
    days:int);
    grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
    day: int, account:chararray, metric1:double, metric2:double);
    grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
    'replicated';
    grunt> dump AjD;
    ...
    (2011,7,26,key1,23.25,2470.0,2011,7,31)
    (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
    (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
    (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
    (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
    (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
    (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
    grunt> describe AjD;
    AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
    chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
    int,DIM::month: int,DIM::days: int}

    grunt> FINAL = FOREACH AjD
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
    grunt> dump FINAL;
    ...
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
    iterator
    for alias FINAL. Backend error : Scalar has more than one row in the
    output.
    1st : (2011,7,26,key1,23.25,2470.0), 2nd
    :(2011,7,26,key2,10.416666666666668,232274.08333333334)

    However if I store it and reload it to shed the "join" schema:

    grunt> STORE AjD INTO 'AjD' using PigStorage(',');
    grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
    day:int, account:chararray, metric1:double, metric2:double, year2:int,
    month2:int, days:int);

    grunt> FINAL = FOREACH AjD2
    GENERATE year, month, account, (metric2 /days);
    grunt> dump FINAL;
    ...
    (2011,7,key1,79.6774193548387)
    (2011,7,key2,7492.712365591398)
    (2011,7,key3,17463.782258064515)
    (2011,7,key4,15897.526881720427)
    (2011,7,key5,23542.319892473122)
    (2011,7,key6,21804.5564516129)
    (2011,7,key7,24867.637096774193)

    What am I missing to make this work without storing and reloading?

    Thanks,
    Rob
  • Raghu Angadi at Jul 29, 2011 at 6:07 pm
    Is implicit scalar conversion going to stay in PIG? My preference would to
    make it explicit like SCALAR(ACCT.year)..
    On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi wrote:

    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
    should be GENERATE ACCT::year, ACCT::month ... etc.

    this is a common mistake to use '.' instead of '::'.. I wish the error
    message is more user friendly..
    PIG supports 'scalars' and assumes your ACCT would be a single row table at
    runtime when access fields like ACCT.year.

    On Fri, Jul 29, 2011 at 9:47 AM, rob parker wrote:

    Trying to join two sets and generate a set from the join and I am getting
    a


    $ hadoop fs -cat DIM/\*
    2011,01,31
    2011,02,28
    2011,03,31
    2011,04,30
    2011,05,31
    2011,06,30
    2011,07,31
    2011,08,31
    2011,09,30
    2011,10,31
    2011,11,30
    2011,12,31


    $ hadoop fs -cat ACCT/\*
    2011,7,26,key1,23.25,2470.0
    2011,7,26,key2,10.416666666666668,232274.08333333334
    2011,7,26,key3,82.83333333333333,541377.25
    2011,7,26,key4,78.5,492823.33333333326
    2011,7,26,key5,110.83333333333334,729811.9166666667
    2011,7,26,key6,102.16666666666666,675941.25
    2011,7,26,key7,118.91666666666666,770896.75


    grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
    days:int);
    grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
    day: int, account:chararray, metric1:double, metric2:double);
    grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
    'replicated';
    grunt> dump AjD;
    ...
    (2011,7,26,key1,23.25,2470.0,2011,7,31)
    (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
    (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
    (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
    (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
    (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
    (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
    grunt> describe AjD;
    AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
    chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
    int,DIM::month: int,DIM::days: int}

    grunt> FINAL = FOREACH AjD
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
    DIM.days);
    grunt> dump FINAL;
    ...
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
    iterator
    for alias FINAL. Backend error : Scalar has more than one row in the
    output.
    1st : (2011,7,26,key1,23.25,2470.0), 2nd
    :(2011,7,26,key2,10.416666666666668,232274.08333333334)

    However if I store it and reload it to shed the "join" schema:

    grunt> STORE AjD INTO 'AjD' using PigStorage(',');
    grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
    day:int, account:chararray, metric1:double, metric2:double, year2:int,
    month2:int, days:int);

    grunt> FINAL = FOREACH AjD2
    GENERATE year, month, account, (metric2 /days);
    grunt> dump FINAL;
    ...
    (2011,7,key1,79.6774193548387)
    (2011,7,key2,7492.712365591398)
    (2011,7,key3,17463.782258064515)
    (2011,7,key4,15897.526881720427)
    (2011,7,key5,23542.319892473122)
    (2011,7,key6,21804.5564516129)
    (2011,7,key7,24867.637096774193)

    What am I missing to make this work without storing and reloading?

    Thanks,
    Rob
  • Rob parker at Jul 29, 2011 at 7:40 pm
    Thanks!

    Rob
    On Fri, Jul 29, 2011 at 11:07 AM, Raghu Angadi wrote:

    Is implicit scalar conversion going to stay in PIG? My preference would to
    make it explicit like SCALAR(ACCT.year)..
    On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi wrote:

    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
    DIM.days);
    should be GENERATE ACCT::year, ACCT::month ... etc.

    this is a common mistake to use '.' instead of '::'.. I wish the error
    message is more user friendly..
    PIG supports 'scalars' and assumes your ACCT would be a single row table at
    runtime when access fields like ACCT.year.


    On Fri, Jul 29, 2011 at 9:47 AM, rob parker <rob.s.parker@gmail.com
    wrote:
    Trying to join two sets and generate a set from the join and I am
    getting
    a


    $ hadoop fs -cat DIM/\*
    2011,01,31
    2011,02,28
    2011,03,31
    2011,04,30
    2011,05,31
    2011,06,30
    2011,07,31
    2011,08,31
    2011,09,30
    2011,10,31
    2011,11,30
    2011,12,31


    $ hadoop fs -cat ACCT/\*
    2011,7,26,key1,23.25,2470.0
    2011,7,26,key2,10.416666666666668,232274.08333333334
    2011,7,26,key3,82.83333333333333,541377.25
    2011,7,26,key4,78.5,492823.33333333326
    2011,7,26,key5,110.83333333333334,729811.9166666667
    2011,7,26,key6,102.16666666666666,675941.25
    2011,7,26,key7,118.91666666666666,770896.75


    grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int,
    days:int);
    grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int,
    day: int, account:chararray, metric1:double, metric2:double);
    grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING
    'replicated';
    grunt> dump AjD;
    ...
    (2011,7,26,key1,23.25,2470.0,2011,7,31)
    (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
    (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
    (2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
    (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
    (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
    (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
    grunt> describe AjD;
    AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
    chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
    int,DIM::month: int,DIM::days: int}

    grunt> FINAL = FOREACH AjD
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 /
    DIM.days);
    grunt> dump FINAL;
    ...
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
    iterator
    for alias FINAL. Backend error : Scalar has more than one row in the
    output.
    1st : (2011,7,26,key1,23.25,2470.0), 2nd
    :(2011,7,26,key2,10.416666666666668,232274.08333333334)

    However if I store it and reload it to shed the "join" schema:

    grunt> STORE AjD INTO 'AjD' using PigStorage(',');
    grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int,
    day:int, account:chararray, metric1:double, metric2:double, year2:int,
    month2:int, days:int);

    grunt> FINAL = FOREACH AjD2
    GENERATE year, month, account, (metric2 /days);
    grunt> dump FINAL;
    ...
    (2011,7,key1,79.6774193548387)
    (2011,7,key2,7492.712365591398)
    (2011,7,key3,17463.782258064515)
    (2011,7,key4,15897.526881720427)
    (2011,7,key5,23542.319892473122)
    (2011,7,key6,21804.5564516129)
    (2011,7,key7,24867.637096774193)

    What am I missing to make this work without storing and reloading?

    Thanks,
    Rob
  • Thejas Nair at Jul 29, 2011 at 8:32 pm

    On 7/29/11 11:07 AM, Raghu Angadi wrote:
    Is implicit scalar conversion going to stay in PIG? My preference would to
    make it explicit like SCALAR(ACCT.year)..
    That's my preference as well , you can vote/ submit a patch ! -
    https://issues.apache.org/jira/browse/PIG-1967

    -Thejas
  • Thejas Nair at Jul 29, 2011 at 8:32 pm

    On 7/29/11 9:47 AM, rob parker wrote:

    grunt> describe AjD;
    AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account:
    chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:
    int,DIM::month: int,DIM::days: int}

    grunt> FINAL = FOREACH AjD
    GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
    grunt> dump FINAL;
    ...
    ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator
    for alias FINAL. Backend error : Scalar has more than one row in the output.
    1st : (2011,7,26,key1,23.25,2470.0), 2nd
    :(2011,7,26,key2,10.416666666666668,232274.08333333334)
    The column in AjD schema is ACCT::year, not ACCT.year (note the "::" vs
    ".").

    So you need to change it to -
    grunt> FINAL = FOREACH AjD
    GENERATE ACCT::year, ACCT::month, ACCT::account, (ACCT::metric2 /
    DIM::days);

    You are accidentally using the relation-as-scalar feature here.
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
    There is a jira open to deprecate the syntax for this feature so that
    users don't accidentally end up using it -
    https://issues.apache.org/jira/browse/PIG-1967

    -Thejas

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 29, '11 at 4:47p
activeJul 29, '11 at 8:32p
posts7
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase