Grokbase Groups Pig user January 2012
FAQ
Hi,

I've just hit a bug that's present in all versions of Pig that I've
tested. If I generate multiple relations from different projections of
the same grouped input, then union them together and do another group
with a composite key, the local rearrange step chooses the wrong
fields to group by. Versions 0.8.1 and 0.9.1 generate incorrect
output; trunk crashes with a "duplicate uid in schema" error. I
encountered the problem in a fairly complex script, but managed to
boil it down to the following test case:

---- bug.pig

a = LOAD 'bug.in' AS (x:int, y:chararray, z:chararray);

SPLIT a INTO a1 IF x==1, a2 IF x==2, a3 IF x==3;

grouped = COGROUP a1 BY y, a2 BY y, a3 BY y;
projected = FOREACH grouped GENERATE a1.z AS z1, a2.z AS z2, a3.z AS z3;

b1 = FOREACH projected GENERATE FLATTEN(z1) AS first, FLATTEN(z2) AS second;
b2 = FOREACH projected GENERATE FLATTEN(z2) AS first, FLATTEN(z3) AS second;

c = UNION b1, b2;
-- results are as expected until this point
d = GROUP c BY (first,second);
STORE d INTO 'bug.out';

---- Input:

1 foo line1
2 foo line2
3 foo line3
3 foo line4

---- Expected output:

(line1,line2) {(line1,line2)}
(line2,line3) {(line2,line3)}
(line2,line4) {(line2,line4)}

---- Actual output from 0.8/0.9
---- notice that the group is being done on (first,first) instead of
(first,second):

(line1,line1) {(line1,line2)}
(line2,line2) {(line2,line3),(line2,line4)}

---- Stack trace from trunk:

2012-01-09 13:25:55,230 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: COGROUP,GROUP_BY,UNION
2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 2270: Logical plan invalid state: duplicate uid in schema :
first#298:chararray,second#298:chararray
2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
- org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000:
Error processing rule LoadTypeCastInserter
at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:287)
at org.apache.pig.PigServer.compilePp(PigServer.java:1317)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1254)
at org.apache.pig.PigServer.execute(PigServer.java:1246)
at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:589)
at org.apache.pig.Main.main(Main.java:148)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR
2270: Logical plan invalid state: duplicate uid in schema :
first#298:chararray,second#298:chararray
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:225)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:160)
at org.apache.pig.newplan.logical.relational.LOUnion.accept(LOUnion.java:182)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
... 16 more

It's possible to work around the problem by performing multiple JOINs
instead of a single COGROUP and multiple FLATTENs, but the resulting
plan uses more map-reduce jobs and does a lot of redundant work.

Is this a known issue or limitation? (I searched JIRA and the list
archives, but didn't see anything that looked relevant.) If not, I'll
open an issue.

Thanks,
-- David

Search Discussions

  • David Wahler at Jan 10, 2012 at 11:12 pm
    I simplified the test case some more and filed a bug report.

    https://issues.apache.org/jira/browse/PIG-2465
    On Mon, Jan 9, 2012 at 1:48 PM, David Wahler wrote:
    Hi,

    I've just hit a bug that's present in all versions of Pig that I've
    tested. If I generate multiple relations from different projections of
    the same grouped input, then union them together and do another group
    with a composite key, the local rearrange step chooses the wrong
    fields to group by. Versions 0.8.1 and 0.9.1 generate incorrect
    output; trunk crashes with a "duplicate uid in schema" error. I
    encountered the problem in a fairly complex script, but managed to
    boil it down to the following test case:

    ---- bug.pig

    a = LOAD 'bug.in' AS (x:int, y:chararray, z:chararray);

    SPLIT a INTO a1 IF x==1, a2 IF x==2, a3 IF x==3;

    grouped = COGROUP a1 BY y, a2 BY y, a3 BY y;
    projected = FOREACH grouped GENERATE a1.z AS z1, a2.z AS z2, a3.z AS z3;

    b1 = FOREACH projected GENERATE FLATTEN(z1) AS first, FLATTEN(z2) AS second;
    b2 = FOREACH projected GENERATE FLATTEN(z2) AS first, FLATTEN(z3) AS second;

    c = UNION b1, b2;
    -- results are as expected until this point
    d = GROUP c BY (first,second);
    STORE d INTO 'bug.out';

    ---- Input:

    1       foo     line1
    2       foo     line2
    3       foo     line3
    3       foo     line4

    ---- Expected output:

    (line1,line2)   {(line1,line2)}
    (line2,line3)   {(line2,line3)}
    (line2,line4)   {(line2,line4)}

    ---- Actual output from 0.8/0.9
    ---- notice that the group is being done on (first,first) instead of
    (first,second):

    (line1,line1)   {(line1,line2)}
    (line2,line2)   {(line2,line3),(line2,line4)}

    ---- Stack trace from trunk:

    2012-01-09 13:25:55,230 [main] INFO
    org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
    script: COGROUP,GROUP_BY,UNION
    2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
    - ERROR 2270: Logical plan invalid state: duplicate uid in schema :
    first#298:chararray,second#298:chararray
    2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
    - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000:
    Error processing rule LoadTypeCastInserter
           at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
           at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:287)
           at org.apache.pig.PigServer.compilePp(PigServer.java:1317)
           at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1254)
           at org.apache.pig.PigServer.execute(PigServer.java:1246)
           at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
           at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131)
           at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
           at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
           at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
           at org.apache.pig.Main.run(Main.java:589)
           at org.apache.pig.Main.main(Main.java:148)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           at java.lang.reflect.Method.invoke(Method.java:597)
           at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
    Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR
    2270: Logical plan invalid state: duplicate uid in schema :
    first#298:chararray,second#298:chararray
           at org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:225)
           at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:160)
           at org.apache.pig.newplan.logical.relational.LOUnion.accept(LOUnion.java:182)
           at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
           at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
           at org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
           at org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
           ... 16 more

    It's possible to work around the problem by performing multiple JOINs
    instead of a single COGROUP and multiple FLATTENs, but the resulting
    plan uses more map-reduce jobs and does a lot of redundant work.

    Is this a known issue or limitation? (I searched JIRA and the list
    archives, but didn't see anything that looked relevant.) If not, I'll
    open an issue.

    Thanks,
    -- David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 9, '12 at 7:49p
activeJan 10, '12 at 11:12p
posts2
users1
websitepig.apache.org

1 user in discussion

David Wahler: 2 posts

People

Translate

site design / logo © 2021 Grokbase