Grokbase Groups Pig dev October 2009
FAQ
We ran into what looks like some edge case bug in Pig, which causes it
to throw an IndexOutOfBoundsException (stack trace below). The script
just joins two relations; it looks like our data was generated
incorrectly, and the join is empty, which may be what's causing the
failure. It also appears to only happen when at least one of the
inputs is on the large size (at least a few hundred megs). Any ideas
on what could be happening and how to zoom in on the underlying cause?
We are running off unmodified trunk.

Script:

register datagen.jar;
E = load 'Employee' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(id,name,cc,dc);
D = load 'Department' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(dept_id,dept_nm);
P = load 'Project' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(id,emp_id,role);
R1 = JOIN E by dc, D by dept_id;
R2 = JOIN R1 by E::id, P by emp_id;
store R2 into 'TestCase2Output';

R2 join fails with the stack trace below. It also fails if we
pre-calculate R1, store it, and load it directly (so, load R1, load P,
join R1 by $0, P by emp_id). We've verified that the records in R1 and
R2 have the expected fields, etc.


Stack Trace:

java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:148)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:226)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:260)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Search Discussions

  • Alan Gates at Oct 13, 2009 at 8:02 pm
    Have you checked that each record your input data has at least the
    number of fields you specify? Have you checked that the field
    separator in your data matches the default for PigPerformanceLoader
    (^A I think)?

    Alan.
    On Oct 13, 2009, at 10:28 AM, Dmitriy Ryaboy wrote:

    We ran into what looks like some edge case bug in Pig, which causes it
    to throw an IndexOutOfBoundsException (stack trace below). The script
    just joins two relations; it looks like our data was generated
    incorrectly, and the join is empty, which may be what's causing the
    failure. It also appears to only happen when at least one of the
    inputs is on the large size (at least a few hundred megs). Any ideas
    on what could be happening and how to zoom in on the underlying cause?
    We are running off unmodified trunk.

    Script:

    register datagen.jar;
    E = load 'Employee' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (id,name,cc,dc);
    D = load 'Department' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (dept_id,dept_nm);
    P = load 'Project' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (id,emp_id,role);
    R1 = JOIN E by dc, D by dept_id;
    R2 = JOIN R1 by E::id, P by emp_id;
    store R2 into 'TestCase2Output';

    R2 join fails with the stack trace below. It also fails if we
    pre-calculate R1, store it, and load it directly (so, load R1, load P,
    join R1 by $0, P by emp_id). We've verified that the records in R1 and
    R2 have the expected fields, etc.


    Stack Trace:

    java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine
    .physicalLayer.expressionOperators.POProject.getNext(POProject.java:
    148)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine
    .physicalLayer.expressionOperators.POProject.getNext(POProject.java:
    226)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine
    .physicalLayer
    .relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:
    260)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine
    .physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine
    .mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
    at
    org
    .apache
    .pig
    .backend
    .hadoop
    .executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
    at
    org
    .apache
    .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce
    $Map.map(PigMapReduce.java:93)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:
    358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
  • Dmitriy Ryaboy at Oct 19, 2009 at 4:56 pm
    Yes and yes.. in any case, the latest from SVN doesn't have this
    issue. Guessing it was 921 that did it.

    -D
    On Tue, Oct 13, 2009 at 4:01 PM, Alan Gates wrote:
    Have you checked that each record your input data has at least the number of
    fields you specify?  Have you checked that the field separator in your data
    matches the default for PigPerformanceLoader (^A I think)?

    Alan.
    On Oct 13, 2009, at 10:28 AM, Dmitriy Ryaboy wrote:

    We ran into what looks like some edge case bug in Pig, which causes it
    to throw an IndexOutOfBoundsException (stack trace below).  The script
    just joins two relations; it looks like our data was generated
    incorrectly, and the join is empty, which may be what's causing the
    failure. It also appears to only happen when at least one of the
    inputs is on the large size (at least a few hundred megs).  Any ideas
    on what could be happening and how to zoom in on the underlying cause?
    We are running off unmodified trunk.

    Script:

    register datagen.jar;
    E =  load 'Employee' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (id,name,cc,dc);
    D =  load 'Department' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (dept_id,dept_nm);
    P =  load 'Project' using
    org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
    (id,emp_id,role);
    R1 = JOIN E by dc, D by dept_id;
    R2 = JOIN R1 by E::id, P by emp_id;
    store R2 into 'TestCase2Output';

    R2 join fails with the stack trace below. It also fails if we
    pre-calculate R1, store it, and load it directly (so, load R1, load P,
    join R1 by $0, P by emp_id). We've verified that the records in R1 and
    R2 have the expected fields, etc.


    Stack Trace:

    java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:226)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:260)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedOct 13, '09 at 5:28p
activeOct 19, '09 at 4:56p
posts3
users2
websitepig.apache.org

2 users in discussion

Dmitriy Ryaboy: 2 posts Alan Gates: 1 post

People

Translate

site design / logo © 2022 Grokbase