Grokbase Groups Pig user July 2010
FAQ
I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m. I
am constantly getting OutOfMemory errors (see below) on my map and reduce
jobs, when I try run my script against large data where it produces around
600 maps.
Looking for some tips on the best configuration for pig and to get rid of
these errors. Thanks.



Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
heap space

Regards
Syed

Search Discussions

  • Ashutosh Chauhan at Jul 8, 2010 at 12:52 am
    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is –Xmx4096m.I am constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Syed Wasti at Jul 8, 2010 at 8:43 pm
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Ashutosh Chauhan at Jul 8, 2010 at 9:01 pm
    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Syed Wasti at Jul 8, 2010 at 10:49 pm
    Thanks Ashutosh, is there any workaround for this, will increasing the heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Ashutosh Chauhan at Jul 9, 2010 at 12:47 am
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Ashutosh Chauhan at Jul 9, 2010 at 12:59 am
    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
    6)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow)
    and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Syed Wasti at Jul 9, 2010 at 7:51 pm
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Ashutosh Chauhan at Jul 9, 2010 at 9:34 pm
    Hi Syed,

    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.

    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param  in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
    av
    a:1135)


    On 7/7/10 5:50 PM, "Ashutosh Chauhan" wrote:

    Syed,

    One line stack traces arent much helpful :) Please provide the full stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
    heap space

    Regards
    Syed
  • Syed Wasti at Jul 9, 2010 at 11:02 pm
    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find.

    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;;

    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }

    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:

    Hi Syed,

    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.

    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param  in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one
    table.

    Final_table =   FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id =    GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)



    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    Syed,

    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    ­Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space

    Regards
    Syed
  • Thejas M Nair at Jul 23, 2010 at 8:16 pm
    Hi Syed,
    I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 .

    As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
    I will be submitting a patch soon for the 0.8 release.

    -Thejas


    On 7/9/10 4:01 PM, "Syed Wasti" wrote:

    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find.

    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;;

    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }

    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:

    Hi Syed,

    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.

    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one
    table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)



    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    Syed,

    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    -Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space

    Regards
    Syed
  • Syed Wasti at Jul 28, 2010 at 6:28 am
    Thank you Thejas for the response.
    I want to share my feedback after trying all the recommended options.
    Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created.
    Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out.
    I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try.

    If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks.

    Regards
    Syed Wasti
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60)
    at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66)
    at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37)
    at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250)
    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568)
    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48)
    at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)



    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Fri, 23 Jul 2010 15:15:24 -0500
    Subject: Re: Java heap error

    Hi Syed,
    I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 .

    As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
    I will be submitting a patch soon for the 0.8 release.

    -Thejas


    On 7/9/10 4:01 PM, "Syed Wasti" wrote:

    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find.

    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;;

    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }

    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:

    Hi Syed,

    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.

    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one
    table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)



    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    Syed,

    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    -Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space

    Regards
    Syed

  • Thejas M Nair at Jul 29, 2010 at 12:31 am
    From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output.
    It looks like for some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' .

    Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ?

    -Thejas


    On 7/27/10 11:27 PM, "Syed Wasti" wrote:



    Thank you Thejas for the response.
    I want to share my feedback after trying all the recommended options.
    Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created.
    Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out.
    I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try.

    If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks.

    Regards
    Syed Wasti
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60)
    at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66)
    at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37)
    at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267)
    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250)
    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568)
    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48)
    at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)



    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Fri, 23 Jul 2010 15:15:24 -0500
    Subject: Re: Java heap error

    Hi Syed,
    I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 .

    As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
    I will be submitting a patch soon for the 0.8 release.

    -Thejas


    On 7/9/10 4:01 PM, "Syed Wasti" wrote:

    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find.

    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;;

    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }

    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:

    Hi Syed,

    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.

    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:
    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.

    Syed

    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:

    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig

    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:

    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.

    Let us know if these changes get your query to completion.

    Ashutosh
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:
    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?

    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:

    Syed,

    You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.

    Ashutosh
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:
    Sorry about the delay, was held with different things.
    Here is the script and the errors below;

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one
    table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;


    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)


    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)



    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    Syed,

    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.

    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:


    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    -Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.



    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space

    Regards
    Syed

  • Syed Wasti at Jul 29, 2010 at 6:11 pm
    Hi Thejas,
    It is from the same script which I shared earlier, I will paste it here again and this error I see is in the same map reduce job where it fails with OOME.
    I have a similar script where I am calling MAX, MIN and SUM functions on the grouped data and fails with similar errors.

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;

    Regards
    Syed Wasti




    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Wed, 28 Jul 2010 17:29:21 -0700
    Subject: Re: Java heap error





    Message body

    From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output.
    It looks like for some reason the system property ‘pig.exec.nocombiner’ is not getting set to ‘true’ .



    Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ?



    -Thejas





    On 7/27/10 11:27 PM, "Syed Wasti" wrote:







    Thank you Thejas for the response.

    I want to share my feedback after trying all the recommended options.

    Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created.

    Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out.

    I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try.



    If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks.



    Regards

    Syed Wasti

    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.ArrayList.(ArrayList.java:112)

    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60)

    at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66)

    at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37)

    at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250)

    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568)

    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48)

    at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)

    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)

    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)





    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.AbstractList.iterator(AbstractList.java:273)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)

    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)

    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)








    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Fri, 23 Jul 2010 15:15:24 -0500
    Subject: Re: Java heap error >
    Hi Syed,
    I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 . >
    As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
    I will be submitting a patch soon for the 0.8 release. >
    -Thejas
    >

    >
    On 7/9/10 4:01 PM, "Syed Wasti" wrote: >
    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find. >
    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory; >
    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    } >
    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    } >
    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;; >
    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    } >
    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }
    >

    >
    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:
    >
    Hi Syed,
    > >
    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.
    > >
    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:

    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.
    > >>
    Syed
    > >>

    > >>
    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:
    > >>
    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig
    > >>>
    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:
    > >>>>
    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.
    > >>>>
    Let us know if these changes get your query to completion.
    > >>>>
    Ashutosh
    > >>>>
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:

    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?
    > >>>>>

    > >>>>>
    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:
    > >>>>>
    Syed,
    > >>>>>>
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.
    > >>>>>>
    Ashutosh
    > >>>>>>
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:

    Sorry about the delay, was held with different things.
    Here is the script and the errors below;
    > >>>>>>>
    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
    > >>>>>>>
    AB = FOREACH AA GENERATE ID, e, f, n,o;
    > >>>>>>>
    AC = FILTER AB BY o == 1;
    > >>>>>>>
    AD = GROUP AC BY (ID, b);
    > >>>>>>>
    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }
    > >>>>>>>
    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.
    > >>>>>>>
    Final_res = UNION AE, AF, AG, AH, AI;
    > >>>>>>>
    The actual number of columns will be 15 here I am showing with one
    table.
    > >>>>>>>
    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;
    > >>>>>>>
    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.
    > >>>>>>>
    grp_id = GROUP Final_table BY ID;
    > >>>>>>>
    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);
    > >>>>>>>
    STORE Final_data;
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    > >> >>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    > >> >>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>

    > >>>>>>>
    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    > >>>>>>>
    Syed,
    > >>>>>>>>
    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.
    > >>>>>>>>
    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:
    > >>>>>>>>

    > >>>>>>>>>
    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    -Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.
    > >>>>>>>>>

    > >>>>>>>>>

    > >>>>>>>>>
    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space
    > >>>>>>>>>
    Regards
    Syed
    > >>>>>>>>>

    > >>>>>>>

    > >>>>>>>

    > >>>>>>>

    > >>>>>>

    > >>>>>

    > >>>>>

    > >>>>>

    > >>>>

    > >>>

    > >>

    > >>

    > >>

    > >

    >

    >

    >

    >
  • Thejas M Nair at Jul 29, 2010 at 7:40 pm
    Hi Syed,
    Disabling the combiner in the pig query should get this working.
    As I mentioned, it looks like combiner is being used in your query. You can confirm that by running explain on your query and checking the MR plan. For some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' in pig. Could it be a typo in the cmdline argument you are adding to disable it (-Dpig.exec.nocombiner=true)?
    -Thejas


    On 7/29/10 11:10 AM, "Syed Wasti" wrote:



    Hi Thejas,
    It is from the same script which I shared earlier, I will paste it here again and this error I see is in the same map reduce job where it fails with OOME.
    I have a similar script where I am calling MAX, MIN and SUM functions on the grouped data and fails with similar errors.

    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);

    AB = FOREACH AA GENERATE ID, e, f, n,o;

    AC = FILTER AB BY o == 1;

    AD = GROUP AC BY (ID, b);

    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }

    The same steps are repeated to load 5 different tables and then a UNION is
    done on them.

    Final_res = UNION AE, AF, AG, AH, AI;

    The actual number of columns will be 15 here I am showing with one table.

    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;

    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.

    grp_id = GROUP Final_table BY ID;

    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);

    STORE Final_data;

    Regards
    Syed Wasti




    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Wed, 28 Jul 2010 17:29:21 -0700
    Subject: Re: Java heap error





    Message body

    From the 2nd stack trace it looks like the combiner did not get disabled . You can verify that by looking at MapReduce plan in explain output.
    It looks like for some reason the system property 'pig.exec.nocombiner' is not getting set to 'true' .



    Can you send the other pig script that errors out with "Error: GC overhead limit exceeded" ?



    -Thejas





    On 7/27/10 11:27 PM, "Syed Wasti" wrote:







    Thank you Thejas for the response.

    I want to share my feedback after trying all the recommended options.

    Tried Increasing the heap size, built pig from the trunk and disabled the combiner by setting the property you recommended. All this did not work and still seeing the same errors, only way which is working for me is using the UDF I created.

    Another case where its errors out with "Error: GC overhead limit exceeded" I noticed is in the recuded jobs when it is in the state of copying map outputs. It just hangs out there for a long time (over 30mins) and finally errors out.

    I tried changing some parameters which I thought should be related but didnt help. Do you think this should be related to the newly created jira or would you recommend any properties that I should try.



    If it helps, I am pasting the stack trace of my map job failures when running the script with disabled combiner. Thanks.



    Regards

    Syed Wasti

    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.ArrayList.(ArrayList.java:112)

    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:60)

    at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:66)

    at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:37)

    at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:100)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:267)

    at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:250)

    at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:568)

    at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:48)

    at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)

    at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)

    at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)

    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)

    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)





    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

    at java.util.AbstractList.iterator(AbstractList.java:273)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:148)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:203)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)

    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:259)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:184)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)

    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)

    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)

    at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)

    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)








    From: tejas@yahoo-inc.com
    To: pig-user@hadoop.apache.org; mdwasti@hotmail.com
    Date: Fri, 23 Jul 2010 15:15:24 -0500
    Subject: Re: Java heap error >
    Hi Syed,
    I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 . >
    As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
    I will be submitting a patch soon for the 0.8 release. >
    -Thejas
    >

    >
    On 7/9/10 4:01 PM, "Syed Wasti" wrote: >
    Yes Ashutosh, that is the case and here the code for the UDF. Let me know
    what you find. >
    public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory; >
    public GroupSum() {
    this.mTupleFactory = TupleFactory.getInstance();
    this.mBagFactory = BagFactory.getInstance();
    } >
    public DataBag exec(Tuple input) throws IOException {
    if (input.size() < 0) {
    int errCode = 2107;
    String msg = "GroupSum expects one input but received "
    + input.size()
    + " inputs. \n";
    throw new ExecException(msg, errCode);
    }
    try {
    DataBag output = this.mBagFactory.newDefaultBag();
    Object o1 = input.get(0);
    if (o1 instanceof DataBag) {
    DataBag bag1 = (DataBag) o1;
    if (bag1.size() == 1L) {
    return bag1;
    }
    sumBag(bag1, output);
    }
    return output;
    } catch (ExecException ee) {
    throw ee;
    }
    } >
    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
    Iterator<?> i1 = o1.iterator();
    Tuple row = null;
    Tuple firstRow = null;; >
    int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
    int cnt = 0;
    while (i1.hasNext()) {
    row = (Tuple) i1.next();
    if (cnt == 0) {
    firstRow = row;
    }
    fld1 += (Integer) row.get(1);
    fld2 += (Integer) row.get(2);
    fld3 += (Integer) row.get(3);
    fld4 += (Integer) row.get(4);
    fld5 += (Integer) row.get(5);
    cnt ++;
    }
    //field 0 has the id in it.
    firstRow.set(1, fld1);
    firstRow.set(2, fld2);
    firstRow.set(3, fld3);
    firstRow.set(4, fld4);
    firstRow.set(5, fld5);
    emitTo.add(firstRow);
    } >
    public Schema outputSchema(Schema input) {
    try {
    Schema tupleSchema = new Schema();
    tupleSchema.add(input.getField(0));
    tupleSchema.setTwoLevelAccessRequired(true);
    return tupleSchema;
    } catch (Exception e) {
    }
    return null;
    }
    }
    >

    >
    On 7/9/10 2:32 PM, "Ashutosh Chauhan" wrote:
    >
    Hi Syed,
    > >
    Do you mean your query fails with OOME if you use Pig's builtin SUM,
    but succeeds if you use your own SUM UDF? If that is so, thats
    interesting. I have a hunch, why that is the case, but would like to
    confirm. Would you mind sharing your SUM UDF.
    > >
    Ashutosh
    On Fri, Jul 9, 2010 at 12:50, Syed Wasti wrote:

    Hi Ashutosh,
    Did not try option 2 and 3, I shall work sometime next week on that.
    But increasing the heap size did not help initially, with the increased heap
    size I came up with a UDF to do the SUM on the grouped data for the last
    step in my script and it completes my query without any errors now.
    > >>
    Syed
    > >>

    > >>
    On 7/8/10 5:58 PM, "Ashutosh Chauhan" wrote:
    > >>
    Aah.. forgot to tell how to set that param in 3). While launching
    pig, provide it as -D cmd line switch, as follows:
    pig -Dpig.cachedbag.memusage=0.02f myscript.pig
    > >>>
    On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
    wrote:
    I will recommend following things in the order:
    > >>>>
    1) Increasing heap size should help.
    2) It seems you are on 0.7. There are couple of memory fixes we have
    committed both on 0.7 branch as well as on trunk. Those should help as
    well. So, build Pig either from trunk or 0.7 branch and use that.
    3) Only if these dont help, you can try tuning the param
    pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
    should help. Try with 0.05, 0.02 and then further down. Downside is,
    as you go lower and lower, it will make your query go slower.
    > >>>>
    Let us know if these changes get your query to completion.
    > >>>>
    Ashutosh
    > >>>>
    On Thu, Jul 8, 2010 at 15:48, Syed Wasti wrote:

    Thanks Ashutosh, is there any workaround for this, will increasing the
    heap
    size help ?
    > >>>>>

    > >>>>>
    On 7/8/10 1:59 PM, "Ashutosh Chauhan" wrote:
    > >>>>>
    Syed,
    > >>>>>>
    Your query and stacktrace look very similar to the one in the jira
    ticket. This may get fixed by 0.8 release.
    > >>>>>>
    Ashutosh
    > >>>>>>
    On Thu, Jul 8, 2010 at 13:42, Syed Wasti wrote:

    Sorry about the delay, was held with different things.
    Here is the script and the errors below;
    > >>>>>>>
    AA = LOAD 'table1' USING PigStorage('\t') as
    (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
    > >>>>>>>
    AB = FOREACH AA GENERATE ID, e, f, n,o;
    > >>>>>>>
    AC = FILTER AB BY o == 1;
    > >>>>>>>
    AD = GROUP AC BY (ID, b);
    > >>>>>>>
    AE = FOREACH AD { A = DISTINCT AC.d;
    GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
    COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }
    > >>>>>>>
    The same steps are repeated to load 5 different tables and then a UNION
    is
    done on them.
    > >>>>>>>
    Final_res = UNION AE, AF, AG, AH, AI;
    > >>>>>>>
    The actual number of columns will be 15 here I am showing with one
    table.
    > >>>>>>>
    Final_table = FOREACH Final_res GENERATE ID,
    (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
    (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
    (type == 'S' AND b == 1?cnt_distinct:0) AS
    12_distinct_tmp,
    (type == 'S' AND b == 2?cnt_distinct:0) AS
    13_distinct_tmp;
    > >>>>>>>
    It works fine until here, it is only after adding this last part of the
    query it starts throwing heap errors.
    > >>>>>>>
    grp_id = GROUP Final_table BY ID;
    > >>>>>>>
    Final_data = FOREACH grp_reg_id GENERATE group AS ID
    SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
    SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);
    > >>>>>>>
    STORE Final_data;
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: Java heap space
    at java.util.ArrayList.(ArrayList.java:112)
    at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    > >> >>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: Java heap space
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.AbstractList.iterator(AbstractList.java:273)
    at
    org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
    at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
    at
    org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOp
    er
    at
    ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.processOnePackageOutput(PigCombiner.java:168)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:159)
    at
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner
    $C
    om
    bine.reduce(PigCombiner.java:50)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>
    Error: GC overhead limit exceeded
    -------
    Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.jav
    a:
    35
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>
    > >> >>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
    )
    at
    > >>>>>>>
    org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
    )
    at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
    at
    org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritabl
    e.
    ja
    va:114)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:67)
    at
    org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
    er
    .d
    eserialize(WritableSerialization.java:40)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.jav
    a:
    11
    6)
    at
    org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
    at
    org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
    va
    :1
    227)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.jav
    a:
    64
    8)
    at
    org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
    .j
    av
    a:1135)
    > >>>>>>>

    > >>>>>>>

    > >>>>>>>
    On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chauhan@gmail.com>
    wrote:
    > >>>>>>>
    Syed,
    > >>>>>>>>
    One line stack traces arent much helpful :) Please provide the full
    stack
    trace and the pig script which produced it and we can take a look.
    > >>>>>>>>
    Ashutosh
    On Wed, Jul 7, 2010 at 14:09, Syed Wasti wrote:
    > >>>>>>>>

    > >>>>>>>>>
    I am running my Pig scripts on our QA cluster (with 4 datanoes, see
    blelow)
    and has Cloudera CDH2 release installed and global heap max is
    -Xmx4096m.I
    am
    constantly getting OutOfMemory errors (see below) on my map and reduce
    jobs, when I try run my script against large data where it produces
    around
    600 maps.
    Looking for some tips on the best configuration for pig and to get rid
    of
    these errors. Thanks.
    > >>>>>>>>>

    > >>>>>>>>>

    > >>>>>>>>>
    Error: GC overhead limit exceededError: java.lang.OutOfMemoryError:
    Java
    heap space
    > >>>>>>>>>
    Regards
    Syed
    > >>>>>>>>>

    > >>>>>>>

    > >>>>>>>

    > >>>>>>>

    > >>>>>>

    > >>>>>

    > >>>>>

    > >>>>>

    > >>>>

    > >>>

    > >>

    > >>

    > >>

    > >

    >

    >

    >

    >
  • ToddG at Jul 20, 2010 at 9:28 pm
    I'd like to include running various PIG scripts in my continuous build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this
    brings up several questions:


    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:

    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):

    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):

    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up
    entire pig scripts, which have REGISTER and DEFINE directives, then the
    pigServer.registerQuery() fails with:

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    Any suggestions?

    -Todd
  • Jeff Zhang at Jul 21, 2010 at 1:42 am
    Hi Todd,

    The method registerQuery can not handle register and define statement. You
    should use method registerJar and and registerFunction instead.

    Another way is to put your script in a file and then use registerScript to
    execute the pig script.


    On Wed, Jul 21, 2010 at 5:26 AM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this brings
    up several questions:


    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:

    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):

    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):

    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up entire
    pig scripts, which have REGISTER and DEFINE directives, then the
    pigServer.registerQuery() fails with:

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    Any suggestions?

    -Todd


    --
    Best Regards

    Jeff Zhang
  • Corbin Hoenes at Jul 21, 2010 at 5:58 am
    Hey Todd we run against entire pig scripts with some helper classes we built basically they preprocess the variables then call register script but the test looks like this:

    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples = runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see more chatter about other techniques.

    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous build system. Of course, I'll only use small datasets for this, and in the beginning, I'll only target a local machine instance. However, this brings up several questions:


    Q: Whats the best way to run PIG from java? Here's what I'm doing, following a pattern I found in some of the pig tests:

    1. Create Pig resources in a base class (shamelessly copied from PigExecTestCase):

    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and PigServer (copied from TestPigSplit):

    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up entire pig scripts, which have REGISTER and DEFINE directives, then the pigServer.registerQuery() fails with:

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    Any suggestions?

    -Todd
  • Corbin Hoenes at Jul 21, 2010 at 6:03 am
    Trying to attach the PigRunner class in case that helps give you a start using register script.
  • Corbin Hoenes at Jul 21, 2010 at 6:08 am
    okay no attachments...try this gist:

    http://gist.github.com/484135
    On Jul 21, 2010, at 12:02 AM, Corbin Hoenes wrote:

    Trying to attach the PigRunner class in case that helps give you a start using register script.


    On Jul 20, 2010, at 11:56 PM, Corbin Hoenes wrote:

    Hey Todd we run against entire pig scripts with some helper classes we built basically they preprocess the variables then call register script but the test looks like this:

    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples = runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see more chatter about other techniques.

    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous build system. Of course, I'll only use small datasets for this, and in the beginning, I'll only target a local machine instance. However, this brings up several questions:


    Q: Whats the best way to run PIG from java? Here's what I'm doing, following a pattern I found in some of the pig tests:

    1. Create Pig resources in a base class (shamelessly copied from PigExecTestCase):

    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and PigServer (copied from TestPigSplit):

    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up entire pig scripts, which have REGISTER and DEFINE directives, then the pigServer.registerQuery() fails with:

    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    Any suggestions?

    -Todd
  • Dmitriy Ryaboy at Jul 21, 2010 at 8:12 am
    Corbin,
    Have you looked at PigUnit? https://issues.apache.org/jira/browse/PIG-1404

    On Tue, Jul 20, 2010 at 11:07 PM, Corbin Hoenes wrote:

    okay no attachments...try this gist:

    http://gist.github.com/484135
    On Jul 21, 2010, at 12:02 AM, Corbin Hoenes wrote:

    Trying to attach the PigRunner class in case that helps give you a start
    using register script.

    On Jul 20, 2010, at 11:56 PM, Corbin Hoenes wrote:

    Hey Todd we run against entire pig scripts with some helper classes we
    built basically they preprocess the variables then call register script but
    the test looks like this:
    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples = runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see more
    chatter about other techniques.
    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this brings
    up several questions:

    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:
    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):
    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):
    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up
    entire pig scripts, which have REGISTER and DEFINE directives, then the
    pigServer.registerQuery() fails with:
    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    Any suggestions?

    -Todd
  • Corbin Hoenes at Jul 21, 2010 at 1:06 pm
    Dimitry,

    Nope that is new for me thanks for pointing it out, been using this home grown class since pig 0.5--really like the idea of unit testing moving into pig as a first class citizen.

    On Jul 21, 2010, at 2:11 AM, Dmitriy Ryaboy wrote:

    Corbin,
    Have you looked at PigUnit? https://issues.apache.org/jira/browse/PIG-1404

    On Tue, Jul 20, 2010 at 11:07 PM, Corbin Hoenes wrote:

    okay no attachments...try this gist:

    http://gist.github.com/484135
    On Jul 21, 2010, at 12:02 AM, Corbin Hoenes wrote:

    Trying to attach the PigRunner class in case that helps give you a start
    using register script.

    On Jul 20, 2010, at 11:56 PM, Corbin Hoenes wrote:

    Hey Todd we run against entire pig scripts with some helper classes we
    built basically they preprocess the variables then call register script but
    the test looks like this:
    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples = runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see more
    chatter about other techniques.
    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this brings
    up several questions:

    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:
    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):
    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE, cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):
    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up
    entire pig scripts, which have REGISTER and DEFINE directives, then the
    pigServer.registerQuery() fails with:
    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    Any suggestions?

    -Todd
  • Dave Viner at Jul 21, 2010 at 4:23 pm
    PigUnit looks awesome. Can this make it into either the latest piggybank
    release or the next core release?


    On Wed, Jul 21, 2010 at 6:06 AM, Corbin Hoenes wrote:

    Dimitry,

    Nope that is new for me thanks for pointing it out, been using this home
    grown class since pig 0.5--really like the idea of unit testing moving into
    pig as a first class citizen.

    On Jul 21, 2010, at 2:11 AM, Dmitriy Ryaboy wrote:

    Corbin,
    Have you looked at PigUnit?
    https://issues.apache.org/jira/browse/PIG-1404
    On Tue, Jul 20, 2010 at 11:07 PM, Corbin Hoenes wrote:

    okay no attachments...try this gist:

    http://gist.github.com/484135
    On Jul 21, 2010, at 12:02 AM, Corbin Hoenes wrote:

    Trying to attach the PigRunner class in case that helps give you a
    start
    using register script.

    On Jul 20, 2010, at 11:56 PM, Corbin Hoenes wrote:

    Hey Todd we run against entire pig scripts with some helper classes we
    built basically they preprocess the variables then call register script
    but
    the test looks like this:
    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples =
    runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see more
    chatter about other techniques.
    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous
    build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this
    brings
    up several questions:

    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:
    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):
    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE,
    cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):
    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up
    entire pig scripts, which have REGISTER and DEFINE directives, then the
    pigServer.registerQuery() fails with:
    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    Any suggestions?

    -Todd
  • Dmitriy Ryaboy at Jul 21, 2010 at 5:01 pm
    Everyone likes it, no one has time to work on it. You guys can feel free to
    jump in and make it a real thing :)
    There's definitely still time for this to make it into Pig 0.8.

    -D
    On Wed, Jul 21, 2010 at 9:22 AM, Dave Viner wrote:

    PigUnit looks awesome. Can this make it into either the latest piggybank
    release or the next core release?


    On Wed, Jul 21, 2010 at 6:06 AM, Corbin Hoenes wrote:

    Dimitry,

    Nope that is new for me thanks for pointing it out, been using this home
    grown class since pig 0.5--really like the idea of unit testing moving into
    pig as a first class citizen.

    On Jul 21, 2010, at 2:11 AM, Dmitriy Ryaboy wrote:

    Corbin,
    Have you looked at PigUnit?
    https://issues.apache.org/jira/browse/PIG-1404
    On Tue, Jul 20, 2010 at 11:07 PM, Corbin Hoenes wrote:

    okay no attachments...try this gist:

    http://gist.github.com/484135
    On Jul 21, 2010, at 12:02 AM, Corbin Hoenes wrote:

    Trying to attach the PigRunner class in case that helps give you a
    start
    using register script.

    On Jul 20, 2010, at 11:56 PM, Corbin Hoenes wrote:

    Hey Todd we run against entire pig scripts with some helper classes
    we
    built basically they preprocess the variables then call register
    script
    but
    the test looks like this:
    @Before
    public void setUp() throws Exception {
    Helper.delete(OUT_FILE);
    runner = new PigRunner();
    }


    @Test
    public void testRecordCount() throws Exception {
    runner.execute("myscript.pig", "param1=foo","param2=bar");

    Iterator<Tuple> tuples =
    runner.getPigServer().openIterator("foo");
    assertEquals(41L, Helper.countTuples(tuples));
    }

    It's been very useful for us to test this way. Would love to see
    more
    chatter about other techniques.
    On Jul 20, 2010, at 3:26 PM, ToddG wrote:

    I'd like to include running various PIG scripts in my continuous
    build
    system. Of course, I'll only use small datasets for this, and in the
    beginning, I'll only target a local machine instance. However, this
    brings
    up several questions:

    Q: Whats the best way to run PIG from java? Here's what I'm doing,
    following a pattern I found in some of the pig tests:
    1. Create Pig resources in a base class (shamelessly copied from
    PigExecTestCase):
    protected MiniCluster cluster;
    protected PigServer pigServer;

    @Before
    public void setUp() throws Exception {

    String execTypeString = System.getProperty("test.exectype");
    if(execTypeString!=null && execTypeString.length()>0){
    execType = PigServer.parseExecType(execTypeString);
    }
    if(execType == MAPREDUCE) {
    cluster = MiniCluster.buildCluster();
    pigServer = new PigServer(MAPREDUCE,
    cluster.getProperties());
    } else {
    pigServer = new PigServer(LOCAL);
    }
    }

    2. Test classes sub class this to get access to the MiniCluster and
    PigServer (copied from TestPigSplit):
    @Test
    public void notestLongEvalSpec() throws Exception{
    inputFileName = "notestLongEvalSpec-input.txt";
    createInput(new String[] {"0\ta"});

    pigServer.registerQuery("a = load '" + inputFileName + "';");
    for (int i=0; i< 500; i++){
    pigServer.registerQuery("a = filter a by $0 == '1';");
    }
    Iterator<Tuple> iter = pigServer.openIterator("a");
    while (iter.hasNext()){
    throw new Exception();
    }
    }

    3. ERROR

    This pattern works for simple PIG directives, but I want to load up
    entire pig scripts, which have REGISTER and DEFINE directives, then
    the
    pigServer.registerQuery() fails with:
    org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000:
    Error
    during parsing. Unrecognized alias REGISTER
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
    at
    org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:441)
    at
    com.audiencescience.apollo.reporting.NetworkRevenueReportTest.shouldParseNetworkRevenueReportScript(NetworkRevenueReportTest.java:74)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    Any suggestions?

    -Todd

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 7, '10 at 9:10p
activeJul 29, '10 at 7:40p
posts24
users8
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase