I have a pig script that is working well for small test data sets but fails on a run over realistic-sized data. Logs show
INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201106061024_0331 has failed!
…
job_201106061024_0331 CitedItemsGrpByDocId,DedupTCPerDocId GROUP_BY,COMBINER Message: Job failed!
…
attempt_201106061024_0331_m_000198_0 […] Error: java.lang.OutOfMemoryError: Java heap space
and similar same for all attempts at a few of the other (many) map tasks for this job.
I believe this job corresponds to these lines in my pig script:
CitedItemsGrpByDocId = group CitedItems by citeddocid;
DedupTCPerDocId =
foreach CitedItemsGrpByDocId {
CitingDocids = CitedItems.citingdocid;
UniqCitingDocids = distinct CitingDocids;
generate group, COUNT(UniqCitingDocids) as tc;
};
I tried increasing mapred.child.java.opts but the job failed in a setup stage with
Error occurred during initialization of VM
Could not reserve enough space for object heap
Are there job configurations/parameters for Hadoop or pig I can set to get around this? Is there a Pig Latin circumlocution, or better way to express what I want, that is not as memory-hungry?
Thank in advance,
Will
William F Dowling
Sr Technical Specialist, Software Engineering