Grokbase Groups Pig user May 2010
FAQ
HI all,

I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB.

My input dataset (on a given date) is big enough to cause some 960 mappers.

A = load 'input data set' ..;
B = load 'smaller data set'..;
C = JOIN A by key, B by key using "replicated";
D = foreach C generate field1, MyUDF(field2) as field2;
E = store D into 'deleteme';

As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.

I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.

I added the following lines
F = 'load previous date data'..;
G = union E, F;
H = distinct G parallel $X;
store H into 'deleteme_H';

When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?

Thanks!

Search Discussions

  • Daniel Dai at May 7, 2010 at 5:29 pm
    I suspect it is because of the distinct combiner. Try the option
    -Dpig.exec.nocombiner=true on the command line, see if it works.

    Daniel

    Kelvin Moss wrote:
    HI all,

    I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB.

    My input dataset (on a given date) is big enough to cause some 960 mappers.

    A = load 'input data set' ..;
    B = load 'smaller data set'..;
    C = JOIN A by key, B by key using "replicated";
    D = foreach C generate field1, MyUDF(field2) as field2;
    E = store D into 'deleteme';

    As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.

    I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.

    I added the following lines
    F = 'load previous date data'..;
    G = union E, F;
    H = distinct G parallel $X;
    store H into 'deleteme_H';

    When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?

    Thanks!




  • Kelvin Moss at May 10, 2010 at 8:13 am
    Hi Daniel,

    You're correct that the distinct statement is causing the issue, because if I comment the distinct the script runs fine. However, I ran the script with the -Dpig.exec.nocombiner=true option but still I got the "Java heap issue" error in the mapper. Any idea, why?

    Thanks!

    --- On Fri, 5/7/10, Daniel Dai wrote:


    From: Daniel Dai <jianyong@yahoo-inc.com>
    Subject: Re: Java heap issue
    To: "pig-user@hadoop.apache.org" <pig-user@hadoop.apache.org>
    Date: Friday, May 7, 2010, 10:58 PM


    I suspect it is because of the distinct combiner. Try the option -Dpig.exec.nocombiner=true on the command line, see if it works.

    Daniel

    Kelvin Moss wrote:
    HI all,
    I had a Pig script that worked completely fine. I called a memory intensive UDF that brought some 600 MB data into each mapper. However, I was able to process and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB.  My input dataset (on a given date) is big enough to cause some 960 mappers.  A = load 'input data set' ..;
    B = load 'smaller data set'..;
    C = JOIN A by key, B by key using "replicated";
    D = foreach C generate field1, MyUDF(field2) as field2;
    E = store D into 'deleteme';
    As you can see it is a Map only process. My output is some 960 part files with each file being around 25-35 MB.
    I do processing for each day. I now have a requirement to merge the results of the above processing with results from another date and store unique results.
    I added the following lines F = 'load previous date data'..;
    G = union E, F;
    H = distinct G parallel $X;
    store H into 'deleteme_H';
    When I add these steps to my process I get errors like "Java heap issue" in the mapper phase. I made F to be a null data set but I am still getting the same error. I wonder why I am getting "Java heap" errors. Is the solution to increase the mapper memory further down?
    Thanks!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 7, '10 at 8:56a
activeMay 10, '10 at 8:13a
posts3
users2
websitepig.apache.org

2 users in discussion

Kelvin Moss: 2 posts Daniel Dai: 1 post

People

Translate

site design / logo © 2021 Grokbase