Grokbase Groups Pig user June 2011

I am using pig-0.9 and hadoop-0.20. I have set to 100 and to -Xmx400m because of the memory constraints I

I have a pig script that does essentially the following:

a = load 'somedata' using SomeUDF();

b = foreach a generate x1, x2, x3, x4, x5, x6... ;

c1 = filter b by x3 is not null;
d1 = foreach c1 group by (x1, x2, x3);
e1 = foreach d1 generate flatten(group), SUM(c1.x5), SUM(c1.x6)...;

c2 = filter b by x4 is not null;
d2 = foreach c2 gropu by (x1, x2, x4);
e2 = foreach d2 generate flatten(group), SUM(c2.x5), SUM(c2.x6)...;
e14 = foreach d14 generate flatten(group), SUM(c14.x5), SUM(c14.x6)...;

f = union e1, e2, e3, e4 ... e14;
store f into 'somefile';

The data has around 350 columns, so the schema is significantly large. I
have made sure the input splits have around 2500 records each. Even then I
see significant spillage happening. The combiner does come into play but the
spillage kills performance. Why does the multi-query optimized map require
so much ram? The memory usage is really confusing as I see the mappers run
at almost all the memory allotted (400m). If I increase io.sort.mb, the
processes dies with am OOM exception. Any ideas?!


Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 30, '11 at 10:21p
activeJun 30, '11 at 10:21p

1 user in discussion

Shubham Chopra: 1 post



site design / logo © 2021 Grokbase