The config files are up to date and have the 12 reducers.
I've been able to verify that this only happens when a UDF is used. I mostly
use eval functions. An example script:
a01 = load 'file' as (key: long, value: int);
the same for a02-a31;
b = cogroup a01 by key, a02 by key, ..., a31 by key;
DEFINE MEDMAD14 pigUDF.MedMad('14');
c = foreach b generate group as key, flatten(MEDMAD14(a01, a02, ..., a31));
The MEDMAD function iterates over the input and produces a bag of a rolling
median and mad according to the window value passed in the definition.
If cogroup is followed by parallel 11 then all is fine (11 equal result
parts) but if I use parallel 12 then I get only 3 large files and 9 zero
What do you think?
On Thu, Jul 2, 2009 at 6:23 PM, Alan Gates wrote:
For most operations Pig uses the default Hadoop partitioner. We do set our
own partitioner for order by, but at the moment I believe that's it. How
are you launching Pig? Is it possible it's picking up an old
hadoop-site.xml file or something?
On Jul 2, 2009, at 4:55 AM, Tamir Kamara wrote:
Recently my cluster configuration changed from 11 reducers to 12. Since
on every job using pig only 3 reducers do actual work and output results
while the others quickly finish and output zero byte files. The result of
the entire job is OK but getting there takes longer because of the uneven
work distribution. Plain MR jobs are fine and do have even work
distribution. Can this be something in pig ?
(I'm using latest trunk)