We have some aggregate statistics we are gathering in an NLP application ,
and we have implemented it in Pig (using 0.7.0 on Hadoop 0.20.2 with Java
1.6.0_22). But some of the mapreduce jobs use a lot of memory (12G of RAM in
one process, for a data set that is 50G in text format at the start of the
computation). I would like to gain some insight into what portion of the Pig
script is being executed by a given mapreduce task. I dumped the plan using
EXPLAIN but I am having trouble interpreting the output. I can¹t find any
resources online that help me understand these plans. Does anyone have any
pointers, or any better ideas on how to debug the logic? I bet I am doing
something that is easy to optimize away, but since I expect to use Pig more
going forward, I¹d like to have a bag of tricks to diagnose its behavior.
Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.