FAQ
We have some aggregate statistics we are gathering in an NLP application ,
and we have implemented it in Pig (using 0.7.0 on Hadoop 0.20.2 with Java
1.6.0_22). But some of the mapreduce jobs use a lot of memory (12G of RAM in
one process, for a data set that is 50G in text format at the start of the
computation). I would like to gain some insight into what portion of the Pig
script is being executed by a given mapreduce task. I dumped the plan using
EXPLAIN but I am having trouble interpreting the output. I can¹t find any
resources online that help me understand these plans. Does anyone have any
pointers, or any better ideas on how to debug the logic? I bet I am doing
something that is easy to optimize away, but since I expect to use Pig more
going forward, I¹d like to have a bag of tricks to diagnose its behavior.

Many thanks,

Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300

SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.

Search Discussions

  • Daniel Dai at Nov 29, 2010 at 10:59 pm
    Sorry there is no document to describe explain result so far. If you
    want to find out the alias -> job mapping in the explain result, look at
    "Map Reduce Plan" section of explain result, every node represents a
    mapreduce job, and you will find alias included in this job.

    I would also suggest using pig 0.8 which has the stats feature of "alias
    -> jobid" mapping after running a Pig script. It is still couple of days
    before releasing, but you can check out and build by yourself now:
    svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8

    Daniel

    Greg Langmead wrote:
    We have some aggregate statistics we are gathering in an NLP application ,
    and we have implemented it in Pig (using 0.7.0 on Hadoop 0.20.2 with Java
    1.6.0_22). But some of the mapreduce jobs use a lot of memory (12G of RAM in
    one process, for a data set that is 50G in text format at the start of the
    computation). I would like to gain some insight into what portion of the Pig
    script is being executed by a given mapreduce task. I dumped the plan using
    EXPLAIN but I am having trouble interpreting the output. I can¹t find any
    resources online that help me understand these plans. Does anyone have any
    pointers, or any better ideas on how to debug the logic? I bet I am doing
    something that is easy to optimize away, but since I expect to use Pig more
    going forward, I¹d like to have a bag of tricks to diagnose its behavior.

    Many thanks,

    Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
    437 7300

    SDL PLC confidential, all rights reserved.
    If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.
    SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207.
    Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedNov 29, '10 at 10:12p
activeNov 29, '10 at 10:59p
posts2
users2
websitepig.apache.org

2 users in discussion

Daniel Dai: 1 post Greg Langmead: 1 post

People

Translate

site design / logo © 2021 Grokbase