Grokbase Groups Pig user August 2009
FAQ
Hi all,

I had a question about running Pig jobs on Amazon's cloud services.
Specifically, how do you go about adding UDF jar files and what, if any,
modifications to make to a script to make sure it runs effectively via
mapreduce (do you need to ship/cache the udf jar, and if so, how?)

Thanks for all the help so far,


--
Zaki Rahaman

Search Discussions

  • Zaki rahaman at Sep 2, 2009 at 2:49 pm
    Apologies for re-posting, but I never got an answer to my question.
    Basically, when using UDF jar files, how do you go about ensuring that the
    jar file is replicated on all nodes on a cluster and that each node uses its
    own local copy of the node and not the 'master' copy (to avoid unnecessary
    network traffic and bandwidth issues)? It looks like this is accomplished
    via a DEFINE + ship/cache statement but I'm not sure which one is necessary
    to use.
    On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman wrote:

    Hi all,

    I had a question about running Pig jobs on Amazon's cloud services.
    Specifically, how do you go about adding UDF jar files and what, if any,
    modifications to make to a script to make sure it runs effectively via
    mapreduce (do you need to ship/cache the udf jar, and if so, how?)

    Thanks for all the help so far,


    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Zjffdu at Sep 2, 2009 at 3:43 pm
    Hi zaki,

    You only need to register the udf jar, and pig will help you distribute the
    jar to cluster.

    And each time you submit the pig script, pig will distribute udf jar to
    cluster.


    Best regards,
    Jeff zhang


    -----Original Message-----
    From: zaki rahaman
    Sent: 2009年9月2日 7:49
    To: pig-user@hadoop.apache.org
    Subject: Re: UDFs and Amazon Elastic MapReduce

    Apologies for re-posting, but I never got an answer to my question.
    Basically, when using UDF jar files, how do you go about ensuring that the
    jar file is replicated on all nodes on a cluster and that each node uses its
    own local copy of the node and not the 'master' copy (to avoid unnecessary
    network traffic and bandwidth issues)? It looks like this is accomplished
    via a DEFINE + ship/cache statement but I'm not sure which one is necessary
    to use.
    On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman wrote:

    Hi all,

    I had a question about running Pig jobs on Amazon's cloud services.
    Specifically, how do you go about adding UDF jar files and what, if any,
    modifications to make to a script to make sure it runs effectively via
    mapreduce (do you need to ship/cache the udf jar, and if so, how?)

    Thanks for all the help so far,


    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Khanna, Richendra at Sep 3, 2009 at 1:15 am
    Hi Zaki,

    As part of the enhancements to Pig for it to work well with Amazon Elastic
    MapReduce, one of the changes made was to allow the argument passed to
    ³REGISTER² to come from a remote file system. So for instance you can do:

    REGISTER s3://my-bucket/path/to/my/uploaded.jar;

    This jar is downloaded to the master by the Grunt shell script at
    interpretation time. It is then uploaded to the distributed cache by the
    Grunt shell as part of running the job. Thus there is nothing in particular
    you need in your jars/script files to ensure they are used in a scalable
    fashion.

    Also for questions related to our service, you might get a faster response
    on our forums
    (http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52&start=
    0), since those are actively watched by our team.

    Thanks,
    Richendra

    On 9/2/09 7:48 AM, "zaki rahaman" wrote:

    Apologies for re-posting, but I never got an answer to my question.
    Basically, when using UDF jar files, how do you go about ensuring that the
    jar file is replicated on all nodes on a cluster and that each node uses its
    own local copy of the node and not the 'master' copy (to avoid unnecessary
    network traffic and bandwidth issues)? It looks like this is accomplished
    via a DEFINE + ship/cache statement but I'm not sure which one is necessary
    to use.
    On Fri, Aug 28, 2009 at 2:39 PM, zaki rahaman wrote:

    Hi all,

    I had a question about running Pig jobs on Amazon's cloud services.
    Specifically, how do you go about adding UDF jar files and what, if any,
    modifications to make to a script to make sure it runs effectively via
    mapreduce (do you need to ship/cache the udf jar, and if so, how?)

    Thanks for all the help so far,


    --
    Zaki Rahaman

    --
    Zaki Rahaman

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 28, '09 at 6:39p
activeSep 3, '09 at 1:15a
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase