Grokbase Groups Pig user July 2010
FAQ
I am new to PIG and running into a fairly basic problem. I have a UDF
which depends on some other 3rd party jars & libraries. I can call the
UDF from my PIG script either from grunt or by running "java -cp ...
org.apache.pig.Main <script>" in local mode, when I have the jars on the
classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce mode I
get errors from Hadoop because it doesn't find the classes & libraries.

I saw another thread on this forum, which had a workaround for the jar.
I can explicitly call register on the dependency, and that seems to fix
the problem. But, there doesn't seem to be a way of specifying the
native libraries to PIG such that the map/reduce jobs are set up to
access them.

I am using PIG 0.5.0. Any help is appreciated!

Thanks,
-sanjay

Search Discussions

  • Thejas M Nair at Jul 29, 2010 at 5:47 pm
    You can use the MR distributed cache to push the native libs - see -
    http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Distribute
    dCache

    "The DistributedCache can also be used to distribute both jars and native
    libraries for use in the map and/or reduce tasks. The child-jvm always has
    its current working directory added to the java.library.path and
    LD_LIBRARY_PATH. And hence the cached libraries can be loaded via
    System.loadLibrary or System.load . More details on how to load shared
    libraries through distributed cache are documented at
    native_libraries.htm"

    So using ­Dmapred.cache.files=<dfs path to file>, in your pig commandline
    should work.

    Please let us know if this worked for you.

    For the jars, you can also use a commandline option -
    -Dpig.additional.jars="jar1:jar2.."

    (thanks to Pradeep for suggesting this solution)

    Thanks,
    Thejas
    On 7/26/10 9:38 AM, "Kaluskar, Sanjay" wrote:

    I am new to PIG and running into a fairly basic problem. I have a UDF
    which depends on some other 3rd party jars & libraries. I can call the
    UDF from my PIG script either from grunt or by running "java -cp ...
    org.apache.pig.Main <script>" in local mode, when I have the jars on the
    classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce mode I
    get errors from Hadoop because it doesn't find the classes & libraries.

    I saw another thread on this forum, which had a workaround for the jar.
    I can explicitly call register on the dependency, and that seems to fix
    the problem. But, there doesn't seem to be a way of specifying the
    native libraries to PIG such that the map/reduce jobs are set up to
    access them.

    I am using PIG 0.5.0. Any help is appreciated!

    Thanks,
    -sanjay
  • Kaluskar, Sanjay at Aug 4, 2010 at 10:14 am
    The register isn't working after I made some changes to mapred-site.xml.
    Right now I am executing PIG script from the command-line as follows:

    java -cp infapig.jar:${HADOOP_HOME}/conf org.apache.pig.Main
    script.pig

    In my script I have a register infapig.jar, which contains everything
    needed by my UDF. But, I get the following error:

    2010-08-04 20:44:44,933 [main] ERROR
    org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate
    exception from backed error: Error: java.lang.ClassNotFoundException:
    com.informatica.products.pigudf.PigudfException

    PigudfException is an exception defined in one of the jars on the
    classpath of infapig.jar.

    I have the following in mapred-site.xml:
    <property>
    <name>mapred.cache.archives</name>

    <value>hdfs://inarch03.informatica.com:54310/infadoop/installer-9.0.2-SN
    APSHOT.zip#infadoop</value>
    </property>
    <property>
    <name>mapred.create.symlink</name>
    <value>yes</value>
    </property>
    <property>
    <name>mapred.job.classpath.files</name>
    <value>infadoop/infapig.jar</value>
    </property>

    I have also tried adding the following to mapred-site.xml:
    <property>
    <name>pig.additional.jars</name>
    <value>infadoop/infapig.jar</value>
    </property>

    However, the following executions work:

    java -cp infapig.jar:${HADOOP_HOME}/conf org.apache.pig.Main -x
    local script.pig
    java -cp infapig.jar org.apache.pig.Main script.pig

    -----Original Message-----
    From: Thejas M Nair
    Sent: Thursday, July 29, 2010 11:16 PM
    To: pig-user@hadoop.apache.org; Kaluskar, Sanjay
    Subject: Re: UDF with dependency on external jars & native code

    You can use the MR distributed cache to push the native libs - see -
    http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Distri
    bute
    dCache

    "The DistributedCache can also be used to distribute both jars and
    native libraries for use in the map and/or reduce tasks. The child-jvm
    always has its current working directory added to the java.library.path
    and LD_LIBRARY_PATH. And hence the cached libraries can be loaded via
    System.loadLibrary or System.load . More details on how to load
    shared
    libraries through distributed cache are documented at
    native_libraries.htm"

    So using -Dmapred.cache.files=<dfs path to file>, in your pig
    commandline should work.

    Please let us know if this worked for you.

    For the jars, you can also use a commandline option -
    -Dpig.additional.jars="jar1:jar2.."

    (thanks to Pradeep for suggesting this solution)

    Thanks,
    Thejas
    On 7/26/10 9:38 AM, "Kaluskar, Sanjay" wrote:

    I am new to PIG and running into a fairly basic problem. I have a UDF
    which depends on some other 3rd party jars & libraries. I can call the
    UDF from my PIG script either from grunt or by running "java -cp ...
    org.apache.pig.Main <script>" in local mode, when I have the jars on
    the classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce
    mode I get errors from Hadoop because it doesn't find the classes &
    libraries.
    I saw another thread on this forum, which had a workaround for the jar.
    I can explicitly call register on the dependency, and that seems to
    fix the problem. But, there doesn't seem to be a way of specifying the
    native libraries to PIG such that the map/reduce jobs are set up to
    access them.

    I am using PIG 0.5.0. Any help is appreciated!

    Thanks,
    -sanjay
  • Thejas M Nair at Aug 4, 2010 at 6:47 pm

    On 8/4/10 3:13 AM, "Kaluskar, Sanjay" wrote:

    The register isn't working after I made some changes to mapred-site.xml.
    Right now I am executing PIG script from the command-line as follows:
    Do you know what change in mapred-site.xml caused it to stop working ? Is it
    after adding mapred.cache.archives ?

    PigudfException is an exception defined in one of the jars on the
    classpath of infapig.jar.
    Is PigudfException also packaged within the jar ?


    -Thejas


    -----Original Message-----
    From: Thejas M Nair
    Sent: Thursday, July 29, 2010 11:16 PM
    To: pig-user@hadoop.apache.org; Kaluskar, Sanjay
    Subject: Re: UDF with dependency on external jars & native code

    You can use the MR distributed cache to push the native libs - see -
    http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Distri
    bute
    dCache

    "The DistributedCache can also be used to distribute both jars and
    native libraries for use in the map and/or reduce tasks. The child-jvm
    always has its current working directory added to the java.library.path
    and LD_LIBRARY_PATH. And hence the cached libraries can be loaded via
    System.loadLibrary or System.load . More details on how to load
    shared
    libraries through distributed cache are documented at
    native_libraries.htm"

    So using -Dmapred.cache.files=<dfs path to file>, in your pig
    commandline should work.

    Please let us know if this worked for you.

    For the jars, you can also use a commandline option -
    -Dpig.additional.jars="jar1:jar2.."

    (thanks to Pradeep for suggesting this solution)

    Thanks,
    Thejas
    On 7/26/10 9:38 AM, "Kaluskar, Sanjay" wrote:

    I am new to PIG and running into a fairly basic problem. I have a UDF
    which depends on some other 3rd party jars & libraries. I can call the
    UDF from my PIG script either from grunt or by running "java -cp ...
    org.apache.pig.Main <script>" in local mode, when I have the jars on
    the classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce
    mode I get errors from Hadoop because it doesn't find the classes &
    libraries.
    I saw another thread on this forum, which had a workaround for the jar.
    I can explicitly call register on the dependency, and that seems to
    fix the problem. But, there doesn't seem to be a way of specifying the
    native libraries to PIG such that the map/reduce jobs are set up to
    access them.

    I am using PIG 0.5.0. Any help is appreciated!

    Thanks,
    -sanjay


  • Kaluskar, Sanjay at Aug 5, 2010 at 5:10 am
    I read the tutorial once more & decided to try out the simplest possible
    config that might work, which is copying the necessary native libs using
    dist cache (they are in an archive called installer-9.0.2-SNAPSHOT.zip),
    and hoping that PIG will take care of copying the jars that are
    registered. I also need to set some env variables to get my native code
    to work.

    So, now I have the following in the mapred-site.xml:

    <property>
    <name>mapred.cache.archives</name>

    <value>hdfs://inarch03.informatica.com:54310/infadoop/installer-9.0.2-SN
    APSHOT.zip#infadoop</value>
    </property>
    <property>
    <name>mapred.create.symlink</name>
    <value>yes</value>
    </property>
    <property>
    <name>mapred.child.java.opts</name>
    <value> -Xmx512M -Djava.library.path=infadoop/infa-resources </value>
    </property>
    <property>
    <name>mapred.child.env</name>

    <value>LD_LIBRARY_PATH=infadoop/infa-resources,INFA_RESOURCES=infadoop/i
    nfa-resources,IMF_CPP_RESOURCE_PATH=infadoop/infa-resources</value>
    </property>

    With this change the job setup fails, and I see the following error in
    the userlogs:

    Exception in thread "main" java.lang.NoClassDefFoundError:
    Caused by: java.lang.ClassNotFoundException:
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
    Could not find the main class: . Program will exit.

    -----Original Message-----
    From: Thejas M Nair
    Sent: Thursday, August 05, 2010 12:16 AM
    To: pig-user@hadoop.apache.org; Kaluskar, Sanjay
    Subject: Re: UDF with dependency on external jars & native code



    On 8/4/10 3:13 AM, "Kaluskar, Sanjay" wrote:

    The register isn't working after I made some changes to
    mapred-site.xml.
    Right now I am executing PIG script from the command-line as follows:
    Do you know what change in mapred-site.xml caused it to stop working ?
    Is it after adding mapred.cache.archives ?

    PigudfException is an exception defined in one of the jars on the
    classpath of infapig.jar.
    Is PigudfException also packaged within the jar ?


    -Thejas


    -----Original Message-----
    From: Thejas M Nair
    Sent: Thursday, July 29, 2010 11:16 PM
    To: pig-user@hadoop.apache.org; Kaluskar, Sanjay
    Subject: Re: UDF with dependency on external jars & native code

    You can use the MR distributed cache to push the native libs - see -
    http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Dist
    <http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Dist>
    ri
    bute
    dCache

    "The DistributedCache can also be used to distribute both jars and
    native libraries for use in the map and/or reduce tasks. The
    child-jvm always has its current working directory added to the
    java.library.path and LD_LIBRARY_PATH. And hence the cached libraries
    can be loaded via
    System.loadLibrary or System.load . More details on how to load
    shared
    libraries through distributed cache are documented at
    native_libraries.htm"

    So using -Dmapred.cache.files=<dfs path to file>, in your pig
    commandline should work.

    Please let us know if this worked for you.

    For the jars, you can also use a commandline option -
    -Dpig.additional.jars="jar1:jar2.."

    (thanks to Pradeep for suggesting this solution)

    Thanks,
    Thejas
    On 7/26/10 9:38 AM, "Kaluskar, Sanjay" wrote:

    I am new to PIG and running into a fairly basic problem. I have a UDF
    which depends on some other 3rd party jars & libraries. I can call
    the UDF from my PIG script either from grunt or by running "java -cp
    ...
    org.apache.pig.Main <script>" in local mode, when I have the jars on
    the classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce
    mode I get errors from Hadoop because it doesn't find the classes &
    libraries.
    I saw another thread on this forum, which had a workaround for the jar.
    I can explicitly call register on the dependency, and that seems to
    fix the problem. But, there doesn't seem to be a way of specifying
    the native libraries to PIG such that the map/reduce jobs are set up
    to access them.

    I am using PIG 0.5.0. Any help is appreciated!

    Thanks,
    -sanjay


  • Kaluskar, Sanjay at Aug 6, 2010 at 2:03 pm
    I discovered why the approach of registering all the dependency jars in
    the PIG script doesn't work for me. Because PIG is trying to expand all
    the jars & adding everything back into a single jar, and some of the
    jars my UDF depends on have conflicts (same filenames), so they get
    over-written.

    I have solved this problem by having a single top-level jar, that names
    all the dependencies in the Class-path in the manifest. This works well
    when I run things standalone or in local mode. When running in cluster
    mode, the best solution will be to have the same top-level jar included
    in the classpath when the jvm for the mapreduce task is started, but I
    haven't been able to figure out how that can be done.

    Setting mapred.child.java.opts seems to conflict with whatever PIG
    specifies for the jvm. Setting CLASSPATH doesn't work because PIG is
    probably specifying classpath as a command-line arg. Just can't get
    around this simple problem!!!

    -----Original Message-----
    From: Thejas M Nair
    Sent: Thursday, July 29, 2010 11:16 PM
    To: pig-user@hadoop.apache.org; Kaluskar, Sanjay
    Subject: Re: UDF with dependency on external jars & native code

    You can use the MR distributed cache to push the native libs - see -
    http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#Distri
    bute
    dCache

    "The DistributedCache can also be used to distribute both jars and
    native libraries for use in the map and/or reduce tasks. The child-jvm
    always has its current working directory added to the java.library.path
    and LD_LIBRARY_PATH. And hence the cached libraries can be loaded via
    System.loadLibrary or System.load . More details on how to load
    shared
    libraries through distributed cache are documented at
    native_libraries.htm"

    So using -Dmapred.cache.files=<dfs path to file>, in your pig
    commandline should work.

    Please let us know if this worked for you.

    For the jars, you can also use a commandline option -
    -Dpig.additional.jars="jar1:jar2.."

    (thanks to Pradeep for suggesting this solution)

    Thanks,
    Thejas
    On 7/26/10 9:38 AM, "Kaluskar, Sanjay" wrote:

    I am new to PIG and running into a fairly basic problem. I have a UDF
    which depends on some other 3rd party jars & libraries. I can call the
    UDF from my PIG script either from grunt or by running "java -cp ...
    org.apache.pig.Main <script>" in local mode, when I have the jars on
    the classpath and the libraries on LD_LIBRARY_PATH. But, in mapreduce
    mode I get errors from Hadoop because it doesn't find the classes &
    libraries.
    I saw another thread on this forum, which had a workaround for the jar.
    I can explicitly call register on the dependency, and that seems to
    fix the problem. But, there doesn't seem to be a way of specifying the
    native libraries to PIG such that the map/reduce jobs are set up to
    access them.

    I am using PIG 0.5.0. Any help is appreciated!

    Thanks,
    -sanjay

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 26, '10 at 6:30p
activeAug 6, '10 at 2:03p
posts6
users2
websitepig.apache.org

2 users in discussion

Kaluskar, Sanjay: 4 posts Thejas M Nair: 2 posts

People

Translate

site design / logo © 2021 Grokbase