FAQ
I am using Hadoop indirectly through PIG, and some of the UDFs (defined
by me) need other jars at runtime (around 150) some of which have
conflicting resource names. Hence, trying to unpack all of them and
repacking into a single jar doesn't work. My solution is to create a
single top-level jar that names all the dependencies in Class-Path in
the MANIFEST.MF. This is also simpler from a user's point of view. Of
course this requires the top-level jar and all the dependencies to be
created with a certain directory structure that I can control.
Currently, I have a structure where I have a root directory which
contains the top-level jar and a directory called lib, and all the
dependencies are in lib, and the top-level jar names the dependencies as
lib/x.jar lib/y.jar etc. I package all of this as a single zip file for
easy installation.

Just to be clear this is the dir structure:

root dir
--- top-level.jar
--- lib
--- x.jar
--- y.jar
I can't register top-level.jar in my PIG script (this is the recommended
approach) because PIG then unpacks & repackages everything into a single
jar, instead of including the jar on the classpath. I can't use
distributed cache because if I specify top-level.jar and lib separately
in mapred.cache.files, then the relative directory locations aren't
preserved. If I use the mapred.cache.archives option and specify the zip
file, I can't add the top-level jar to the classpath (because the
entries in mapred.job.classpath.files must be something from
mapred.cache.files).

If mapred.child.java.opts also allowed java.class.path to be augmented
(similar to java.library.path, which I am using for native libs that I
store in another dir parallel to lib), it would have solved my problem.
I could have specified the zip in mapred.cache.archives, and added the
jar to the classpath. Right now I can't see any solution, other than
using a shared file system and adding top-level.jar to HADOOP_CLASSPATH
- this works because I am using a small cluster that has a shared file
system but clearly it's not always feasible (and of course, it's
modifying Hadoop's environment).

Please suggest any alternatives you can think of.

Thanks,
-sanjay

Search Discussions

  • Arun C Murthy at Aug 11, 2010 at 4:59 pm
    Moving to mapreduce-user@, bcc common-user@.

    Why do you need to create a single top-level jar? Just register each
    of your jars and put each in the distributed cache... however you have
    150 jars which is a lot. Is there a way you can decrease that? I'm
    sure how you do this in pig, but in MR you have the ability to add a
    jar in the DC to the classpath of the child
    (DistributedCache.addFileToClassPath).

    Hope that helps.

    Arun
    On Aug 11, 2010, at 12:48 AM, Kaluskar, Sanjay wrote:

    I am using Hadoop indirectly through PIG, and some of the UDFs
    (defined
    by me) need other jars at runtime (around 150) some of which have
    conflicting resource names. Hence, trying to unpack all of them and
    repacking into a single jar doesn't work. My solution is to create a
    single top-level jar that names all the dependencies in Class-Path in
    the MANIFEST.MF. This is also simpler from a user's point of view. Of
    course this requires the top-level jar and all the dependencies to be
    created with a certain directory structure that I can control.
    Currently, I have a structure where I have a root directory which
    contains the top-level jar and a directory called lib, and all the
    dependencies are in lib, and the top-level jar names the
    dependencies as
    lib/x.jar lib/y.jar etc. I package all of this as a single zip file
    for
    easy installation.

    Just to be clear this is the dir structure:

    root dir
    --- top-level.jar
    --- lib
    --- x.jar
    --- y.jar
    I can't register top-level.jar in my PIG script (this is the
    recommended
    approach) because PIG then unpacks & repackages everything into a
    single
    jar, instead of including the jar on the classpath. I can't use
    distributed cache because if I specify top-level.jar and lib
    separately
    in mapred.cache.files, then the relative directory locations aren't
    preserved. If I use the mapred.cache.archives option and specify the
    zip
    file, I can't add the top-level jar to the classpath (because the
    entries in mapred.job.classpath.files must be something from
    mapred.cache.files).

    If mapred.child.java.opts also allowed java.class.path to be augmented
    (similar to java.library.path, which I am using for native libs that I
    store in another dir parallel to lib), it would have solved my
    problem.
    I could have specified the zip in mapred.cache.archives, and added the
    jar to the classpath. Right now I can't see any solution, other than
    using a shared file system and adding top-level.jar to
    HADOOP_CLASSPATH
    - this works because I am using a small cluster that has a shared file
    system but clearly it's not always feasible (and of course, it's
    modifying Hadoop's environment).

    Please suggest any alternatives you can think of.

    Thanks,
    -sanjay

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 11, '10 at 7:49a
activeAug 11, '10 at 4:59p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Arun C Murthy: 1 post Kaluskar, Sanjay: 1 post

People

Translate

site design / logo © 2022 Grokbase