Grokbase Groups Pig user January 2011
FAQ
I'm looking for some suggestions and ideas for how to handle JAR
dependencies in a production environment.

Most of the pig scripts I write require multiple JAR files. For instance, I
have a pig script that processes some data through a Solr instance which
requires my Solr UDF and some solr, lucene and apache commons jars. These
pig scripts are stored in a git repo and that git repo is deployed to our
production cluster. Obviously we don't want to store the jars in git; I'd
rather store them in our mvn repo with the rest of the jars the company
uses.

The plan is to have a maven pom.xml for each pig script that defines which
jars that pig script depends on. A shell script will then call "mvn
dependency:copy-dependencies -DoutputDirectory=pig-jars" before calling the
actual pig command to run the script. Given that, I'm trying to figure out
the best solution to a few questions.

* For development I'd like to store the pig jar (pig-0.7.0-core.jar) in
maven but there is no pom.xml for that jar (easily fixed) and that jar
contains all the java prerequisites (javax.servlet, apache commons, etc)
which seem to be making maven unhappy when I try to import it into the maven
company repo. Is there a pig-only jar?

* What do other people use to deploy their code to various systems? Check in
jars with the code? Keep jars in a separate, network-based directory?

Geoff
--
Sent from my email client.

Search Discussions

  • Dmitriy Ryaboy at Jan 21, 2011 at 12:27 am
    This is becoming a bigger problem for us as well, as use of Pig becomes more
    varied across the company.
    Would love some to hear what others have found to work for them.

    D
    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway wrote:

    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For instance, I
    have a pig script that processes some data through a Solr instance which
    requires my Solr UDF and some solr, lucene and apache commons jars. These
    pig scripts are stored in a git repo and that git repo is deployed to our
    production cluster. Obviously we don't want to store the jars in git; I'd
    rather store them in our mvn repo with the rest of the jars the company
    uses.

    The plan is to have a maven pom.xml for each pig script that defines which
    jars that pig script depends on. A shell script will then call "mvn
    dependency:copy-dependencies -DoutputDirectory=pig-jars" before calling the
    actual pig command to run the script. Given that, I'm trying to figure out
    the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar) in
    maven but there is no pom.xml for that jar (easily fixed) and that jar
    contains all the java prerequisites (javax.servlet, apache commons, etc)
    which seem to be making maven unhappy when I try to import it into the
    maven
    company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems? Check
    in
    jars with the code? Keep jars in a separate, network-based directory?

    Geoff
    --
    Sent from my email client.
  • Kaluskar, Sanjay at Jan 21, 2011 at 1:48 am
    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts from
    some other representation (Informatica mappings), and in many cases the
    scripts also call UDFs that depend on about 300 jars & 580 native
    libraries. Additionally, I generate a jar for each PIG script that
    contains the UDFs called from that script. I add the latter jar in the
    script in a register statement. But registering the 300 jars that the
    UDFs depend on individually is error prone & tedious; so I have
    automated that part. I have a top-level jar that includes all the 300
    jars on its Class-path in the MANIFEST.MF and I add this top-level jar
    to the classpath. I generate that (top-level jar) using maven's assembly
    plugin. I also generate a zip of everything (jars, native libs) using
    maven's assembly plugin and use dist cache to distribute it and add the
    native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig becomes
    more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a Solr
    instance which requires my Solr UDF and some solr, lucene and apache
    commons jars. These pig scripts are stored in a git repo and that git
    repo is deployed to our production cluster. Obviously we don't want to
    store the jars in git; I'd rather store them in our mvn repo with the
    rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and that
    jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Dmitriy Ryaboy at Jan 21, 2011 at 2:27 am
    Sanjay,
    Informatica compiles to Pig now, eh? Interesting...
    How do you handle jar conflicts if you bundle the whole lot? Doesn't this
    cost you a lot on job startup time?

    Dmitriy

    On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay wrote:

    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts from
    some other representation (Informatica mappings), and in many cases the
    scripts also call UDFs that depend on about 300 jars & 580 native
    libraries. Additionally, I generate a jar for each PIG script that
    contains the UDFs called from that script. I add the latter jar in the
    script in a register statement. But registering the 300 jars that the
    UDFs depend on individually is error prone & tedious; so I have
    automated that part. I have a top-level jar that includes all the 300
    jars on its Class-path in the MANIFEST.MF and I add this top-level jar
    to the classpath. I generate that (top-level jar) using maven's assembly
    plugin. I also generate a zip of everything (jars, native libs) using
    maven's assembly plugin and use dist cache to distribute it and add the
    native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig becomes
    more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a Solr
    instance which requires my Solr UDF and some solr, lucene and apache
    commons jars. These pig scripts are stored in a git repo and that git
    repo is deployed to our production cluster. Obviously we don't want to
    store the jars in git; I'd rather store them in our mvn repo with the
    rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and that
    jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Kaluskar, Sanjay at Jan 21, 2011 at 3:33 am
    Hi Dmitriy,

    Well, what I have is still experimental & not in any product. But, yes
    we can compile to a Pig script. I try to use the native relational
    operators where possible & use UDFs in other cases.

    I don't understand which conflicts you are referring to. Initially, I
    was trying to create a single jar (containing all the 300 dependencies)
    using the maven-dependency-plugin (BTW that seems to be the recommended
    approach & should work in many cases) but it turned out that some of our
    internal components had conflicting file names for some of the resources
    (should probably be fixed!). My current approach works better because I
    don't try to re-package any dependency. Yes, startup times are slow - of
    course, I am open to other ideas :-)

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 07:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    Sanjay,
    Informatica compiles to Pig now, eh? Interesting...
    How do you handle jar conflicts if you bundle the whole lot? Doesn't
    this cost you a lot on job startup time?

    Dmitriy


    On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
    wrote:
    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts
    from some other representation (Informatica mappings), and in many
    cases the scripts also call UDFs that depend on about 300 jars & 580
    native libraries. Additionally, I generate a jar for each PIG script
    that contains the UDFs called from that script. I add the latter jar
    in the script in a register statement. But registering the 300 jars
    that the UDFs depend on individually is error prone & tedious; so I
    have automated that part. I have a top-level jar that includes all the
    300 jars on its Class-path in the MANIFEST.MF and I add this top-level
    jar to the classpath. I generate that (top-level jar) using maven's
    assembly plugin. I also generate a zip of everything (jars, native
    libs) using maven's assembly plugin and use dist cache to distribute
    it and add the native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig
    becomes more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a
    Solr instance which requires my Solr UDF and some solr, lucene and
    apache commons jars. These pig scripts are stored in a git repo and
    that git repo is deployed to our production cluster. Obviously we
    don't want to
    store the jars in git; I'd rather store them in our mvn repo with
    the rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and
    that jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Erik Onnen at Jan 21, 2011 at 6:45 am
    As a new member to the list, I offer our lone data point. We use the maven
    shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/

    Shade produces an "uber" JAR with an optional declared main class.

    <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
    reasonable number of dependencies (in our case ~40), it just works and
    results in a single JAR. We're lucky enough that across the board, we can
    use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.

    <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
    two caveats we've encountered:
    * System dependencies aren't rolled into the "uber" JAR - if you want
    something to be in the deployment artifact, you need to at a minimum put it
    into your local repo - we do this via bash scripting for HBase 0.90.0 for
    example.
    * Conflicts - so far we've managed to do a maven dependency:tree and exclude
    conflicting dependencies, but I'm sure there is a point where that will not
    work any more.

    I'd love to hear how others are solving the problem, so far this has worked
    for us.

    -erik

    On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay wrote:

    Hi Dmitriy,

    Well, what I have is still experimental & not in any product. But, yes
    we can compile to a Pig script. I try to use the native relational
    operators where possible & use UDFs in other cases.

    I don't understand which conflicts you are referring to. Initially, I
    was trying to create a single jar (containing all the 300 dependencies)
    using the maven-dependency-plugin (BTW that seems to be the recommended
    approach & should work in many cases) but it turned out that some of our
    internal components had conflicting file names for some of the resources
    (should probably be fixed!). My current approach works better because I
    don't try to re-package any dependency. Yes, startup times are slow - of
    course, I am open to other ideas :-)

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 07:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    Sanjay,
    Informatica compiles to Pig now, eh? Interesting...
    How do you handle jar conflicts if you bundle the whole lot? Doesn't
    this cost you a lot on job startup time?

    Dmitriy


    On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
    <skaluskar@informatica.com
    wrote:
    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts
    from some other representation (Informatica mappings), and in many
    cases the scripts also call UDFs that depend on about 300 jars & 580
    native libraries. Additionally, I generate a jar for each PIG script
    that contains the UDFs called from that script. I add the latter jar
    in the script in a register statement. But registering the 300 jars
    that the UDFs depend on individually is error prone & tedious; so I
    have automated that part. I have a top-level jar that includes all the
    300 jars on its Class-path in the MANIFEST.MF and I add this top-level
    jar to the classpath. I generate that (top-level jar) using maven's
    assembly plugin. I also generate a zip of everything (jars, native
    libs) using maven's assembly plugin and use dist cache to distribute
    it and add the native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig
    becomes more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a
    Solr instance which requires my Solr UDF and some solr, lucene and
    apache commons jars. These pig scripts are stored in a git repo and
    that git repo is deployed to our production cluster. Obviously we
    don't want to
    store the jars in git; I'd rather store them in our mvn repo with
    the rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and
    that jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Alejandro Abdelnur at Jan 21, 2011 at 7:24 am
    In Oozie we run into a similar problem.

    As workflows with pig actions proliferate the lib/ directory of each
    workflow app had to contain Pig and dependent JARs. This becomes a nightmare
    as to maintain as workflow app increase.

    The approach to solve this was to add to oozie the concept of a sharelib/
    directory in HDFS.

    Then copy to the sharelib/ all the JARs you want to use across multiple
    workflow applications.

    When submitting a workflow you can specify the sharelib/ dir you want to use
    or you can indicate Oozie to use the system sharelib/ (the default one).

    Oozie then adds to the distributed cache for the for Pig job all the JARs in
    the specified sharelib/

    The benefits of this approach is that JAR files are only once in HDFS and
    they can be managed and updated globally. And users won't miss a JAR by
    mistake.

    This feature is coming in Oozie 2.3

    Pig could easily have a -sharelib option that points to an HDFS sharelib/
    directory thus achieving the same.

    <ad>
    BTW, as Oozie supports submitting pig jobs over Oozie, doing 'oozie pig -f
    ....' you can get the feature for free, plus that Oozie becomes a Pig
    server (you get a job ID and you track progress later), all this without
    having to write a workflow.
    </ad>

    Hope this helps.

    Alejandro

    On Fri, Jan 21, 2011 at 2:44 PM, Erik Onnen wrote:

    As a new member to the list, I offer our lone data point. We use the maven
    shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/

    Shade produces an "uber" JAR with an optional declared main class.

    <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
    reasonable number of dependencies (in our case ~40), it just works and
    results in a single JAR. We're lucky enough that across the board, we can
    use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.

    <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
    two caveats we've encountered:
    * System dependencies aren't rolled into the "uber" JAR - if you want
    something to be in the deployment artifact, you need to at a minimum put it
    into your local repo - we do this via bash scripting for HBase 0.90.0 for
    example.
    * Conflicts - so far we've managed to do a maven dependency:tree and
    exclude
    conflicting dependencies, but I'm sure there is a point where that will not
    work any more.

    I'd love to hear how others are solving the problem, so far this has worked
    for us.

    -erik


    On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
    skaluskar@informatica.com
    wrote:
    Hi Dmitriy,

    Well, what I have is still experimental & not in any product. But, yes
    we can compile to a Pig script. I try to use the native relational
    operators where possible & use UDFs in other cases.

    I don't understand which conflicts you are referring to. Initially, I
    was trying to create a single jar (containing all the 300 dependencies)
    using the maven-dependency-plugin (BTW that seems to be the recommended
    approach & should work in many cases) but it turned out that some of our
    internal components had conflicting file names for some of the resources
    (should probably be fixed!). My current approach works better because I
    don't try to re-package any dependency. Yes, startup times are slow - of
    course, I am open to other ideas :-)

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 07:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    Sanjay,
    Informatica compiles to Pig now, eh? Interesting...
    How do you handle jar conflicts if you bundle the whole lot? Doesn't
    this cost you a lot on job startup time?

    Dmitriy


    On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
    <skaluskar@informatica.com
    wrote:
    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts
    from some other representation (Informatica mappings), and in many
    cases the scripts also call UDFs that depend on about 300 jars & 580
    native libraries. Additionally, I generate a jar for each PIG script
    that contains the UDFs called from that script. I add the latter jar
    in the script in a register statement. But registering the 300 jars
    that the UDFs depend on individually is error prone & tedious; so I
    have automated that part. I have a top-level jar that includes all the
    300 jars on its Class-path in the MANIFEST.MF and I add this top-level
    jar to the classpath. I generate that (top-level jar) using maven's
    assembly plugin. I also generate a zip of everything (jars, native
    libs) using maven's assembly plugin and use dist cache to distribute
    it and add the native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig
    becomes more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a
    Solr instance which requires my Solr UDF and some solr, lucene and
    apache commons jars. These pig scripts are stored in a git repo and
    that git repo is deployed to our production cluster. Obviously we
    don't want to
    store the jars in git; I'd rather store them in our mvn repo with
    the rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and
    that jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Dmitriy Lyubimov at Jan 22, 2011 at 1:04 am
    shade plugin is useful only to a point. E.g. signed jars would not survive
    that in my experience (BouncyCastle library comes to mind).

    -d
    On Thu, Jan 20, 2011 at 10:44 PM, Erik Onnen wrote:

    As a new member to the list, I offer our lone data point. We use the maven
    shade plugin: http://maven.apache.org/plugins/maven-shade-plugin/

    Shade produces an "uber" JAR with an optional declared main class.

    <http://maven.apache.org/plugins/maven-shade-plugin/>On the up side, for a
    reasonable number of dependencies (in our case ~40), it just works and
    results in a single JAR. We're lucky enough that across the board, we can
    use one JAR for launching a message consumer, an Hadoop Job, and a Pig job.

    <http://maven.apache.org/plugins/maven-shade-plugin/>That said, there are
    two caveats we've encountered:
    * System dependencies aren't rolled into the "uber" JAR - if you want
    something to be in the deployment artifact, you need to at a minimum put it
    into your local repo - we do this via bash scripting for HBase 0.90.0 for
    example.
    * Conflicts - so far we've managed to do a maven dependency:tree and
    exclude
    conflicting dependencies, but I'm sure there is a point where that will not
    work any more.

    I'd love to hear how others are solving the problem, so far this has worked
    for us.

    -erik


    On Thu, Jan 20, 2011 at 7:31 PM, Kaluskar, Sanjay <
    skaluskar@informatica.com
    wrote:
    Hi Dmitriy,

    Well, what I have is still experimental & not in any product. But, yes
    we can compile to a Pig script. I try to use the native relational
    operators where possible & use UDFs in other cases.

    I don't understand which conflicts you are referring to. Initially, I
    was trying to create a single jar (containing all the 300 dependencies)
    using the maven-dependency-plugin (BTW that seems to be the recommended
    approach & should work in many cases) but it turned out that some of our
    internal components had conflicting file names for some of the resources
    (should probably be fixed!). My current approach works better because I
    don't try to re-package any dependency. Yes, startup times are slow - of
    course, I am open to other ideas :-)

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 07:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    Sanjay,
    Informatica compiles to Pig now, eh? Interesting...
    How do you handle jar conflicts if you bundle the whole lot? Doesn't
    this cost you a lot on job startup time?

    Dmitriy


    On Thu, Jan 20, 2011 at 5:41 PM, Kaluskar, Sanjay
    <skaluskar@informatica.com
    wrote:
    I have a similar problem and I can tell you what I am doing currently,
    just in case it is useful. I have a tool that generates PIG scripts
    from some other representation (Informatica mappings), and in many
    cases the scripts also call UDFs that depend on about 300 jars & 580
    native libraries. Additionally, I generate a jar for each PIG script
    that contains the UDFs called from that script. I add the latter jar
    in the script in a register statement. But registering the 300 jars
    that the UDFs depend on individually is error prone & tedious; so I
    have automated that part. I have a top-level jar that includes all the
    300 jars on its Class-path in the MANIFEST.MF and I add this top-level
    jar to the classpath. I generate that (top-level jar) using maven's
    assembly plugin. I also generate a zip of everything (jars, native
    libs) using maven's assembly plugin and use dist cache to distribute
    it and add the native libs to the LD_LIBRARY_PATH.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: 21 January 2011 05:57
    To: user@pig.apache.org
    Subject: Re: Managing pig script jar dependencies

    This is becoming a bigger problem for us as well, as use of Pig
    becomes more varied across the company.
    Would love some to hear what others have found to work for them.

    D

    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway
    wrote:
    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.

    Most of the pig scripts I write require multiple JAR files. For
    instance, I have a pig script that processes some data through a
    Solr instance which requires my Solr UDF and some solr, lucene and
    apache commons jars. These pig scripts are stored in a git repo and
    that git repo is deployed to our production cluster. Obviously we
    don't want to
    store the jars in git; I'd rather store them in our mvn repo with
    the rest of the jars the company uses.

    The plan is to have a maven pom.xml for each pig script that defines
    which jars that pig script depends on. A shell script will then call
    "mvn dependency:copy-dependencies -DoutputDirectory=pig-jars" before
    calling the actual pig command to run the script. Given that, I'm
    trying to figure out the best solution to a few questions.

    * For development I'd like to store the pig jar (pig-0.7.0-core.jar)
    in maven but there is no pom.xml for that jar (easily fixed) and
    that jar contains all the java prerequisites (javax.servlet, apache
    commons, etc) which seem to be making maven unhappy when I try to
    import it into the maven company repo. Is there a pig-only jar?

    * What do other people use to deploy their code to various systems?
    Check in jars with the code? Keep jars in a separate, network-based
    directory?

    Geoff
    --
    Sent from my email client.
  • Dmitriy Lyubimov at Jan 22, 2011 at 1:01 am
    We have a bootstrap command that copies all libraries of hour maven assembly
    to a location in HDFS (actually, we use maven groupId and artifactId of our
    assembly in the hierarchical path to ensure each client has its jars on the
    backend avaiable of exactly the same assembly build).

    We also use Grunt embedding instead of server embedding. Grunt has
    invaluable preprocessing capabilities compared to PigServer(). Basically we
    kickoff a java client that has grunt integrated and knows of its maven build
    number, so it knows what hdfs locations to pass on to pig for the jars. This
    is a little bit of a hack over Grunt but it's only perhaps a hunred lines
    longer than just do the same thing with a PigServer.

    -dmitriy
    On Wed, Jan 19, 2011 at 2:24 PM, Geoffrey Gallaway wrote:

    I'm looking for some suggestions and ideas for how to handle JAR
    dependencies in a production environment.


    * What do other people use to deploy their code to various systems? Check
    in
    jars with the code? Keep jars in a separate, network-based directory?

    Geoff
    --
    Sent from my email client.
  • Related Discussions

    Discussion Navigation
    viewthread | post
    Discussion Overview
    groupuser @
    categoriespig, hadoop
    postedJan 19, '11 at 10:24p
    activeJan 22, '11 at 1:04a
    posts9
    users6
    websitepig.apache.org

    People

    Translate

    site design / logo © 2021 Grokbase