Grokbase Groups Pig user April 2009
FAQ
How does Hadoop distributed file cache work with Pig?

I have some data files and jars that are used by some UDFs I've written.
They work if I am executing Pig scripts on a local file system but fail with
exceptions because the jars and files are not found when running in
map/reduce mode on a HDFS cluster.



Bill

Search Discussions

  • Olga Natkovich at Apr 13, 2009 at 5:14 pm
    Distributed cache is only used for streaming right now but it would be a
    good idea to support this for UDFs as well.

    Olga
    -----Original Message-----
    From: Bill Habermaas
    Sent: Monday, April 13, 2009 10:05 AM
    To: pig-user@hadoop.apache.org
    Subject: HDFS Distributed Cache with Pig

    How does Hadoop distributed file cache work with Pig?

    I have some data files and jars that are used by some UDFs
    I've written.
    They work if I am executing Pig scripts on a local file
    system but fail with exceptions because the jars and files
    are not found when running in map/reduce mode on a HDFS cluster.



    Bill




  • Kevin Weil at Apr 15, 2009 at 12:49 pm
    +1 for supporting the distributed cache from UDFs.

    Kevin
    On Mon, Apr 13, 2009 at 10:11 AM, Olga Natkovich wrote:

    Distributed cache is only used for streaming right now but it would be a
    good idea to support this for UDFs as well.

    Olga
    -----Original Message-----
    From: Bill Habermaas
    Sent: Monday, April 13, 2009 10:05 AM
    To: pig-user@hadoop.apache.org
    Subject: HDFS Distributed Cache with Pig

    How does Hadoop distributed file cache work with Pig?

    I have some data files and jars that are used by some UDFs
    I've written.
    They work if I am executing Pig scripts on a local file
    system but fail with exceptions because the jars and files
    are not found when running in map/reduce mode on a HDFS cluster.



    Bill




  • Mridul Muralidharan at Apr 15, 2009 at 2:14 pm
    Using distributed cache is actually not that tough from pig.

    First copy your file to hdfs - say it is copied as
    hdfs://host:port/mypath/file.zip for a zip file.


    Next, in your conf file, you need to define

    a) mapred.cache.archives=hdfs://host:port/mypath/file.zip#my_location
    b) mapred.create.symlink=yes

    When the job is run, this will create a symlink called 'my_location' in
    your current working directory (of the udf) where you can access the
    exploded contents of your zip file.



    So in your udf, it is simple case of opening the paths under my_location
    from current working directory to get to your relevant files.

    Hope this helps.

    Regards,
    Mridul




    Bill Habermaas wrote:
    How does Hadoop distributed file cache work with Pig?

    I have some data files and jars that are used by some UDFs I've written.
    They work if I am executing Pig scripts on a local file system but fail with
    exceptions because the jars and files are not found when running in
    map/reduce mode on a HDFS cluster.



    Bill




  • Bill Habermaas at Apr 15, 2009 at 2:53 pm
    Great tip.
    Many thanks, I'll try it.
    Bill

    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Wednesday, April 15, 2009 10:14 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: HDFS Distributed Cache with Pig


    Using distributed cache is actually not that tough from pig.

    First copy your file to hdfs - say it is copied as
    hdfs://host:port/mypath/file.zip for a zip file.


    Next, in your conf file, you need to define

    a) mapred.cache.archives=hdfs://host:port/mypath/file.zip#my_location
    b) mapred.create.symlink=yes

    When the job is run, this will create a symlink called 'my_location' in
    your current working directory (of the udf) where you can access the
    exploded contents of your zip file.



    So in your udf, it is simple case of opening the paths under my_location
    from current working directory to get to your relevant files.

    Hope this helps.

    Regards,
    Mridul




    Bill Habermaas wrote:
    How does Hadoop distributed file cache work with Pig?

    I have some data files and jars that are used by some UDFs I've written.
    They work if I am executing Pig scripts on a local file system but fail with
    exceptions because the jars and files are not found when running in
    map/reduce mode on a HDFS cluster.



    Bill




Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 13, '09 at 5:06p
activeApr 15, '09 at 2:53p
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase