FAQ
I can use cacheFile to load .so files into the distributed cache and it works fine (the streaming executable links against the .so and runs), but I can't get it to work with -cacheArchive. It always says it can't find the .so file. I realize that if you jar a directory, the directory will be recreated when you unjar, but I've tried jaring a file directly. It is easily verified that unjarring such a file reproduces the original file as a sibling of the jar file itself. So it seems to me that cacheArchive should have transferred the jar file to the cwd of my task, unjarred it, and produced a .so file right there, but it doesn't link up with the executable. Like I said, I know this basic approach works just fine with cacheFile.

What could be the problem here? I can't easily see the files on the cluster since it is a remote cluster with limited access. I don't believe I can ssh to any individual machine to investigate the files that are created for a task...but I think I have worked through the process logically and I'm not sure what I'm doing wrong.

Thoughts?

________________________________________________________________________________
Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

"Luminous beings are we, not this crude matter."
-- Yoda
________________________________________________________________________________

Search Discussions

  • Keith Wiley at Aug 5, 2011 at 5:27 pm
    Quick followup. I substituted the true mapper for a little python script that just lists the cwd's contents and dumps them to the streaming output (stderr). Oddly, I it doesn't look like the .jar far was unpacked. I can see it there, but not the unpacked version, so it looks like -cacheArchive transferred the file but didn't unjar it.

    Anyone ever seen something like this before?

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "I do not feel obliged to believe that the same God who has endowed us with
    sense, reason, and intellect has intended us to forgo their use."
    -- Galileo Galilei
    ________________________________________________________________________________
  • Keith Wiley at Aug 5, 2011 at 5:48 pm
    Okay, I think I understand. The symlink name that follows the pound sign in the -cacheArchive directive isn't the name of the transferred jar file -- it is the name of a directory that the .jar file will be put into and then be unjarred. So, it doesn't act like like jar would on a local machine, where files are recreated at the current directory level. Rather, everything is pushed down by one level. Wish a corresponding cmddev flag to point LD_LIBRARY_PATH to the correct location, I think I can get it to find the shared libraries now.
    On Aug 5, 2011, at 10:27 , Keith Wiley wrote:

    Quick followup. I substituted the true mapper for a little python script that just lists the cwd's contents and dumps them to the streaming output (stderr). Oddly, I it doesn't look like the .jar far was unpacked. I can see it there, but not the unpacked version, so it looks like -cacheArchive transferred the file but didn't unjar it.

    Anyone ever seen something like this before?

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "And what if we picked the wrong religion? Every week, we're just making God
    madder and madder!"
    -- Homer Simpson
    ________________________________________________________________________________
  • Ramya Sunil at Aug 5, 2011 at 5:45 pm
    Hi Keith,

    I have tried the exact use case you have mentioned and it works fine for me.
    Below is the command line for the same:

    [ramya]$ jar vxf samplelib.jar
    created: META-INF/
    inflated: META-INF/MANIFEST.MF
    inflated: libhdfs.so

    [ramya]$ hadoop dfs -put samplelib.jar samplelib.jar

    [ramya]$ hadoop jar hadoop-streaming.jar -input InputDir -mapper "ls
    testlink/libhdfs.so" -reducer NONE -output out -cacheArchive
    hdfs://<namenode>:<port>/user/ramya/samplelib.jar#testlink

    [ramya]$ hadoop dfs -cat out/*
    testlink/libhdfs.so
    testlink/libhdfs.so
    testlink/libhdfs.so


    Hope it helps.

    Thanks
    Ramya

    On 8/5/11 10:10 AM, "Keith Wiley" wrote:


    I can use cacheFile to load .so files into the distributed cache and it
    works fine (the streaming executable links against the .so and runs), but I
    can't get it to work with -cacheArchive. It always says it can't find the
    .so file. I realize that if you jar a directory, the directory will be
    recreated when you unjar, but I've tried jaring a file directly. It is
    easily verified that unjarring such a file reproduces the original file as a
    sibling of the jar file itself. So it seems to me that cacheArchive should
    have transferred the jar file to the cwd of my task, unjarred it, and
    produced a .so file right there, but it doesn't link up with the executable.
    Like I said, I know this basic approach works just fine with cacheFile.

    What could be the problem here? I can't easily see the files on the cluster
    since it is a remote cluster with limited access. I don't believe I can ssh
    to any individual machine to investigate the files that are created for a
    task...but I think I have worked through the process logically and I'm not
    sure what I'm doing wrong.

    Thoughts?

    ________________________________________________________________________________
    Keith Wiley *kwiley@keithwiley.com* keithwiley.com
    music.keithwiley.com

    "Luminous beings are we, not this crude matter."
    -- Yoda
    ________________________________________________________________________________
  • Keith Wiley at Aug 5, 2011 at 5:50 pm
    Right, so it was pushed down a level into the "testlink" directory. That's why my shared libraries were not linking properly to my mapper executable. I can fix that by using "cmddev" to redirect LD_LIBRARY_PATH. I think that'll work.
    On Aug 5, 2011, at 10:44 , Ramya Sunil wrote:

    Hi Keith,

    I have tried the exact use case you have mentioned and it works fine for me.
    Below is the command line for the same:

    [ramya]$ jar vxf samplelib.jar
    created: META-INF/
    inflated: META-INF/MANIFEST.MF
    inflated: libhdfs.so

    [ramya]$ hadoop dfs -put samplelib.jar samplelib.jar

    [ramya]$ hadoop jar hadoop-streaming.jar -input InputDir -mapper "ls
    testlink/libhdfs.so" -reducer NONE -output out -cacheArchive
    hdfs://<namenode>:<port>/user/ramya/samplelib.jar#testlink

    [ramya]$ hadoop dfs -cat out/*
    testlink/libhdfs.so
    testlink/libhdfs.so
    testlink/libhdfs.so


    Hope it helps.

    Thanks
    Ramya

    ________________________________________________________________________________
    Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com

    "I used to be with it, but then they changed what it was. Now, what I'm with
    isn't it, and what's it seems weird and scary to me."
    -- Abe (Grandpa) Simpson
    ________________________________________________________________________________

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 5, '11 at 5:10p
activeAug 5, '11 at 5:50p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Keith Wiley: 4 posts Ramya Sunil: 1 post

People

Translate

site design / logo © 2022 Grokbase