FAQ
Hi guys,
I'm going through Hadoop The Definitive Guide trying to understand how to
use DistributedCache (0.20.2) to make a configuration file available to my
Mapper in every node of the cluster. The book says I should use
DistributedCache.addCacheFile to make the file available and then retrieve
it using DistributedCache.getLocalCacheFiles. However when I run my Job it
fails because the config file is not present. Looking at the implementation
of DistributedCache, addCacheFile sets the attribute mapred.cache.files
while getLocalCacheFiles retrieves the property mapred.cache.localFiles.
I figured maybe I should use setLocalFiles with getLocalCacheFiles or
getCacheFiles which addCacheFile but I wanted to see if anyone had some
further explanation as to why the code might not be working.

Thanks!
Pony

Search Discussions

  • Shi Yu at Jun 6, 2011 at 9:14 pm

    On 6/6/2011 3:49 PM, Juan P. wrote:
    Hi guys,
    I'm going through Hadoop The Definitive Guide trying to understand how to
    use DistributedCache (0.20.2) to make a configuration file available to my
    Mapper in every node of the cluster. The book says I should use
    DistributedCache.addCacheFile to make the file available and then retrieve
    it using DistributedCache.getLocalCacheFiles. However when I run my Job it
    fails because the config file is not present. Looking at the implementation
    of DistributedCache, addCacheFile sets the attribute mapred.cache.files
    while getLocalCacheFiles retrieves the property mapred.cache.localFiles.
    I figured maybe I should use setLocalFiles with getLocalCacheFiles or
    getCacheFiles which addCacheFile but I wanted to see if anyone had some
    further explanation as to why the code might not be working.

    Thanks!
    Pony
    I still don't understand, in a cluster you have a shared directory to
    all the nodes, right? Just put the configuration file in that directory
    and load it in all the mappers, isn't that simple?
    So I still don't understand why bother DistributedCache, the only reason
    might be the shared directory is costly for network and usually has
    storage limit.

    Shi
  • John Armstrong at Jun 6, 2011 at 9:27 pm

    On Mon, 06 Jun 2011 16:14:14 -0500, Shi Yu wrote:
    I still don't understand, in a cluster you have a shared directory to
    all the nodes, right? Just put the configuration file in that directory
    and load it in all the mappers, isn't that simple?
    So I still don't understand why bother DistributedCache, the only reason
    might be the shared directory is costly for network and usually has
    storage limit.
    That's exactly the problem the DistributedCache is designed for. It
    guarantees that you only need to copy the file to any given local
    filesystem once. Using the way you suggest, if there are a hundred mappers
    on a given node they'd all need to make a local copy of the file instead of
    just making one local copy and moving it around locally from then on.
  • Juan P. at Jun 7, 2011 at 12:41 pm
    John,
    Not 100% clear on what you meant. You are saying I should put the file into
    my HDFS cluster or should I use DistributedCache? If you suggest the latter,
    could you address my original question?

    Thanks for your help!
    Pony
    On Mon, Jun 6, 2011 at 6:27 PM, John Armstrong wrote:
    On Mon, 06 Jun 2011 16:14:14 -0500, Shi Yu wrote:
    I still don't understand, in a cluster you have a shared directory to
    all the nodes, right? Just put the configuration file in that directory
    and load it in all the mappers, isn't that simple?
    So I still don't understand why bother DistributedCache, the only reason
    might be the shared directory is costly for network and usually has
    storage limit.
    That's exactly the problem the DistributedCache is designed for. It
    guarantees that you only need to copy the file to any given local
    filesystem once. Using the way you suggest, if there are a hundred mappers
    on a given node they'd all need to make a local copy of the file instead of
    just making one local copy and moving it around locally from then on.
  • John Armstrong at Jun 7, 2011 at 1:52 pm

    On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." wrote:
    Not 100% clear on what you meant. You are saying I should put the file into
    my HDFS cluster or should I use DistributedCache? If you suggest the
    latter,
    could you address my original question?
    I mean that you can certainly get away with putting information into a
    known place on HDFS and loading it in each mapper or reducer, but that may
    become very inefficient as your problem scales up. Mostly I was responding
    to Shi Yu's question about why the DC is even worth using at all.

    As to your question, here's how I do it, which I think I basically lifted
    from an example in The Definitive Guide. There may be better ways, though.

    In my setup, I put files into the DC by getting Path objects (which should
    be able to reference either HDFS or local filesystem files, though I always
    have my files on HDFS to start) and using

    DistributedCache.addCacheFile(path.toUri(), conf);

    Then within my mapper or reducer I retrieve all the cached files with

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    IIRC, this is what you were doing. The problem is this gets all the
    cached files, although they are now in a working directory on the local
    filesystem. Luckily, I know the filename of the file I want, so I iterate

    for (Path cachePath : cacheFiles) {
    if (cachePath.getName().equals(cachedFilename)) {
    return cachePath;
    }
    }

    Then I've got the path to the local filesystem copy of the file I want in
    hand and I can do whatever I want with it.

    hth
  • Robert Evans at Jun 9, 2011 at 1:50 pm
    I think the issue you are seeing is because the distributed cache is not set up by default to create symlinks to the files it pulls over. If you want to access them through symlinks in the local directory call DistributedCache.createSymklink(conf) before submitting your job, otherwise you can use getLocalCacheFiles and getLocalCacheArchives to know where the files are.

    One thing to be aware of is that the cache archives and cache files format may optionally end with a #<name> where <name> is the name of the symlink you want on the compute node.

    --Bobby Evans


    On 6/7/11 8:52 AM, "John Armstrong" wrote:
    On Tue, 7 Jun 2011 09:41:21 -0300, "Juan P." wrote:
    Not 100% clear on what you meant. You are saying I should put the file into
    my HDFS cluster or should I use DistributedCache? If you suggest the
    latter,
    could you address my original question?
    I mean that you can certainly get away with putting information into a
    known place on HDFS and loading it in each mapper or reducer, but that may
    become very inefficient as your problem scales up. Mostly I was responding
    to Shi Yu's question about why the DC is even worth using at all.

    As to your question, here's how I do it, which I think I basically lifted
    from an example in The Definitive Guide. There may be better ways, though.

    In my setup, I put files into the DC by getting Path objects (which should
    be able to reference either HDFS or local filesystem files, though I always
    have my files on HDFS to start) and using

    DistributedCache.addCacheFile(path.toUri(), conf);

    Then within my mapper or reducer I retrieve all the cached files with

    Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

    IIRC, this is what you were doing. The problem is this gets all the
    cached files, although they are now in a working directory on the local
    filesystem. Luckily, I know the filename of the file I want, so I iterate

    for (Path cachePath : cacheFiles) {
    if (cachePath.getName().equals(cachedFilename)) {
    return cachePath;
    }
    }

    Then I've got the path to the local filesystem copy of the file I want in
    hand and I can do whatever I want with it.

    hth

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 6, '11 at 8:49p
activeJun 9, '11 at 1:50p
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase