FAQ
I'm trying to use the distributed cache in a MapReduce job written to the
new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path is
added to the distributed cache as follows:

public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "Job");
...
DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
...
return job.waitForCompletion(true) ? 0 : 1;
}

The "setup()" method in my mapper tries to read the path as follows:

protected void setup(Context context) throws IOException {
Path[] paths = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
}

But "paths" is null.

I'm assuming I'm setting up the distributed cache incorrectly. I've seen a
few hints in previous mailing list postings that indicate that the
distributed cache is accessed via the Job and JobContext objects in the
revised API, but the javadocs don't seem to support that.

Thanks.
Larry

Search Discussions

  • Ted Yu at Apr 15, 2010 at 8:07 pm
    Please see the sample within
    src\core\org\apache\hadoop\filecache\DistributedCache.java:

    * JobConf job = new JobConf();
    * DistributedCache.addCacheFile(new
    URI("/myapp/lookup.dat#lookup.dat"),
    * job);


    On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
    wrote:
    I'm trying to use the distributed cache in a MapReduce job written to the
    new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path is
    added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = new Job(conf, "Job");
    ...
    DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
    ...
    return job.waitForCompletion(true) ? 0 : 1;
    }

    The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
    Path[] paths = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());
    }

    But "paths" is null.

    I'm assuming I'm setting up the distributed cache incorrectly. I've seen a
    few hints in previous mailing list postings that indicate that the
    distributed cache is accessed via the Job and JobContext objects in the
    revised API, but the javadocs don't seem to support that.

    Thanks.
    Larry
  • Larry Compton at Apr 15, 2010 at 8:16 pm
    Ted,

    Thanks. I have looked at that example. The javadocs for DistributedCache
    still refer to deprecated classes, like JobConf. I'm trying to use the
    revised API.

    Larry
    On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu wrote:

    Please see the sample within
    src\core\org\apache\hadoop\filecache\DistributedCache.java:

    * JobConf job = new JobConf();
    * DistributedCache.addCacheFile(new
    URI("/myapp/lookup.dat#lookup.dat"),
    * job);


    On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
    wrote:
    I'm trying to use the distributed cache in a MapReduce job written to the
    new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file path is
    added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = new Job(conf, "Job");
    ...
    DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
    ...
    return job.waitForCompletion(true) ? 0 : 1;
    }

    The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
    Path[] paths = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());
    }

    But "paths" is null.

    I'm assuming I'm setting up the distributed cache incorrectly. I've seen a
    few hints in previous mailing list postings that indicate that the
    distributed cache is accessed via the Job and JobContext objects in the
    revised API, but the javadocs don't seem to support that.

    Thanks.
    Larry
  • Ted Yu at Apr 15, 2010 at 8:58 pm
    Please take a look at the loop starting at line 158 in TaskRunner.java:
    p[i] = DistributedCache.getLocalCache(files[i], conf,
    new Path(baseDir),
    fileStatus,
    false, Long.parseLong(

    fileTimestamps[i]),
    new Path(workDir.
    getAbsolutePath()),
    false);
    }
    DistributedCache.setLocalFiles(conf, stringifyPathArray(p));

    I think the confusing part is that DistributedCache.getLocalCacheFiles() is
    paired with DistributedCache.setLocalFiles()

    Cheers

    On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
    wrote:
    Ted,

    Thanks. I have looked at that example. The javadocs for DistributedCache
    still refer to deprecated classes, like JobConf. I'm trying to use the
    revised API.

    Larry
    On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu wrote:

    Please see the sample within
    src\core\org\apache\hadoop\filecache\DistributedCache.java:

    * JobConf job = new JobConf();
    * DistributedCache.addCacheFile(new
    URI("/myapp/lookup.dat#lookup.dat"),
    * job);


    On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
    wrote:
    I'm trying to use the distributed cache in a MapReduce job written to
    the
    new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
    path
    is
    added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = new Job(conf, "Job");
    ...
    DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
    ...
    return job.waitForCompletion(true) ? 0 : 1;
    }

    The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
    Path[] paths = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());
    }

    But "paths" is null.

    I'm assuming I'm setting up the distributed cache incorrectly. I've
    seen
    a
    few hints in previous mailing list postings that indicate that the
    distributed cache is accessed via the Job and JobContext objects in the
    revised API, but the javadocs don't seem to support that.

    Thanks.
    Larry
  • Amareshwari Sri Ramadasu at Apr 16, 2010 at 5:07 am
    Hi,
    @Ted, below code is internal code. Users are not expected to call DistributedCache.getLocalCache(), they cannot use it also. They do not know all the parameters.
    @Larry, DistributedCache is not changed to use new api in branch 0.20. The change is done in only from branch 0.21. See MAPREDUCE-898 ( https://issues.apache.org/jira/browse/MAPREDUCE-898).
    If you are using branch 0.20, you are encouraged to use deprecated JobConf itself.
    You can try the following change in your code:
    Change the line > > > DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
    to DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration());

    Thanks
    Amareshwari

    On 4/16/10 2:27 AM, "Ted Yu" wrote:

    Please take a look at the loop starting at line 158 in TaskRunner.java:
    p[i] = DistributedCache.getLocalCache(files[i], conf,
    new Path(baseDir),
    fileStatus,
    false, Long.parseLong(

    fileTimestamps[i]),
    new Path(workDir.
    getAbsolutePath()),
    false);
    }
    DistributedCache.setLocalFiles(conf, stringifyPathArray(p));

    I think the confusing part is that DistributedCache.getLocalCacheFiles() is
    paired with DistributedCache.setLocalFiles()

    Cheers

    On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
    wrote:
    Ted,

    Thanks. I have looked at that example. The javadocs for DistributedCache
    still refer to deprecated classes, like JobConf. I'm trying to use the
    revised API.

    Larry
    On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu wrote:

    Please see the sample within
    src\core\org\apache\hadoop\filecache\DistributedCache.java:

    * JobConf job = new JobConf();
    * DistributedCache.addCacheFile(new
    URI("/myapp/lookup.dat#lookup.dat"),
    * job);


    On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
    wrote:
    I'm trying to use the distributed cache in a MapReduce job written to
    the
    new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
    path
    is
    added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = new Job(conf, "Job");
    ...
    DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
    ...
    return job.waitForCompletion(true) ? 0 : 1;
    }

    The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
    Path[] paths = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());
    }

    But "paths" is null.

    I'm assuming I'm setting up the distributed cache incorrectly. I've
    seen
    a
    few hints in previous mailing list postings that indicate that the
    distributed cache is accessed via the Job and JobContext objects in the
    revised API, but the javadocs don't seem to support that.

    Thanks.
    Larry
  • Larry Compton at Apr 16, 2010 at 1:13 pm
    Thanks. That clears it up.

    Larry
    On Fri, Apr 16, 2010 at 1:05 AM, Amareshwari Sri Ramadasu wrote:

    Hi,
    @Ted, below code is internal code. Users are not expected to call
    DistributedCache.getLocalCache(), they cannot use it also. They do not know
    all the parameters.
    @Larry, DistributedCache is not changed to use new api in branch 0.20. The
    change is done in only from branch 0.21. See MAPREDUCE-898 (
    https://issues.apache.org/jira/browse/MAPREDUCE-898).
    If you are using branch 0.20, you are encouraged to use deprecated JobConf
    itself.
    You can try the following change in your code:
    Change the line > > > DistributedCache.addCacheFile(new
    Path(args[0]).toUri(), conf);
    to DistributedCache.addCacheFile(new Path(args[0]).toUri(),
    job.getConfiguration());

    Thanks
    Amareshwari

    On 4/16/10 2:27 AM, "Ted Yu" wrote:

    Please take a look at the loop starting at line 158 in TaskRunner.java:
    p[i] = DistributedCache.getLocalCache(files[i], conf,
    new Path(baseDir),
    fileStatus,
    false, Long.parseLong(

    fileTimestamps[i]),
    new Path(workDir.
    getAbsolutePath()),
    false);
    }
    DistributedCache.setLocalFiles(conf, stringifyPathArray(p));

    I think the confusing part is that DistributedCache.getLocalCacheFiles() is
    paired with DistributedCache.setLocalFiles()

    Cheers

    On Thu, Apr 15, 2010 at 1:16 PM, Larry Compton
    wrote:
    Ted,

    Thanks. I have looked at that example. The javadocs for DistributedCache
    still refer to deprecated classes, like JobConf. I'm trying to use the
    revised API.

    Larry
    On Thu, Apr 15, 2010 at 4:07 PM, Ted Yu wrote:

    Please see the sample within
    src\core\org\apache\hadoop\filecache\DistributedCache.java:

    * JobConf job = new JobConf();
    * DistributedCache.addCacheFile(new
    URI("/myapp/lookup.dat#lookup.dat"),
    * job);


    On Thu, Apr 15, 2010 at 12:56 PM, Larry Compton
    wrote:
    I'm trying to use the distributed cache in a MapReduce job written to
    the
    new API (org.apache.hadoop.mapreduce.*). In my "Tool" class, a file
    path
    is
    added to the distributed cache as follows:

    public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = new Job(conf, "Job");
    ...
    DistributedCache.addCacheFile(new Path(args[0]).toUri(),
    conf);
    ...
    return job.waitForCompletion(true) ? 0 : 1;
    }

    The "setup()" method in my mapper tries to read the path as follows:

    protected void setup(Context context) throws IOException {
    Path[] paths = DistributedCache.getLocalCacheFiles(context
    .getConfiguration());
    }

    But "paths" is null.

    I'm assuming I'm setting up the distributed cache incorrectly. I've
    seen
    a
    few hints in previous mailing list postings that indicate that the
    distributed cache is accessed via the Job and JobContext objects in
    the
    revised API, but the javadocs don't seem to support that.

    Thanks.
    Larry
  • Hgahlot at Jul 8, 2010 at 9:23 pm
    I had the same problem but Amreshwari's suggestion solved it. I am porting a
    code from the 0.18.3 API to 0.20.2 API. I am now facing problems with the
    setting of keys through Configuration object. The value set during
    configuration using conf.setBoolean(<String value>, <default boolean value>)
    is not retrieved in the mapper. I then ported the WordCountv2.0 example
    provided in the MapReduce tutorial and upgraded it to use the new API but it
    has the same problem. It works fine with the 0.18.3 API but fails in the
    upgraded version. When I try to get the name of the input file using
    inputFile = conf.get("map.input.file");
    it prints null...
    Kindly let me know how to set values of these user-defined keys in the new
    API.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-tp722187p952861.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
  • Hgahlot at Jul 9, 2010 at 10:20 pm

    hgahlot wrote:

    I had the same problem but Amreshwari's suggestion solved it. I am porting
    a code from the 0.18.3 API to 0.20.2 API. I am now facing problems with
    the setting of keys through Configuration object. The value set during
    configuration using conf.setBoolean(<String value>, <default boolean
    value>) is not retrieved in the mapper. I then ported the WordCountv2.0
    example provided in the MapReduce tutorial and upgraded it to use the new
    API but it has the same problem. It works fine with the 0.18.3 API but
    fails in the upgraded version. When I try to get the name of the input
    file using
    inputFile = conf.get("map.input.file");
    it prints null...
    Kindly let me know how to set values of these user-defined keys in the new
    API.
    using job.getConfiguration().set(...) instead of conf.set(...) solved it.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-tp722187p955402.html
    Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 15, '10 at 7:57p
activeJul 9, '10 at 10:20p
posts8
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase