FAQ
I know in general that I shouldn't worry too much about
initialization costs, as they will be amortized over the life of the
job and are often a drop in the bucket time wise. However, in my
setup I have a conf() method that needs to load in some resources
from disk. This is on a per job basis currently. I know that each
node in my cluster is going to need these resources and every job I
submit is going to end up doing this same thing. So I was wondering
if there was anyway these resources could be loaded once per startup
of the task tracker. In some sense, this is akin to putting
something into application scope in a webapp as opposed to session
scope.

Thanks,
Grant

Search Discussions

  • Dennis Kubes at Oct 30, 2006 at 4:01 pm
    We have been doing some similar things but we use custom MapRunner
    classes to load the resouces, for example files that need to be opened
    or a shared cache to reduce network reads, once per Map split and then
    pass the resources into the Map tasks. Here is an example of what it
    might look like:

    Dennis

    public class YourRunner
    implements MapRunnable {

    private JobConf job;
    private YourMapper mapper;
    private Class inputKeyClass;
    private Class inputValueClass;

    public void configure(JobConf job) {
    this.job = job;
    this.inputKeyClass = job.getInputKeyClass();
    this.inputValueClass = job.getInputValueClass();
    }

    private void closeReaders(MapFile.Reader[] readers) {

    if (readers == null)
    return;
    for (int i = 0; i < readers.length; i++) {
    try {
    readers[i].close();
    }
    catch (Exception e) {

    }
    }
    }

    public void run(RecordReader input, OutputCollector output, Reporter
    reporter)
    throws IOException {

    final FileSystem fs = FileSystem.get(job);

    Configuration conf = NutchConfiguration.create();
    mapper = new YourMapper();

    Path filesPath= new Path(parent, mapfiledir);
    MapFile.Reader[] readers= MapFileOutputFormat.getReaders(fs,
    filesPath, conf);
    Map <Integer, String> cache= new HashMap <Integer, String>();

    mapper.setCache(cache);
    mapper.setReaders(readers);

    try {

    WritableComparable key =
    (WritableComparable)job.newInstance(inputKeyClass);
    Writable value = (Writable)job.newInstance(inputValueClass);

    while (input.next(key, value)) {
    mapper.map(key, value, output, reporter);
    }
    }
    finally {
    mapper.close();
    }

    closeReaders(levelReaders);
    }
    }


    Grant Ingersoll wrote:
    I know in general that I shouldn't worry too much about initialization
    costs, as they will be amortized over the life of the job and are
    often a drop in the bucket time wise. However, in my setup I have a
    conf() method that needs to load in some resources from disk. This
    is on a per job basis currently. I know that each node in my cluster
    is going to need these resources and every job I submit is going to
    end up doing this same thing. So I was wondering if there was anyway
    these resources could be loaded once per startup of the task tracker.
    In some sense, this is akin to putting something into application
    scope in a webapp as opposed to session scope.

    Thanks,
    Grant
  • Doug Cutting at Oct 30, 2006 at 5:49 pm
    A new JVM is used per task, so they'll need to be re-read per-task.

    The job jar is cached locally by the tasktracker. So it is only copied
    from dfs to the local disk once per job. Its content is shared by all
    tasks in that job. So you can include shared files in a job jar. Tasks
    are run connected to a directory with the unpacked jar contents.

    You can also add other files that should be cached to your job, using
    the DistributedCache API:

    http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/filecache/DistributedCache.html

    But including things in the job jar is far simpler.

    Doug

    Grant Ingersoll wrote:
    I know in general that I shouldn't worry too much about initialization
    costs, as they will be amortized over the life of the job and are often
    a drop in the bucket time wise. However, in my setup I have a conf()
    method that needs to load in some resources from disk. This is on a
    per job basis currently. I know that each node in my cluster is going
    to need these resources and every job I submit is going to end up doing
    this same thing. So I was wondering if there was anyway these resources
    could be loaded once per startup of the task tracker. In some sense,
    this is akin to putting something into application scope in a webapp as
    opposed to session scope.

    Thanks,
    Grant

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 30, '06 at 1:47p
activeOct 30, '06 at 5:49p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase