We have been doing some similar things but we use custom MapRunner
classes to load the resouces, for example files that need to be opened
or a shared cache to reduce network reads, once per Map split and then
pass the resources into the Map tasks. Here is an example of what it
might look like:
Dennis
public class YourRunner
implements MapRunnable {
private JobConf job;
private YourMapper mapper;
private Class inputKeyClass;
private Class inputValueClass;
public void configure(JobConf job) {
this.job = job;
this.inputKeyClass = job.getInputKeyClass();
this.inputValueClass = job.getInputValueClass();
}
private void closeReaders(MapFile.Reader[] readers) {
if (readers == null)
return;
for (int i = 0; i < readers.length; i++) {
try {
readers[i].close();
}
catch (Exception e) {
}
}
}
public void run(RecordReader input, OutputCollector output, Reporter
reporter)
throws IOException {
final FileSystem fs = FileSystem.get(job);
Configuration conf = NutchConfiguration.create();
mapper = new YourMapper();
Path filesPath= new Path(parent, mapfiledir);
MapFile.Reader[] readers= MapFileOutputFormat.getReaders(fs,
filesPath, conf);
Map <Integer, String> cache= new HashMap <Integer, String>();
mapper.setCache(cache);
mapper.setReaders(readers);
try {
WritableComparable key =
(WritableComparable)job.newInstance(inputKeyClass);
Writable value = (Writable)job.newInstance(inputValueClass);
while (input.next(key, value)) {
mapper.map(key, value, output, reporter);
}
}
finally {
mapper.close();
}
closeReaders(levelReaders);
}
}
Grant Ingersoll wrote:
I know in general that I shouldn't worry too much about initialization
costs, as they will be amortized over the life of the job and are
often a drop in the bucket time wise. However, in my setup I have a
conf() method that needs to load in some resources from disk. This
is on a per job basis currently. I know that each node in my cluster
is going to need these resources and every job I submit is going to
end up doing this same thing. So I was wondering if there was anyway
these resources could be loaded once per startup of the task tracker.
In some sense, this is akin to putting something into application
scope in a webapp as opposed to session scope.
Thanks,
Grant