FAQ
Hi,

Is it possible (and does it make sense) to reuse JVMs across jobs?

The job.reuse.jvm.num.tasks config option is a job specific parameter,
as its name implies. When running multiple independent jobs
simultaneously with job.reuse.jvm=-1 (this means always reuse), I see
a lot of different Java PIDs (far more than map.tasks.maximum +
reduce.tasks.maximum) across the duration of the job runs, instead of
the same Java processes persisting. The number of live JVMs on a given
node/tasktracker at any time never exceeds map.tasks.maximum +
reduce.tasks.maximum, as expected, but we do tear down idle JVMs and
spawn new ones quite often.

for example, here are the number of distinct Java PIDs when submitting
1, 4, 32 copies of the same job in parallel:
1 28
2 39
4 106
32 740

The relevant killing and spawing logic should be in
src/mapred/org/mapred/org/apache/hadoop/mapred/JvmManager.java,
particularly the reapJvm() method, but I haven't dug deeper. I am
wondering if it would be possible and worthwhile from a performance
standpoint to be able to reuse JVMs across jobs i.e. have a common JVM
pool for all hadoop jobs?
thanks,

- Vasilis

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedFeb 20, '10 at 12:11a
activeFeb 20, '10 at 12:11a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Vasilis Liaskovitis: 1 post

People

Translate

site design / logo © 2022 Grokbase