FAQ
Hello,

I am trying to setup a MapReduce job so that the task JVMs are reused on
each cluster node. Libraries used by my MapReduce job have a significant
initialization time, mainly creating singletons, and it would be nice if
I could make it so that these singletons are only created once per slot,
rather than once per task. The input for the job is HBase, so for a
large row scan, the initialization time is proving to be quite
significant, as the processing done on each row is rather small and the
number of tasks is high.

I am setting mapred.job.reuse.jvm.num.tasks to -1 in the job
configuration, as stated in the documentation ([1]), yet I am still
seeing a different JVM start for each task. This is visible both by
watching the processes executing on each node using ps, as well as
watching the debugging logs from the job. Otherwise, the job is working
as expected.

I have tried switching to the deprecated JobConf class and using
setNumTasksToExecutePerJvm, but to no avail. I also tried setting
mapreduce.job.jvm.numtasks, the equivalent setting in Hadoop 0.21, in
case the documentation was out of date, though this did not help either.

I have confirmed that mapred.job.reuse.jvm.num.tasks is being
transferred to the copy of the job configuration on the task tracker, by
looking at the task tracker's copy of job.xml ([2]).

I am running Cloudera's cdh3u0 (Hadoop 0.20.2, full version string:
0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14) and HBase
0.90.1.

Thank you in advance if anyone may be able to shed light on this issue.

[1] -
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task
+JVM+Reuse

[2] - The following appears in the property file (split across multiple
lines by me for readability):
<property>
<!--Loaded from /mnt/mapred/jt/jobTracker/job_201107281409_0028.xml-->
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>-1</value>
</property>

Regards,

Brandon Vargo

Search Discussions

  • Harsh J at Jul 29, 2011 at 7:03 pm
    Brandon,

    New JVMs for each slot will be spawned across different jobs. For
    tasks of the same job, this shouldn't happen. Are you seeing this
    happen for tasks of the same job itself?

    Also, since your question may be specific to CDH use, I've moved the
    discussion to cdh-user@cloudera.org (mapreduce-user@ bcc'd)
    On Fri, Jul 29, 2011 at 11:50 PM, Brandon Vargo wrote:
    Hello,

    I am trying to setup a MapReduce job so that the task JVMs are reused on
    each cluster node. Libraries used by my MapReduce job have a significant
    initialization time, mainly creating singletons, and it would be nice if
    I could make it so that these singletons are only created once per slot,
    rather than once per task. The input for the job is HBase, so for a
    large row scan, the initialization time is proving to be quite
    significant, as the processing done on each row is rather small and the
    number of tasks is high.

    I am setting mapred.job.reuse.jvm.num.tasks to -1 in the job
    configuration, as stated in the documentation ([1]), yet I am still
    seeing a different JVM start for each task. This is visible both by
    watching the processes executing on each node using ps, as well as
    watching the debugging logs from the job. Otherwise, the job is working
    as expected.

    I have tried switching to the deprecated JobConf class and using
    setNumTasksToExecutePerJvm, but to no avail. I also tried setting
    mapreduce.job.jvm.numtasks, the equivalent setting in Hadoop 0.21, in
    case the documentation was out of date, though this did not help either.

    I have confirmed that mapred.job.reuse.jvm.num.tasks is being
    transferred to the copy of the job configuration on the task tracker, by
    looking at the task tracker's copy of job.xml ([2]).

    I am running Cloudera's cdh3u0 (Hadoop 0.20.2, full version string:
    0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14) and HBase
    0.90.1.

    Thank you in advance if anyone may be able to shed light on this issue.

    [1] -
    http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task
    +JVM+Reuse

    [2] - The following appears in the property file (split across multiple
    lines by me for readability):
    <property>
    <!--Loaded from /mnt/mapred/jt/jobTracker/job_201107281409_0028.xml-->
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>-1</value>
    </property>

    Regards,

    Brandon Vargo


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 29, '11 at 6:20p
activeJul 29, '11 at 7:03p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Brandon Vargo: 1 post Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase