FAQ
Hi Hadoop guys,

At the moment, for each task (map or reduce) a new JVM is created by the
TaskTracker to run the Job.

We have in our Hadoop cluster a high number of small files thus
requiring a high number of map tasks. I know this is suboptimal, but
aggregating those small files is not possible now. So an idea came to us
: launching jobs in the task tracker JVM so the overhead of creating a
new vm will disappear.

I already have a working patch against the 0.10.1 release of Hadoop that
launch tasks inside the TaskTracker JVM if a specific parameter is set
in the JobConf of the launched Job (for job we trust ;) ). Each new task
have a specific class loader which basically load every needed class by
the Task, as it was running in a brand new JVM. (the same "classpath" is
used)

For that to work, an upgrade of commons-logging to the 1.1 version is
needed in order to circumvent class loader / memory leaks issues. I've
done some profiling using jprofiler on the task tracker to find and to
remove mem leaks. So I'm pretty confident with this code.

If you are interested with that, please let me know.
If so, I will provide a patch against the current Hadoop trunk in Jira
as soon as possible.

--
Philippe.

Search Discussions

  • Torsten Curdt at Mar 19, 2007 at 3:09 pm

    On 19.03.2007, at 15:46, Philippe Gassmann wrote:

    Hi Hadoop guys,

    At the moment, for each task (map or reduce) a new JVM is created
    by the
    TaskTracker to run the Job.

    We have in our Hadoop cluster a high number of small files thus
    requiring a high number of map tasks. I know this is suboptimal, but
    aggregating those small files is not possible now. So an idea came
    to us
    : launching jobs in the task tracker JVM so the overhead of creating a
    new vm will disappear.
    Cool stuff! :)

    cheers
    --
    Torsten
  • Doug Cutting at Mar 19, 2007 at 4:55 pm

    Philippe Gassmann wrote:
    At the moment, for each task (map or reduce) a new JVM is created by the
    TaskTracker to run the Job.

    We have in our Hadoop cluster a high number of small files thus
    requiring a high number of map tasks. I know this is suboptimal, but
    aggregating those small files is not possible now. So an idea came to us
    : launching jobs in the task tracker JVM so the overhead of creating a
    new vm will disappear.
    A simpler approach might be to develop an InputFormat that includes
    multiple files per split.
    I already have a working patch against the 0.10.1 release of Hadoop that
    launch tasks inside the TaskTracker JVM if a specific parameter is set
    in the JobConf of the launched Job (for job we trust ;) ).
    Ideally this could be through a task-running interface, that permits one
    to plug in different implementations. For example, sometimes it may
    make sense to run tasks in-process, sometimes to run them in a child
    JVM, and sometimes to fork a non-Java sub-process. So, rather than
    specifying a flag on the job, one would specify the runner
    implementation class.

    Doug
  • Philippe Gassmann at Mar 19, 2007 at 5:54 pm

    Doug Cutting a écrit :
    Philippe Gassmann wrote:
    At the moment, for each task (map or reduce) a new JVM is created by the
    TaskTracker to run the Job.

    We have in our Hadoop cluster a high number of small files thus
    requiring a high number of map tasks. I know this is suboptimal, but
    aggregating those small files is not possible now. So an idea came to us
    : launching jobs in the task tracker JVM so the overhead of creating a
    new vm will disappear.
    A simpler approach might be to develop an InputFormat that includes
    multiple files per split.
    Yes, but the issue remains present if you have to deal with a high
    number of map tasks to distribute the load on many machines. Launching a
    JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
    have to do 2000 map, there will be 2000 seconds lost in launching JVMs...
    I already have a working patch against the 0.10.1 release of Hadoop that
    launch tasks inside the TaskTracker JVM if a specific parameter is set
    in the JobConf of the launched Job (for job we trust ;) ).
    Ideally this could be through a task-running interface, that permits
    one to plug in different implementations. For example, sometimes it
    may make sense to run tasks in-process, sometimes to run them in a
    child JVM, and sometimes to fork a non-Java sub-process. So, rather
    than specifying a flag on the job, one would specify the runner
    implementation class.
    A bit of refactoring of the TaskRunner hierarchy is needed for this to
    work : the code that launch tasks in the JVM or in a separate process is
    very similar and it would have a sense that the TaskRunner would be the
    superclass of a InJVMRunner and a ChildJVMRunner.
    But what can we do with MapTaskRunner and ReduceTaskRunner ? It is not
    acceptable to have let's say : 2 or more implementation of the
    MapTaskRunner (one for in a child JVM execution, one for a in tracker
    JVM execution...). It would be painful to maintain and very complicated.
    Doug
  • Doug Cutting at Mar 19, 2007 at 6:14 pm

    Philippe Gassmann wrote:
    Yes, but the issue remains present if you have to deal with a high
    number of map tasks to distribute the load on many machines. Launching a
    JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
    have to do 2000 map, there will be 2000 seconds lost in launching JVMs...
    The InputFormat controls the number of map tasks. So, if 2000 is too
    many, so that JVM startup time dominates, then you can develop an
    InputFormat that splits things into fewer tasks so that this is not a
    problem.
    A bit of refactoring of the TaskRunner hierarchy is needed for this to
    work : the code that launch tasks in the JVM or in a separate process is
    very similar and it would have a sense that the TaskRunner would be the
    superclass of a InJVMRunner and a ChildJVMRunner.
    But what can we do with MapTaskRunner and ReduceTaskRunner ? It is not
    acceptable to have let's say : 2 or more implementation of the
    MapTaskRunner (one for in a child JVM execution, one for a in tracker
    JVM execution...). It would be painful to maintain and very complicated.
    Perhaps it is too complicated for now, but I think we will want
    something like that long-term, so it is worth thinking about.

    Doug
  • Milind Bhandarkar at Mar 19, 2007 at 6:21 pm

    Yes, but the issue remains present if you have to deal with a high
    number of map tasks to distribute the load on many machines.
    Launching a
    JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
    have to do 2000 map, there will be 2000 seconds lost in launching
    JVMs...
    Executing users' code in system daemons is a security risk. In my
    experience, security always wins in when pitted against performance.
    IMHO, there is a happy middle ground, i.e. to maintain a pool of
    running JVMs that are launched when the tasktracker starts up. Even
    then, care has to be taken against memory leaks etc.

    - Milind

    --
    Milind Bhandarkar
    (mailto:milindb@yahoo-inc.com)
    (phone: 408-349-2136 W)
  • Torsten Curdt at Mar 19, 2007 at 7:36 pm

    Yes, but the issue remains present if you have to deal with a high
    number of map tasks to distribute the load on many machines.
    Launching a
    JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
    have to do 2000 map, there will be 2000 seconds lost in launching
    JVMs...
    Executing users' code in system daemons is a security risk.
    Of course there is security benefit in starting the jobs in a
    different JVM but if you don't trust the code you are executing this
    is probably not for you either. So bottom line is - if you weight up
    the performance penalty against the gained security I am still no
    excited about the JVM spawning idea.

    If you really consider security that big of a problem - come up with
    your own language to ease and restrict the jobs.

    My 2 cents
    --
    Torsten
  • Stephane Bailliez at Mar 20, 2007 at 10:25 am

    Torsten Curdt wrote:
    Executing users' code in system daemons is a security risk.
    Of course there is security benefit in starting the jobs in a different
    JVM but if you don't trust the code you are executing this is probably
    not for you either. So bottom line is - if you weight up the performance
    penalty against the gained security I am still no excited about the JVM
    spawning idea.

    If you really consider security that big of a problem - come up with
    your own language to ease and restrict the jobs.
    I think security here was more about 'taking down the whole task
    tracker' risk.

    Being a complete idiot for distributed computing, I would say it is easy
    to explode a JVM when doing such distributed jobs, (should it be for OOM
    or anything).

    If you run within the task tracker vm you'll have to carefully size the
    tracker vm to accommodate potentially the resources of all possibles
    jobs running at the same time or simply allocate a gigantic amount of
    resources 'just in case', which kind of offset the benefits of any
    performance improvement to stability.

    Not mentioning cleaning up all the mess left by running jobs including
    flushing the introspection cache to avoid leaks, which will then impact
    performance of other jobs since it is not a selective flush.

    Failing jobs are not exactly uncommon and running things in a sandboxed
    environment with less risk for the tracker seems like a perfectly
    reasonable choice. So yeah, vm pooling certainly makes perfect sense for
    it or should probably look at what Doug suggests as well.

    My 0.01 kopek ;)

    -- stephane
  • Torsten Curdt at Mar 20, 2007 at 11:29 am

    On 20.03.2007, at 11:19, Stephane Bailliez wrote:

    Torsten Curdt wrote:
    Executing users' code in system daemons is a security risk.
    Of course there is security benefit in starting the jobs in a
    different JVM but if you don't trust the code you are executing
    this is probably not for you either. So bottom line is - if you
    weight up the performance penalty against the gained security I am
    still no excited about the JVM spawning idea.
    If you really consider security that big of a problem - come up
    with your own language to ease and restrict the jobs.
    I think security here was more about 'taking down the whole task
    tracker' risk.
    Well, the same applies
    Being a complete idiot for distributed computing, I would say it is
    easy to explode a JVM when doing such distributed jobs, (should it
    be for OOM or anything).
    Then restrict what people can do - at least Google went that route.
    If you run within the task tracker vm you'll have to carefully size
    the tracker vm to accommodate potentially the resources of all
    possibles jobs running at the same time or simply allocate a
    gigantic amount of resources 'just in case', which kind of offset
    the benefits of any performance improvement to stability.
    Question is whether the task tracker should have access to that
    gigantic amount of resources. In one jvm or the other.
    Not mentioning cleaning up all the mess left by running jobs
    including flushing the introspection cache to avoid leaks, which
    will then impact performance of other jobs since it is not a
    selective flush.

    Failing jobs are not exactly uncommon and running things in a
    sandboxed environment with less risk for the tracker seems like a
    perfectly reasonable choice. So yeah, vm pooling certainly makes
    perfect sense for it
    I am still not convinced - sorry

    It's a bit like you would like to run JSPs in a separate JVM because
    they might take down the servlet container.

    cheers
    --
    Torsten
  • Stephane Bailliez at Mar 20, 2007 at 2:06 pm

    Torsten Curdt wrote:
    Being a complete idiot for distributed computing, I would say it is
    easy to explode a JVM when doing such distributed jobs, (should it be
    for OOM or anything).
    Then restrict what people can do - at least Google went that route.
    I don't know what Google did on the specifics :)

    If you want to do that with Java and restrict memory usage, cpu usage
    and descriptor access within each inVM instance. That's a considerable
    amount of work that likely implies writing a specific agent for the vm
    (or an agent for a specific vm that is, because it's pretty unlikely
    that you will get the same results across vms), assuming that can then
    really be done at the classloader level for each task (which is pretty
    insanely complex to me if you have to consider allocation done at the
    parent classloader level, etc..)

    At least by forking a vm you can afford to get some reasonably bound
    control over the resources usage (or at least memory) without bringing
    down everything since a vm is already bound to some degrees.

    Failing jobs are not exactly uncommon and running things in a
    sandboxed environment with less risk for the tracker seems like a
    perfectly reasonable choice. So yeah, vm pooling certainly makes
    perfect sense for it
    I am still not convinced - sorry

    It's a bit like you would like to run JSPs in a separate JVM because
    they might take down the servlet container.
    it is a bit too extreme in granularity. I think it is more about like
    running n different webapps within the same VM or not. So if one webapp
    is resource hog, separating it would not harm the n-1 other applications
    and you would either create another server instance or move it away to
    another node.

    I know of environment with large number of nodes (not related to hadoop)
    where they also reboot a set of nodes daily to ensure that all machines
    are really in working conditions (it's usually when the machine reboots
    due to failure or whatever that someone has to rush to it because some
    service forgot to be registered or things like that, so doing this
    periodic check gives some people better ideas of their response time to
    failure). That depends of operational procedures for sure.

    I don't think it should be done in the spirit that everything is perfect
    in the perfect world because we know it is not like that. So there will
    be compromise between safety and performance and having something
    reasonably tolerant to failure is also a performance advantage.

    Doing simple things in a task like a deleteOnExit is enough to leak on
    some VMs a few kbs each time and stay there until the vm dies (fixed in
    1.5.0_10 if I remember well). Figuring out things like that in the end
    is likely to take a severe amount of time considering it is an internal
    leak and will not appear in your favorite java profiler either.

    Bottom line is that even if you're 100% sure of your code which is quite
    unlikely (at least for me as far as I'm concerned ), you don't know
    third-party code. So without being totally paranoid, this is something
    that cannot be ignored.

    -- stephane
  • Sylvain Wallez at Mar 20, 2007 at 4:01 pm

    Stephane Bailliez wrote:
    Torsten Curdt wrote:
    Being a complete idiot for distributed computing, I would say it is
    easy to explode a JVM when doing such distributed jobs, (should it
    be for OOM or anything).
    Then restrict what people can do - at least Google went that route.
    I don't know what Google did on the specifics :)
    They came up with their own language for mapreduce jobs:
    http://labs.google.com/papers/sawzall.html
    If you want to do that with Java and restrict memory usage, cpu usage
    and descriptor access within each inVM instance. That's a considerable
    amount of work that likely implies writing a specific agent for the vm
    (or an agent for a specific vm that is, because it's pretty unlikely
    that you will get the same results across vms), assuming that can then
    really be done at the classloader level for each task (which is pretty
    insanely complex to me if you have to consider allocation done at the
    parent classloader level, etc..)

    At least by forking a vm you can afford to get some reasonably bound
    control over the resources usage (or at least memory) without bringing
    down everything since a vm is already bound to some degrees.

    Failing jobs are not exactly uncommon and running things in a
    sandboxed environment with less risk for the tracker seems like a
    perfectly reasonable choice. So yeah, vm pooling certainly makes
    perfect sense for it
    I am still not convinced - sorry

    It's a bit like you would like to run JSPs in a separate JVM because
    they might take down the servlet container.
    it is a bit too extreme in granularity. I think it is more about like
    running n different webapps within the same VM or not. So if one
    webapp is resource hog, separating it would not harm the n-1 other
    applications and you would either create another server instance or
    move it away to another node.

    I know of environment with large number of nodes (not related to
    hadoop) where they also reboot a set of nodes daily to ensure that all
    machines are really in working conditions (it's usually when the
    machine reboots due to failure or whatever that someone has to rush to
    it because some service forgot to be registered or things like that,
    so doing this periodic check gives some people better ideas of their
    response time to failure). That depends of operational procedures for
    sure.
    This can be another implementation of the TaskTracker: a single JVM that
    forks a "replacement JVM" after either a given time or a given amount of
    tasks executed. This can avoid JVM fork overhead while also avoiding
    memory leak problems.

    The forked JVM could even be pre-forked and monitor the active one,
    taking over if it no more responds (and eventually killing it).

    Sylvain

    --
    Sylvain Wallez - http://bluxte.net
  • Owen O'Malley at Mar 20, 2007 at 4:05 am

    On Mar 19, 2007, at 10:51 AM, Philippe Gassmann wrote:

    Doug Cutting a écrit :
    A simpler approach might be to develop an InputFormat that includes
    multiple files per split.
    Yes, but the issue remains present if you have to deal with a high
    number of map tasks to distribute the load on many machines.
    Launching a
    JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
    have to do 2000 map, there will be 2000 seconds lost in launching
    JVMs...
    For task granularity, the most that makes sense is roughly 10-50
    tasks/node. Given that a node runs at least 2 tasks at once, it maps
    into 5-25 seconds of wallclock time. It is noticeable, but shouldn't
    be the dominant factor.
    I already have a working patch against the 0.10.1 release of
    Hadoop that
    launch tasks inside the TaskTracker JVM if a specific parameter
    is set
    in the JobConf of the launched Job (for job we trust ;) ).
    Another possible direction would be to have the Task JVM ask for
    another Task before exiting. I believe that Ben Reed experimented
    with that and the changes were not too extensive. For security, you
    would want to limit the JVM reuse to tasks within the same job.

    As a side note, we've already seen cases of client code that killed
    the task trackers. So it is hardly an abstract concern. *smile* (The
    client code managed to send kill signals to the entire process group,
    which included the task tracker. It was hard to debug and I'm not
    very interested in making it easier for client code to take out the
    servers.)

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 19, '07 at 2:47p
activeMar 20, '07 at 4:01p
posts12
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase