FAQ
One of the things that bothers me about the JobTracker is that it is
running user code when it creates the FileSplits. In the long term this
puts the JobTracker JVM at risk due to errors in the user code.

The JobTracker uses the InputFormat to create a set of tasks that it
then schedules. The task creation does not need to happen at the
JobTracker. If we allowed the clients to create the set of tasks, the
JobTracker would not need to load and run any user generated code. It
would also remove some of the processing load from the JobTracker. On
the downside it does greatly increase the amount of information sent to
the JobTracker when a job is submitted.

ben

Search Discussions

  • Doug Cutting at Sep 28, 2006 at 6:14 pm

    Benjamin Reed wrote:
    One of the things that bothers me about the JobTracker is that it is
    running user code when it creates the FileSplits. In the long term this
    puts the JobTracker JVM at risk due to errors in the user code.
    JVM's are supposed to be able to do this kind of stuff securely. Still,
    we don't currently leverage this much, and the JVM's security is
    limited, so it is a valid concern.

    Note that, while we do avoid running user code in tasktrackers (mapping,
    sorting and reducing are done in a subprocess) they're still run as a
    system user id. So security issues are to some degree unavoidable.

    But in terms of inadvertant denial of service, running user code in the
    job tracker, a single-point-of-failure, does make the system more fragile.
    The JobTracker uses the InputFormat to create a set of tasks that it
    then schedules. The task creation does not need to happen at the
    JobTracker. If we allowed the clients to create the set of tasks, the
    JobTracker would not need to load and run any user generated code. It
    would also remove some of the processing load from the JobTracker. On
    the downside it does greatly increase the amount of information sent to
    the JobTracker when a job is submitted.
    Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
    altered to be submitJob(StringJobFile, Split[]). The RPC system can
    handle reasonably large values like this, so I don't think that would be
    a problem. But the memory impact on the JobTracker could become
    significant, since the splits for queued jobs would now be around. This
    could be mitigated by writing the splits to a temporary file.

    The semantics would be subtly different: if you queue a job now, the
    file listing is done just before the job is executed, not when its
    submitted. But programs shouldn't rely on that, so I don't think this
    is a big worry.

    Overall, I don't see any major problems with this. It won't simplify
    things much. We can remove the code which computes splits in a separate
    thread, but we'd have to add code to store splits to temporary files, so
    codesize is a wash. And it would remove a potential reliability problem.

    Doug
  • Bryan A. P. Pendleton at Sep 28, 2006 at 6:24 pm
    I'm largely at fault for the "user code running in the JobTracker" that
    exists.

    I support this change - but, I might reformulate it. Why not make this a
    sort of special Job? It can even be formulated roughly like this:

    input<JobDescription,FilePaths> -> map(Job,FilePath) ->
    reduce(Job,FileSplits) -> SchedulableJob

    It might even make sense to do an extra run that pre-computes cached
    locations of FileSplits, although I think that is still bottlenecked by the
    NameNode.
    On 9/28/06, Doug Cutting wrote:

    Benjamin Reed wrote:
    One of the things that bothers me about the JobTracker is that it is
    running user code when it creates the FileSplits. In the long term this
    puts the JobTracker JVM at risk due to errors in the user code.
    JVM's are supposed to be able to do this kind of stuff securely. Still,
    we don't currently leverage this much, and the JVM's security is
    limited, so it is a valid concern.

    Note that, while we do avoid running user code in tasktrackers (mapping,
    sorting and reducing are done in a subprocess) they're still run as a
    system user id. So security issues are to some degree unavoidable.

    But in terms of inadvertant denial of service, running user code in the
    job tracker, a single-point-of-failure, does make the system more fragile.
    The JobTracker uses the InputFormat to create a set of tasks that it
    then schedules. The task creation does not need to happen at the
    JobTracker. If we allowed the clients to create the set of tasks, the
    JobTracker would not need to load and run any user generated code. It
    would also remove some of the processing load from the JobTracker. On
    the downside it does greatly increase the amount of information sent to
    the JobTracker when a job is submitted.
    Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
    altered to be submitJob(StringJobFile, Split[]). The RPC system can
    handle reasonably large values like this, so I don't think that would be
    a problem. But the memory impact on the JobTracker could become
    significant, since the splits for queued jobs would now be around. This
    could be mitigated by writing the splits to a temporary file.

    The semantics would be subtly different: if you queue a job now, the
    file listing is done just before the job is executed, not when its
    submitted. But programs shouldn't rely on that, so I don't think this
    is a big worry.

    Overall, I don't see any major problems with this. It won't simplify
    things much. We can remove the code which computes splits in a separate
    thread, but we'd have to add code to store splits to temporary files, so
    codesize is a wash. And it would remove a potential reliability problem.

    Doug


    --
    Bryan A. P. Pendleton
    Ph: (877) geek-1-bp
  • Benjamin Reed at Sep 29, 2006 at 7:23 am
    I please correct me if I'm reading the code incorrectly, but it seems
    like submitJob puts the submitted job on the jobInitQueue which is
    immediately dequeued by the JobInitThread and then initTasks() will get
    the file splits and create Tasks. Thus, it doesn't seem like there is
    any difference in memory foot print.

    ben

    Doug Cutting wrote:
    Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
    altered to be submitJob(StringJobFile, Split[]). The RPC system can
    handle reasonably large values like this, so I don't think that would
    be a problem. But the memory impact on the JobTracker could become
    significant, since the splits for queued jobs would now be around.
    This could be mitigated by writing the splits to a temporary file.

    The semantics would be subtly different: if you queue a job now, the
    file listing is done just before the job is executed, not when its
    submitted. But programs shouldn't rely on that, so I don't think this
    is a big worry.

    Overall, I don't see any major problems with this. It won't simplify
    things much. We can remove the code which computes splits in a
    separate thread, but we'd have to add code to store splits to
    temporary files, so codesize is a wash. And it would remove a
    potential reliability problem.
  • Owen O'Malley at Sep 29, 2006 at 3:49 pm

    On Sep 29, 2006, at 12:20 AM, Benjamin Reed wrote:

    I please correct me if I'm reading the code incorrectly, but it seems
    like submitJob puts the submitted job on the jobInitQueue which is
    immediately dequeued by the JobInitThread and then initTasks() will
    get
    the file splits and create Tasks. Thus, it doesn't seem like there is
    any difference in memory foot print.
    Agreed, it won't cost more memory. In fact, it will be less because
    we won't have the init task thread running and creating InputFormats
    and running user code. Of course, once we allow user-defined
    InputSplits we will be back in exactly the same boat of running user-
    code on the JobTracker, unless we also ship over the preferred hosts
    for each InputFormat too.

    -- Owen
  • Benjamin Reed at Sep 29, 2006 at 4:15 pm
    No, even with user defined Splits we don't need to use user code in the
    JobTracker if we make Split a Writable class that has the hosts array.

    Split will write the hosts first, so in the JobTracker, when you get the
    byte array representing the Split, any fields from the sub class will
    follow the Split serialized bytes. The JobTracker can skip the Type in
    the bytes representing the serialized Split and then deserialize just a
    Split (ignoring the rest). You can make this process robust by putting a
    fingerprint at the beginning and end of the serialized part of Split, so
    that you can detect user defined Splits that change the serialization
    order. (This is another example of why Writable is cooler than
    Serializable. It would be really hard to just deserialize a super class
    from a serialized sub class using Java serialization.)

    You would ship the full byte array to the task trackers so that the
    InputFormats running in Childs can deserialize the full type.

    ben

    Owen O'Malley wrote:
    On Sep 29, 2006, at 12:20 AM, Benjamin Reed wrote:

    I please correct me if I'm reading the code incorrectly, but it seems
    like submitJob puts the submitted job on the jobInitQueue which is
    immediately dequeued by the JobInitThread and then initTasks() will get
    the file splits and create Tasks. Thus, it doesn't seem like there is
    any difference in memory foot print.
    Agreed, it won't cost more memory. In fact, it will be less because we
    won't have the init task thread running and creating InputFormats and
    running user code. Of course, once we allow user-defined InputSplits
    we will be back in exactly the same boat of running user-code on the
    JobTracker, unless we also ship over the preferred hosts for each
    InputFormat too.

    -- Owen
  • Doug Cutting at Sep 29, 2006 at 4:28 pm

    Benjamin Reed wrote:
    Split will write the hosts first, so in the JobTracker, when you get the
    byte array representing the Split, any fields from the sub class will
    follow the Split serialized bytes. The JobTracker can skip the Type in
    the bytes representing the serialized Split and then deserialize just a
    Split (ignoring the rest).
    That could work, but it would be hard to pass these direclty over either
    RPC or via a SequenceFile, no? We'd have to write a custom container
    for the array of Splits. The container could then be passed as a whole
    over RPC or placed in a SequenceFile. That could become a little awkward.

    Doug
  • Doug Cutting at Sep 29, 2006 at 4:22 pm

    Owen O'Malley wrote:
    Of course, once we allow user-defined InputSplits we
    will be back in exactly the same boat of running user-code on the
    JobTracker, unless we also ship over the preferred hosts for each
    InputFormat too.
    So, to entirely avoid user code in the job tracker we'd need a final
    class that represents each task to be created, a SplitLocations. These
    would correspond 1-1 to splits, but would only contain the list of
    preferred hosts. A way to implement this might be to write two parallel
    files in DFS, one with the SplitLocations, and one with the Splits.
    Then the first is passed to the job tracker with the name of the second
    file. Then only task child processes would open the split file, seeking
    to the appropriate index. We could use ArrayFile for these, and highly
    replicate them, especially their indexes.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedSep 28, '06 at 8:04a
activeSep 29, '06 at 4:28p
posts8
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase