Benjamin Reed wrote:
One of the things that bothers me about the JobTracker is that it is
running user code when it creates the FileSplits. In the long term this
puts the JobTracker JVM at risk due to errors in the user code.
JVM's are supposed to be able to do this kind of stuff securely. Still,
we don't currently leverage this much, and the JVM's security is
limited, so it is a valid concern.
Note that, while we do avoid running user code in tasktrackers (mapping,
sorting and reducing are done in a subprocess) they're still run as a
system user id. So security issues are to some degree unavoidable.
But in terms of inadvertant denial of service, running user code in the
job tracker, a single-point-of-failure, does make the system more fragile.
The JobTracker uses the InputFormat to create a set of tasks that it
then schedules. The task creation does not need to happen at the
JobTracker. If we allowed the clients to create the set of tasks, the
JobTracker would not need to load and run any user generated code. It
would also remove some of the processing load from the JobTracker. On
the downside it does greatly increase the amount of information sent to
the JobTracker when a job is submitted.
Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
altered to be submitJob(StringJobFile, Split). The RPC system can
handle reasonably large values like this, so I don't think that would be
a problem. But the memory impact on the JobTracker could become
significant, since the splits for queued jobs would now be around. This
could be mitigated by writing the splits to a temporary file.
The semantics would be subtly different: if you queue a job now, the
file listing is done just before the job is executed, not when its
submitted. But programs shouldn't rely on that, so I don't think this
is a big worry.
Overall, I don't see any major problems with this. It won't simplify
things much. We can remove the code which computes splits in a separate
thread, but we'd have to add code to store splits to temporary files, so
codesize is a wash. And it would remove a potential reliability problem.