My organization is having a problem where a single large Hadoop job is
starving out other tasks.
We're running Hadoop on a cluster with N nodes, using the Fair Scheduler.
We're not using any queues. We have a very large job that takes weeks to
run. This job has M map tasks where M >> N. Each map task is long running
(maybe 1 to 4 hours). We're running the large job at "VERY LOW" priority,
but it's still starving everybody else out. Any other job you submit might
take a half and hour to an hour to be allocated a mapper slot and start
For the time being were stuck with the M >> N and multi-hour mapper runtime
constraints, but we can't tell everyone to not use the cluster for the next
few weeks while this big job completes. Are there any other scheduling
mechanisms we can use to make this happen. I've been reading about fair
scheduler queues, but I'm not sure what the strategy for employing them
should be. (Do I create an not-the-large-job queue, which is guaranteed
slots?) Or are we just stuck with this problem because the long mapper times
mess up the scheduling granularity?
I'm looking for advice on solutions to try and/or pointers to documentation
to read. So far I've been working from the Fair Scheduler
Is there most extensive documentation elsewhere, or some case studies I