I wonder is there is a way to control how maps are assigned to splits
in order to balance the load across the cluster.
Here is a simplified example. I have tow types of inputs: "long" and
"short". Each input is in a different file and will be processed by a
single map task. Suppose the "long" inputs take 10s to process while
the "short" inputs take 3s to process. I have two "long" inputs and
two "short" inputs. My cluster has 2 nodes and each node can execute
only one map task at a time. A possible schedule of the tasks could be
Node 1: "long map", "short map" -> 10s + 3s = 13s
Node 2: "long map", "short map" -> 10s + 3s = 13s
So, my job will be done in 13s. Another possible schedule is:
Node 1: "long map" -> 10s
Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s
And, my job will be done in 16s. Clearly, the first scheduling is better.
Is there a way to control how the schedule is build? If I can control
which inputs are processed first, I could schedule the "long" inputs
to be processed first and so they will be balanced across nodes and I
will end up with something similar to the first schedule.
I could configure the job so that a "long" input gets processed by
more that a map, and so end up balancing the work, but I noticed that
overall, this takes more time than a bad scheduling with only one map