I wonder is there is a way to control how maps are assigned to splits
in order to balance the load across the cluster.

Here is a simplified example. I have tow types of inputs: "long" and
"short". Each input is in a different file and will be processed by a
single map task. Suppose the "long" inputs take 10s to process while
the "short" inputs take 3s to process. I have two "long" inputs and
two "short" inputs. My cluster has 2 nodes and each node can execute
only one map task at a time. A possible schedule of the tasks could be
the following:

Node 1: "long map", "short map" -> 10s + 3s = 13s
Node 2: "long map", "short map" -> 10s + 3s = 13s

So, my job will be done in 13s. Another possible schedule is:

Node 1: "long map" -> 10s
Node 2: "short map", "short map", "long map" -> 3s + 3s + 10s = 16s

And, my job will be done in 16s. Clearly, the first scheduling is better.

Is there a way to control how the schedule is build? If I can control
which inputs are processed first, I could schedule the "long" inputs
to be processed first and so they will be balanced across nodes and I
will end up with something similar to the first schedule.

I could configure the job so that a "long" input gets processed by
more that a map, and so end up balancing the work, but I noticed that
overall, this takes more time than a bad scheduling with only one map
per input.


Rares Vernica

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 3 | next ›
Discussion Overview
groupcommon-user @
postedAug 27, '09 at 1:53a
activeAug 27, '09 at 5:57p

2 users in discussion

Rares Vernica: 2 posts Alex Loddengaard: 1 post



site design / logo © 2022 Grokbase