Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. I'm suspicious that you can do
this without scanning through the data (which is what often constitutes the
bulk of a time in a mapreduce program).
But how much data are you using? I would imagine that if you're operating at
the scale where Hadoop makes sense, then the high- and low-cost objects will
-- on average -- balance out and tasks will be roughly evenly proportioned.
In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. If you have 5
million objects to process, then make each split be roughly equal to say
20,000 of them. Then even if some splits take long to process and others
take a short time, then one CPU may dispatch with a dozen cheap splits in
the same time where one unlucky JVM had to process a single very expensive
split. Now you haven't had to manually balance anything, and you still get
to keep all your CPUs full.
On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman wrote:
Thanks Aaron, that really helps. I probably do need to control the number
of splits. My input 'data' consists of Java objects and their size (in
bytes) doesn't necessarily reflect the amount of time needed for each map
operation. I need to ensure that I have enough map tasks so that all cpus
are utilized and the job gets done in a reasonable amount of time.
(Currently I'm creating multiple input files and making them unsplitable,
but subclassing SequenceFileInputFormat to explicitly control then number of
splits sounds like a better approach).
Aaron Kimball wrote:
Yes, there can be more than one InputSplit per SequenceFile. The file will
be split more-or-less along 64 MB boundaries. (the actual "edges" of the
splits will be adjusted to hit the next block of key-value pairs, so it
might be a few kilobytes off.)
The SequenceFileInputFormat regards mapred.map.tasks
as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
is always 100% user-controlled.) If you need exact control over the number
of map tasks, you'll need to subclass it and modify this behavior. That
having been said -- are you sure you actually need to precisely control
value? Or is it enough to know how many splits were created?
On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <firstname.lastname@example.org>
Suppose a SequenceFile (containing keys and values that are
is used as input. Will it be divided into InputSplits? If so, what's the
criteria use for splitting?
I'm interested in this because I need to control the number of map tasks
used, which (if I understand it correctly), is equal to the number of