Inline. (Answer is specific to Hadoop 1.x since you asked for that
alone, but certain things may vary for Hadoop 2.x).
On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert wrote:
I have some questions related to basic functionality in Hadoop.
1. When a Mapper process the intermediate output data, how it knows how many
partitions to do(how many reducers will be) and how much data to go in each
partition for each reducer ?
The number of reducers is non-dynamic and is user-specified, and is
set in the job configuration. Hence the Partitioner knows about the
value it needs to use for its numPartitions (== numReduces for the
For this one in 1.x code, look at MapTask.java, in the constructors of
internal classes OldOutputCollector (Stable API) and
NewOutputCollector (New API).
The data estimated to be going into a partition, for limit/scheduling
checks, is currently a naive computation, done by summing upon the
estimate output sizes of each map. See
ResourceEstimator#getEstimatedReduceInputSize for the overall
estimation across maps, and see Task#calculateOutputSize for the
per-map estimation code.
2. A JobTracker when assigns a task to a reducer, it will also specify the
locations of intermediate output data where it should retrieve it right ?
But how a reducer will know from each remote location with intermediate
output what portion it has to retrieve only ?
The JT does not send in the information of locations when a reduce is
scheduled. When the reducers begin their shuffle phase, they query the
TaskTracker to get the map completion events, via
TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
itself calls the JobTracker#getTaskCompletionEvents protocol call to
get this info underneath. The returned structure carries the host that
has completed the map successfully, which the Reduce's copier relies
on to fetch the data from the right host's TT.
The reduce merely asks the data assigned for it for the specific
completed maps at each TT. Note that a reduce task ID is also its
partition ID, so it merely has to ask the data for its own task ID #
and the TT serves, over HTTP, the right parts of the intermediate data
Feel free to ping back if you need some more clarification! :)