I have ensured that my mapper produces a unique key for every value it writes and further more that each map() call only writes one value. I note here that the value is a custom for which I handle the Writable interface methods.
I realize that it isn't very real world to have (well, want) no combining done prior to reducing, but I'm still getting my feet wet.
When the reducer runs, I expected to see one reduce() call for every map() call, and I do. However, the value I get is the composite of all the reduce() calls that came before it.
So, for example, the mapper gets data like this :
ID, Name, Type, Other stuff...
A000, Cream, Group, ...
B231, Led Zeppelin, Group, ...
A044, Liberace, Individual, ...
ID is the external key from the source data and is guaranteed to be unique.
When I map it, I create a container for the row data and output that container with all the data from that row only and use the ID field as a key.
Since the key is always unique I expected the sort/shuffle step to never coalesce any two values. So I expected my reduce() method to be called once per mapped input row, and it is.
The problem is, as each row is processed, the reducer sees a set of cumulative value data instead of a container with a row of data in it. So the 'value' parameter to reduce always has the information from previous reduce steps.
For example, given the data above :
1st Reducer Call :
Key = A000
(object 1) : Name = Cream, Type = Group, MBID = A000, ...
2nd Reducer Call :
Key = B231
(object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
(object 2) : Name = Cream, Type = Group, MBID = A000, ...
So the second reduce call has data in it from the first reduce call. Very strange! At a guess I would say the reducer is re-using the object when it reads the objects back from the mapping step. I dunno..
If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0