Thanks John,
I am confused again by the result of my test case, could you please take a look:
The code related is:
public static class IntSumReducer
extends Reducer<IntPair,NullWritable,IntPair,NullWritable> {
public void reduce(IntPair key, Iterable<NullWritable> values,
Context context
) throws IOException, InterruptedException {
int count=0;
for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print(" : ");
System.out.println(key.getSecond());
}
System.out.println("number of records for this group "+Integer.toString(count));
System.out.println("-----------------biggest key is--------------------------");
System.out.print(key.getFirst());
System.out.print(" ----- ");
System.out.println(key.getSecond());
context.write(key, NullWritable.get());
}
}
I am using the new API (released is from cloudera). We can see from the output, for each call of reduce function, 100 records were processed, but as the reduce is defined as
reduce(IntPair key, Iterable<NullWritable> values, Context context), so key should be fixed (not change) during every single execution, but the strange thing is that for each loop of Iterable<NullWritable> values, the key is different!!!!!!. Using your explanation, the same information (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below
0 : 97
0 : 97
0 : 96
0 : 96
0 : 94
0 : 93
0 : 93
0 : 91
0 : 90
0 : 89
0 : 86
0 : 85
.... deleted to save space
0 : 2
0 : 1
0 : 1
0 : 0
0 : 0
number of records for this group 100
-----------------biggest key is--------------------------
0 ----- 0
4 : 99
4 : 99
4 : 98
4 : 96
4 : 95
4 : 94
4 : 93
4 : 92
4 : 91
4 : 91
4 : 90
At 2011-08-03 20:02:34,"John Armstrong" wrote:On Wed, 3 Aug 2011 10:35:51 +0800 (CST), "Daniel,Wu" wrote:
So the key of a group is determined by the first coming record in the
group, if we have 3 records in a group
1: (1900,35)
2:(1900,34)
3:(1900,33)
if (1900,35) comes in as the first row, then the result key will be
(1900,35), when the second row (1900,34) comes in, it won't the impact the
key of the group, meaning it will not overwrite the key (1900,35) to
(1900,34), correct.
Effectively, yes. Remember that on the inside it's using the comparator
something like this:
(1900, 35).. do I have that key already? [searches collection of keys
with, say, a BST] no! I'll add it here.
(1900,34).. do I have that key already? [searches again, now getting a
result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
according to the GroupComparator it is!] so I'll add its value to the key's
iterable of values.
etc.