FAQ
at page 243:
Per my understanding, The reducer is supposed to output the first value (the maximum) for each year. But I just don't know how it work.

suppose we have the data
1901 200
1901 300
1901 400

Since group is done by the year, so we have only one group, but we have 3 different key as the key is a combination of year and temperature. for the reduce, the output should be key, list(value) pair, since we have 3 key, so we should output 3 rows, but since we have only one group, we only output 1 rows. So where is the conflict? Where do I misunderstand?

public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntPair.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
return IntPair.compare(ip1.getFirst(), ip2.getFirst());
}
}

static class MaxTemperatureReducer extends MapReduceBase
implements Reducer<IntPair, NullWritable, IntPair, NullWritable> {
public void reduce(IntPair key, Iterator<NullWritable> values,
OutputCollector<IntPair, NullWritable> output, Reporter reporter)
throws IOException {
output.collect(key, NullWritable.get());
}
}

Search Discussions

  • John Armstrong at Aug 2, 2011 at 1:35 pm

    On Tue, 2 Aug 2011 21:25:47 +0800 (CST), "Daniel,Wu" wrote:
    at page 243:
    Per my understanding, The reducer is supposed to output the first value
    (the maximum) for each year. But I just don't know how it work.

    suppose we have the data
    1901 200
    1901 300
    1901 400

    Since group is done by the year, so we have only one group, but we have 3
    different key as the key is a combination of year and temperature. for the
    reduce, the output should be key, list(value) pair, since we have 3 key,
    so we should output 3 rows, but since we have only one group, we only
    output 1 rows. So where is the conflict? Where do I misunderstand?
    Keep reading the section in the book:

    "This still isn't enough to achieve our coal, however. A partitioner
    ensures only that one reducer receives all the records for a year; it
    doesn't change the fact that the reducer groups by key within the
    partition... The final piece of the puzzle is the setting to control the
    grouping. If we group values in the reducer by the year part of the key,
    then we will see all the records for the same year in one reduce group.
    And since they are sorted by temperature in descending order, the first is
    the maximum temperature."

    That is, in that example they also change the way the reducer groups its
    inputs.
  • Daniel,Wu at Aug 2, 2011 at 1:49 pm
    we usually use something like values.next() to loop every rows in a specific group, but I didn't see any code to loop the list, at least it need to get the first row in the list, which is something like
    values.get().
    or will NullWritable.get() get the first row in the group?


    static class MaxTemperatureReducer extends MapReduceBase
    implements Reducer<IntPair, NullWritable, IntPair, NullWritable> {
    public void reduce(IntPair key, Iterator<NullWritable> values,
    OutputCollector<IntPair, NullWritable> output, Reporter reporter)
    throws IOException {
    output.collect(key, NullWritable.get());
    }
    }
    If we group values in the reducer by the year part of the key,
    then we will see all the records for the same year in one reduce group.
    And since they are sorted by temperature in descending order, the first is
    the maximum temperature."
    At 2011-08-02 21:34:57,"John Armstrong" wrote:
    On Tue, 2 Aug 2011 21:25:47 +0800 (CST), "Daniel,Wu" wrote:
    at page 243:
    Per my understanding, The reducer is supposed to output the first value
    (the maximum) for each year. But I just don't know how it work.

    suppose we have the data
    1901 200
    1901 300
    1901 400

    Since group is done by the year, so we have only one group, but we have 3
    different key as the key is a combination of year and temperature. for the
    reduce, the output should be key, list(value) pair, since we have 3 key,
    so we should output 3 rows, but since we have only one group, we only
    output 1 rows. So where is the conflict? Where do I misunderstand?
    Keep reading the section in the book:

    "This still isn't enough to achieve our coal, however. A partitioner
    ensures only that one reducer receives all the records for a year; it
    doesn't change the fact that the reducer groups by key within the
    partition... The final piece of the puzzle is the setting to control the
    grouping. If we group values in the reducer by the year part of the key,
    then we will see all the records for the same year in one reduce group.
    And since they are sorted by temperature in descending order, the first is
    the maximum temperature."

    That is, in that example they also change the way the reducer groups its
    inputs.
  • John Armstrong at Aug 2, 2011 at 1:59 pm

    On Tue, 2 Aug 2011 21:49:22 +0800 (CST), "Daniel,Wu" wrote:
    we usually use something like values.next() to loop every rows in a
    specific group, but I didn't see any code to loop the list, at least it
    need to get the first row in the list, which is something like
    values.get().
    or will NullWritable.get() get the first row in the group?
    No; like you said before the value is now in the key.

    The grouping comparator receives (1900,35),(1900,34),(1900,34), and so on.
    Due to the line

    return -IntPair.compare(ip1.getSecond(),ip2.getSecond());

    in the KeyComparator, these are guaranteed to come in reverse order in the
    second slot. That is, if 35 is the maximum temperature then (1900,35) will
    come before ANY other (1900,t). Then as the GroupComparator does its
    thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO
    (1900,35), and thus its (null) value is added to the (1900,35) group.

    The reducer then gets a (1900,35) key with an Iterable of null values,
    which it pretty much discards and just emits the key, which contains the
    maximum value.

    I admit, it's a pretty subtle trick, and I'm actually glad you brought it
    up since I think I may be able to use it to solve a problem I've been
    having...
  • Daniel,Wu at Aug 3, 2011 at 2:36 am
    So the key of a group is determined by the first coming record in the group, if we have 3 records in a group
    1: (1900,35)
    2:(1900,34)
    3:(1900,33)

    if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.
    in the KeyComparator, these are guaranteed to come in reverse order in the >second slot. That is, if 35 is the maximum temperature then (1900,35) will >come before ANY other (1900,t). Then as the GroupComparator does its >thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO >(1900,35), and thus its (null) value is added to the (1900,35) group. > >The reducer then gets a (1900,35) key with an Iterable of null values, >which it pretty much discards and just emits the key, which contains the >maximum value.
  • Daniel,Wu at Aug 3, 2011 at 3:42 am
    or I should ask, should the input of the reducer for the group of year 1900 be like
    key, value pair
    (1900,35), null
    (1900,34),null
    (1900,33),null


    or like
    (1900,35), null
    (1900,35), null ==> since (1900,34) is for the same group as (1900,35), so it use (1900,35) as the key.
    (1900,35), null

    At 2011-08-03 10:35:51,"Daniel,Wu" wrote:

    So the key of a group is determined by the first coming record in the group, if we have 3 records in a group
    1: (1900,35)
    2:(1900,34)
    3:(1900,33)

    if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.
    in the KeyComparator, these are guaranteed to come in reverse order in the >second slot. That is, if 35 is the maximum temperature then (1900,35) will >come before ANY other (1900,t). Then as the GroupComparator does its >thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO >(1900,35), and thus its (null) value is added to the (1900,35) group. > >The reducer then gets a (1900,35) key with an Iterable of null values, >which it pretty much discards and just emits the key, which contains the >maximum value.
  • Daniel,Wu at Aug 3, 2011 at 8:07 am
    I understand now. And looks like the job will print the min value instead of max value per my test. In the stdout I can see the following data: 3 is the year (I fake the data by myself), 99 is the max, and 0 is the min. We can see for year 3, there are 100 records. So the inside a group, the key could be different, and
    context.write(key, NullWritable.get()) will write the LAST key to the output, since the temperature is order desc, so the last key has the min temperature

    3 99
    ........
    3 0
    number of records for this group 100
    -----------------biggest key is--------------------------
    3 0


    public void reduce(IntPair key, Iterable<NullWritable> values,
    Context context
    ) throws IOException, InterruptedException {
    int count=0;
    for (NullWritable iw:values) {
    count++;
    System.out.print(key.getFirst());
    System.out.print(' ');
    System.out.println(key.getSecond());
    }
    System.out.println("number of records for this group "+Integer.toString(count));
    System.out.println("-----------------biggest key is--------------------------");
    System.out.print(key.getFirst());
    System.out.print(' ');
    System.out.println(key.getSecond());
    context.write(key, NullWritable.get());
    }



    At 2011-08-03 11:41:23,"Daniel,Wu" wrote:
    or I should ask, should the input of the reducer for the group of year 1900 be like
    key, value pair
    (1900,35), null
    (1900,34),null
    (1900,33),null


    or like
    (1900,35), null
    (1900,35), null ==> since (1900,34) is for the same group as (1900,35), so it use (1900,35) as the key.
    (1900,35), null

    At 2011-08-03 10:35:51,"Daniel,Wu" wrote:

    So the key of a group is determined by the first coming record in the group, if we have 3 records in a group
    1: (1900,35)
    2:(1900,34)
    3:(1900,33)

    if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.
    in the KeyComparator, these are guaranteed to come in reverse order in the >second slot. That is, if 35 is the maximum temperature then (1900,35) will >come before ANY other (1900,t). Then as the GroupComparator does its >thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO >(1900,35), and thus its (null) value is added to the (1900,35) group. > >The reducer then gets a (1900,35) key with an Iterable of null values, >which it pretty much discards and just emits the key, which contains the >maximum value.
  • John Armstrong at Aug 3, 2011 at 12:03 pm

    On Wed, 3 Aug 2011 10:35:51 +0800 (CST), "Daniel,Wu" wrote:
    So the key of a group is determined by the first coming record in the
    group, if we have 3 records in a group
    1: (1900,35)
    2:(1900,34)
    3:(1900,33)

    if (1900,35) comes in as the first row, then the result key will be
    (1900,35), when the second row (1900,34) comes in, it won't the impact the
    key of the group, meaning it will not overwrite the key (1900,35) to
    (1900,34), correct.
    Effectively, yes. Remember that on the inside it's using the comparator
    something like this:

    (1900, 35).. do I have that key already? [searches collection of keys
    with, say, a BST] no! I'll add it here.
    (1900,34).. do I have that key already? [searches again, now getting a
    result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
    according to the GroupComparator it is!] so I'll add its value to the key's
    iterable of values.
    etc.
  • Daniel,Wu at Aug 4, 2011 at 6:08 am
    Thanks John,

    I am confused again by the result of my test case, could you please take a look:
    The code related is:

    public static class IntSumReducer
    extends Reducer<IntPair,NullWritable,IntPair,NullWritable> {

    public void reduce(IntPair key, Iterable<NullWritable> values,
    Context context
    ) throws IOException, InterruptedException {
    int count=0;
    for (NullWritable iw:values) {
    count++;
    System.out.print(key.getFirst());
    System.out.print(" : ");
    System.out.println(key.getSecond());
    }
    System.out.println("number of records for this group "+Integer.toString(count));
    System.out.println("-----------------biggest key is--------------------------");
    System.out.print(key.getFirst());
    System.out.print(" ----- ");
    System.out.println(key.getSecond());
    context.write(key, NullWritable.get());
    }
    }

    I am using the new API (released is from cloudera). We can see from the output, for each call of reduce function, 100 records were processed, but as the reduce is defined as
    reduce(IntPair key, Iterable<NullWritable> values, Context context), so key should be fixed (not change) during every single execution, but the strange thing is that for each loop of Iterable<NullWritable> values, the key is different!!!!!!. Using your explanation, the same information (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below


    0 : 97
    0 : 97
    0 : 96
    0 : 96
    0 : 94
    0 : 93
    0 : 93
    0 : 91
    0 : 90
    0 : 89
    0 : 86
    0 : 85
    .... deleted to save space
    0 : 2
    0 : 1
    0 : 1
    0 : 0
    0 : 0
    number of records for this group 100
    -----------------biggest key is--------------------------
    0 ----- 0
    4 : 99
    4 : 99
    4 : 98
    4 : 96
    4 : 95
    4 : 94
    4 : 93
    4 : 92
    4 : 91
    4 : 91
    4 : 90




    At 2011-08-03 20:02:34,"John Armstrong" wrote:
    On Wed, 3 Aug 2011 10:35:51 +0800 (CST), "Daniel,Wu" wrote:
    So the key of a group is determined by the first coming record in the
    group, if we have 3 records in a group
    1: (1900,35)
    2:(1900,34)
    3:(1900,33)

    if (1900,35) comes in as the first row, then the result key will be
    (1900,35), when the second row (1900,34) comes in, it won't the impact the
    key of the group, meaning it will not overwrite the key (1900,35) to
    (1900,34), correct.
    Effectively, yes. Remember that on the inside it's using the comparator
    something like this:

    (1900, 35).. do I have that key already? [searches collection of keys
    with, say, a BST] no! I'll add it here.
    (1900,34).. do I have that key already? [searches again, now getting a
    result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
    according to the GroupComparator it is!] so I'll add its value to the key's
    iterable of values.
    etc.
  • John Armstrong at Aug 4, 2011 at 12:51 pm

    On Thu, 4 Aug 2011 14:07:12 +0800 (CST), "Daniel,Wu" wrote:
    I am using the new API (released is from cloudera). We can see from the
    output, for each call of reduce function, 100 records were processed, but
    as the reduce is defined as
    reduce(IntPair key, Iterable<NullWritable> values, Context context), so
    key should be fixed (not change) during every single execution, but the
    strange thing is that for each loop of Iterable<NullWritable> values, the
    key is different!!!!!!. Using your explanation, the same information
    (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96...
    0:0 as below
    Ah, but they're NOT different! That's the whole point!

    Think carefully: how does Hadoop decide what keys are "the same" when
    sorting and grouping reducer inputs? It uses a comparator. If the
    comparator says compare(key1,key2)==0, then as far as Hadoop is concerned
    the keys are the same.

    So here the comparator only really checks the first int in the pair:

    "compare(0:97,0:96)? well let's compare 0 and 0...
    Integer.compare(0,0)==0, so these are the same key."

    You have to be careful about the semantics of "equality" whenever you're
    using nonstandard comparators.
  • Daniel,Wu at Aug 5, 2011 at 12:50 am
    Hi John,

    Another finding, if I remove the loop of values ( remove for (NullWritable iw:values)), then the result is the MAX temperature for each year. and the original test I did return the MIN temperature for each year. The book also mentioned the value if mutable, I think the key might also be mutable, means as we loop each value in iterable<NullWritable>, the content of the key object is reset. Since the input is in order, so if we don't do any loop (as in the new test), the the key got at the end of reduce function is the first record in the group, which has the max value. If we loop each value in the value list, say loop 100 times, the context of the key will also change 100 times, and the key got at the end of the reduce function will be the last key, which has the MIN value. This theory of immutable Key can explain how to test works.Just need to figure out why each loop in the statement for (NullWritable iw:values) can change the content of the key. If any one know this, pleas
    e help tell me.

    public void reduce(IntPair key, Iterable<NullWritable> values,
    Context context
    ) throws IOException, InterruptedException {
    int count=0;
    /*for (NullWritable iw:values) {
    count++;
    System.out.print(key.getFirst());
    System.out.print(" : ");
    System.out.println(key.getSecond());
    }*/
    // System.out.println("number of records for this group "+Integer.toString(count));
    System.out.println("-----------------biggest key is--------------------------");
    System.out.print(key.getFirst());
    System.out.print(" ----- ");
    System.out.println(key.getSecond());
    context.write(key, NullWritable.get());
    }
    }


    -----------------biggest key is--------------------------
    0 ----- 97
    -----------------biggest key is--------------------------
    4 ----- 99
    -----------------biggest key is--------------------------
    8 ----- 99
    -----------------biggest key is--------------------------
    12 ----- 97
    -----------------biggest key is--------------------------
    16 ----- 98


    At 2011-08-04 20:51:01,"John Armstrong" wrote:
    On Thu, 4 Aug 2011 14:07:12 +0800 (CST), "Daniel,Wu" wrote:
    I am using the new API (released is from cloudera). We can see from the
    output, for each call of reduce function, 100 records were processed, but
    as the reduce is defined as
    reduce(IntPair key, Iterable<NullWritable> values, Context context), so
    key should be fixed (not change) during every single execution, but the
    strange thing is that for each loop of Iterable<NullWritable> values, the
    key is different!!!!!!. Using your explanation, the same information
    (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96...
    0:0 as below
    Ah, but they're NOT different! That's the whole point!

    Think carefully: how does Hadoop decide what keys are "the same" when
    sorting and grouping reducer inputs? It uses a comparator. If the
    comparator says compare(key1,key2)==0, then as far as Hadoop is concerned
    the keys are the same.

    So here the comparator only really checks the first int in the pair:

    "compare(0:97,0:96)? well let's compare 0 and 0...
    Integer.compare(0,0)==0, so these are the same key."

    You have to be careful about the semantics of "equality" whenever you're
    using nonstandard comparators.
  • John Armstrong at Aug 5, 2011 at 12:43 pm

    On Fri, 5 Aug 2011 08:50:02 +0800 (CST), "Daniel,Wu" wrote:
    The book also
    mentioned the value if mutable, I think the key might also be mutable,
    means as we loop each value in iterable<NullWritable>, the content of the
    key object is reset.
    The "mutability" of the value is one of the weirdnesses of Hadoop you have
    to get used to, and one of the few times it becomes important that Java
    object semantics are pointer semantics. Anyway, it wouldn't surprise me if
    the key were also replaced on iteration, but I'd have to dig into the
    Hadoop code to check on that. Or you could!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 2, '11 at 1:26p
activeAug 5, '11 at 12:43p
posts12
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Daniel,Wu: 7 posts John Armstrong: 5 posts

People

Translate

site design / logo © 2022 Grokbase