FAQ
Hi All,

Here is what's happening. I have implemented my own WritableComparable keys
and values.
Inside a reducer I am seeing 'reduce' being invoked with the "same" key
_twice_.
I have checked that context.getKeyComparator() and
context.getSortComparator() are both WritableComparator which
indicates that 'compareTo' method of my key should be called when doing
reduce-side merge.

Indeed, inside the 'reduce' method I captured both key instances and did the
following checks:

((WritableComparator)context.getKeyComparator()).compare((Object)key1,
(Object)key2)
((WritableComparator)context.getSortComparator()).compare((Object)key2,
(Object)key2)

In both calls, the result is '0', confirming that key1 and key2 are
equivalent.

So, what is going on?

Note that key1 and key2 come from different mappers but they should have
been collapsed in the reducer since
they are both equal according to WritableComparator. Also note that key1
and key2 are not bitwise equivalent, but
that shouldn't matter, or should it?

Many thanks in advance!

stan

Search Discussions

  • William Kinney at Jan 10, 2012 at 10:52 pm
    I have noticed this too with one job. Keys that are equal (.equals(),
    hashCode() === and compareTo === 0) are being sent to multiple reduce tasks
    therefore resulting in incorrect output.

    Any insight?

    On Sat, Aug 13, 2011 at 11:14 AM, Stan Rosenberg wrote:

    Hi All,

    Here is what's happening. I have implemented my own WritableComparable
    keys
    and values.
    Inside a reducer I am seeing 'reduce' being invoked with the "same" key
    _twice_.
    I have checked that context.getKeyComparator() and
    context.getSortComparator() are both WritableComparator which
    indicates that 'compareTo' method of my key should be called when doing
    reduce-side merge.

    Indeed, inside the 'reduce' method I captured both key instances and did
    the
    following checks:

    ((WritableComparator)context.getKeyComparator()).compare((Object)key1,
    (Object)key2)
    ((WritableComparator)context.getSortComparator()).compare((Object)key2,
    (Object)key2)

    In both calls, the result is '0', confirming that key1 and key2 are
    equivalent.

    So, what is going on?

    Note that key1 and key2 come from different mappers but they should have
    been collapsed in the reducer since
    they are both equal according to WritableComparator. Also note that key1
    and key2 are not bitwise equivalent, but
    that shouldn't matter, or should it?

    Many thanks in advance!

    stan
  • W.P. McNeill at Jan 10, 2012 at 10:59 pm
    The Hadoop framework reuses Writable objects for key and value arguments,
    so if your code stores a pointer to that object instead of copying it you
    can find yourself with mysterious duplicate objects. This has tripped me
    up a number of times. Details on what exactly I encountered and how I fixed
    it are here
    http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
    and
    here
    http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
    .
  • William Kinney at Jan 10, 2012 at 11:13 pm
    I'm (unfortunately) aware of this and this isn't the issue. My key object
    contains only long, int and String values.

    The job map output is consistent, but the reduce input groups and values
    for the key vary from one job to the next on the same input. It's like it
    isn't properly comparing and partitioning the keys.

    I have properly implemented a hashCode(), equals() and the
    WritableComparable methods.

    Also not surprisingly when I use 1 reduce task, the output is correct.

    On Tue, Jan 10, 2012 at 5:58 PM, W.P. McNeill wrote:

    The Hadoop framework reuses Writable objects for key and value arguments,
    so if your code stores a pointer to that object instead of copying it you
    can find yourself with mysterious duplicate objects. This has tripped me
    up a number of times. Details on what exactly I encountered and how I fixed
    it are here

    http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
    and
    here

    http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
    .
  • William Kinney at Jan 10, 2012 at 11:16 pm
    Naturally after I send that email I find that I am wrong. I was also using
    an enum field, which was the culprit.
    On Tue, Jan 10, 2012 at 6:13 PM, William Kinney wrote:

    I'm (unfortunately) aware of this and this isn't the issue. My key object
    contains only long, int and String values.

    The job map output is consistent, but the reduce input groups and values
    for the key vary from one job to the next on the same input. It's like it
    isn't properly comparing and partitioning the keys.

    I have properly implemented a hashCode(), equals() and the
    WritableComparable methods.

    Also not surprisingly when I use 1 reduce task, the output is correct.

    On Tue, Jan 10, 2012 at 5:58 PM, W.P. McNeill wrote:

    The Hadoop framework reuses Writable objects for key and value arguments,
    so if your code stores a pointer to that object instead of copying it you
    can find yourself with mysterious duplicate objects. This has tripped me
    up a number of times. Details on what exactly I encountered and how I
    fixed
    it are here

    http://cornercases.wordpress.com/2011/03/14/serializing-complex-mapreduce-keys/
    and
    here

    http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
    .

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 13, '11 at 3:15p
activeJan 10, '12 at 11:16p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase