FAQ
I created my own Writable class to store 3 pieces of information. In
my mapreducer.Reducer class I collect all of them and then process as
a group, ie:

reduce(key, values, context) {
List<Foo> myFoos =new ArrayList();
for (Foo value : values) {
myFoos.add(value);
}
}

I was perplexed when entries in the list changed underneath me so I
put a freeze() method on it:

boolean m_frozen = false;
public void freeze() {
m_frozen = true;
}
@Override
public void readFields(DataInput in) throws IOException {
if (m_frozen) {
throw new IllegalStateException();
}
}

And noted that the exception was thrown:
at Foo.readFields(Foo.java:169)
at org.apache.hadoop.io.serializer.WritableSerialization
$WritableDeserializer.deserialize(WritableSerialization.java:67)
at org.apache.hadoop.io.serializer.WritableSerialization
$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org
.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
116)
at org.apache.hadoop.mapreduce.ReduceContext
$ValueIterator.next(ReduceContext.java:163)

Am I doing something wrong? Should I expect this VALUEIN object to
change from underneath me? I'm using hadoop 0.20.1 (from a cloudera
tarball)

Chris

Search Discussions

  • Eric Sammer at Jan 13, 2010 at 12:14 am

    On 1/12/10 6:53 PM, Wilkes, Chris wrote:
    I created my own Writable class to store 3 pieces of information. In my
    mapreducer.Reducer class I collect all of them and then process as a
    group, ie:

    reduce(key, values, context) {
    List<Foo> myFoos =new ArrayList();
    for (Foo value : values) {
    myFoos.add(value);
    }
    } snip
    Am I doing something wrong? Should I expect this VALUEIN object to
    change from underneath me? I'm using hadoop 0.20.1 (from a cloudera
    tarball)
    That's the documented behavior. Hadoop reuses the same Writable instance
    and replaces the *members* in the readFields() method in most cases (all
    cases?). The instance of Foo in your example will be the same object and
    simply have its members overwritten after each call to readFields().
    Currently, you're building a list of the same object. At the end of your
    for, you'll have a list of N objects all containing the same data. This
    is one of those "gotchas." If you really need to build a list like this,
    you'd have to resort to doing a deep copy, but you're better off avoid
    it if you can as it will drastically impact performance and add the
    requirement that all values for a given key fit in memory.

    Hope this helps.
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com
  • Ed Mazur at Jan 13, 2010 at 5:30 pm

    On Tue, Jan 12, 2010 at 7:14 PM, Eric Sammer wrote:
    On 1/12/10 6:53 PM, Wilkes, Chris wrote:
    I created my own Writable class to store 3 pieces of information.  In my
    mapreducer.Reducer class I collect all of them and then process as a
    group, ie:

    reduce(key, values, context) {
    List<Foo> myFoos =new ArrayList();
    for (Foo value : values) {
    myFoos.add(value);
    }
    } snip
    Am I doing something wrong?  Should I expect this VALUEIN object to
    change from underneath me?  I'm using hadoop 0.20.1 (from a cloudera
    tarball)
    That's the documented behavior. Hadoop reuses the same Writable instance
    and replaces the *members* in the readFields() method in most cases (all
    cases?). The instance of Foo in your example will be the same object and
    simply have its members overwritten after each call to readFields().
    Currently, you're building a list of the same object. At the end of your
    for, you'll have a list of N objects all containing the same data. This
    is one of those "gotchas." If you really need to build a list like this,
    you'd have to resort to doing a deep copy, but you're better off avoid
    it if you can as it will drastically impact performance and add the
    requirement that all values for a given key fit in memory.
    What is the preferred method of avoiding value buffering? For example,
    if you're building a basic inverted index, you have one key (term)
    associated with many values (doc ids) in your reducer. If you want an
    output pair of something like <Text, IntArrayWritable>, is there a way
    to build and output the id array without buffering values? The only
    alternative I see is to instead use <Text, IntWritable> and repeat the
    term for every doc id, but this seems wasteful.

    Ed
  • Eric Sammer at Jan 13, 2010 at 6:00 pm

    On 1/13/10 12:29 PM, Ed Mazur wrote:
    What is the preferred method of avoiding value buffering? For example,
    if you're building a basic inverted index, you have one key (term)
    associated with many values (doc ids) in your reducer. If you want an
    output pair of something like <Text, IntArrayWritable>, is there a way
    to build and output the id array without buffering values? The only
    alternative I see is to instead use <Text, IntWritable> and repeat the
    term for every doc id, but this seems wasteful.

    Ed
    Ed:

    In that case, I think you would want to buffer the values. I should
    probably correct myself and say that it depends on the application. In
    general, the assumption made by the framework is that all reduce values
    for a given key may not fit in memory. In specific implementations it
    may be fine (or even necessary) for the user to do buffering like this.

    Thanks and sorry for the confusion.
    --
    Eric Sammer
    eric@lifeless.net
    http://esammer.blogspot.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJan 12, '10 at 11:53p
activeJan 13, '10 at 6:00p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase