FAQ
Since there are few sources on how to write a good Writable,
I want to share some tips I've learned over the last few days, while writing
a job that demanded some complex Writable object.
I will be glad to hear more tips, corrections, etc.


Wrtiables are a way to transfer complex data types from the mapper to the
reducer (/ combiner), or as a flexible output format for mapreduce jobs.
However, Hadoop does not always use them the way you think it does:
1. Hadoop reuses writable objects (at least the old API does) - so strange
data miraculously appears in your writables, if they are not cleaned right
before being used.
2. Hadoop compares Writables and hashes them a lot - so writing good
"hashCode" and "equals" functions is a necessity.
3. Hadoop needs an empty constructor for writables - so if you write another
constructor, be sure to also implement the empty one.

Any complex writable object you write (complex = more then just a couple of
fields) should:

*. Override Object's "*equals*": compare all available fields (deep
compare), check unique fields first, to avoid checking the rest.

*. Override Object's "*hashCode*": the simplest way is XORing (^) the hash
codes of the most important fields.

*. Create an *empty constructor* - even if you don't need one. Implementing
a different constructor is ok, as long as the empty is also available.

*. Implement (the mandatory) Writable's *readFields()* and *write()*. Use
versioning to allow scalability over time.
In the very begining of readFields(), clear all available fields (lists,
primitives, etc).
The best way to to do that is to create a clearFields() function, that will
be called both from "readFields()" and from the empty constructor.
Remember Hadoop reuses writables (again, at least the old API - "mapred" -
does), so this is not just a good habit, but clearly a must.

*. implement "*read()*" - this isn't mandatory but it's simple and helpful:

public static UserWritable read(DataInput in) throws IOException {
UserWritable u = new UserWritable();
u.readFields(in);
return u;
}


More golden tips are welcomed. So does remarks.

--
Oded

Search Discussions

  • Ted Yu at May 22, 2010 at 9:51 pm
    Thanks for sharing your precious experience.

    Hadoop achieves higher efficiency with Writables than with Serializables
    through object resuse. That's why clearFields() function is needed.

    In my job, I found it useful to persist the size of one record into
    DataInput before the actual record if your Writable instance has more than
    one record in it. This is for preparation of corrupt record (which we faced
    in our trial) so that you can skip to the next record and reduce data loss.

    Also DataOutput.writeUTF() may surprise you. At least it gave us a little
    surprise.

    Versioning is an interesting subject. I hope to see more suggestions on this
    topic.

    Cheers
    On Sat, May 22, 2010 at 2:09 PM, Oded Rosen wrote:

    Since there are few sources on how to write a good Writable,
    I want to share some tips I've learned over the last few days, while
    writing
    a job that demanded some complex Writable object.
    I will be glad to hear more tips, corrections, etc.


    Wrtiables are a way to transfer complex data types from the mapper to the
    reducer (/ combiner), or as a flexible output format for mapreduce jobs.
    However, Hadoop does not always use them the way you think it does:
    1. Hadoop reuses writable objects (at least the old API does) - so strange
    data miraculously appears in your writables, if they are not cleaned right
    before being used.
    2. Hadoop compares Writables and hashes them a lot - so writing good
    "hashCode" and "equals" functions is a necessity.
    3. Hadoop needs an empty constructor for writables - so if you write
    another
    constructor, be sure to also implement the empty one.

    Any complex writable object you write (complex = more then just a couple of
    fields) should:

    *. Override Object's "*equals*": compare all available fields (deep
    compare), check unique fields first, to avoid checking the rest.

    *. Override Object's "*hashCode*": the simplest way is XORing (^) the hash
    codes of the most important fields.

    *. Create an *empty constructor* - even if you don't need one. Implementing
    a different constructor is ok, as long as the empty is also available.

    *. Implement (the mandatory) Writable's *readFields()* and *write()*. Use
    versioning to allow scalability over time.
    In the very begining of readFields(), clear all available fields (lists,
    primitives, etc).
    The best way to to do that is to create a clearFields() function, that will
    be called both from "readFields()" and from the empty constructor.
    Remember Hadoop reuses writables (again, at least the old API - "mapred" -
    does), so this is not just a good habit, but clearly a must.

    *. implement "*read()*" - this isn't mandatory but it's simple and helpful:

    public static UserWritable read(DataInput in) throws IOException {
    UserWritable u = new UserWritable();
    u.readFields(in);
    return u;
    }


    More golden tips are welcomed. So does remarks.

    --
    Oded
  • Oded Rosen at May 22, 2010 at 10:11 pm
    Thanks,
    The "record size" tip looks helpful,
    But I do not like surprises (at least not while programming):
    what do you mean by "DataOutput.writeUTF() may surprise you"?
    On Sun, May 23, 2010 at 12:51 AM, Ted Yu wrote:

    Thanks for sharing your precious experience.

    Hadoop achieves higher efficiency with Writables than with Serializables
    through object resuse. That's why clearFields() function is needed.

    In my job, I found it useful to persist the size of one record into
    DataInput before the actual record if your Writable instance has more than
    one record in it. This is for preparation of corrupt record (which we faced
    in our trial) so that you can skip to the next record and reduce data loss.

    Also DataOutput.writeUTF() may surprise you. At least it gave us a little
    surprise.

    Versioning is an interesting subject. I hope to see more suggestions on
    this
    topic.

    Cheers
    On Sat, May 22, 2010 at 2:09 PM, Oded Rosen wrote:

    Since there are few sources on how to write a good Writable,
    I want to share some tips I've learned over the last few days, while
    writing
    a job that demanded some complex Writable object.
    I will be glad to hear more tips, corrections, etc.


    Wrtiables are a way to transfer complex data types from the mapper to the
    reducer (/ combiner), or as a flexible output format for mapreduce jobs.
    However, Hadoop does not always use them the way you think it does:
    1. Hadoop reuses writable objects (at least the old API does) - so strange
    data miraculously appears in your writables, if they are not cleaned right
    before being used.
    2. Hadoop compares Writables and hashes them a lot - so writing good
    "hashCode" and "equals" functions is a necessity.
    3. Hadoop needs an empty constructor for writables - so if you write
    another
    constructor, be sure to also implement the empty one.

    Any complex writable object you write (complex = more then just a couple of
    fields) should:

    *. Override Object's "*equals*": compare all available fields (deep
    compare), check unique fields first, to avoid checking the rest.

    *. Override Object's "*hashCode*": the simplest way is XORing (^) the hash
    codes of the most important fields.

    *. Create an *empty constructor* - even if you don't need one.
    Implementing
    a different constructor is ok, as long as the empty is also available.

    *. Implement (the mandatory) Writable's *readFields()* and *write()*. Use
    versioning to allow scalability over time.
    In the very begining of readFields(), clear all available fields (lists,
    primitives, etc).
    The best way to to do that is to create a clearFields() function, that will
    be called both from "readFields()" and from the empty constructor.
    Remember Hadoop reuses writables (again, at least the old API - "mapred" -
    does), so this is not just a good habit, but clearly a must.

    *. implement "*read()*" - this isn't mandatory but it's simple and helpful:
    public static UserWritable read(DataInput in) throws IOException {
    UserWritable u = new UserWritable();
    u.readFields(in);
    return u;
    }


    More golden tips are welcomed. So does remarks.

    --
    Oded


    --
    Oded
  • Owen O'Malley at May 22, 2010 at 10:13 pm

    On May 22, 2010, at 2:09 PM, Oded Rosen wrote:

    *. Override Object's "*equals*": compare all available fields (deep
    compare), check unique fields first, to avoid checking the rest.
    You should implement equals, but more important for keys is to define
    a RawComparator for your type. If you don't define one, the framework
    will de-serialize your type a *lot*. Look at one of the writables like
    IntWritable or Text for how to register your comparator.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgeneral @
categorieshadoop
postedMay 22, '10 at 9:10p
activeMay 22, '10 at 10:13p
posts4
users3
websitehadoop.apache.org
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase