FAQ
Hi all,

I'm using Avro as a serialization format and assume I have a generated specific class FOO that I use as a Mapper output format:

class FOO {
int a;
List<BAR> barList;
}

where BAR is another generated specific Java class.

When I iterate over "Iterable<FOO> values" in the Reducer it is clear that the same object of class FOO is reused, i.e.
FOO foo1 = values.iterator.next();
FOO foo2 = values.iterator.next();
assertThat(foo1 == foo2, is (true));

So I have the following questions:
1) Is the list barList reused over the next() calls?
2) If yes, can the objects that are in the barList be reused? For example, if the first time next() is called, the list contains two BAR objects, the next time next() is called the barList contains 3 objects and 2 of them are equal by reference to the two from the list of the first next() call. In other words, does Hadoop maintain some sort of "object pool"?
3) Why do not AvroTools generate clone() methods since it would be quite straightforward and more importantly useful given that objects are reused?

Thanks a lot in advance!

Vyacheslav

Search Discussions

  • Joey Echeverria at Aug 3, 2011 at 11:20 am
    Hadoop reuses objects as an optimization. If you need to keep a copy
    in memory, you need to call clone yourself. I've never used Avro, but
    my guess is that the BARs are not reused, only the FOO.

    -Joey

    On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
    wrote:
    Hi all,

    I'm using Avro as a serialization format and assume I have a generated specific class FOO that I use as a Mapper output format:

    class FOO {
    int a;
    List<BAR> barList;
    }

    where BAR is another generated specific Java class.

    When I iterate over "Iterable<FOO> values" in the Reducer it is clear that the same object of class FOO is reused, i.e.
    FOO foo1 = values.iterator.next();
    FOO foo2 = values.iterator.next();
    assertThat(foo1 == foo2, is (true));

    So I have the following questions:
    1) Is the list barList reused over the next() calls?
    2) If yes, can the objects that are in the barList be reused? For example, if the first time next() is called, the list contains two BAR objects, the next time next() is called the barList contains 3 objects and 2 of them are equal by reference to the two from the list of the first next() call. In other words, does Hadoop maintain some sort of "object pool"?
    3) Why do not AvroTools  generate clone() methods since it would be quite straightforward and more importantly useful given that objects are reused?

    Thanks a lot in advance!

    Vyacheslav




    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Milind Bhandarkar at Aug 4, 2011 at 7:36 pm
    HADOOP-2399 has caused a lot of problems for users so far, and the saga
    still continues :-(

    I remember spending 18 straight hours in 2008 with a user debugging this
    issue.

    - milind

    ---
    Milind Bhandarkar
    Greenplum Labs, EMC
    (Disclaimer: Opinions expressed in this email are those of the author, and
    do
    not necessarily represent the views of any organization, past or present,
    the author might be affiliated with.)



    On 8/3/11 4:19 AM, "Joey Echeverria" wrote:

    Hadoop reuses objects as an optimization. If you need to keep a copy
    in memory, you need to call clone yourself. I've never used Avro, but
    my guess is that the BARs are not reused, only the FOO.

    -Joey

    On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
    wrote:
    Hi all,

    I'm using Avro as a serialization format and assume I have a generated
    specific class FOO that I use as a Mapper output format:

    class FOO {
    int a;
    List<BAR> barList;
    }

    where BAR is another generated specific Java class.

    When I iterate over "Iterable<FOO> values" in the Reducer it is clear
    that the same object of class FOO is reused, i.e.
    FOO foo1 = values.iterator.next();
    FOO foo2 = values.iterator.next();
    assertThat(foo1 == foo2, is (true));

    So I have the following questions:
    1) Is the list barList reused over the next() calls?
    2) If yes, can the objects that are in the barList be reused? For
    example, if the first time next() is called, the list contains two BAR
    objects, the next time next() is called the barList contains 3 objects
    and 2 of them are equal by reference to the two from the list of the
    first next() call. In other words, does Hadoop maintain some sort of
    "object pool"?
    3) Why do not AvroTools generate clone() methods since it would be
    quite straightforward and more importantly useful given that objects are
    reused?

    Thanks a lot in advance!

    Vyacheslav




    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Vyacheslav Zholudev at Aug 4, 2011 at 9:07 pm
    Just sharing my today's discovery:
    Hadoop also reuses objects in internal lists, in my example the BAR objects.
    That is if the first FOO object has two BAR objects in the list, then the
    second FOO object will contain the same (equal by reference) first two BAR
    objects in the list. So in case of Avro it would be good if auto-generated
    code implemented a 'clone' method.
    Btw, is it good to clone avro-specific objects by serializing/deserializing
    using SpecificDatum{Writer|Reader}?

    Vyacheslav

    On 4 August 2011 21:35, wrote:

    HADOOP-2399 has caused a lot of problems for users so far, and the saga
    still continues :-(

    I remember spending 18 straight hours in 2008 with a user debugging this
    issue.

    - milind

    ---
    Milind Bhandarkar
    Greenplum Labs, EMC
    (Disclaimer: Opinions expressed in this email are those of the author, and
    do
    not necessarily represent the views of any organization, past or present,
    the author might be affiliated with.)



    On 8/3/11 4:19 AM, "Joey Echeverria" wrote:

    Hadoop reuses objects as an optimization. If you need to keep a copy
    in memory, you need to call clone yourself. I've never used Avro, but
    my guess is that the BARs are not reused, only the FOO.

    -Joey

    On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
    wrote:
    Hi all,

    I'm using Avro as a serialization format and assume I have a generated
    specific class FOO that I use as a Mapper output format:

    class FOO {
    int a;
    List<BAR> barList;
    }

    where BAR is another generated specific Java class.

    When I iterate over "Iterable<FOO> values" in the Reducer it is clear
    that the same object of class FOO is reused, i.e.
    FOO foo1 = values.iterator.next();
    FOO foo2 = values.iterator.next();
    assertThat(foo1 == foo2, is (true));

    So I have the following questions:
    1) Is the list barList reused over the next() calls?
    2) If yes, can the objects that are in the barList be reused? For
    example, if the first time next() is called, the list contains two BAR
    objects, the next time next() is called the barList contains 3 objects
    and 2 of them are equal by reference to the two from the list of the
    first next() call. In other words, does Hadoop maintain some sort of
    "object pool"?
    3) Why do not AvroTools generate clone() methods since it would be
    quite straightforward and more importantly useful given that objects are
    reused?

    Thanks a lot in advance!

    Vyacheslav




    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

    --
    Best,
    Vyacheslav Zholudev
  • Joey Echeverria at Aug 4, 2011 at 9:16 pm
    Wow, I didn't expect that. That's nastier than usual. I would think
    that cloning by serializing/deserializing would be unnecessarily slow.
    I would file a JIRA with Avro asking for a clone() or copy constructor
    in generated code.

    -Joey

    On Thu, Aug 4, 2011 at 5:07 PM, Vyacheslav Zholudev
    wrote:
    Just sharing my today's discovery:
    Hadoop also reuses objects in internal lists, in my example the BAR objects.
    That is if the first FOO object has two BAR objects in the list, then the
    second FOO object will contain the same (equal by reference) first two BAR
    objects in the list. So in case of Avro it would be good if auto-generated
    code implemented a 'clone' method.
    Btw, is it good to clone avro-specific objects by serializing/deserializing
    using SpecificDatum{Writer|Reader}?
    Vyacheslav
    On 4 August 2011 21:35, wrote:

    HADOOP-2399 has caused a lot of problems for users so far, and the saga
    still continues :-(

    I remember spending 18 straight hours in 2008 with a user debugging this
    issue.

    - milind

    ---
    Milind Bhandarkar
    Greenplum Labs, EMC
    (Disclaimer: Opinions expressed in this email are those of the author, and
    do
    not necessarily represent the views of any organization, past or present,
    the author might be affiliated with.)



    On 8/3/11 4:19 AM, "Joey Echeverria" wrote:

    Hadoop reuses objects as an optimization. If you need to keep a copy
    in memory, you need to call clone yourself. I've never used Avro, but
    my guess is that the BARs are not reused, only the FOO.

    -Joey

    On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
    wrote:
    Hi all,

    I'm using Avro as a serialization format and assume I have a generated
    specific class FOO that I use as a Mapper output format:

    class FOO {
    int a;
    List<BAR> barList;
    }

    where BAR is another generated specific Java class.

    When I iterate over "Iterable<FOO> values" in the Reducer it is clear
    that the same object of class FOO is reused, i.e.
    FOO foo1 = values.iterator.next();
    FOO foo2 = values.iterator.next();
    assertThat(foo1 == foo2, is (true));

    So I have the following questions:
    1) Is the list barList reused over the next() calls?
    2) If yes, can the objects that are in the barList be reused? For
    example, if the first time next() is called, the list contains two BAR
    objects, the next time next() is called the barList contains 3 objects
    and 2 of them are equal by reference to the two from the list of the
    first next() call. In other words, does Hadoop maintain some sort of
    "object pool"?
    3) Why do not AvroTools  generate clone() methods since it would be
    quite straightforward and more importantly useful given that objects are
    reused?

    Thanks a lot in advance!

    Vyacheslav




    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434


    --
    Best,
    Vyacheslav Zholudev


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedAug 3, '11 at 7:19a
activeAug 4, '11 at 9:16p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase