FAQ
Hallo,

I'm working with avro as the serialization framework for my hadoop map-reduce jobs, and am emitting GenericRecord/null as my K/V values from my mapper classes. Having looked at the code, I see that the "key" objects (i.e. my records) are only recognised as being discrete by my reducer if it sees that the .equals() method called on the record shows a distinction. However, if the schema is the same (which it is for most of my mappers), then .equals() calls .compare(), which in turn depends on the ORDER attributes set on the fields. This means that if I have no sorting defined in my schema, that all records are treated as being equal to one another. Have I understood this correctly, and if so, is that not a violation of the equals contract? (for one thing, it would mean GenericRecord objects will often cause confusion when used with maps and other containers).


Regards,

Andrew

Search Discussions

  • Doug Cutting at Feb 9, 2012 at 7:49 pm

    On 02/09/2012 07:02 AM, Andrew Kenworthy wrote:
    This means that if I have no sorting defined in my schema, that all
    records are treated as being equal to one another.
    If you specify "order":"ignore" for all fields in a record, then, yes,
    all instances of that record would be equal. I cannot imagine a case
    where this would be useful, but I also don't see how this would violate
    the equals() contract.

    The default for fields is to behave as if "order":"ascending" is
    specified. Records are equal if all of their fields that are not
    specified as "order":"ignore" are equal.

    Doug
  • Andrew Kenworthy at Feb 10, 2012 at 12:26 pm
    Hallo Doug,

    Thank you for your feedback. I agree that implicitly using Order.IGNORE to ignore differences in records makes sense, as that is the criteria used to define distinction when sorting. But it looks as though only the schema name is checked when deciding whether to examine each field or not. This can, as the test below shows, result in a lack of symmetry when using equals if one is not careful (i.e. the example is a "bad" one as it's not a good idea to have two schemas with the same name and namespace yet with different contents, but shows how one might inadvertently make a wrong assumption about equality):-

    @Test
    public void test() {
    Schema schema1 = Schema.createRecord("test_record", null, "my.namespace", false);
    List<Field> fields1 = new ArrayList<Field>();
    fields1.add(new Field("attribute1", Schema.create(Schema.Type.STRING), null, null, Order.IGNORE));
    schema1.setFields(fields1);
    Schema schema2 = Schema.createRecord("test_record", null, "my.namespace", false);
    List<Field> fields2 = new ArrayList<Field>();
    fields2.add(new Field("attribute1", Schema.create(Schema.Type.STRING), null, null, Order.ASCENDING));
    schema2.setFields(fields2);
    GenericRecord record1 = new GenericData.Record(schema1);
    record1.put("attribute1", "1");
    GenericRecord record2 = new GenericData.Record(schema2);
    record2.put("attribute1", "2");
    System.out.println(record1.equals(record2)); // returns TRUE
    System.out.println(record2.equals(record1)); // returns FALSE
    }

    Andrew


    ________________________________
    From: Doug Cutting <cutting@apache.org>
    To: user@avro.apache.org
    Sent: Thursday, February 9, 2012 8:49 PM
    Subject: Re: Does Avro GenericData.Record violate the .equals contract?
    On 02/09/2012 07:02 AM, Andrew Kenworthy wrote:
    This means that if I have no sorting defined in my schema, that all
    records are treated as being equal to one another.
    If you specify "order":"ignore" for all fields in a record, then, yes,
    all instances of that record would be equal.  I cannot imagine a case
    where this would be useful, but I also don't see how this would violate
    the equals() contract.

    The default for fields is to behave as if "order":"ascending" is
    specified.  Records are equal if all of their fields that are not
    specified as "order":"ignore" are equal.

    Doug

  • Doug Cutting at Feb 10, 2012 at 5:57 pm
    This does look like a bug in GenericData.Record#equals(). It should
    return false when the schemas are not equal. It currently only checks
    the schema names as a performance optimization, but that optimization is
    not a good one. Can you please file a bug report in Jira?

    Thanks,

    Doug
    On 02/10/2012 04:26 AM, Andrew Kenworthy wrote:
    Hallo Doug,

    Thank you for your feedback. I agree that implicitly using Order.IGNORE
    to ignore differences in records makes sense, as that is the criteria
    used to define distinction when sorting. But it looks as though only the
    schema name is checked when deciding whether to examine each field or
    not. This can, as the test below shows, result in a lack of symmetry
    when using equals if one is not careful (i.e. the example is a "bad" one
    as it's not a good idea to have two schemas with the same name and
    namespace yet with different contents, but shows how one might
    inadvertently make a wrong assumption about equality):-

    @Test
    public void test() {
    Schema schema1 = Schema.createRecord("test_record", null,
    "my.namespace", false);
    List<Field> fields1 = new ArrayList<Field>();
    fields1.add(new Field("attribute1", Schema.create(Schema.Type.STRING),
    null, null, Order.IGNORE));
    schema1.setFields(fields1);
    Schema schema2 = Schema.createRecord("test_record", null,
    "my.namespace", false);
    List<Field> fields2 = new ArrayList<Field>();
    fields2.add(new Field("attribute1", Schema.create(Schema.Type.STRING),
    null, null, Order.ASCENDING));
    schema2.setFields(fields2);
    GenericRecord record1 = new GenericData.Record(schema1);
    record1.put("attribute1", "1");
    GenericRecord record2 = new GenericData.Record(schema2);
    record2.put("attribute1", "2");
    System.out.println(record1.equals(record2)); // returns TRUE
    System.out.println(record2.equals(record1)); // returns FALSE
    }

    Andrew

    ------------------------------------------------------------------------
    *From:* Doug Cutting <cutting@apache.org>
    *To:* user@avro.apache.org
    *Sent:* Thursday, February 9, 2012 8:49 PM
    *Subject:* Re: Does Avro GenericData.Record violate the .equals
    contract?
    On 02/09/2012 07:02 AM, Andrew Kenworthy wrote:
    This means that if I have no sorting defined in my schema, that all
    records are treated as being equal to one another.
    If you specify "order":"ignore" for all fields in a record, then, yes,
    all instances of that record would be equal. I cannot imagine a case
    where this would be useful, but I also don't see how this would violate
    the equals() contract.

    The default for fields is to behave as if "order":"ascending" is
    specified. Records are equal if all of their fields that are not
    specified as "order":"ignore" are equal.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriesavro
postedFeb 9, '12 at 3:02p
activeFeb 10, '12 at 5:57p
posts4
users2
websiteavro.apache.org
irc#avro

2 users in discussion

Andrew Kenworthy: 2 posts Doug Cutting: 2 posts

People

Translate

site design / logo © 2021 Grokbase