Grokbase Groups Avro user April 2013
FAQ
It is well documented in the specification:
http://avro.apache.org/docs/current/spec.html#json_encoding

I know others have overridden this behavior by extending GenericData and/or
the JsonDecoder/Encoder. It wouldn't conform to the Avro Specification
JSON, but you can extend avro do do what you need it to.

The reason for this encoding is to make sure that round-tripping data from
binary to json and back results in the same data. Additionally, unions can
be more complicated and contain multiple records each with different names.
Disambiguating the value requires more information since several Avro data
types map to the same JSON data type. If the schema is a union of bytes and
string, is "hello" a string, or byte literal? If it is a union of a map and
a record, is {"state":"CA", "city":"Pittsburgh"} a record with two string
fields, or a map? There are other approaches, and for some users perfect
transmission of types is not critical. Generally speaking, if you want to
output Avro data as JSON and consume as JSON, the extra data is not helpful.
If you want to read it back in as Avro, you're going to need the info to
know which branch of the union to take.
On 4/6/13 6:49 PM, "Jonathan Coveney" wrote:

Err, it's the output format that deserializes the json and then writes it in
the binary format, not the input format. But either way the general flow is
the same.

As a general aside, is it the case that the java case is correct in that when
writing a union it should be {"string": "hello"} or whatnot? Seems like we
should probably add that to the documentation if it is a requirement.


2013/4/7 Jonathan Coveney <jcoveney@gmail.com>
Scott,

Thanks for the input. The use case is that a number of our batch processes
are built on python streaming. Currently, the reducer will output a json
string as a value, and then the input format will deserialize the json, and
then write it in binary format.

Given that our use of python streaming isn't going away, any suggestions on
how to make this better? Is there a better way to go from json string ->
writing binary avro data?

Thanks again
Jon


2013/4/6 Scott Carey <scottcarey@apache.org>
This is due to using the JSON encoding for avro and not the binary encoding.
It would appear that the Python version is a little bit lax on the spec.
Some have built variations of the JSON encoding that do not label the union,
but there are drawbacks to this too, as the type can be ambiguous in a very
large number of cases without a label.

Why are you using the JSON encoding for Avro? The primary purpose of the
JSON serialization form as it is now is for transforming the binary to human
readable form.
Instead of building your GenericRecord from a JSON string, try using
GenericRecordBuilder.

-Scott
On 4/5/13 4:59 AM, "Jonathan Coveney" wrote:

Ok, I figured out the issue:

If you make string c the following:
String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
\"favorite_color\": {\"string\": \"blue\"}}";

Then this works.

This represents a divergence between the python and the Java
implementation... the above does not work in Python, but it does work in
Java. And of course, vice versa.

I think I know how to fix this (and can file a bug with my reproduction and
the fix), but I'm not sure which one is the expected case? Which
implementation is wrong?

Thanks


2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
Correction: the issue is when reading the string according to the avro
schema, not on writing. it fails before I get a chance to write :)


2013/4/5 Jonathan Coveney <jcoveney@gmail.com>
I implemented essentially the Java avro example but using the
GenericDatumWriter and GenericDatumReader and hit an issue.

https://gist.github.com/jcoveney/5317904

This is the error:
Exception in thread "main" java.lang.RuntimeException:
org.apache.avro.AvroTypeException: Expected start-union. Got
VALUE_NUMBER_INT
at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got
VALUE_NUMBER_INT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
52)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.
java:177)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
48)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:1
39)
at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)

Am I doing something wrong? Is this a bug? I'm digging in now but am
curious if anyone has seen this before?

I get the feeling I am working with Avro in a way that most people do not
:)

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 9 | next ›
Discussion Overview
groupuser @
categoriesavro
postedApr 5, '13 at 9:25a
activeApr 9, '13 at 4:17p
posts9
users3
websiteavro.apache.org
irc#avro

People

Translate

site design / logo © 2021 Grokbase