I've got some objects originally loaded, using the JSON loader from
elephantbird, into nested maps, and subsequently stored using
LZOPigStorage after various stages of processing.
When I subsequently load these nested maps, and pass them into a UDF for
modification, pretty much any attempt to reference nested objects
produces ClassCastExceptions where org.apache.pig.data.DataByteArray
cannot be cast to whatever it is that I'm trying to get at.
That all said, if I tear apart the map in pig latin, and pass all its
bits and pieces as arguments to the UDF for reassembly, these exceptions
don't happen, so this is what I've been doing, but it gets pretty ugly
and complicated at times (especially when the layers of nesting involve
arrays that I have to FLATTEN out and then group back together).
Looking at the stored data, I've noticed that nested maps are serialized
differently than the top-level map, so I've developed a suspicion that
there's some extra bit of magic that Pig is using to parse/cast these
sub-maps that I just don't know about (to be able to call it in my
UDFs). If such a beast exists, and someone could provide me with a
pointer, I'd appreciate it.
Also, the cluster I'm working on right now is using version 0.8.1-cdh3u4
of Pig. I poked arond JIRA to see if this issue has been previously
observed (and if I should bug the relevant ops-folk that much harder to
upgrade our pig), but found nothing. If this has somehow been fixed in a
later version (or perhaps if someone can recommend a storage class that
doesn't cause this problem in the first place), that pointer would also
be very much appreciated.
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3