Grokbase Groups Pig user January 2011
FAQ
I wonder if discussion of the Piggybank and other User Defined Fields is
best done here (since it is *using* Pig) or on the Development list (because
it is enhancing Pig).

I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
https://gist.github.com/601331


The class works for me - mostly....


This works when the Json is just a single level

{"field1": "value1", "field2": "value2", "field3": "value3"}

But doesn't seem to work when the json is nested

{"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
"value5", "field6": "value6"}, "field3": "value3"}

Has anyone got this working? I can't see how the existing code deals with
this.
parseStringToTuple only creates a single Map. There is no recursion I can
see.



Any suggestions?

Search Discussions

  • Jacob Perkins at Jan 29, 2011 at 1:43 pm
    Alex,

    It's a hack (sort of) but here's how I always do it. Since parsing json
    in java will put you in an insane asylum:

    Write a map only wukong script that parses the json as you want it. See
    the example here:

    http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html

    then use the STREAM operator to stream your raw records (load them as
    chararrays first) through your wukong script. It's not perfect but it
    gets the job done.

    --jacob
    @thedatachef

    On Sat, 2011-01-29 at 12:12 +0000, Alex McLintock wrote:
    I wonder if discussion of the Piggybank and other User Defined Fields is
    best done here (since it is *using* Pig) or on the Development list (because
    it is enhancing Pig).

    I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
    Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
    https://gist.github.com/601331


    The class works for me - mostly....


    This works when the Json is just a single level

    {"field1": "value1", "field2": "value2", "field3": "value3"}

    But doesn't seem to work when the json is nested

    {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
    "value5", "field6": "value6"}, "field3": "value3"}

    Has anyone got this working? I can't see how the existing code deals with
    this.
    parseStringToTuple only creates a single Map. There is no recursion I can
    see.



    Any suggestions?
  • Alex McLintock at Jan 30, 2011 at 8:09 pm

    On 29 January 2011 13:43, Jacob Perkins wrote:
    Write a map only wukong script that parses the json as you want it. See
    the example here:


    http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html
    Hi Jacob,

    Thanks very much for helping me out. I haven't heard of Wukong before.
    I am a bit concerned though by adding Ruby into my tool stack as well as
    Pig. It seems like a step too far.
    Presumably I have to distribute Ruby and Wukong across all my job nodes in
    the same way as if I were writing perl or C++ streaming programs.

    With STREAMing - the script is launched once per file, right, not once per
    record?

    Alex
  • Jacob Perkins at Jan 30, 2011 at 9:01 pm
    Yes, you would have to distribute ruby (though. it's typically
    installed by default) as well as the wukong and json libraries to all
    the nodes in the cluster. Unfortunately this isn't something wukong
    gives you for free at the moment though it is planned.

    As far as I know Pig doesn't do anything more complex than launch a
    hadoop streaming job and use the output in the subsequent steps

    btw I write 90% of my mr jobs using either wukong or Pig. Only when
    it's absolutely required do I use a language with as much overhead as
    java :)

    --jacob
    @thedatachef

    Sent from my iPhone
    On Jan 30, 2011, at 2:09 PM, Alex McLintock wrote:
    On 29 January 2011 13:43, Jacob Perkins wrote:

    Write a map only wukong script that parses the json as you want it.
    See
    the example here:


    http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html
    Hi Jacob,

    Thanks very much for helping me out. I haven't heard of Wukong before.
    I am a bit concerned though by adding Ruby into my tool stack as
    well as
    Pig. It seems like a step too far.
    Presumably I have to distribute Ruby and Wukong across all my job
    nodes in
    the same way as if I were writing perl or C++ streaming programs.

    With STREAMing - the script is launched once per file, right, not
    once per
    record?

    Alex
  • Harsh J at Jan 30, 2011 at 10:24 pm
    Hello,

    On Sat, Jan 29, 2011 at 5:42 PM, Alex McLintock
    wrote:
    I wonder if discussion of the Piggybank and other User Defined Fields is
    best done here (since it is *using* Pig) or on the Development list (because
    it is enhancing Pig).

    I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
    Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
    https://gist.github.com/601331


    The class works for me - mostly....


    This works when the Json is just a single level

    {"field1": "value1", "field2": "value2", "field3": "value3"}

    But doesn't seem to work when the json is nested

    {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
    "value5", "field6": "value6"}, "field3": "value3"}
    The json-simple library for Java will build the entire JSON
    representation as a JSONObject, which is _exactly_ what you need. This
    is a Java Map-like class which would contain your structure properly.
    What remains is to properly convert this to a Pig-acceptable Map
    structure.

    But what's happening in Vogt's code (and also Elephant-Bird's
    LzoJsonLoader from which it was sourced) is that the Map is
    down-converted to a simple Key-Value mapping instead of a Map
    containing another Map. This was done due to a limitation in Pig
    0.6.0, where the Map type could not hold complex types in it -- as
    noted in the latter class's javadoc [1].

    This limitation has gone away in 0.7.0+ I think (As the Pig Map spec
    now supports <String, {Atom, Tuple, Bag, Map}>, so you can feel free
    to change/get rid of the iteration inside parseStringToTuple(...) to
    not 'flatten' the Map.

    Additionally I think the json-simple dependency can perhaps be removed
    in favor of Jackson Core/Mapper libraries that are now being shipped
    by Hadoop itself (eliminating an extra JAR). Pig does not ship the
    json-simple library along. But you may want to be careful about the
    version of Jackson Core/Mapper in place inside your Hadoop. There are
    much more recent updates of it available with benefits.

    Perhaps, if you feel like, you can contribute your change back to
    elephant-bird [2]. I think they're open to newer-Pig related changes.

    [1] - https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/LzoJsonLoader.java
    [2] - https://github.com/kevinweil/elephant-bird

    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 29, '11 at 12:13p
activeJan 30, '11 at 10:24p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase