Grokbase Groups Pig user January 2011
FAQ
I have written a generalized load func for nested Json - but hit a wall.

Not sure how to access the nested data once in pig for something like the
following:

Original JSON:

{"body":[{"token":"foo2","hash":"-33333333333"},{"token":"bar2","hash":"-22222222222"}],"pmessgid":"559830","subject":[{"token":"fooo","hash":"111111"},{"token":"bar","hash":"999999"}],"userid":"77274","messageid":"559837","threadid":"104997"}


Dump of tuple.toString() in to system out from my LoadFunc (after generating
the tuple from a custom load func - a recursive json walking mechanism that
generates nested maps and tuples)

([body#([token#foo2,hash#-33333333333],[token#bar2,hash#-22222222222]),subject#([token#fooo,hash#111111],[token#bar,hash#999999]),userid#77274,messageid#559837,threadid#104997,pmessgid#559830])


So far so good, I can produce the right data structure in code, and when I
dump it via the toString() it looks good!

**** My problem ->

So here is the schema in the example above:
Map<String,Object> where Object is either a list of tuple of
Map<String,String>s OR just a String.


In my pig script, I can get this far:
A = LOAD '/jivepoc/jivecommunity/dbsqoop/usermessages-clean-features2' USING
com.proximal.pig.tools.JSONLoader() as (
json: map[]
);

If I don't qualify the map[] above, i can select an item from the map (say
'body') and it says:

certain_keys = FOREACH A GENERATE json#'body' AS b;
DESCRIBE certain_keys;
certain_keys: {b: bytearray}

Looks good, it is a bytearray if i don't further define what i have, but now
I'm stuck -> I need to load a much more detailed map[].
Problem is the map[] (as pointed out above) can contain either a String or
a Map<String,String>

There is no typecasting right? I'm I missing something, or am I stuck??

Thanks
Lance



Additional info:

Code to do this:

Search Discussions

  • Daniel Dai at Jan 18, 2011 at 8:00 pm
    Currently, we treat all map value as bytearray. However, if you project
    the map value later in the script, you have chance to cast the map
    value. Eg:

    a = load '1.json' using JSONLoader() as (m:map[]);
    b = foreach a generate (map[])m#'key' as v;
    c = foreach b generate (long)v;

    But you cannot cast the map as a whole. This will be addressed in 0.9 as
    we are introducing a typed map.

    Daniel

    Lance Riedel wrote:
    I have written a generalized load func for nested Json - but hit a wall.

    Not sure how to access the nested data once in pig for something like the
    following:

    Original JSON:

    {"body":[{"token":"foo2","hash":"-33333333333"},{"token":"bar2","hash":"-22222222222"}],"pmessgid":"559830","subject":[{"token":"fooo","hash":"111111"},{"token":"bar","hash":"999999"}],"userid":"77274","messageid":"559837","threadid":"104997"}


    Dump of tuple.toString() in to system out from my LoadFunc (after generating
    the tuple from a custom load func - a recursive json walking mechanism that
    generates nested maps and tuples)

    ([body#([token#foo2,hash#-33333333333],[token#bar2,hash#-22222222222]),subject#([token#fooo,hash#111111],[token#bar,hash#999999]),userid#77274,messageid#559837,threadid#104997,pmessgid#559830])


    So far so good, I can produce the right data structure in code, and when I
    dump it via the toString() it looks good!

    **** My problem ->

    So here is the schema in the example above:
    Map<String,Object> where Object is either a list of tuple of
    Map<String,String>s OR just a String.


    In my pig script, I can get this far:
    A = LOAD '/jivepoc/jivecommunity/dbsqoop/usermessages-clean-features2' USING
    com.proximal.pig.tools.JSONLoader() as (
    json: map[]
    );

    If I don't qualify the map[] above, i can select an item from the map (say
    'body') and it says:

    certain_keys = FOREACH A GENERATE json#'body' AS b;
    DESCRIBE certain_keys;
    certain_keys: {b: bytearray}

    Looks good, it is a bytearray if i don't further define what i have, but now
    I'm stuck -> I need to load a much more detailed map[].
    Problem is the map[] (as pointed out above) can contain either a String or
    a Map<String,String>

    There is no typecasting right? I'm I missing something, or am I stuck??

    Thanks
    Lance



    Additional info:

    Code to do this:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 15, '11 at 12:13a
activeJan 18, '11 at 8:00p
posts2
users2
websitepig.apache.org

2 users in discussion

Daniel Dai: 1 post Lance Riedel: 1 post

People

Translate

site design / logo © 2021 Grokbase