Grokbase Groups Pig user June 2010
FAQ
Hi,
I have written a UDF to sort the grouped data on a given field (in my case
date field) and return the sorted data in a databag. I want my method to get
the schema of my fields within the input (which is in a bag) and returning
bag should carry this schema.
In the outputSchema method the input schema is treated as tuple schema, for
which I will have to add this schema in a tuple and push this tuple in a
bag. So my output will look something like this;

grunt>grp_frds= GROUP gen_frds BY id;
grunt>grp_out= FOREACH grp_frds GENERATE FLATTEN(PartByDesc(gen_frds, 3));
--(second parameter is the field on which I want to sort my bag)
grunt> describe grp_out;
grp_out: {bag_of_tokenTuples::gen_frds: {id: long,dep_id: long, grp:
int,date: chararray}}

So, in my case I don¹t want the date field any more, so in the next operator

forc = FOREACH grp_gen GENERATE FLATTEN(bag_of_tokenTuples::gen_frds.id) AS
id, FLATTEN(bag_of_tokenTuples:: gen_frds. dep_id) AS dep_id,
FLATTEN(bag_of_tokenTuples::gen_frds.grp) AS grp;

All this works as I want it to be, but I am expecting the FLATTEN keyword I
am using over my UDF to eliminate all the nesting or within the
³bag_of_tokenTuples² eleminate the ³gen_frds² bag and have the fields within
the bag_of_tokenTuples.
Looking for suggestions please.

Thanks
Syed Wasti

Search Discussions

  • Syed Wasti at Jun 12, 2010 at 9:06 pm
    Well, I think I should start with the first step.
    My UDF works on a grouped data, I want to get some suggestions on how I can
    retain my schema of input grouped data in my outputSchema method. Thanks.

    Regards
    Syed Wasti

    On 6/11/10 4:28 PM, "Syed Wasti" wrote:

    Hi,
    I have written a UDF to sort the grouped data on a given field (in my case
    date field) and return the sorted data in a databag. I want my method to get
    the schema of my fields within the input (which is in a bag) and returning
    bag should carry this schema.
    In the outputSchema method the input schema is treated as tuple schema, for
    which I will have to add this schema in a tuple and push this tuple in a
    bag. So my output will look something like this;

    grunt>grp_frds= GROUP gen_frds BY id;
    grunt>grp_out= FOREACH grp_frds GENERATE FLATTEN(PartByDesc(gen_frds, 3));
    --(second parameter is the field on which I want to sort my bag)
    grunt> describe grp_out;
    grp_out: {bag_of_tokenTuples::gen_frds: {id: long,dep_id: long, grp:
    int,date: chararray}}

    So, in my case I don¹t want the date field any more, so in the next operator

    forc = FOREACH grp_gen GENERATE FLATTEN(bag_of_tokenTuples::gen_frds.id) AS
    id, FLATTEN(bag_of_tokenTuples:: gen_frds. dep_id) AS dep_id,
    FLATTEN(bag_of_tokenTuples::gen_frds.grp) AS grp;

    All this works as I want it to be, but I am expecting the FLATTEN keyword I
    am using over my UDF to eliminate all the nesting or within the
    ³bag_of_tokenTuples² eleminate the ³gen_frds² bag and have the fields within
    the bag_of_tokenTuples.
    Looking for suggestions please.

    Thanks
    Syed Wasti

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 11, '10 at 11:29p
activeJun 12, '10 at 9:06p
posts2
users1
websitepig.apache.org

1 user in discussion

Syed Wasti: 2 posts

People

Translate

site design / logo © 2021 Grokbase