Grokbase Groups Pig user May 2013
FAQ
Hi,
   I have a dataset with two three columns, group_id, position, and name. I
need for each group to generate a concatenated string of all names ordered
by their position. I can do this by sorting all data based on position, (or
group_id and position), then grouping them by group_id, and finally
concatenating names in each group. I have two questions here,
1- Does this really work? In other words, does the GROUP BY operator retain
order?
2- What is the most efficient way to do it? Is it better, if possible, to
group first and then sort? Let's say I order by the pair (group_id,
position) first, can this be hinted to Pig to make the group by faster.
Thanks for your help


Best regards,
Ahmed Eldawy

Search Discussions

  • Cheolsoo Park at May 13, 2013 at 5:26 pm
    Hi Ahmed,

    Please try this:

    grped = GROUP foo BY group_id;
    sorted = FOREACH grped {
         ordered = ORDER foo BY position;
         GENERATE group, MyUDF(ordered.name); -- MyUDF concatenates strings in a
    bag
    };

    What this will do is:
    1) Mappers will send the same keys to a reducer.
    2) Each reducer will only sort values of their keys.

    In fact, it is possible for Pig to optimize this even further
    using secondary key sort optimization (i.e. Pig can remove ORDER BY in
    reducers and entirely rely on Hadoop secondary sorting instead). But there
    were some bugs with secondary key sort optimization for this case, and it
    is removed from trunk recently.

    Thanks,
    Cheolsoo










    On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy wrote:

    Hi,
    I have a dataset with two three columns, group_id, position, and name. I
    need for each group to generate a concatenated string of all names ordered
    by their position. I can do this by sorting all data based on position, (or
    group_id and position), then grouping them by group_id, and finally
    concatenating names in each group. I have two questions here,
    1- Does this really work? In other words, does the GROUP BY operator retain
    order?
    2- What is the most efficient way to do it? Is it better, if possible, to
    group first and then sort? Let's say I order by the pair (group_id,
    position) first, can this be hinted to Pig to make the group by faster.
    Thanks for your help


    Best regards,
    Ahmed Eldawy
  • Ahmed Eldawy at May 13, 2013 at 7:49 pm
    Thanks Cheolsoo for your help. I'm still learning Pig and I didn't know
    about this nested structure. I'll try it and see how much performance I
    gain compared to my naive implementation.

    Best regards,
    Ahmed Eldawy

    On Mon, May 13, 2013 at 12:18 PM, Cheolsoo Park wrote:

    Hi Ahmed,

    Please try this:

    grped = GROUP foo BY group_id;
    sorted = FOREACH grped {
    ordered = ORDER foo BY position;
    GENERATE group, MyUDF(ordered.name); -- MyUDF concatenates strings in
    a
    bag
    };

    What this will do is:
    1) Mappers will send the same keys to a reducer.
    2) Each reducer will only sort values of their keys.

    In fact, it is possible for Pig to optimize this even further
    using secondary key sort optimization (i.e. Pig can remove ORDER BY in
    reducers and entirely rely on Hadoop secondary sorting instead). But there
    were some bugs with secondary key sort optimization for this case, and it
    is removed from trunk recently.

    Thanks,
    Cheolsoo










    On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy wrote:

    Hi,
    I have a dataset with two three columns, group_id, position, and name. I
    need for each group to generate a concatenated string of all names ordered
    by their position. I can do this by sorting all data based on position, (or
    group_id and position), then grouping them by group_id, and finally
    concatenating names in each group. I have two questions here,
    1- Does this really work? In other words, does the GROUP BY operator retain
    order?
    2- What is the most efficient way to do it? Is it better, if possible, to
    group first and then sort? Let's say I order by the pair (group_id,
    position) first, can this be hinted to Pig to make the group by faster.
    Thanks for your help


    Best regards,
    Ahmed Eldawy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 13, '13 at 5:26p
activeMay 13, '13 at 7:49p
posts3
users2
websitepig.apache.org

2 users in discussion

Ahmed Eldawy: 2 posts Cheolsoo Park: 1 post

People

Translate

site design / logo © 2021 Grokbase