Grokbase Groups Pig user May 2013
Hi Ahmed,

Please try this:

grped = GROUP foo BY group_id;
sorted = FOREACH grped {
     ordered = ORDER foo BY position;
     GENERATE group, MyUDF(; -- MyUDF concatenates strings in a

What this will do is:
1) Mappers will send the same keys to a reducer.
2) Each reducer will only sort values of their keys.

In fact, it is possible for Pig to optimize this even further
using secondary key sort optimization (i.e. Pig can remove ORDER BY in
reducers and entirely rely on Hadoop secondary sorting instead). But there
were some bugs with secondary key sort optimization for this case, and it
is removed from trunk recently.


On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy wrote:

I have a dataset with two three columns, group_id, position, and name. I
need for each group to generate a concatenated string of all names ordered
by their position. I can do this by sorting all data based on position, (or
group_id and position), then grouping them by group_id, and finally
concatenating names in each group. I have two questions here,
1- Does this really work? In other words, does the GROUP BY operator retain
2- What is the most efficient way to do it? Is it better, if possible, to
group first and then sort? Let's say I order by the pair (group_id,
position) first, can this be hinted to Pig to make the group by faster.
Thanks for your help

Best regards,
Ahmed Eldawy

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 3 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 13, '13 at 5:26p
activeMay 13, '13 at 7:49p

2 users in discussion

Ahmed Eldawy: 2 posts Cheolsoo Park: 1 post



site design / logo © 2021 Grokbase