The way I would go about doing it is to stream the data through a dept separator then do your order limit. This would heuristically cut the input data size down and give you some normalization. There is a way to do it with a group first, but that seems more difficult to me.
Neil Kodner wrote:
I'm trying to perform a top-n query in pig. For example's sake, lets say my
input data is
(employeeid, departmentid, salary).
I'm trying to get the top n-highest-salaried employees of each department.
I would start off by grouping the data by department but am not sure how to
sort and limit the grouped data before flattening it into my output rows.