|| at Jul 13, 2011 at 3:37 pm
You can also try using a custom reducer script, as follows:
SELECT groupCol, metric, otherFieldYouCareAbout
DISTRIBUTE BY groupCol
SORT BY groupCol ASC, metric DESC
AS groupCol, metric, otherFieldYouCareAbout
All rows of a particular group go to the same instance of
myGroupingReduceScript, in sorted order. The script tracks the current
group and simply outputs the entire row. After it has output 50 rows for
the current group, it ignores the rest of the rows for the group.
The disadvantage of this approach is that all rows for a particular group go
to the same reducer. If you have millions of rows per group, that may be
On Tue, Jul 12, 2011 at 11:58 PM, Igor Tatarinov wrote:
I would use a UDF. Your query would look something like this:
SELECT group_key, rank(group_key) AS rank
FROM (SELECT group_key, value
DISTRIBUTE BY group_key
SORT BY group_key, value DESC)
HAVING rank <= 50;
The trick is to make the UDF keep track of the group key so that you could
reset your rank whenever a new group starts. Also, I am pretty sure you have
to put DISTRIBUTE/SORT BY into a subquery. If you put those in the main
query, the output data will be partitioned so your rank() won't work.
Not sure if the HAVING clause would work in Hive. You can use WHERE
Hope this helps.
On Mon, Jul 11, 2011 at 6:27 PM, Vipul Patel wrote:
I'm trying to run a query that will return the top 50 results within
each group. The table contains millions of records and there are over
100 unique values in the field I wish to group by -- consequently
using UNION to string together subqueries won't be very efficient.
Any suggestions would be very much appreciated.
Thank you in advance!
Dilip Antony Josephhttp://csgrad.blogspot.comhttp://www.marydilip.info