Grokbase Groups Hive user July 2011
FAQ
Hi,

I'm trying to run a query that will return the top 50 results within
each group. The table contains millions of records and there are over
100 unique values in the field I wish to group by -- consequently
using UNION to string together subqueries won't be very efficient.
Any suggestions would be very much appreciated.

Thank you in advance!

Vipul Patel

Search Discussions

  • Igor Tatarinov at Jul 13, 2011 at 6:59 am
    I would use a UDF. Your query would look something like this:

    SELECT group_key, rank(group_key) AS rank
    FROM (SELECT group_key, value
    DISTRIBUTE BY group_key
    SORT BY group_key, value DESC)
    HAVING rank <= 50;

    The trick is to make the UDF keep track of the group key so that you could
    reset your rank whenever a new group starts. Also, I am pretty sure you have
    to put DISTRIBUTE/SORT BY into a subquery. If you put those in the main
    query, the output data will be partitioned so your rank() won't work.

    Not sure if the HAVING clause would work in Hive. You can use WHERE
    rank(group_key)<=50 instead.

    Hope this helps.
    igor
    decide.com

    On Mon, Jul 11, 2011 at 6:27 PM, Vipul Patel wrote:

    Hi,

    I'm trying to run a query that will return the top 50 results within
    each group. The table contains millions of records and there are over
    100 unique values in the field I wish to group by -- consequently
    using UNION to string together subqueries won't be very efficient.
    Any suggestions would be very much appreciated.

    Thank you in advance!

    Vipul Patel
  • Dilip Joseph at Jul 13, 2011 at 3:37 pm
    You can also try using a custom reducer script, as follows:

    FROM (
    SELECT groupCol, metric, otherFieldYouCareAbout

    FROM MyTable

    DISTRIBUTE BY groupCol

    SORT BY groupCol ASC, metric DESC

    ) t1

    REDUCE *

    USING 'myGroupingReduceScript.py'
    AS groupCol, metric, otherFieldYouCareAbout

    ;


    All rows of a particular group go to the same instance of
    myGroupingReduceScript, in sorted order. The script tracks the current
    group and simply outputs the entire row. After it has output 50 rows for
    the current group, it ignores the rest of the rows for the group.

    The disadvantage of this approach is that all rows for a particular group go
    to the same reducer. If you have millions of rows per group, that may be
    slow.

    Dilip
    On Tue, Jul 12, 2011 at 11:58 PM, Igor Tatarinov wrote:

    I would use a UDF. Your query would look something like this:

    SELECT group_key, rank(group_key) AS rank
    FROM (SELECT group_key, value
    DISTRIBUTE BY group_key
    SORT BY group_key, value DESC)
    HAVING rank <= 50;

    The trick is to make the UDF keep track of the group key so that you could
    reset your rank whenever a new group starts. Also, I am pretty sure you have
    to put DISTRIBUTE/SORT BY into a subquery. If you put those in the main
    query, the output data will be partitioned so your rank() won't work.

    Not sure if the HAVING clause would work in Hive. You can use WHERE
    rank(group_key)<=50 instead.

    Hope this helps.
    igor
    decide.com

    On Mon, Jul 11, 2011 at 6:27 PM, Vipul Patel wrote:

    Hi,

    I'm trying to run a query that will return the top 50 results within
    each group. The table contains millions of records and there are over
    100 unique values in the field I wish to group by -- consequently
    using UNION to string together subqueries won't be very efficient.
    Any suggestions would be very much appreciated.

    Thank you in advance!

    Vipul Patel

    --
    _________________________________________
    Dilip Antony Joseph
    http://csgrad.blogspot.com
    http://www.marydilip.info

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 12, '11 at 1:27a
activeJul 13, '11 at 3:37p
posts3
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase