FAQ
I'm trying to use Hive to solve a fairly common SQL scenario that I run into. I have boiled the problem down into its most basic form:

You have a table of transactions defined as so: CREATE TABLE transactions (product_id INT, customer_id INT)
--------------------|
--Transactions------|
---product_id (INT)-|
---customer_id(INT)-|
--------------------|

****The goal is simple: For each product, produce a list of the top 5 largest customers.****

So, the base query would look like this:

SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id

You could insert that value into another table called products_bought defined as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)

Now you have an intermediate result that tells you how many times each customer bought each product. But, obviously, that doesn't completely solve the problem.

At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY. Here's an example in T-SQL:

--THE CURSOR METHOD:

DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t

OPEN product_cur

FETCH product_cur into @productId

WHILE (@@FETCH_STATUS <> -1)
BEGIN

FETCH product_cur into @productId
INSERT top_customers_by_product
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc


END
CLOSE Domains
DEALLOCATE Domains


--THE CROSS APPLY METHOD:

--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc
GO

--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions

--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A


Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares). How can I do the same thing in Hive? Here's the question restated: How can I implement a cursor in Hive? How can I do a for or while loop in Hive? Can I implement a CROSS APPLY in Hive?


I realize that I can implement a cursor outside of Hive and just execute the same Hive script over and over and over again. And, that's not a horrible solution as long as it leverages the full power of Hadoop. My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive.

Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem.

Actually, I'm also not particularly religious about using Hive. If there's some other tech that does what I need, that's cool too.

Thanks in advance.

Mike Roberts

Search Discussions

  • Zheng Shao at Jan 26, 2010 at 7:03 am
    We can use a combination of UDAF and LATERAL VIEW to implement what you want.

    1. Define a UDAF like this: max_n(5, products_bought, customer_id)
    which returns the top 5 products_bought and their customer_id in type
    of array<struct<col0:int,col1:int>>
    2. Use the Lateral views (with explode) to transform a single row into
    multiple rows.

    SELECT t.product_id, t5.products_bought, t5.customer_id
    FROM (
    SELECT product_id, max_n(5, products_bought, customer_id) as top5
    FROM temp
    GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
    products_bought, customer_id;

    See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
    Paul is the author of UDTF and Lateral view. He might be able to give
    you more details.

    Zheng
    On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts wrote:


    I'm trying to use Hive to solve a fairly common SQL scenario that I run
    into.  I have boiled the problem down into its most basic form:



    You have a table of transactions defined as so:  CREATE TABLE transactions
    (product_id INT, customer_id INT)


    --------------------|
    --Transactions------|
    ---product_id (INT)-|
    ---customer_id(INT)-|
    --------------------|




    ****The goal is simple: For each product, produce a list of the top 5
    largest customers.****



    So, the base query would look like this:



    SELECT product_id, customer_id, count(*) as products_bought

    FROM transactions

    GROUP BY product_id, customer_id



    You could insert that value into another table called products_bought
    defined as:

    CREATE TABLE prod_bought

    (product_id INT, customer_id INT, products_bought INT)



    Now you have an intermediate result that tells you how many times each
    customer bought each product.  But, obviously, that doesn't completely solve
    the problem.



    At this point, in order to solve the problem, you'd have to use a cursor or
    a CROSS APPLY.  Here's an example in T-SQL:



    --THE CURSOR METHOD:



    DECLARE @productId int;

    DECLARE product_cur CURSOR FOR

    SELECT DISTINCT product_id

    FROM transactions t



    OPEN product_cur



    FETCH product_cur into @productId



    WHILE (@@FETCH_STATUS <> -1)

    BEGIN



    FETCH product_cur into @productId

    INSERT top_customers_by_product

    SELECT TOP 5 product_id, customer_id, products_bought

    FROM prod_bought

    WHERE product_id = @productId

    ORDER BY products_bought desc





    END

    CLOSE Domains

    DEALLOCATE Domains





    --THE CROSS APPLY METHOD:



    --First create a user defined function

    CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)

    RETURNS TABLE

    AS

    RETURN

    SELECT TOP 5 product_id, customer_id, products_bought

    FROM prod_bought

    WHERE product_id = @productId

    ORDER BY products_bought desc

    GO



    --Build a table of distinct product Ids

    SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions



    --Run the CROSS APPLY

    SELECT A.product_id

    , A.customer_id

    , A.products_bought

    INTO top_customers_by_product

    FROM temp_distinct_product_ids T

    CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A





    Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
    dramatically faster for anyone that cares).  How can I do the same thing in
    Hive?  Here's the question restated: How can I implement a cursor in Hive?
    How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
    Hive?





    I realize that I can implement a cursor outside of Hive and just execute the
    same Hive script over and over and over again.  And, that's not a horrible
    solution as long as it leverages the full power of Hadoop.  My concern is
    that each of the individual queries that is run inside are fairly
    inexpensive, but the total number of products makes the total job *very*
    expensive.



    Also, the solution should be reusable -- I'd really prefer not to write a
    custom jar every time I run into this problem.



    Actually, I’m also not particularly religious about using Hive.  If there’s
    some other tech that does what I need, that’s cool too.



    Thanks in advance.



    Mike Roberts


    --
    Yours,
    Zheng
  • Mike Roberts at Jan 26, 2010 at 6:17 pm
    I need that max_n UDAF developed that Zheng suggests below. I have it open for bidding on Rentacoder here: http://www.rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequest.asp?lngBidRequestId=1335150

    I'll post the working code when I get it.

    --Mike

    -----Original Message-----
    From: Zheng Shao
    Sent: Tuesday, January 26, 2010 12:03 AM
    To: hive-user@hadoop.apache.org; Paul Yang
    Subject: Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

    We can use a combination of UDAF and LATERAL VIEW to implement what you want.

    1. Define a UDAF like this: max_n(5, products_bought, customer_id)
    which returns the top 5 products_bought and their customer_id in type
    of array<struct<col0:int,col1:int>>
    2. Use the Lateral views (with explode) to transform a single row into
    multiple rows.

    SELECT t.product_id, t5.products_bought, t5.customer_id
    FROM (
    SELECT product_id, max_n(5, products_bought, customer_id) as top5
    FROM temp
    GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
    products_bought, customer_id;

    See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
    Paul is the author of UDTF and Lateral view. He might be able to give
    you more details.

    Zheng
    On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts wrote:


    I'm trying to use Hive to solve a fairly common SQL scenario that I run
    into.  I have boiled the problem down into its most basic form:



    You have a table of transactions defined as so:  CREATE TABLE transactions
    (product_id INT, customer_id INT)


    --------------------|
    --Transactions------|
    ---product_id (INT)-|
    ---customer_id(INT)-|
    --------------------|




    ****The goal is simple: For each product, produce a list of the top 5
    largest customers.****



    So, the base query would look like this:



    SELECT product_id, customer_id, count(*) as products_bought

    FROM transactions

    GROUP BY product_id, customer_id



    You could insert that value into another table called products_bought
    defined as:

    CREATE TABLE prod_bought

    (product_id INT, customer_id INT, products_bought INT)



    Now you have an intermediate result that tells you how many times each
    customer bought each product.  But, obviously, that doesn't completely solve
    the problem.



    At this point, in order to solve the problem, you'd have to use a cursor or
    a CROSS APPLY.  Here's an example in T-SQL:



    --THE CURSOR METHOD:



    DECLARE @productId int;

    DECLARE product_cur CURSOR FOR

    SELECT DISTINCT product_id

    FROM transactions t



    OPEN product_cur



    FETCH product_cur into @productId



    WHILE (@@FETCH_STATUS <> -1)

    BEGIN



    FETCH product_cur into @productId

    INSERT top_customers_by_product

    SELECT TOP 5 product_id, customer_id, products_bought

    FROM prod_bought

    WHERE product_id = @productId

    ORDER BY products_bought desc





    END

    CLOSE Domains

    DEALLOCATE Domains





    --THE CROSS APPLY METHOD:



    --First create a user defined function

    CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)

    RETURNS TABLE

    AS

    RETURN

    SELECT TOP 5 product_id, customer_id, products_bought

    FROM prod_bought

    WHERE product_id = @productId

    ORDER BY products_bought desc

    GO



    --Build a table of distinct product Ids

    SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions



    --Run the CROSS APPLY

    SELECT A.product_id

    , A.customer_id

    , A.products_bought

    INTO top_customers_by_product

    FROM temp_distinct_product_ids T

    CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A





    Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
    dramatically faster for anyone that cares).  How can I do the same thing in
    Hive?  Here's the question restated: How can I implement a cursor in Hive?
    How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
    Hive?





    I realize that I can implement a cursor outside of Hive and just execute the
    same Hive script over and over and over again.  And, that's not a horrible
    solution as long as it leverages the full power of Hadoop.  My concern is
    that each of the individual queries that is run inside are fairly
    inexpensive, but the total number of products makes the total job *very*
    expensive.



    Also, the solution should be reusable -- I'd really prefer not to write a
    custom jar every time I run into this problem.



    Actually, I'm also not particularly religious about using Hive.  If there's
    some other tech that does what I need, that's cool too.



    Thanks in advance.



    Mike Roberts


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 26, '10 at 6:48a
activeJan 26, '10 at 6:17p
posts3
users2
websitehive.apache.org

2 users in discussion

Mike Roberts: 2 posts Zheng Shao: 1 post

People

Translate

site design / logo © 2022 Grokbase