You have a table of transactions defined as so: CREATE TABLE transactions (product_id INT, customer_id INT)
--------------------|
--Transactions------|
---product_id (INT)-|
---customer_id(INT)-|
--------------------|
--Transactions------|
---product_id (INT)-|
---customer_id(INT)-|
--------------------|
****The goal is simple: For each product, produce a list of the top 5 largest customers.****
So, the base query would look like this:
SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id
You could insert that value into another table called products_bought defined as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)
Now you have an intermediate result that tells you how many times each customer bought each product. But, obviously, that doesn't completely solve the problem.
At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY. Here's an example in T-SQL:
--THE CURSOR METHOD:
DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t
OPEN product_cur
FETCH product_cur into @productId
WHILE (@@FETCH_STATUS <> -1)
BEGIN
FETCH product_cur into @productId
INSERT top_customers_by_product
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc
END
CLOSE Domains
DEALLOCATE Domains
--THE CROSS APPLY METHOD:
--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
SELECT TOP 5 product_id, customer_id, products_bought
FROM prod_bought
WHERE product_id = @productId
ORDER BY products_bought desc
GO
--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares). How can I do the same thing in Hive? Here's the question restated: How can I implement a cursor in Hive? How can I do a for or while loop in Hive? Can I implement a CROSS APPLY in Hive?
I realize that I can implement a cursor outside of Hive and just execute the same Hive script over and over and over again. And, that's not a horrible solution as long as it leverages the full power of Hadoop. My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive.
Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem.
Actually, I'm also not particularly religious about using Hive. If there's some other tech that does what I need, that's cool too.
Thanks in advance.
Mike Roberts