Grokbase Groups Pig user June 2010
Hi Thejas
Thanks, with your input I managed to work it out. Here's my solution. Hope it's useful to someone.

/* load in the log file */
log = LOAD 'example-user-tracking.txt' AS (user:chararray, page:chararray);

/* generate the list of unique users grouped by page */
unique_log = DISTINCT log;
g_users = GROUP unique_log BY page;
l_page_users = FOREACH g_users GENERATE group AS page, FLATTEN(unique_log.user) AS user;

/* generate a list of pages grouped by users, and add on a counter */
l_counted = FOREACH log GENERATE user, page, 1 AS counter;
g_userpages = GROUP l_counted BY (user,page);
l_userpages = FOREACH g_userpages GENERATE FLATTEN (group), COUNT(l_counted) AS occurs;

/* joined the 2 lists together using user as key */
joined = JOIN l_page_users BY user, l_userpages BY group::user;
l_page_page = FOREACH joined GENERATE l_page_users::page AS page1, l_userpages::group::page AS page2, l_userp
ages::occurs AS occurs;
g_page_page = GROUP l_page_page BY (page1, page2);
/* generate a list showing the number of occurence of starting page moving to landing page */
result = FOREACH g_page_page GENERATE group, SUM(l_page_page.occurs) AS occurs;

/* display the result */
DUMP result;

If someone is able to optimize this, please do share. Not sure if my version is the best way to achieve the result.

-----Original Message-----
From: Thejas Nair <>
To: <>;
Sent: Thu, Jul 1, 2010 6:10 am
Subject: Re: Help with writing Pig Query

Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
-- as of current version of pig, you need to use two different loads for
self join

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results

I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)


On 6/29/10 11:37 PM, "" wrote:

I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
still trying to wrap my head around it. I'm stuck with putting together a

A DUMP of my sample dataset is as follows,

log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
DUMP log;


What I'm trying to do is to say, Users visiting page 'a' also visited this
list of other pages ranked by number of times the page was visited. Can anyone
help or give me some guidance?


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 30, '10 at 4:41p
activeJul 1, '10 at 3:35p

2 users in discussion

Diagnostix: 2 posts Thejas Nair: 1 post



site design / logo © 2022 Grokbase