Grokbase Groups Pig user June 2010
Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
-- as of current version of pig, you need to use two different loads for
self join

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results

I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)


On 6/29/10 11:37 PM, "" wrote:

I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
still trying to wrap my head around it. I'm stuck with putting together a

A DUMP of my sample dataset is as follows,

log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
DUMP log;


What I'm trying to do is to say, Users visiting page 'a' also visited this
list of other pages ranked by number of times the page was visited. Can anyone
help or give me some guidance?


Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 3 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 30, '10 at 4:41p
activeJul 1, '10 at 3:35p

2 users in discussion

Diagnostix: 2 posts Thejas Nair: 1 post



site design / logo © 2022 Grokbase