Hi,
Lets say you have a file with columns userid username location amount
To count the total number of users:
A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A ALL PARALLEL 40;
R = FOREACH G GENERATE COUNT($1);
dump R;
To count the number of users by location;
A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A BY location PARALLEL 40;
R = FOREACH G GENERATE FLATTEN(group), COUNT($1);
dump R;
To get the sum of amount per location, userid
A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
amount:long);
G = GROUP A BY (location, userid) PARALLEL 40;
R = FOREACH G GENERATE FLATTEN(group), COUNT($1) as usercount,
SUM($1.amount) as useramount;
NOTE PARALLEL is set to 40 as an example, this should be set by you, and
depends on your cluster setup, data etc.
To count its always GROUP either ALL or BY <column name>
Then FOREACH and generate COUNT($1) the $1.
Hope this helps,
-----Original Message-----
From: Anze
Sent: Friday, October 29, 2010 12:01 PM
To: user@pig.apache.org
Subject: relations count
Hi!
I hope this is not too newbie question, but it's driving me crazy... How do
you count the records in a relation? Like DUMP, but instead of list of
records, I would like their count.
Thanks,
Anze