I'm becoming a bit more comfortable writing scripts, but still not always
sure what the best way to structure/frame my statements in order to optimize
performance. When it comes to Split and Filter, for example, one could
filter multiple times on a raw set of data or condense it into one split
statement, but it's not clear from the docs what the best practice in this
case is. Below is my script as it stands. Your input would be greatly
appreciated.
-- Queries for August by Day/Month/Week
REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;
raw = LOAD 'data' AS (timestamp:chararray, ip:chararray, userid:chararray);
dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
day;
SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF (userid
matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR (userid
matches '*NOPERM*')));
-- Daily Count(s)
daygrp = GROUP daily BY day PARALLEL 36;
daycnts = FOREACH daygrp GENERATE group, COUNT(daily);
-- NoPerm
npgrp = GROUP noperm BY day;
npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);
--Broken
brkgrp = GROUP broken BY day;
brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);
-- Weekly Count(s)
weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS week;
wkgrp = GROUP weekly By week PARALLEL 36;
wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);
--Broken
broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
week;
brkgrp2 = GROUP broken2 BY week;
brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);
--NoPerm
noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
week;
npgrp2 = GROUP noperm2 BY week;
npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);
-- Monthly Count
month = GROUP weekly ALL;
mcnt = FOREACH month GENERATE COUNT(weekly);
npmonth = GROUP noperm2 ALL;
npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);
brkmonth = GROUP broken2 ALL;
brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);
// Store Output
--
Zaki Rahaman