Grokbase Groups Pig user August 2009
FAQ
I saw a thread on the list-serv about doing distinct count in a nested
foreach. I'm not sure I followed exactly what was meant, but below is my
script. Any suggestions on optimizations (it's my first stab at a "real" Pig
script even though I know it's somewhat trivial).

REGISTER /opt/analytics_env/pig/mypigudfs.jar;

raw = LOAD '/opt/analytics_env/processed_logs/*.queryLog.2009-08*' AS
(timestamp:chararray, ip:chararray, userid:chararray); -- Load Query Logs
for the month of August
dailyraw = FOREACH rawdata GENERATE userid, mypigudfs.ExtractDay(timestamp)
AS day;
daily = DISTINCT dailyraw;
daygrp = GROUP daily BY day;
daycnts = FOREACH daygrp GENERATE group, COUNT(daily);

weeklyraw = FOREACH rawdata GENERATE userid,
mypigudfs.ExtractWeek(timestamp) AS week;
weekly = DISTINCT weeklyraw;
wkgrp = GROUP weekly By week;
wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

rawmonth = FOREACH rawdata GENERATE userid;
monthly = DISTINCT rawmonth;
mnth = GROUP monthly ALL;
mcnt = FOREACH mnth GENERATE COUNT(monthly);

STORE daycnts INTO '/opt/analytics_env/pig/testdump/dailycounts' USING
PigStorage('\t');
STORE wkcnts INTO '/opt/analytics_env/pig/testdump/weeklycounts' USING
PigStorage('\t');
STORE mcnt INTO '/opt/analytics_env/pig/testdump/monthlycount' USING
PigStorage('\t');

--
Zaki Rahaman

Search Discussions

  • Nikhil Gupta at Aug 27, 2009 at 11:19 pm
    Zaki,

    Good work, your script looks fine to me.
    - Reducing the operator pipeline makes scripts efficient, so something
    like -

    A = LOAD 'A.txt' as (timestamp, value);
    result = FOREACH (GROUP (DISTINCT(A)) BY timestamp PARALLEL 10)
    {
    GENERATE group, COUNT($1);
    };
    DUMP result;

    _might_ be better [at the cost of making your script harder to read].
    However, I have never run any such benchmarks myself, and would like
    to hear experience of other members with such optimizations.

    -nikhil
    http://stanford.edu/~nikgupta/

    On Thu, Aug 27, 2009 at 2:27 PM, zaki rahamanwrote:
    I saw a thread on the list-serv about doing distinct count in a nested
    foreach. I'm not sure I followed exactly what was meant, but below is my
    script. Any suggestions on optimizations (it's my first stab at a "real" Pig
    script even though I know it's somewhat trivial).

    REGISTER /opt/analytics_env/pig/mypigudfs.jar;

    raw = LOAD '/opt/analytics_env/processed_logs/*.queryLog.2009-08*' AS
    (timestamp:chararray, ip:chararray, userid:chararray); -- Load Query Logs
    for the month of August
    dailyraw = FOREACH rawdata GENERATE userid, mypigudfs.ExtractDay(timestamp)
    AS day;
    daily = DISTINCT dailyraw;
    daygrp = GROUP daily BY day;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);

    weeklyraw = FOREACH rawdata GENERATE userid,
    mypigudfs.ExtractWeek(timestamp) AS week;
    weekly = DISTINCT weeklyraw;
    wkgrp = GROUP weekly By week;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    rawmonth = FOREACH rawdata GENERATE userid;
    monthly = DISTINCT rawmonth;
    mnth = GROUP monthly ALL;
    mcnt = FOREACH mnth GENERATE COUNT(monthly);

    STORE daycnts INTO '/opt/analytics_env/pig/testdump/dailycounts' USING
    PigStorage('\t');
    STORE wkcnts INTO '/opt/analytics_env/pig/testdump/weeklycounts' USING
    PigStorage('\t');
    STORE mcnt INTO '/opt/analytics_env/pig/testdump/monthlycount' USING
    PigStorage('\t');

    --
    Zaki Rahaman
  • Dmitriy Ryaboy at Aug 28, 2009 at 2:36 am
    It might be faster to use the "daily" relation to generate weekly and
    monthly counts.
    Somewhere down the line you are going to run into the fact that once
    you load up September or July, you will have double rows for weeks
    that span months...

    raw = LOAD '/opt/analytics_env/processed_logs/*.queryLog.2009-08*' AS
    (timestamp:chararray, ip:chararray, userid:chararray); -- Load Query Logs
    for the month of August
    dailyraw = FOREACH rawdata GENERATE userid, mypigudfs.ExtractDay(timestamp)
    AS day;
    daily = DISTINCT dailyraw;
    daygrp = GROUP daily BY day;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);

    weeklyraw = FOREACH daily GENERATE userid,
    mypigudfs.calculateWeek(day) AS week;
    weekly = DISTINCT weeklyraw;
    wkgrp = GROUP weekly By week;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    -- weekly only ok if you know everything is from one month
    -- otherwise, use daily and calc month from day
    rawmonth = FOREACH weekly GENERATE userid;
    monthly = DISTINCT rawmonth;
    mnth = GROUP monthly ALL;
    mcnt = FOREACH mnth GENERATE COUNT(monthly);

    STORE daycnts INTO '/opt/analytics_env/pig/testdump/dailycounts' USING
    PigStorage('\t');
    STORE wkcnts INTO '/opt/analytics_env/pig/testdump/weeklycounts' USING
    PigStorage('\t');
    STORE mcnt INTO '/opt/analytics_env/pig/testdump/monthlycount' USING
    PigStorage('\t');



    On Thu, Aug 27, 2009 at 7:18 PM, Nikhil Guptawrote:
    Zaki,

    Good work, your script looks fine to me.

    From this: http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html#Reduce+Your+Operator+Pipeline
    - Reducing the operator pipeline makes scripts efficient, so something
    like -

    A = LOAD 'A.txt' as (timestamp, value);
    result = FOREACH (GROUP (DISTINCT(A)) BY timestamp PARALLEL 10)
    {
    GENERATE group, COUNT($1);
    };
    DUMP result;

    _might_ be better [at the cost of making your script harder to read].
    However, I have never run any such benchmarks myself, and would like
    to hear experience of other members with such optimizations.

    -nikhil
    http://stanford.edu/~nikgupta/

    On Thu, Aug 27, 2009 at 2:27 PM, zaki rahamanwrote:
    I saw a thread on the list-serv about doing distinct count in a nested
    foreach. I'm not sure I followed exactly what was meant, but below is my
    script. Any suggestions on optimizations (it's my first stab at a "real" Pig
    script even though I know it's somewhat trivial).

    REGISTER /opt/analytics_env/pig/mypigudfs.jar;

    raw = LOAD '/opt/analytics_env/processed_logs/*.queryLog.2009-08*' AS
    (timestamp:chararray, ip:chararray, userid:chararray); -- Load Query Logs
    for the month of August
    dailyraw = FOREACH rawdata GENERATE userid, mypigudfs.ExtractDay(timestamp)
    AS day;
    daily = DISTINCT dailyraw;
    daygrp = GROUP daily BY day;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);

    weeklyraw = FOREACH rawdata GENERATE userid,
    mypigudfs.ExtractWeek(timestamp) AS week;
    weekly = DISTINCT weeklyraw;
    wkgrp = GROUP weekly By week;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    rawmonth = FOREACH rawdata GENERATE userid;
    monthly = DISTINCT rawmonth;
    mnth = GROUP monthly ALL;
    mcnt = FOREACH mnth GENERATE COUNT(monthly);

    STORE daycnts INTO '/opt/analytics_env/pig/testdump/dailycounts' USING
    PigStorage('\t');
    STORE wkcnts INTO '/opt/analytics_env/pig/testdump/weeklycounts' USING
    PigStorage('\t');
    STORE mcnt INTO '/opt/analytics_env/pig/testdump/monthlycount' USING
    PigStorage('\t');

    --
    Zaki Rahaman

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 27, '09 at 9:28p
activeAug 28, '09 at 2:36a
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase