FAQ
Hi all,

I'm becoming a bit more comfortable writing scripts, but still not always
sure what the best way to structure/frame my statements in order to optimize
performance. When it comes to Split and Filter, for example, one could
filter multiple times on a raw set of data or condense it into one split
statement, but it's not clear from the docs what the best practice in this
case is. Below is my script as it stands. Your input would be greatly
appreciated.

-- Queries for August by Day/Month/Week

REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

raw = LOAD 'data' AS (timestamp:chararray, ip:chararray, userid:chararray);

dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
day;
SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF (userid
matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR (userid
matches '*NOPERM*')));


-- Daily Count(s)

daygrp = GROUP daily BY day PARALLEL 36;
daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


-- NoPerm
npgrp = GROUP noperm BY day;
npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

--Broken
brkgrp = GROUP broken BY day;
brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


-- Weekly Count(s)

weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS week;
wkgrp = GROUP weekly By week PARALLEL 36;
wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

--Broken
broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
week;
brkgrp2 = GROUP broken2 BY week;
brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


--NoPerm
noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
week;
npgrp2 = GROUP noperm2 BY week;
npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


-- Monthly Count

month = GROUP weekly ALL;
mcnt = FOREACH month GENERATE COUNT(weekly);

npmonth = GROUP noperm2 ALL;
npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

brkmonth = GROUP broken2 ALL;
brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

// Store Output

--
Zaki Rahaman

Search Discussions

  • Alan Gates at Sep 8, 2009 at 6:01 pm
    In this particular case, it doesn't matter which you choose. If you
    filter multiple times on a relation, then Pig will insert a split with
    the filters you give.

    Alan.
    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,

    I'm becoming a bit more comfortable writing scripts, but still not
    always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one could
    filter multiple times on a raw set of data or condense it into one
    split
    statement, but it's not clear from the docs what the best practice
    in this
    case is. Below is my script as it stands. Your input would be greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid,
    mypigudfs.ExtractDay(timestamp) AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR
    (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day)
    AS week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day)
    AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day)
    AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman
  • Zaki rahaman at Sep 8, 2009 at 6:46 pm
    Another quick question, how do you deal with blank fields (i.e. between
    fields 11 and 13 there is field 12 which is always blank). Files are tab
    delim
    On Tue, Sep 8, 2009 at 1:59 PM, Alan Gates wrote:

    In this particular case, it doesn't matter which you choose. If you filter
    multiple times on a relation, then Pig will insert a split with the filters
    you give.

    Alan.


    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,
    I'm becoming a bit more comfortable writing scripts, but still not always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one could
    filter multiple times on a raw set of data or condense it into one split
    statement, but it's not clear from the docs what the best practice in this
    case is. Below is my script as it stands. Your input would be greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Dmitriy Ryaboy at Sep 8, 2009 at 7:13 pm
    The simplest thing would be to simply project it out:

    a = load '/data/foo' using PigStorage as (f1, f2, f3);
    b = foreach a generate f1, f3;

    You could write a custom loader or use the regex loader but that seems
    like overkill.

    -D

    On Tue, Sep 8, 2009 at 2:46 PM, zaki rahamanwrote:
    Another quick question, how do you deal with blank fields (i.e. between
    fields 11 and 13 there is field 12 which is always blank). Files are tab
    delim
    On Tue, Sep 8, 2009 at 1:59 PM, Alan Gates wrote:

    In this particular case, it doesn't matter which you choose.  If you filter
    multiple times on a relation, then Pig will insert a split with the filters
    you give.

    Alan.


    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,
    I'm becoming a bit more comfortable writing scripts, but still not always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one could
    filter multiple times on a raw set of data or condense it into one split
    statement, but it's not clear from the docs what the best practice in this
    case is. Below is my script as it stands. Your input would be greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Alan Gates at Sep 8, 2009 at 8:42 pm
    You mean it looks something like this (converting tabs to commas for
    readability): 3,5,7,,9 ?

    In that case that field will be read as a null.

    Alan.
    On Sep 8, 2009, at 11:46 AM, zaki rahaman wrote:

    Another quick question, how do you deal with blank fields (i.e.
    between
    fields 11 and 13 there is field 12 which is always blank). Files are
    tab
    delim
    On Tue, Sep 8, 2009 at 1:59 PM, Alan Gates wrote:

    In this particular case, it doesn't matter which you choose. If
    you filter
    multiple times on a relation, then Pig will insert a split with the
    filters
    you give.

    Alan.


    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,
    I'm becoming a bit more comfortable writing scripts, but still not
    always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one
    could
    filter multiple times on a raw set of data or condense it into one
    split
    statement, but it's not clear from the docs what the best practice
    in this
    case is. Below is my script as it stands. Your input would be
    greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid,
    mypigudfs.ExtractDay(timestamp) AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR
    (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day)
    AS
    week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid,
    mypigudfs.ExtractWeek(day) AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid,
    mypigudfs.ExtractWeek(day) AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Zaki Rahaman at Sep 8, 2009 at 11:33 pm
    Yes except it's tab delimited. My question was more about how to set
    up the load statement so that f 13 gets loaded properly.

    Sent from my iPhone
    On Sep 8, 2009, at 4:39 PM, Alan Gates wrote:

    You mean it looks something like this (converting tabs to commas for
    readability): 3,5,7,,9 ?

    In that case that field will be read as a null.

    Alan.
    On Sep 8, 2009, at 11:46 AM, zaki rahaman wrote:

    Another quick question, how do you deal with blank fields (i.e.
    between
    fields 11 and 13 there is field 12 which is always blank). Files
    are tab
    delim

    On Tue, Sep 8, 2009 at 1:59 PM, Alan Gates <gates@yahoo-inc.com>
    wrote:
    In this particular case, it doesn't matter which you choose. If
    you filter
    multiple times on a relation, then Pig will insert a split with
    the filters
    you give.

    Alan.


    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,
    I'm becoming a bit more comfortable writing scripts, but still
    not always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one
    could
    filter multiple times on a raw set of data or condense it into
    one split
    statement, but it's not clear from the docs what the best
    practice in this
    case is. Below is my script as it stands. Your input would be
    greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay
    (timestamp) AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm
    IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*')
    OR (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek
    (day) AS
    week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek
    (day) AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek
    (day) AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman

    --
    Zaki Rahaman
  • Nikhil Gupta at Sep 9, 2009 at 2:35 am
    It will get loaded as a null doesn't matter if your file is comma separated
    or tab delimited.
    On Wed, Sep 9, 2009 at 5:02 AM, Zaki Rahaman wrote:

    Yes except it's tab delimited. My question was more about how to set up the
    load statement so that f 13 gets loaded properly.

    Sent from my iPhone


    On Sep 8, 2009, at 4:39 PM, Alan Gates wrote:

    You mean it looks something like this (converting tabs to commas for
    readability): 3,5,7,,9 ?

    In that case that field will be read as a null.

    Alan.

    On Sep 8, 2009, at 11:46 AM, zaki rahaman wrote:

    Another quick question, how do you deal with blank fields (i.e. between
    fields 11 and 13 there is field 12 which is always blank). Files are tab
    delim

    On Tue, Sep 8, 2009 at 1:59 PM, Alan Gates wrote:

    In this particular case, it doesn't matter which you choose. If you
    filter
    multiple times on a relation, then Pig will insert a split with the
    filters
    you give.

    Alan.


    On Sep 3, 2009, at 1:03 PM, zaki rahaman wrote:

    Hi all,
    I'm becoming a bit more comfortable writing scripts, but still not
    always
    sure what the best way to structure/frame my statements in order to
    optimize
    performance. When it comes to Split and Filter, for example, one could
    filter multiple times on a raw set of data or condense it into one
    split
    statement, but it's not clear from the docs what the best practice in
    this
    case is. Below is my script as it stands. Your input would be greatly
    appreciated.

    -- Queries for August by Day/Month/Week

    REGISTER s3://kikin-pig-test/udfs/mypigudfs.jar;

    raw = LOAD 'data' AS (timestamp:chararray, ip:chararray,
    userid:chararray);

    dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp)
    AS
    day;
    SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
    (userid
    matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR
    (userid
    matches '*NOPERM*')));


    -- Daily Count(s)

    daygrp = GROUP daily BY day PARALLEL 36;
    daycnts = FOREACH daygrp GENERATE group, COUNT(daily);


    -- NoPerm
    npgrp = GROUP noperm BY day;
    npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

    --Broken
    brkgrp = GROUP broken BY day;
    brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);


    -- Weekly Count(s)

    weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    wkgrp = GROUP weekly By week PARALLEL 36;
    wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

    --Broken
    broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    brkgrp2 = GROUP broken2 BY week;
    brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);


    --NoPerm
    noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
    week;
    npgrp2 = GROUP noperm2 BY week;
    npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);


    -- Monthly Count

    month = GROUP weekly ALL;
    mcnt = FOREACH month GENERATE COUNT(weekly);

    npmonth = GROUP noperm2 ALL;
    npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

    brkmonth = GROUP broken2 ALL;
    brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

    // Store Output

    --
    Zaki Rahaman
    --
    Zaki Rahaman

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 3, '09 at 8:04p
activeSep 9, '09 at 2:35a
posts7
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase