Grokbase Groups Pig dev June 2009
FAQ
Hi Pig team,

We'd like to get your feedback on a set of queries we implemented on Pig.

We've attached the hadoop configuration and pig queries in the email. We start the queries by issuing "pig xxx.pig". The queries are from SIGMOD'2009 paper. More details are at https://issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG for this?)


One improvement is that we are going to change hadoop to use LZO as intermediate compression algorithm very soon. Previously we used gzip for all performance tests including hadoop, hive and pig.

The reason that we specify the number of reducers in the query is to try to match the same number of reducer as Hive automatically suggested. Please let us know what is the best way to set the number of reducers in Pig.

Are there any other improvements we can make to the Pig query and the hadoop configuration?

Thanks,
Zheng

Search Discussions

  • Zheng Shao at Jun 23, 2009 at 3:39 pm
    By the way, just for clarification, these queries are used for gathering performance data.

    Zheng
    From: Zheng Shao
    Sent: Monday, June 22, 2009 10:37 PM
    To: 'pig-dev@hadoop.apache.org'
    Subject: asking for comments on benchmark queries

    Hi Pig team,

    We'd like to get your feedback on a set of queries we implemented on Pig.

    We've attached the hadoop configuration and pig queries in the email. We start the queries by issuing "pig xxx.pig". The queries are from SIGMOD'2009 paper. More details are at https://issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG for this?)


    One improvement is that we are going to change hadoop to use LZO as intermediate compression algorithm very soon. Previously we used gzip for all performance tests including hadoop, hive and pig.

    The reason that we specify the number of reducers in the query is to try to match the same number of reducer as Hive automatically suggested. Please let us know what is the best way to set the number of reducers in Pig.

    Are there any other improvements we can make to the Pig query and the hadoop configuration?

    Thanks,
    Zheng
  • Alan Gates at Jun 23, 2009 at 5:32 pm
    Zheng,

    I don't think you're subscribed to pig-dev (your emails have been
    bouncing to the moderator). So I've cc'd you explicitly on this.

    I don't think we need a Pig JIRA, it's probably easier if we all work
    on the hive one. I'll post my comments on the various scripts to that
    bug. I've also attached them here since pig-dev won't see the updates
    to that bug.

    Alan.

    grep_select.pig:

    Adding types in the LOAD statement will force Pig to cast the key
    field, even though it doesn't need to (it only reads and writes the
    key field). So I'd change the query to be:

    rmf output/PIG_bench/grep_select;
    a = load '/data/grep/*' using PigStorage as (key,field);
    b = filter a by field matches '.*XYZ.*';
    store b into 'output/PIG_bench/grep_select';

    field will still be cast to a chararray for the matches, but we won't
    waste time casting key and then turning it back into bytes for the
    store.

    rankings_select.pig:

    Same comment, remove the casts. pagerank will be properly cast to an
    integer.

    rmf output/PIG_bench/rankings_select;
    a = load '/data/rankings/*' using PigStorage('|') as
    (pagerank,pageurl,aveduration);
    b = filter a by pagerank > 10;
    store b into 'output/PIG_bench/rankings_select';

    rankings_uservisits_join.pig:

    Here you want to keep the casts of pagerank so that it is handled as
    the right type. adRevenue will default to double in SUM when you
    don't specify a type. You also want to project out all unneeded
    columns as soon as possible. You should set PARALLEL on the join to
    use the number of reducers appropriate for your cluster. Given that
    you have 10 machines and 5 reduce slots per machine, and speculative
    execution is off you probably want 50 reducers. I notice you set
    parallel to 60 on the group by. That will give you 10 trailing
    reducers. Unless you have a need for the result to be split 60 ways
    you should reduce that to 50 as well. (I'm assuming here when you say
    you have a 10 node cluster you mean 10 data nodes, not counting your
    name node and task tracker. The reduce formula should be 5 * number
    of data nodes.)

    A last question is how large are the uservisits and rankings data
    sets? If either is < 80M or so you can use the fragment/replicate
    join, which is much faster than the general join. The following
    script assumes that isn't the case; but if it is let me know and I can
    show you the syntax for it.

    So the end query looks like:

    rmf output/PIG_bench/html_join;
    a = load '/data/uservisits/*' using PigStorage('|') as

    (sourceIP
    ,destURL
    ,visitDate
    ,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
    b = load '/data/rankings/*' using PigStorage('|') as
    (pagerank:int,pageurl,aveduration);
    c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
    c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
    b1 = foreach b generate pagerank, pageurl;
    d = JOIN c1 by destURL, b1 by pageurl parallel 50;
    d1 = foreach d generate sourceIP, pagerank, adRevenue;
    e = group d1 by sourceIP parallel 50;
    f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
    store f into 'output/PIG_bench/html_join';

    uservisists_agrre.pig:

    Same comments as above on projecting out as early as possible and on
    setting parallel appropriately for your cluster.

    rmf output/PIG_bench/uservisits_aggre;
    a = load '/data/uservisits/*' using PigStorage('|') as

    (sourceIP
    ,destURL
    ,visitDate
    ,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);
    a1 = foreach a generate sourceIP, adRevenue;
    b = group a by sourceIP parallel 50;
    c = FOREACH b GENERATE group, SUM(a. adRevenue);
    store c into 'output/PIG_bench/uservisits_aggre';


    On Jun 22, 2009, at 10:36 PM, Zheng Shao wrote:

    Hi Pig team,

    We’d like to get your feedback on a set of queries we implemented on
    Pig.

    We’ve attached the hadoop configuration and pig queries in the
    email. We start the queries by issuing “pig xxx.pig”. The queries
    are from SIGMOD’2009 paper. More details are athttps://
    issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG
    for this?)


    One improvement is that we are going to change hadoop to use LZO as
    intermediate compression algorithm very soon. Previously we used
    gzip for all performance tests including hadoop, hive and pig.

    The reason that we specify the number of reducers in the query is to
    try to match the same number of reducer as Hive automatically
    suggested. Please let us know what is the best way to set the number
    of reducers in Pig.

    Are there any other improvements we can make to the Pig query and
    the hadoop configuration?

    Thanks,
    Zheng

    <hadoop-site.xml><hive-default.xml><hadoop-env.sh.txt>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedJun 23, '09 at 3:39p
activeJun 23, '09 at 5:32p
posts3
users2
websitepig.apache.org

2 users in discussion

Zheng Shao: 2 posts Alan Gates: 1 post

People

Translate

site design / logo © 2022 Grokbase