FAQ
Combiner not use because optimizor inserts a foreach between group and algebric function
----------------------------------------------------------------------------------------

Key: PIG-1637
URL: https://issues.apache.org/jira/browse/PIG-1637
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
Fix For: 0.8.0


The following script does not use combiner after new optimization change.

{code}
A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
C = group B all;
D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
store D into ':OUTPATH:';
{code}

This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
{code}
A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
C = group B all;
C1 = foreach C generate B;
D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
store D into ':OUTPATH:';
{code}

That cancel the combiner optimization for D.

The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Daniel Dai (JIRA) at Sep 27, 2010 at 10:24 pm
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1637:
    ----------------------------

    Attachment: PIG-1637-1.patch
    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 28, 2010 at 6:25 pm
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1637:
    ----------------------------

    Attachment: PIG-1637-2.patch

    A bug caught by Xuefu. Reattach the patch.
    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch, PIG-1637-2.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 28, 2010 at 8:01 pm
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915880#action_12915880 ]

    Daniel Dai commented on PIG-1637:
    ---------------------------------

    test-patch result for PIG-1637-2.patch:

    [exec] +1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 3 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch, PIG-1637-2.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Xuefu Zhang (JIRA) at Sep 28, 2010 at 10:26 pm
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915943#action_12915943 ]

    Xuefu Zhang commented on PIG-1637:
    ----------------------------------

    +1

    Patch looks good, except that we don't have to require that all output expressions in the first foreach contain only simple projection. As long as the output expression in the first foreach that is referenced multiple times in the second foreach contains only simple projection, the merge can proceed. Doing this, the following two loops may be better merged to one.

    @@ -93,14 +93,17 @@
    // Otherwise, we may do expression calculation more than once, defeat the benefit of this
    // optimization
    Set<Integer> inputs = new HashSet<Integer>();
    + boolean duplicateInputs = false;
    for (Operator op : foreach2.getInnerPlan().getSources()) {
    // If the source is not LOInnerLoad, then it must be LOGenerate. This happens when
    // the 1st ForEach does not rely on any input of 2nd ForEach
    if (op instanceof LOInnerLoad) {
    LOInnerLoad innerLoad = (LOInnerLoad)op;
    int input = innerLoad.getProjection().getColNum();
    - if (inputs.contains(input))
    - return false;
    + if (inputs.contains(input)) {
    + duplicateInputs = true;
    + break;
    + }
    else
    inputs.add(input);

    @@ -109,6 +112,27 @@
    }
    }

    + // Duplicate inputs in the case first foreach only containing LOInnerLoad and
    + // LOGenerate is allowed, and output plan is simple projection
    + if (duplicateInputs) {
    + Iterator<Operator> it1 = foreach1.getInnerPlan().getOperators();
    + while( it1.hasNext() ) {
    + Operator op = it1.next();
    + if(!(op instanceof LOGenerate) && !(op instanceof LOInnerLoad))
    + return false;
    + if (op instanceof LOGenerate) {
    + List<LogicalExpressionPlan> outputPlans = ((LOGenerate)op).getOutputPlans();
    + for (LogicalExpressionPlan outputPlan : outputPlans) {
    + Iterator<Operator> iter = outputPlan.getOperators();
    + while (iter.hasNext()) {
    + if (!(iter.next() instanceof ProjectExpression))
    + return false;
    + }
    + }
    + }
    + }
    + }

    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch, PIG-1637-2.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 28, 2010 at 10:39 pm
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915950#action_12915950 ]

    Daniel Dai commented on PIG-1637:
    ---------------------------------

    Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch solve the "combiner not used" issue, will commit this part first. I will open another Jira to improve it. Also, MergeForEach is a best example to practice cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so it is better to improve it once PIG-1587 is available.
    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch, PIG-1637-2.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 29, 2010 at 5:30 am
    [ https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai resolved PIG-1637.
    -----------------------------

    Hadoop Flags: [Reviewed]
    Resolution: Fixed

    All tests pass except for TestSortedTableUnion / TestSortedTableUnionMergeJoin for zebra, which are already fail and will be addressed by [PIG-1649|https://issues.apache.org/jira/browse/PIG-1649].

    Patch committed to both trunk and 0.8 branch.
    Combiner not use because optimizor inserts a foreach between group and algebric function
    ----------------------------------------------------------------------------------------

    Key: PIG-1637
    URL: https://issues.apache.org/jira/browse/PIG-1637
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: 0.8.0

    Attachments: PIG-1637-1.patch, PIG-1637-2.patch


    The following script does not use combiner after new optimization change.
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    This is because after group, optimizer detect group key is not used afterward, it add a foreach statement after C. This is how it looks like after optimization:
    {code}
    A = load ':INPATH:/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links);
    B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as estimated_revenue;
    C = group B all;
    C1 = foreach C generate B;
    D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
    store D into ':OUTPATH:';
    {code}
    That cancel the combiner optimization for D.
    The way to solve the issue is to merge the C1 we inserted and D. Currently, we do not merge these two foreach. The reason is that one output of the first foreach (B) is referred twice in D, and currently rule assume after merge, we need to calculate B twice in D. Actually, C1 is only doing projection, no calculation of B. Merging C1 and D will not result calculating B twice. So C1 and D should be merged.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedSep 21, '10 at 6:52p
activeSep 29, '10 at 5:30a
posts7
users1
websitepig.apache.org

1 user in discussion

Daniel Dai (JIRA): 7 posts

People

Translate

site design / logo © 2022 Grokbase