Grokbase Groups Pig dev April 2011
FAQ
in nested foreach, accumutive udf taking input from order-by does not get results in order
------------------------------------------------------------------------------------------

Key: PIG-1963
URL: https://issues.apache.org/jira/browse/PIG-1963
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Thejas M Nair


This happens only when secondary sort is not being used for the order-by.
For example -
{code}
a1 = load 'fruits.txt' as (f1:int,f2);
a2 = load 'fruits.txt' as (f1:int,f2);

b = cogroup a1 by f1, a2 by f1;

d = foreach b {
sort1 = order a1 by f2;
sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
}

-- explain d;
dump d;
{code}



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Thejas M Nair (JIRA) at Apr 4, 2011 at 10:23 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015661#comment-13015661 ]

    Thejas M Nair commented on PIG-1963:
    ------------------------------------

    MYCONCATBAG udf in the query in description concatenates the entries in the bag, in the order it is recieved.
    When the query run with the property - pig.accumulative.batchsize=2 ,
    and input -
    {code}
    100 apple
    200 orange
    300 strawberry
    300 pear
    100 apple
    300 pear
    400 apple
    {code}

    gives output -
    {code}
    (100,(100)(100),(apple)(apple))
    (200,(200),(orange))
    (300,(300)(300)(300),(pear)(strawberry)(pear)) -- this should be (300,(300)(300)(300),(pear)(pear)(strawberry))
    (400,(400),(apple))
    {code}
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair

    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 4, 2011 at 10:25 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1963:
    -------------------------------

    Attachment: MYCONCATBAG.java

    attaching udf used in the example.
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Attachments: MYCONCATBAG.java


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 4, 2011 at 10:35 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015665#comment-13015665 ]

    Thejas M Nair commented on PIG-1963:
    ------------------------------------

    Note that the issue is seen only when there are more than 20000 in the bag used by the nested order-by statement, or the value of pig.accumulative.batchsize property if it is set.

    The is happening because in accumulative mode the nested relational operator is being passed a portion of the bag. That works fine in case of operations such as filter or limit. If secondary sort is used for the ordering, there is no POSort in the plan, so it works fine.

    This issue might exist in case of nested distinct as well, because it is also supposed to be a blocking operation.

    Another query which demonstrates this issue (when property pig.accumulative.batchsize=2 is set)

    {code}
    a1 = load 'fruits.txt' as (cid:int,fruit : chararray);

    b = group a1 by cid;

    d = foreach b {
    sort1 = order a1 by fruit ;
    sort2 = order a1 by fruit desc;
    generate group as cid, MYCONCATBAG(sort1.fruit), MYCONCATBAG(sort2.fruit); -- The second instance of the udf does not get sorted results
    }

    explain d;
    dump d;
    {code}

    To fix this, if such blocking relational operators exist in the plan after secondary-sort optimization, accumulative mode should be disabled by the optimizer.

    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Attachments: MYCONCATBAG.java


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 7, 2011 at 4:23 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1963:
    -------------------------------

    Attachment: PIG-1963.1.patch

    PIG-1963.1.patch - unit tests, test-patch with 0.8 branch. Running them for trunk.
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Attachments: MYCONCATBAG.java, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 7, 2011 at 8:20 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1963:
    -------------------------------

    Fix Version/s: 0.8.0
    0.9.0
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Fix For: 0.8.0, 0.9.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Apr 8, 2011 at 8:19 am
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017339#comment-13017339 ]

    Daniel Dai commented on PIG-1963:
    ---------------------------------

    This prevent other nested relational operator, right? From the way checkUDFInput handles PORelationToExprProject, it seems the original code intend to prevent other relational operator (except foreach) as well, but miss the case POProject following PORelationToExprProject. +1 for the patch if this is the case.
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Fix For: 0.8.0, 0.9.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 8, 2011 at 4:06 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017500#comment-13017500 ]

    Thejas M Nair commented on PIG-1963:
    ------------------------------------

    bq. This prevent other nested relational operator, right? From the way checkUDFInput handles PORelationToExprProject, it seems the original code intend to prevent other relational operator (except foreach) as well, but miss the case POProject following PORelationToExprProject. +1 for the patch if this is the case.

    Yes, the AccumulatorOptimizer code intends to turn off accumulative mode if it sees any relational operator other than POForEach and POSortedDistinct as input to accumulative udf. The case of POProject should have been handled like PORelationToExprProject.

    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Fix For: 0.8.0, 0.9.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 8, 2011 at 4:12 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017503#comment-13017503 ]

    Thejas M Nair commented on PIG-1963:
    ------------------------------------

    The AccumulatorOptimizer should allow accumulative mode to be used if the input relation is a non-blocking relation like filter or limit. I have created PIG-1980 to address that.

    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Fix For: 0.8.0, 0.9.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 8, 2011 at 7:23 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1963:
    -------------------------------

    Attachment: PIG-1963.1.1.patch

    PIG-1963.1.1.patch - patch to remove a test case added in PIG-1911, as that query will no longer run in accumulative mode.
    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Fix For: 0.8.0, 0.9.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.1.patch, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 8, 2011 at 7:31 pm
    [ https://issues.apache.org/jira/browse/PIG-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair resolved PIG-1963.
    --------------------------------

    Resolution: Fixed
    Assignee: Thejas M Nair

    Patch committed to trunk and 0.8 branch.

    in nested foreach, accumutive udf taking input from order-by does not get results in order
    ------------------------------------------------------------------------------------------

    Key: PIG-1963
    URL: https://issues.apache.org/jira/browse/PIG-1963
    Project: Pig
    Issue Type: Bug
    Affects Versions: 0.8.0, 0.9.0
    Reporter: Thejas M Nair
    Assignee: Thejas M Nair
    Fix For: 0.9.0, 0.8.0

    Attachments: MYCONCATBAG.java, PIG-1963.1.1.patch, PIG-1963.1.patch


    This happens only when secondary sort is not being used for the order-by.
    For example -
    {code}
    a1 = load 'fruits.txt' as (f1:int,f2);
    a2 = load 'fruits.txt' as (f1:int,f2);
    b = cogroup a1 by f1, a2 by f1;
    d = foreach b {
    sort1 = order a1 by f2;
    sort2 = order a2 by f2; -- secondary sort not getting used here, MYCONCATBAG gets results in wrong order
    generate group, MYCONCATBAG(sort1.f1), MYCONCATBAG(sort2.f2);
    }
    -- explain d;
    dump d;
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedApr 4, '11 at 10:19p
activeApr 8, '11 at 7:31p
posts11
users1
websitepig.apache.org

1 user in discussion

Thejas M Nair (JIRA): 11 posts

People

Translate

site design / logo © 2022 Grokbase