Grokbase Groups Pig dev August 2008
FAQ
Limit return incorrect records when we use multiple reducer
-----------------------------------------------------------

Key: PIG-364
URL: https://issues.apache.org/jira/browse/PIG-364
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Reporter: Daniel Dai
Assignee: Daniel Dai
Fix For: types_branch


Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Daniel Dai (JIRA) at Aug 8, 2008 at 12:59 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620799#action_12620799 ]

    Daniel Dai commented on PIG-364:
    --------------------------------

    Seems no perfect solution. Here are three possible treatments:
    1. If there is a limit in reducer, and number of reducer > 1, add another map-reduce after that with only 1 reducer
    Cons: extra-overhead
    2. Instead of map-reduce, manupilate output file directly, keep top k in output file
    Cons: not orthodox, extra-overhead (but not as much as 1)
    3. If there is a limit in reducer, change the parallel degree of the reducer to 1
    Cons: can not take advantage of parallel processing for reducer
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 11, 2008 at 5:33 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-364:
    ---------------------------

    Attachment: PIG-364.patch

    This patch takes approach 1. It will add one additional map-reduce operator with 1 reducer if the requested parallelism > 1. Now the behavior of limit is:

    1. If the map plan is closed before POLimit operator, we put POLimit in reduce plan, grant requested parallelism, if requested parallelism > 1, close reduce plan, add one additional map-reduce operator with 1 reducer

    2. If the map plan is open before POLimit operator, we put POLimit in map plan, close map plan, add another POLimit to reduce plan, and set parallelism of this map-reduce operator 1. Although in this case, POLimit create a map-reduce boundary, we do not associate a parallel option with limit keyword. I believe provide a parallel option with limit will arouse confusion to the user, because it is relatively hard to explain to the user whether this parallel option will be granted or not

    3. In limited sort case, we will have POSort with limit<>-1. If the parallelism for POSort > 1, we add one additional map-reduce operator with 1 reducer

    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 11, 2008 at 7:48 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-364:
    ----------------------------------

    Assignee: Shravan Matthur Narayanamurthy (was: Daniel Dai)

    Shravan, could you please review this patch, thanks
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 12, 2008 at 8:30 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630673#action_12630673 ]

    Shravan Matthur Narayanamurthy commented on PIG-364:
    ----------------------------------------------------

    The patch seems to follow the approach mentioned, but I think there is a problem with this approach. I tested it both with and without the patch. The problem exists in both cases. Consider the following script:

    A = load 'st10M';
    B = limit A 10;
    C = filter B by 2>1 parallel 10;
    dump C;

    I would expect to see only 10 tuples here. But I see 40 tuples here. The reason is that there were 4 mappers that produced 40 tuples in all and there were 10 reducers asked by filter since limit crosses map-reduce boundary. There was no limiting action done by the limit on the reduce side as there is capacity to pass 10*10=100 tuples.

    The problem is that we cannot guarantee the parallelism by setting the requested parallelism to some value while visiting limit as it can get modified further down the line which is shown in the example. The same case happens even in the limter in the reducer case for the following script where I see 97 tuples instead of the expected 10. If everything went right (that is no tuples are clipped) this should have been 100 and the actual output is pretty close:

    A = load 'st10M';
    B1 = group A by $0 parallel 10;
    B = limit B1 10;
    C = filter B by 2>1 parallel 10;
    dump C;

    One way I could think of is to terminate the new reduce phase created with a store therby ensuring that the parallelism is ensured to be that set during limit. Then start a new MapReduceOper by loading this file. Another way is to disable further changes to the reduce parallelism by maintaining a flag which controls the changes to the parallelism of the map reduce operator. But this would mean that we disobey the user's request, which might be meaningless at some places. Also, the semantics of parallel will be distorted when the limit is in picture.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 12, 2008 at 9:32 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630686#action_12630686 ]

    Olga Natkovich commented on PIG-364:
    ------------------------------------

    Would the following work:

    If a limit is present, always add an MR state with single reducer unless the whole plan is already terminated by a single reducer?
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 13, 2008 at 2:27 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630727#action_12630727 ]

    Daniel Dai commented on PIG-364:
    --------------------------------

    I see the problem. The largest requestedParallelism determine the number of reducers. I thought it was determined by the operator creating the map-reduce boundary. Then I have to do this check after complete compiling a map-reduce operator, if it contains a limit and requestedParallelism>1, then add a singler reducer after that. Thank you for pointing out.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 13, 2008 at 4:21 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-364:
    ----------------------------------

    Assignee: Daniel Dai (was: Shravan Matthur Narayanamurthy)
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 14, 2008 at 4:38 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-364:
    ---------------------------

    Attachment: PIG-364-2.patch

    The new patch add a MRPlan visitor after compiling. For each MapReduceOper, if it contains limit and requestedParallelism>1, insert a MapReduceOper after it with 1 reducer.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 16, 2008 at 1:34 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-364:
    -------------------------------

    Status: Patch Available (was: Open)
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 16, 2008 at 9:16 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631554#action_12631554 ]

    Shravan Matthur Narayanamurthy commented on PIG-364:
    ----------------------------------------------------

    I see a couple of problems with the patch.
    1) When you insert the new limitAdjustMROp MapReduceOper into the plan, the case where multiple successors might exist is not handled. For instance if there is a split after the limit, then you will only insert the limitMROp for the first outgoing edge.

    2) This is more of a semantic issue. By following the approach in the patch the semantics of limit do not hold. Consider the following:
    A = load 'URLs' as (url:string, pagerank:double);
    B = order A by pagerank parallel 100;
    C = limit B 10;
    D = foreach C generate url, CRAWL(url);
    store D into 'crawledpages';

    Here I would expect to crawl only the top 10 pages. However, with the current patch, I would probably crawl 1000 pages and trim my result to 10. This might not be what users want.

    3) If at all we decide to go with this approach afterx fixing 1, it is probably a good idea to introduce a limit operator into the map of limitAdjustMROp.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 17, 2008 at 7:02 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631880#action_12631880 ]

    Olga Natkovich commented on PIG-364:
    ------------------------------------

    Lets fix (1) and leave (2) alone for now
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 18, 2008 at 5:30 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-364:
    -------------------------------

    Status: Open (was: Patch Available)

    updated patch is needed
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 18, 2008 at 7:14 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632121#action_12632121 ]

    Daniel Dai commented on PIG-364:
    --------------------------------

    For Shravan's comments
    1) Can you give a script? I tested several split after limit cases, didn't get this error.
    2) As discussed in [PIG-171|https://issues.apache.org/jira/browse/PIG-171], since it seems to be no simple solution, so we just leave it as it is for now
    3) No problem if this looks better

    Thank you
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 18, 2008 at 7:57 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-364:
    -----------------------------------------------

    Attachment: limitsplit.png
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 18, 2008 at 8:07 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-364:
    -----------------------------------------------

    Attachment: (was: limitsplit.png)
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 18, 2008 at 8:07 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-364:
    -----------------------------------------------

    Attachment: limitsplit.png
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 18, 2008 at 8:11 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632365#action_12632365 ]

    Shravan Matthur Narayanamurthy commented on PIG-364:
    ----------------------------------------------------

    Consider the following script:
    {noformat}
    a = load 'file:/etc/passwd';
    b = limit a 10;
    c = filter b by 2>1 parallel 10;
    split c into c1 if 2>1, c2 if 2>1;
    d = group c1 by $0;
    e = group c2 by $0;
    f = group d by $0, e by $0;
    dump f;
    {noformat}

    This is a case where, multiple MROps are generated at the split as shown in the figure below, if what I understand from the code is right.

    !https://issues.apache.org/jira/secure/attachment/12390410/limitsplit.png!

    Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind you, the limitadjuster has now changed the output of this to some temporary file. At this point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

    In order to avoid this, we need to make sure to disconnect all the successors and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

    Let me know if I my understanding is wrong.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 18, 2008 at 8:13 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632365#action_12632365 ]

    shravanmn edited comment on PIG-364 at 9/18/08 1:11 PM:
    -----------------------------------------------------------------------------

    Consider the following script:
    {noformat}
    a = load 'file:/etc/passwd';
    b = limit a 10;
    c = filter b by 2>1 parallel 10;
    split c into c1 if 2>1, c2 if 2>1;
    d = group c1 by $0;
    e = group c2 by $0;
    f = group d by $0, e by $0;
    dump f;
    {noformat}

    This is a case where, multiple MROps are generated at the split as shown in the figure below, if what I understand from the code is right.

    !https://issues.apache.org/jira/secure/attachment/12390412/limitsplit.png!

    Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind you, the limitadjuster has now changed the output of this to some temporary file. At this point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

    In order to avoid this, we need to make sure to disconnect all the successors and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

    Let me know if I my understanding is wrong.

    was (Author: shravanmn):
    Consider the following script:
    {noformat}
    a = load 'file:/etc/passwd';
    b = limit a 10;
    c = filter b by 2>1 parallel 10;
    split c into c1 if 2>1, c2 if 2>1;
    d = group c1 by $0;
    e = group c2 by $0;
    f = group d by $0, e by $0;
    dump f;
    {noformat}

    This is a case where, multiple MROps are generated at the split as shown in the figure below, if what I understand from the code is right.

    !https://issues.apache.org/jira/secure/attachment/12390410/limitsplit.png!

    Now when the job controller sees this graph of MROps, it first schedules the LD MROp. To remind you, the limitadjuster has now changed the output of this to some temporary file. At this point, the controller has an option to schedule both the Lim Adj Op and the free 2-LRs Op whose dependency has been just resolved. If at all the choice is to execute the 2-LRs oP it tries to read the original output of the split which doesn't exist since the Lim Adj Op hasn't run yet and will fail. However if it decides to choose the Lim Adj Op, things will go fine.

    In order to avoid this, we need to make sure to disconnect all the successors and make the Lim Adj Op their predecessor and connect Lim Adj Op to LD as indicated in the figure.

    Let me know if I my understanding is wrong.
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 19, 2008 at 6:05 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-364:
    ---------------------------

    Attachment: PIG-364-3.patch

    Hi, Shravan,
    Thank you for your detailed explanation. Here is the modified patch addressing your comment 1 and 3. I tested using the following script:

    a = load 'studenttab10k';
    b = group a by $0 parallel 10;
    c = limit b 10;
    split c into c1 if $0 lt 'bob white', c2 if $0 gte 'bob white';
    c12 = group c1 by $0;
    c22 = group c2 by $0;
    c4 = union c12, c22;
    dump c4;

    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364-3.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Sep 19, 2008 at 6:05 am
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-364:
    ---------------------------

    Status: Patch Available (was: Open)
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364-3.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Sep 19, 2008 at 3:41 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632730#action_12632730 ]

    Shravan Matthur Narayanamurthy commented on PIG-364:
    ----------------------------------------------------

    Looks good to me. Thanks Daniel for incorporating all the comments!
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364-3.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 19, 2008 at 7:11 pm
    [ https://issues.apache.org/jira/browse/PIG-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-364:
    -------------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    patch committed. thanks, daniel and shravan!
    Limit return incorrect records when we use multiple reducer
    -----------------------------------------------------------

    Key: PIG-364
    URL: https://issues.apache.org/jira/browse/PIG-364
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Daniel Dai
    Assignee: Daniel Dai
    Fix For: types_branch

    Attachments: limitsplit.png, PIG-364-2.patch, PIG-364-3.patch, PIG-364.patch


    Currently we put Limit(k) operator in the reducer plan. However, in the case of n reducer, we will get up to n*k output.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 7, '08 at 12:37a
activeSep 19, '08 at 7:11p
posts23
users1
websitepig.apache.org

1 user in discussion

Olga Natkovich (JIRA): 23 posts

People

Translate

site design / logo © 2022 Grokbase