Grokbase Groups Pig dev August 2010
FAQ
proactive-spill bags should share the memory alloted for it
-----------------------------------------------------------

Key: PIG-1544
URL: https://issues.apache.org/jira/browse/PIG-1544
Project: Pig
Issue Type: Bug
Reporter: Thejas M Nair


Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.

This needs to be fixed and all proactive-spill bags should share the memory-limit .

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Olga Natkovich (JIRA) at Aug 16, 2010 at 10:58 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899162#action_12899162 ]

    Olga Natkovich commented on PIG-1544:
    -------------------------------------

    One way to do this is to only use InternalCacheBags for the bags that we are aware off upfront. Then we can have a visitor on the plan that counts the number of bags needed and divides memory accordingly.
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 17, 2010 at 12:55 am
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899221#action_12899221 ]

    Thejas M Nair commented on PIG-1544:
    ------------------------------------

    Note that it will not be possible to determine at query plan generation time, the number of bags that will be present at a time during query execution in all cases. For example, a udf could collect several bags. But that use case is likely to be rare, so i don't think it needs to be considered for memory size limit estimate. It should be sufficient to determine the number of places bags are created in the query plan.



    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 17, 2010 at 1:21 am
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899227#action_12899227 ]

    Olga Natkovich commented on PIG-1544:
    -------------------------------------

    We should not be using these bags for the cases like UDF for exactly the reason you are mentioning
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 17, 2010 at 7:03 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899526#action_12899526 ]

    Thejas M Nair commented on PIG-1544:
    ------------------------------------

    bq. We should not be using these bags for the cases like UDF for exactly the reason you are mentioning
    The case I had in mind was not one where UDF is creating proactive-spill bags, but case where udf input takes bags and they happen to be of proactive-spilling type and the udf retains bags from previous rows.

    Anyway, I have come up with a more realistic(?) use case where it is difficult to determine the number of proactive-spill bags that will be present at run time -

    {code}
    L = load 'f1' as ( c1 : int, b1 : bag{ } );
    F1 = foreach L { d = distinct b1; generate c1, d; } -- InternalDistinctBag will be created here
    G = group F by c1 using 'merge'; -- This group-by could [1] accumulate several of these InternalDistinctBag objects
    F2 = foreach G generate ...

    [1] - This does not happen because the query plan has a PORelationToExpressionProject after the result from PODistinct which copies the bag. But it looks like we can optimize and get rid of that bag in this case.

    {code}


    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 17, 2010 at 7:07 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899529#action_12899529 ]

    Olga Natkovich commented on PIG-1544:
    -------------------------------------

    So we should not use them in this case either. We should only use internal bags for things we no upfront
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 19, 2010 at 4:16 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900334#action_12900334 ]

    Thejas M Nair commented on PIG-1544:
    ------------------------------------

    While computing the number of bags, we should remember to consider the multi-query case as well.
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 19, 2010 at 8:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900440#action_12900440 ]

    Thejas M Nair commented on PIG-1544:
    ------------------------------------

    bq. While computing the number of bags, we should remember to consider the multi-query case as well.
    In case of multi-query, the sub-plans for each query in multi-query are executed one at a time for a given tuple with large bags. So the number of large bags that can't be garbage collected would be similar to that of single query.

    Another thing to keep in mind is that multiple bags that are working on common input (in case of distinct/order-by in nested foreach), would be sharing some/most of the memory with the input bag because pig does not create copies of the column objects.

    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 2, 2010 at 6:28 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905628#action_12905628 ]

    Olga Natkovich commented on PIG-1544:
    -------------------------------------

    I am going to take my previous comment back and say that we should make this work for UDFs as well. The main reason for this is that we don't have another way to make sure that UDFs do not run out of memory. One approach that Alan proposed was to make bags when they are created to ask for memory and have a central broker in charge of the memory pool. The details of this or whether there is a better approach need to be still thought through.
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair

    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 2, 2010 at 6:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1544:
    --------------------------------

    Assignee: Thejas M Nair
    Fix Version/s: 0.9.0
    proactive-spill bags should share the memory alloted for it
    -----------------------------------------------------------

    Key: PIG-1544
    URL: https://issues.apache.org/jira/browse/PIG-1544
    Project: Pig
    Issue Type: Bug
    Reporter: Thejas M Nair
    Assignee: Thejas M Nair
    Fix For: 0.9.0


    Initially proactive spill bags were designed for use in (co)group (InternalCacheBag) and they knew the total number of proactive bags that were present, and shared the memory limit specified using the property pig.cachedbag.memusage .
    But the two proactive bag implementations were added later - InternalDistinctBag and InternalSortedBag are not aware of actual number of bags being used - their users always assume total-numbags = 3.
    This needs to be fixed and all proactive-spill bags should share the memory-limit .
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 12, '10 at 9:10p
activeSep 2, '10 at 6:30p
posts10
users1
websitepig.apache.org

1 user in discussion

Olga Natkovich (JIRA): 10 posts

People

Translate

site design / logo © 2022 Grokbase