Grokbase Groups Pig dev April 2008
FAQ
pig creates many small files when it spills
-------------------------------------------

Key: PIG-176
URL: https://issues.apache.org/jira/browse/PIG-176
Project: Pig
Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Alan Gates


Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.

The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Pi Song (JIRA) at Apr 3, 2008 at 12:02 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585069#action_12585069 ]

    Pi Song commented on PIG-176:
    -----------------------------

    So let's say if the size is smaller than something, don't spill right? This is very easy to fix but we will be able to reclaim a bit less memory than before therefore causing some tasks to fail more often in exchange for some tasks running faster. Is this acceptable?

    Probably the best way to go is to make it configurable but Pig-111 isn't in yet. Sighhh..... I want to have more time.
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates

    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Apr 3, 2008 at 4:13 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585163#action_12585163 ]

    Olga Natkovich commented on PIG-176:
    ------------------------------------

    Pi,

    Running faster is part of it. The other part is not to fill up disks with tiny files which causes disk frgamentation and also takes forever to cleanup at the end of processing though you suggestion of cleaning as we go might help that a bit.
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates

    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Apr 7, 2008 at 1:35 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586385#action_12586385 ]

    Pi Song commented on PIG-176:
    -----------------------------

    Based on the fact that now we spill big bags first, my observation is that there are still cases where a big container bag is spilled and therefore its mContent becomes empty but most of its inner bags' WeakReferences aren't clean-up by GC yet. In such cases, if we haven't freed up enough memory, those inner bags will be unnecessarily spilled (however all their contents were already spilled in the big bag spill). Possibly that are 2 simple ways to solve this:-

    1) In SpillableMemoryManager, we try putting Thread.yield() in between each spill. This should allow some more time for GC to do more clean-up without degrading performance too much. However, if the main execution thread doesn't produce any bag (e.g. a map task where all keys and values are tuples and atomic data), this will give more time to the main execution thread to use up more memory more quickly.

    2) Check the size of the current spillable being spilled. If it is larger than constant X, do a System.GC(). This is safer than (1) but due to the fact that we explicitly call GC more often, it may have some impact on performance. However, by considering the fact that spilling small files is much slower than doing System.GC(), this approach should then generally give a better performance.

    I don't really have a processing task that incurs spilling that much. Can anyone please try (2) out?
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates

    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Apr 9, 2008 at 2:33 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Pi Song updated PIG-176:
    ------------------------

    Attachment: pig_176_smallbags_v1.patch

    This patch implements (1) Spill file size threshold (2)My idea in the last comment

    "spill.size.threshold" and "spill.gc.activation.size" are to be set as JVM parameters or .pigrc in order to use this new feature. Default values are 0 and Long.MAX_VALUE respectively.

    There is a bit of problem in (1) that Bag.getMemorySize() sometimes doesn't return accurate value so even the threshold is set, it's still possible that files smaller than the threshold are created.

    The configuration code is still messy in MapReduceLauncher. This needs a clean-up after the configuration patch gets in.
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates
    Attachments: pig_176_smallbags_v1.patch


    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Apr 17, 2008 at 3:52 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590074#action_12590074 ]

    Alan Gates commented on PIG-176:
    --------------------------------

    Pi,

    Did you want to rework this patch now since PIG-111 is in and you can read the properties from pig's Properties object rather than System.getProperties()?
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates
    Attachments: pig_176_smallbags_v1.patch


    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Apr 17, 2008 at 11:14 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590238#action_12590238 ]

    Pi Song commented on PIG-176:
    -----------------------------

    OK, will do that.
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates
    Attachments: pig_176_smallbags_v1.patch


    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Apr 18, 2008 at 2:33 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Pi Song updated PIG-176:
    ------------------------

    Attachment: pig176_v2.patch

    Updated with the latest trunk + make use of the new configuration structure
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates
    Attachments: pig176_v2.patch, pig_176_smallbags_v1.patch


    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at May 2, 2008 at 9:00 pm
    [ https://issues.apache.org/jira/browse/PIG-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates resolved PIG-176.
    ----------------------------

    Resolution: Fixed
    Fix Version/s: 0.1.0

    Patch checked in at revision 652906.
    pig creates many small files when it spills
    -------------------------------------------

    Key: PIG-176
    URL: https://issues.apache.org/jira/browse/PIG-176
    Project: Pig
    Issue Type: Bug
    Reporter: Olga Natkovich
    Assignee: Alan Gates
    Fix For: 0.1.0

    Attachments: pig176_v2.patch, pig_176_smallbags_v1.patch


    Currently, on spill pig can generate millions of small (under 128K) files. Partially this is due to PIG-170 but even with that patch, you can still try and spill small bags.
    The proposal is to not spill small files. Alan told me that the logic is already there but we just need to bump the size limit.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedApr 1, '08 at 12:40a
activeMay 2, '08 at 9:00p
posts9
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 9 posts

People

Translate

site design / logo © 2022 Grokbase