Grokbase Groups Pig dev March 2011
FAQ
GFCross should allow the user to set the DEFAULT_PARALLELISM value
------------------------------------------------------------------

Key: PIG-1932
URL: https://issues.apache.org/jira/browse/PIG-1932
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.8.0
Reporter: Alan Gates
Priority: Minor


The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.

We should leave the default value at 96 but allow a property to override this default and change the value.

We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Alan Gates (JIRA) at Mar 24, 2011 at 5:49 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Fix Version/s: 0.9.0
    Assignee: Alan Gates
    Status: Patch Available (was: Open)
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Alan Gates (JIRA) at Mar 24, 2011 at 5:49 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Attachment: PIG-1932.patch

    Unit tests pass. Results of test-patch:

    [exec] -1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 3 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] -1 release audit. The applied patch generated 545 release audit warnings (more than the trunk's current 544 warnings).
    [exec]

    the new release audit warning is because I added a file.
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Alan Gates (JIRA) at Mar 24, 2011 at 11:50 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Status: Open (was: Patch Available)

    Daniel convinced me I should use the parallelism value from the cross, since what's really important about this is how many join groups it creates. You want to create enough groups to keep each reducers busy.
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Alan Gates (JIRA) at Mar 29, 2011 at 5:19 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Status: Patch Available (was: Open)
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch, PIG-1932_2.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Alan Gates (JIRA) at Mar 29, 2011 at 5:19 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Attachment: PIG-1932_2.patch

    Commit test unit tests pass

    [exec] -1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 3 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] -1 release audit. The applied patch generated 552 release audit warnings (more than the trunk's current 550 warnings).
    [exec]
    [exec]
    [exec]

    Release audit issues are due to new file and changes to javadocs in GFCross.
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch, PIG-1932_2.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 29, 2011 at 5:53 pm
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012579#comment-13012579 ]

    Daniel Dai commented on PIG-1932:
    ---------------------------------

    +1
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch, PIG-1932_2.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Alan Gates (JIRA) at Mar 31, 2011 at 1:17 am
    [ https://issues.apache.org/jira/browse/PIG-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-1932:
    ----------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    Patch 2 checked in.
    GFCross should allow the user to set the DEFAULT_PARALLELISM value
    ------------------------------------------------------------------

    Key: PIG-1932
    URL: https://issues.apache.org/jira/browse/PIG-1932
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: 0.8.0
    Reporter: Alan Gates
    Assignee: Alan Gates
    Priority: Minor
    Fix For: 0.9.0

    Attachments: PIG-1932.patch, PIG-1932_2.patch


    The internal UDF GFCross uses a final static int DEFAULT_PARALLELISM to determine how wide to spread the records in a cross. It is currently hard wired to 96. There are no comments in the code on how that value was settled on. Despite the name, this value is not necessarily related to the reduce parallelism controlled by the parallel clause. It controls how many artificial join key values are generated and how many times each record is duplicated before going through the join. The higher it is set the more key values (and thus the less likely the cross will run out of memory) but also the more times each record is duplicated in the map phase before being sent to the reduce.
    We should leave the default value at 96 but allow a property to override this default and change the value.
    We cannot use a constructor argument here because the use of the UDF is not exposed to the user, so he has no opportunity to pass a constructor argument to it.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedMar 23, '11 at 12:54a
activeMar 31, '11 at 1:17a
posts8
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 8 posts

People

Translate

site design / logo © 2022 Grokbase