FAQ
PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
----------------------------------------------------------------------------------------

Key: PIG-450
URL: https://issues.apache.org/jira/browse/PIG-450
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
Fix For: types_branch


In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Alan Gates (JIRA) at Sep 23, 2008 at 9:20 pm
    [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-450:
    ---------------------------

    Attachment: PIG-450.patch

    This patch adds a combiner step to distincts that just removes the duplicate values so that less data is carried across from map to reduce. Here are the resulting time differences (all times in seconds):
    Num records||Num keys||Num reducers||1.4 || 2.0 || 2.0 with this patch ||
    200M | 60 | 1 | 2547 | 1388 | 142 |
    200M | 16M | 50 | 384 | 227 | 231 |
    The main benefit is with a small number of keys, but there does not appear to be a penalty with a larger number of keys.


    PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
    ----------------------------------------------------------------------------------------

    Key: PIG-450
    URL: https://issues.apache.org/jira/browse/PIG-450
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Alan Gates
    Fix For: types_branch

    Attachments: PIG-450.patch


    In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Sep 23, 2008 at 9:20 pm
    [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-450:
    ---------------------------

    Status: Patch Available (was: Open)
    PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
    ----------------------------------------------------------------------------------------

    Key: PIG-450
    URL: https://issues.apache.org/jira/browse/PIG-450
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Alan Gates
    Fix For: types_branch

    Attachments: PIG-450.patch


    In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 24, 2008 at 1:08 am
    [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633986#action_12633986 ]

    Olga Natkovich commented on PIG-450:
    ------------------------------------

    +1
    PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
    ----------------------------------------------------------------------------------------

    Key: PIG-450
    URL: https://issues.apache.org/jira/browse/PIG-450
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Alan Gates
    Fix For: types_branch

    Attachments: PIG-450.patch


    In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Sep 24, 2008 at 6:16 pm
    [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-450:
    ---------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    Patch checked in.
    PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
    ----------------------------------------------------------------------------------------

    Key: PIG-450
    URL: https://issues.apache.org/jira/browse/PIG-450
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Alan Gates
    Fix For: types_branch

    Attachments: PIG-450.patch


    In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedSep 23, '08 at 7:32p
activeSep 24, '08 at 6:16p
posts5
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 5 posts

People

Translate

site design / logo © 2022 Grokbase