Grokbase Groups Pig dev October 2008
FAQ
PERFORMANCE: streaming data to aggregate functions
--------------------------------------------------

Key: PIG-484
URL: https://issues.apache.org/jira/browse/PIG-484
Project: Pig
Issue Type: Improvement
Affects Versions: types_branch
Reporter: Olga Natkovich
Fix For: types_branch


Currently, for queries like

A = load 'data';
B = group A by $0;
C = foreach A generate group, MIN(A.$1), MAX (A.$1)

The data will be put into the bag before being passed to aggregate functions. This is unnecessary and inefficient. In this case, data can be just streamed to the functions.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Pradeep Kamath (JIRA) at Nov 6, 2008 at 5:49 pm
    [ https://issues.apache.org/jira/browse/PIG-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Pradeep Kamath updated PIG-484:
    -------------------------------

    Assignee: Pradeep Kamath
    Status: Patch Available (was: Open)

    Patch details:
    - The idea is to stream small chunks in bags to the "Initial" version of the algebraic function in the combiner. So in cases where in the map, there is an explosion of values, the original bag would have been big and possibly caused costly spills. This will be avoided now since small chunks will be sent to the aggregate function's "initial" method.

    The code checks for a combine plan and if it is present, it replaces the POPackage and POForEach in the combine plan with POJoinPackage which is a combination of the two customized for streaming small bags between the package and the foreach.


    PERFORMANCE: streaming data to aggregate functions
    --------------------------------------------------

    Key: PIG-484
    URL: https://issues.apache.org/jira/browse/PIG-484
    Project: Pig
    Issue Type: Improvement
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Assignee: Pradeep Kamath
    Fix For: types_branch

    Attachments: PIG-484.patch


    Currently, for queries like
    A = load 'data';
    B = group A by $0;
    C = foreach A generate group, MIN(A.$1), MAX (A.$1)
    The data will be put into the bag before being passed to aggregate functions. This is unnecessary and inefficient. In this case, data can be just streamed to the functions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pradeep Kamath (JIRA) at Nov 6, 2008 at 5:49 pm
    [ https://issues.apache.org/jira/browse/PIG-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Pradeep Kamath updated PIG-484:
    -------------------------------

    Attachment: PIG-484.patch
    PERFORMANCE: streaming data to aggregate functions
    --------------------------------------------------

    Key: PIG-484
    URL: https://issues.apache.org/jira/browse/PIG-484
    Project: Pig
    Issue Type: Improvement
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Assignee: Pradeep Kamath
    Fix For: types_branch

    Attachments: PIG-484.patch


    Currently, for queries like
    A = load 'data';
    B = group A by $0;
    C = foreach A generate group, MIN(A.$1), MAX (A.$1)
    The data will be put into the bag before being passed to aggregate functions. This is unnecessary and inefficient. In this case, data can be just streamed to the functions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Nov 11, 2008 at 1:26 am
    [ https://issues.apache.org/jira/browse/PIG-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-484:
    ---------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    Patch checked in. I ran performance tests on large data and saw no significant changes. This is fine, as this change is more for scalability than performance.
    PERFORMANCE: streaming data to aggregate functions
    --------------------------------------------------

    Key: PIG-484
    URL: https://issues.apache.org/jira/browse/PIG-484
    Project: Pig
    Issue Type: Improvement
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Assignee: Pradeep Kamath
    Fix For: types_branch

    Attachments: PIG-484.patch


    Currently, for queries like
    A = load 'data';
    B = group A by $0;
    C = foreach A generate group, MIN(A.$1), MAX (A.$1)
    The data will be put into the bag before being passed to aggregate functions. This is unnecessary and inefficient. In this case, data can be just streamed to the functions.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedOct 9, '08 at 7:11p
activeNov 11, '08 at 1:26a
posts4
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 4 posts

People

Translate

site design / logo © 2022 Grokbase