FAQ
PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
----------------------------------------------------------------------------------------

Key: PIG-984
URL: https://issues.apache.org/jira/browse/PIG-984
Project: Pig
Issue Type: New Feature
Reporter: Richard Ding


The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.

However, in the cases where the input data has the following properties

1. The records with the same key are grouped together (such as the data is sorted by the keys).
2. The records with the same key are in the same mapper input.

the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.

Alan proposed adding a hint to the group by clause like this one:

{code}
A = load 'input' using SomeLoader(...);
B = group A by $0 using "mapside";
C = foreach B generate ...
{code}

The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.

It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).

It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.

For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Thejas M Nair (JIRA) at Sep 30, 2009 at 8:20 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760992#action_12760992 ]

    Thejas M Nair commented on PIG-984:
    -----------------------------------

    bq. For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.

    Some transformations on the group-by keys might still preserve the two properties (1,2). So expressions should also be supported. As with the columns case, users need to know for sure that the two properties are satisfied.

    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Sep 30, 2009 at 10:23 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761028#action_12761028 ]

    Santhosh Srinivasan commented on PIG-984:
    -----------------------------------------

    A couple of things:

    1. I am concerned about extending the language for supporting features that can be handled internally. The scope of the language has not been defined but the language continues to evolve.

    2. I agree with Thejas' comment about allowing expressions that do not alter the property. Pig will not be able to check that but it is no different from being able to check if the data is sorted or not.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Sep 30, 2009 at 11:59 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761049#action_12761049 ]

    Richard Ding commented on PIG-984:
    ----------------------------------

    The reason that no expressions are allowed now is that we're trying to make this new feature easy to support (and debug). When there are use cases that require the group by expression on map-side, this restriction can easily be lifted.


    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Dmitriy V. Ryaboy (JIRA) at Oct 1, 2009 at 1:15 am
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761070#action_12761070 ]

    Dmitriy V. Ryaboy commented on PIG-984:
    ---------------------------------------

    Good idea.

    It should be straightforward to look at the sort info associated with the ResourceSchema (see the load/store proposal) to know whether the data is sorted; this frees us from relying on loaders, lets us follow ORDER BYs and LIMITs, etc.

    Still, this is not quite safe unless you know that the distribution key is a subset of your group key. A simple sorted input stream can still be split among mappers with some rows with the same key going to one, and some to the other. Do you have thoughts on how to handle such cases?

    This is something that can be inferred looking at the schema and distribution key. I understand wanting a manual handle to turn on the behavior while developing, but the production version of this can be done automatically ( "if distributed by and sorted on a subset of group keys, apply map-side group" rule in the optimizer).
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Oct 1, 2009 at 1:23 am
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761073#action_12761073 ]

    Santhosh Srinivasan commented on PIG-984:
    -----------------------------------------

    bq. This is something that can be inferred looking at the schema and distribution key. I understand wanting a manual handle to turn on the behavior while developing, but the production version of this can be done automatically ( "if distributed by and sorted on a subset of group keys, apply map-side group" rule in the optimizer).

    +1 Thats what I meant when I said

    bq. 1. I am concerned about extending the language for supporting features that can be handled internally. The scope of the language has not been defined but the language continues to evolve.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 1, 2009 at 4:31 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761257#action_12761257 ]

    Alan Gates commented on PIG-984:
    --------------------------------

    The controlling philosophic point here is that pigs are domestic animals (see http://wiki.apache.org/pig/PigPhilosophy). Just as in join, where we have exposed all possible join implementations to the user, we want to do the same with this new feature. At some future point when we have a capable optimizer, we will try to select the best type of join, and try to select this form of grouping when it's appropriate. But even then, we want to expose this functionality to the user directly because the optimizer may not have access to the necessary information to determine the best grouping choice (e.g., data sources with no schema). And we don't want to wait until the optimizer can handle these things to start exposing it.

    I don't agree with Santosh's assertion that the language is evolving with no definition. I agree we do not yet have a comprehensive definition of Pig Latin, which we need. But this is in line with what we've done for joins, philosophically, semantically, and syntacticly.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Oct 1, 2009 at 5:01 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761270#action_12761270 ]

    Santhosh Srinivasan commented on PIG-984:
    -----------------------------------------

    bq. But this is in line with what we've done for joins, philosophically, semantically, and syntacticly.

    Not exactly; with joins we are exposing different kinds of joins. Here we are exposing the underlying aspects of the framework (mapside). If there is a parallel framework that does not do map-reduce then having mapside in the language is philosophically and semantically not correct.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 1, 2009 at 5:28 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761276#action_12761276 ]

    Alan Gates commented on PIG-984:
    --------------------------------

    I'm fine with changing the name from 'mapside' to 'collected' or something. I see your point that exposing the term 'mapside' is bad because it is hadoop specific. But I think the overall idea of allowing the user to select the type of grouping is good.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 5, 2009 at 7:12 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Assignee: Richard Ding
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding

    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 9, 2009 at 10:32 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Patch Available (was: Open)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 9, 2009 at 10:33 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Attachment: PIG-984.patch

    Based on the feedbacks, the hint is changed to 'using "collected". It's now implemented as a map-side group operation.

    This patch contains the implementation.

    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Oct 10, 2009 at 4:01 am
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764290#action_12764290 ]

    Hadoop QA commented on PIG-984:
    -------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12421781/PIG-984.patch
    against trunk revision 823693.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    -1 javac. The applied patch generated 413 javac compiler warnings (more than the trunk's current 411 warnings).

    -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/69/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/69/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/69/console

    This message is automatically generated.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 12, 2009 at 9:20 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Open (was: Patch Available)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 12, 2009 at 9:22 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Attachment: PIG-984_1.patch

    This patch fixed the above QA errors.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 12, 2009 at 9:22 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Patch Available (was: Open)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Oct 12, 2009 at 9:31 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764846#action_12764846 ]

    Santhosh Srinivasan commented on PIG-984:
    -----------------------------------------

    Very quick comment. The parser has a log.info which should be converted to a log.debug

    Index: src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt
    ===================================================================


    + [<USING> ("\"collected\"" {
    + log.info("Using mapside");

    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Oct 13, 2009 at 12:29 am
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764902#action_12764902 ]

    Hadoop QA commented on PIG-984:
    -------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12421908/PIG-984_1.patch
    against trunk revision 824446.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/73/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/73/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/73/console

    This message is automatically generated.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 15, 2009 at 6:13 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Open (was: Patch Available)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 15, 2009 at 6:14 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Attachment: PIG-984_1.patch

    This patch removed the debug (info) message.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 15, 2009 at 6:14 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Patch Available (was: Open)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Oct 16, 2009 at 4:51 am
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766409#action_12766409 ]

    Hadoop QA commented on PIG-984:
    -------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12422255/PIG-984_1.patch
    against trunk revision 825712.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs. The patch appears to cause Findbugs to fail.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/85/testReport/
    Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/85/console

    This message is automatically generated.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 16, 2009 at 6:20 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Open (was: Patch Available)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 16, 2009 at 6:36 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Attachment: PIG-984_1.patch

    Fix the compile errors.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Richard Ding (JIRA) at Oct 16, 2009 at 6:36 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Richard Ding updated PIG-984:
    -----------------------------

    Status: Patch Available (was: Open)
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Oct 17, 2009 at 7:52 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766951#action_12766951 ]

    Hadoop QA commented on PIG-984:
    -------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12422386/PIG-984_1.patch
    against trunk revision 826110.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    -1 patch. The patch command could not apply the patch.

    Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/94/console

    This message is automatically generated.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 22, 2009 at 7:34 pm
    [ https://issues.apache.org/jira/browse/PIG-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-984:
    ---------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    Patch committed. Thanks Richard.
    PERFORMANCE: Implement a map-side group operator to speed up processing of ordered data
    ----------------------------------------------------------------------------------------

    Key: PIG-984
    URL: https://issues.apache.org/jira/browse/PIG-984
    Project: Pig
    Issue Type: New Feature
    Reporter: Richard Ding
    Assignee: Richard Ding
    Attachments: PIG-984.patch, PIG-984_1.patch, PIG-984_1.patch, PIG-984_1.patch


    The general group by operation in Pig needs both mappers and reducers (the aggregation is done in reducers). This incurs disk writes/reads between mappers and reducers.
    However, in the cases where the input data has the following properties
    1. The records with the same key are grouped together (such as the data is sorted by the keys).
    2. The records with the same key are in the same mapper input.
    the group by operation can be performed in the mappers only and thus remove the overhead of disk writes/reads.
    Alan proposed adding a hint to the group by clause like this one:
    {code}
    A = load 'input' using SomeLoader(...);
    B = group A by $0 using "mapside";
    C = foreach B generate ...
    {code}
    The proposed addition of using "mapside" to group will be a mapside group operator that collects all records for a given key into a buffer. When it sees a key change it will emit the key and bag for records it had buffered. It will assume that all keys for a given record are collected together and thus there is not need to buffer across keys.
    It is expected that "SomeLoader" will be implemented by data systems such as Zebra to ensure the data emitted by the loader satisfies the above properties (1) and (2).
    It will be the responsibility of the user (or the loader) to guarantee these properties (1) & (2) before invoking the mapside hint for the group by clause. The Pig runtime can't check for the errors in the input data.
    For the group by clauses with mapside hint, Pig Latin will only support group by columns (including *), not group by expressions nor group all.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedSep 30, '09 at 7:09p
activeOct 22, '09 at 7:34p
posts27
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 27 posts

People

Translate

site design / logo © 2023 Grokbase