Grokbase Groups Pig dev August 2008
FAQ
JOIN and cogroup should handle NULLs correctly
----------------------------------------------

Key: PIG-361
URL: https://issues.apache.org/jira/browse/PIG-361
Project: Pig
Issue Type: Sub-task
Affects Versions: types_branch
Reporter: Pradeep Kamath
Fix For: types_branch


JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.

Cogroup should coalesce all NULL key rows into one group.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Olga Natkovich (JIRA) at Aug 18, 2008 at 11:22 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-361:
    ----------------------------------

    Assignee: Alan Gates
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Fix For: types_branch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 4, 2008 at 1:10 am
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-361:
    ----------------------------------

    Assignee: Shravan Matthur Narayanamurthy (was: Alan Gates)

    Shravan, could you take a look plaese.

    I think we want to preserve SQL semantics here:

    JOIN / INNER COGROUP - throws away all NULLs
    OUTER COGROUP + flatten with bincond simulates outer joins where missing data is padded by NULLs and nulls are assumed to be all different - never multiplied.
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 4, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-361:
    ----------------------------------

    Assignee: Alan Gates (was: Shravan Matthur Narayanamurthy)

    Reassigning back to Alan since he is looking into join optimization
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Fix For: types_branch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 4, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628477#action_12628477 ]

    Olga Natkovich commented on PIG-361:
    ------------------------------------

    After having further discussion, here is what I think is the right thing to do:

    (1) Cogroup distinguishes between NULL keys from different relations by creating separate records

    A = load ...
    B = load ...
    C = congroup A by $0, B by $0;
    ...

    Assuming that both A and B contain null values in the key column, C would look as follows:

    {
    ....
    NULL, {.....}, {}
    NULL, {}, {...}
    ....
    }

    The first record corresponds to all records of A with NULL key and the second with record of B with empty key.

    (2) This is consistent with SQL semantics that NULLs are not the same. It will make JOIN work as is and also outer join expressed as COGROUP + FOREACH with Bincond work as with earlier versions.

    (3) The required work is to add relation id to the comparison function. Join optimization already does that. We will try to piggyback this issue onto join optimization
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 4, 2008 at 11:52 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-361:
    -------------------------------

    Priority: Critical (was: Major)
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Priority: Critical
    Fix For: types_branch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Sep 18, 2008 at 11:15 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-361:
    ---------------------------

    Attachment: PIG-361.patch

    This patch makes a number of changes. It removes IndexedTuple. Instead values are passed between map and reduce jobs as NullableTuples. These extend WritableComparable and contain a tuple. They also have bytes to indicate whether a tuple is null and which part of a join it comes from.

    A new type PigNullableWritable has been added. All of the NullableXWritable types now extend this (including NullableTuple). Keys passed between map and reduce jobs are now of this type. This allows the sorting to be done on the index but not the grouping or partitioning.

    I also found a major problem in the SortParitioner. It was assuming all input were tuples and then applying the raw comparator. But in 2.0 we do not use tuples in the case of a single key. So I modified SortPartitioner to correctly determine the key type and use the correct type of comparator.


    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Priority: Critical
    Fix For: types_branch

    Attachments: PIG-361.patch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Sep 19, 2008 at 7:07 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632774#action_12632774 ]

    Olga Natkovich commented on PIG-361:
    ------------------------------------

    +1 on the patch

    Couple of small comments:

    (1) NullableBag and NullableTuple can use static factory to avoid "if" on every bag/tuple construction
    (2) Index is currenty attached to all data even if we only have 1 stream. It is only a single byte but we could optimize a bit further here later
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Priority: Critical
    Fix For: types_branch

    Attachments: PIG-361.patch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Sep 19, 2008 at 9:35 pm
    [ https://issues.apache.org/jira/browse/PIG-361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates resolved PIG-361.
    ----------------------------

    Resolution: Fixed

    PIG-361.patch committed.
    JOIN and cogroup should handle NULLs correctly
    ----------------------------------------------

    Key: PIG-361
    URL: https://issues.apache.org/jira/browse/PIG-361
    Project: Pig
    Issue Type: Sub-task
    Affects Versions: types_branch
    Reporter: Pradeep Kamath
    Assignee: Alan Gates
    Priority: Critical
    Fix For: types_branch

    Attachments: PIG-361.patch


    JOIN should follow SQL semantics .i.e if the join key is a null or part of the join key is null in the first table, it should not join with similar keys in the second table.
    Cogroup should coalesce all NULL key rows into one group.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 6, '08 at 5:41p
activeSep 19, '08 at 9:35p
posts9
users1
websitepig.apache.org

1 user in discussion

Alan Gates (JIRA): 9 posts

People

Translate

site design / logo © 2022 Grokbase