Grokbase Groups Pig dev August 2008
FAQ
Semantics of generate * have changed
------------------------------------

Key: PIG-359
URL: https://issues.apache.org/jira/browse/PIG-359
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Priority: Minor
Fix For: types_branch


In the main trunk, the script

A = load 'myfile';
B = foreach A generate *;

returns:

(x, y, z)

In the types branch, it returns:

((x, y, z))

There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Mridul Muralidharan at Aug 5, 2008 at 7:40 pm
    Assuming 2 field schema for A, shouldn't

    B = foreach A generate $0, $1;
    and
    B = foreach A generate *;

    not be the same ?

    This is similar to

    B = foreach A generate myFunc($0, $1)
    and
    B = foreach A generate myFunc(*)

    The udf gets a tuple in both cases as ($0, $1) and not (($0, $1)) for
    second case.


    Regards,
    Mridul




    Alan Gates (JIRA) wrote:
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Priority: Minor
    Fix For: types_branch


    In the main trunk, the script

    A = load 'myfile';
    B = foreach A generate *;

    returns:

    (x, y, z)

    In the types branch, it returns:

    ((x, y, z))

    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
  • Alan Gates at Aug 5, 2008 at 7:50 pm
    I think we're saying the same thing.

    In the UDF case, both result in the UDF getting a tuple with two fields.

    In the non-UDF case, both should result in a tuple with two fields. At
    the moment generate * results in a tuple with one field, which is a
    tuple that has two fields. It should not. That's the bug.

    Alan.

    Mridul Muralidharan wrote:
    Assuming 2 field schema for A, shouldn't

    B = foreach A generate $0, $1;
    and
    B = foreach A generate *;

    not be the same ?

    This is similar to

    B = foreach A generate myFunc($0, $1)
    and
    B = foreach A generate myFunc(*)

    The udf gets a tuple in both cases as ($0, $1) and not (($0, $1)) for
    second case.


    Regards,
    Mridul




    Alan Gates (JIRA) wrote:
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Priority: Minor
    Fix For: types_branch


    In the main trunk, the script

    A = load 'myfile';
    B = foreach A generate *;

    returns:

    (x, y, z)

    In the types branch, it returns:

    ((x, y, z))

    There is an extra level of tuple in it. In the main branch generate
    * seems to include an implicit flatten.
  • Olga Natkovich (JIRA) at Aug 18, 2008 at 11:34 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-359:
    -------------------------------

    Priority: Major (was: Minor)

    Bumping the priority since we see users who are trying this code running into this issue
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Fix For: types_branch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 21, 2008 at 9:16 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich reassigned PIG-359:
    ----------------------------------

    Assignee: Shravan Matthur Narayanamurthy

    Shravan, could you take a look please
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Aug 22, 2008 at 2:21 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-359:
    -----------------------------------------------

    Status: Patch Available (was: Open)

    Added a check in CreateTuple to see if we have a Single Tuple inside a Tuple and added logic to return the inner tuple if so.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Aug 22, 2008 at 2:23 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-359:
    -----------------------------------------------

    Attachment: 359.patch
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 22, 2008 at 2:55 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624859#action_12624859 ]

    Olga Natkovich commented on PIG-359:
    ------------------------------------

    Shravan, why is it always a good idea to do this? This is not * specific?
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Aug 25, 2008 at 10:48 am
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shravan Matthur Narayanamurthy updated PIG-359:
    -----------------------------------------------

    Attachment: 359-1.patch

    You are right olga. This is * specific. Changed the patch to include the following:
    In foreach when the operator gets created it also creates a list of the leaves of its inner plans for optimization. Here I also check if the leaf of an innerplan is a project(*). If so I set flatten true for that plan. This causes the foreach logic to flatten tuples.

    The same was the case in POUserFunc when you process * as an input. The semantics were different from the trunk. So changed it in a similar way to ensure the trunk behaviour.

    Because of the changes, needed to change a test case and a golden file.

    All of them inculded in 359-1. Thanks Olga for reviewing.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Aug 26, 2008 at 9:47 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625885#action_12625885 ]

    Alan Gates commented on PIG-359:
    --------------------------------

    I don't think we want the changes to POUserFunc. In the cases of udf(*) the right thing will happen in the existing code because lines 159-161 handle making sure we don't double wrap tuples. And removing these lines causes problems for scripts like this:

    A = load 'myfile' as a:tuple (...);
    B = foreach A generate udf(a);

    Now 'a' will be double wrapped (that is, there will be a tuple containing just the tuple 'a'). This isn't what we want.

    The changes to POForEach look good.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Shravan Matthur Narayanamurthy (JIRA) at Aug 27, 2008 at 4:06 am
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625967#action_12625967 ]

    Shravan Matthur Narayanamurthy commented on PIG-359:
    ----------------------------------------------------

    Alan, two things.
    1) The current code isn't enough because of the following:
    A = load 'file:/etc/passwd' using PigStorage(':');
    B = foreach A generate ARITY(*,*);
    dump B;

    Trunk emits 14(2 times the artiy of each tuple in A which is 7). The current code would emit two. Another example of what current code doesn't handle is

    A = load 'file:/etc/passwd' using PigStorage(':');
    B = foreach A generate ARITY($0, '---', *);
    Trunk emits 9(2 + 7). Current code would emit 3.

    2) You are right in saying that 'a' will be double wrapped. But thats how trunk works right now and I think its right because consider this script:

    A = load 'myfile' as (a:tuple(...), b:tuple(...));
    B = foreach A generate udf(a,b);

    We want 'a', 'b' to be intact inside the tuple input that is being passed to the UDF. So we would expect the arity to be two instead of 2 times the arity of 'a' & 'b'. Generalizing this, I think double wrapping should be ok. The way I tested this behaviour in trunk is by writing a UDF that returns a Tuple say TupleOutputUDF, which just copies the input tuple to the output. I tried the following script in trunk:
    A = load 'file:/etc/passwd' using PigStorage(':');
    B = foreach A generate ARITY(TupleOutputUDF(*));
    dump B;

    with a return value of 1. The current code returns 7.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Aug 27, 2008 at 9:44 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-359:
    ---------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    I still don't like the double wrapping. But Shravan is correct that this matches the previous behavior, and there's no good reason to change it so we shouldn't. The patch has been checked in.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Aug 28, 2008 at 5:01 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626647#action_12626647 ]

    Santhosh Srinivasan commented on PIG-359:
    -----------------------------------------

    In POUserFunc.java, the following code makes an assumption that project( * ) always returns a tuple. In the foreach nested block, we could be projecting bags, at which point the code will fail with ClassCastException. E.g: testNestedPlan in TestEvalPipeline.java

    {code}

    + if(op instanceof POProject){
    + POProject projOp = (POProject)op;
    + if(projOp.isStar()){
    + Tuple trslt = (Tuple) temp.result;
    + Tuple rslt = (Tuple) res.result;
    + for(int i=0;i<trslt.size();i++)
    + rslt.append(trslt.get(i));
    + continue;
    + }
    + }
    {code}
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Aug 28, 2008 at 9:42 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Alan Gates updated PIG-359:
    ---------------------------

    Attachment: PIG-359-2.patch

    This patch addresses the issues Santhosh identified with the patch 359-1.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch, PIG-359-2.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Aug 28, 2008 at 10:54 pm
    [ https://issues.apache.org/jira/browse/PIG-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626779#action_12626779 ]

    Santhosh Srinivasan commented on PIG-359:
    -----------------------------------------

    +1 for Pig-359-2.patch. Looks good.
    Semantics of generate * have changed
    ------------------------------------

    Key: PIG-359
    URL: https://issues.apache.org/jira/browse/PIG-359
    Project: Pig
    Issue Type: Bug
    Components: impl
    Affects Versions: types_branch
    Reporter: Alan Gates
    Assignee: Shravan Matthur Narayanamurthy
    Fix For: types_branch

    Attachments: 359-1.patch, 359.patch, PIG-359-2.patch


    In the main trunk, the script
    A = load 'myfile';
    B = foreach A generate *;
    returns:
    (x, y, z)
    In the types branch, it returns:
    ((x, y, z))
    There is an extra level of tuple in it. In the main branch generate * seems to include an implicit flatten.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 5, '08 at 5:47p
activeAug 28, '08 at 10:54p
posts15
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase