Grokbase Groups Pig dev October 2010
FAQ

[Pig-dev] [jira] Created: (PIG-1693) There needs to be a way in foreach to indicate "and all the rest of the fields"

Alan Gates (JIRA)
Oct 22, 2010 at 5:56 pm
There needs to be a way in foreach to indicate "and all the rest of the fields"
-------------------------------------------------------------------------------

Key: PIG-1693
URL: https://issues.apache.org/jira/browse/PIG-1693
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Alan Gates


A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:

{code}
...
Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
store Z into 'output';
{code}

Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:

{code}
...
Z = foreach Y generate (int)firstcol, "and all the rest";
store Z into 'output'
{code}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
reply

Search Discussions

33 responses

  • Alan Gates (JIRA) at Oct 22, 2010 at 6:16 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923953#action_12923953 ]

    Alan Gates commented on PIG-1693:
    ---------------------------------

    I can see a couple of ways of approaching this.

    One would be something like the colon operator in Python, meaning everything in between. As colon is not widely used for this across programming languages, I propose '...' instead, since that is the natural language meaning of ellipses. If it was used before a certain field it would mean the beginning up to that field:

    {code}
    B = foreach A generate ..., $10;
    {code}
    would mean $0-$9

    If used between two fields, it would mean everything in between:

    {code}
    B = foreach A generate $7, ..., $10;
    {code}
    would mean $8 and $9.

    If used at the end of the line, it would mean everything after the last referenced field:

    {code}
    B = foreach A generate $10, ...;
    {code}

    would mean $11 to the end of the record.

    Another approach would be to define a symbol that means "all fields not referenced in this list of expressions". If, for
    example, we chose @ to mean this, then:

    {code}
    B = foreach A generate $10, @;
    {code}
    would mean $0-$9, and $11 to the end.

    Then does $10 keep its place as the eleventh field or become the first field?

    I like the '...' option better, as it allows more control of ordering and will be easier for users to understand.

    Whichever one we choose we have to answer what it means if an expression contains more than one field:

    {code}
    B = foreach A generate udf($3, $5), ..., udf($8, $10);
    {code}

    What range does '...' include? I propose it includes the highest column number on the left and the lowest on the right (thus in this example, $6 and $7).

    In the @ case it's clear that @ would refer to $0, $1, $2, $4, $6, $7, $9, and anything past $10. But the ordering becomes even stickier. Where do $4 and $9 go?

    In cases where Pig knows the schema, the '...' or '@' operator could be resolved at compile time. This will be more efficient. In cases where it does not, an new physical operator would be required to handle the @ or ellipse end case "$1, ..." as we cannot construct a set of projections that knows exactly which columns to pass through.


    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates

    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Milind Bhandarkar (JIRA) at Oct 22, 2010 at 6:38 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923962#action_12923962 ]

    Milind Bhandarkar commented on PIG-1693:
    ----------------------------------------

    I prefer colon. (it's one keystroke, instead of three you propose), it can represent ranges vey well, and without any ambiguity.

    e.g. $:4, $5:6, $7:

    $:n = 0..n
    $m:n = m..n
    $n: = n..end
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates

    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Oct 22, 2010 at 11:19 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924078#action_12924078 ]

    Olga Natkovich commented on PIG-1693:
    -------------------------------------

    I like .... as well. In the foreach ambiguous foreach example, I would suggest that we require the user to provide start and end rather than making our own rules.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates

    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Oct 22, 2010 at 11:21 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1693:
    --------------------------------

    Fix Version/s: 0.9.0
    Assignee: Daniel Dai
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Oct 22, 2010 at 11:43 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924087#action_12924087 ]

    Santhosh Srinivasan commented on PIG-1693:
    ------------------------------------------

    Why don't we add a drop columns feature? Then we could do the following for the use case stated in the ticket description.

    {code}
    Z = foreach Y drop a, b, c;
    Z1 = foreach Z generate *;
    {code}
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Milind Bhandarkar (JIRA) at Oct 23, 2010 at 12:21 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924095#action_12924095 ]

    Milind Bhandarkar commented on PIG-1693:
    ----------------------------------------

    Is there a pig philosphy stated somewhere to make pig a "write-only" language ?

    Does anyone else feel that putting ... in the statements looks like you are omitting irrrelevant stuff ?
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Milind Bhandarkar (JIRA) at Oct 23, 2010 at 3:17 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924114#action_12924114 ]

    Milind Bhandarkar commented on PIG-1693:
    ----------------------------------------

    Talked to Olga and Thejas offline. Told them my reservations about "...".
    Ranges are a well-established concepts in scripting languages.
    For example, Perl array slicing uses "..", Python uses ":".
    ... is used for varargs, which means any number of arguments, and does not define a range.

    So, ".." (notice, two dots, not three) can be considered.

    Basically, a range is specified by a beginning and an end.
    If beginning is omitted, then 0 is assumed.
    If end is omitted, then max_index(range) is assumed.
    If we use ':', then omitting beginning or end does not look odd as ".."

    To give you an example, if I want to specify all fields after 3, there are two choices.

    $4.., or $4:

    If I want to specify all the fields upto field 6,

    $..6, ot $:6

    If I want to specify fields between 3 and 10,

    $3..10 or $3:10.

    Please choose between .. and :.

    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 25, 2010 at 9:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924730#action_12924730 ]

    Alan Gates commented on PIG-1693:
    ---------------------------------

    The point that '...' is used for varargs and thus may be confusing is a valid one. Perhaps '..' would be a better choice since it is used in both Perl and Ruby. I still don't like ':'.

    Whichever one we choose, syntax and semantics (as suggested by Olga and Milind) seem good.

    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Milind Bhandarkar (JIRA) at Oct 25, 2010 at 9:52 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924742#action_12924742 ]

    Milind Bhandarkar commented on PIG-1693:
    ----------------------------------------

    If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 25, 2010 at 10:04 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924750#action_12924750 ]

    Alan Gates commented on PIG-1693:
    ---------------------------------

    Santhosh, I don't see how drop meets the use case. I want to cast one column and leave all the rest the same. I don't want to drop it.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Santhosh Srinivasan (JIRA) at Oct 25, 2010 at 10:22 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924757#action_12924757 ]

    Santhosh Srinivasan commented on PIG-1693:
    ------------------------------------------

    Please ignore my comment. I was thinking about the use of handling 'n' columns in a record of size 'm' where m >> n
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Oct 26, 2010 at 4:28 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925022#action_12925022 ]

    Alan Gates commented on PIG-1693:
    ---------------------------------

    bq. If we go with "..", then can we mandate that both the beginning and end indexes are mandatory ? That will avoid the ambiguity in your last example.

    As you suggested above, I think we should support 3 cases:

    ..$x -- $0 through $x, inclusive
    $x.. -- $x through end, inclusive
    $x..$y -- $x through $y, inclusive

    The one change I made from your syntax is keeping the '$' attached to the positional variables, because this should be legal by alias too. So if one has a schema (alpha, beta, gamma, delta, epsilon)

    ..gamma
    gamma..
    beta..delta

    would all be legal too.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Milind Bhandarkar (JIRA) at Oct 27, 2010 at 6:07 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925476#action_12925476 ]

    Milind Bhandarkar commented on PIG-1693:
    ----------------------------------------

    +1 to Alan's last comment.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Scott Carey (JIRA) at Nov 6, 2010 at 12:32 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928900#action_12928900 ]

    Scott Carey commented on PIG-1693:
    ----------------------------------

    If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable, what happens when you add a column to a complex flow? Or if you remove one? suddenly you are adding numbers to statements or decrementing numbers all over the place.

    Y has 10 named columns, with full schemas.

    Use case 1, operate on subset:
    {code}
    Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    {code}

    Use case 2, remove a subset:
    {code}
    Z = foreach Y generate firstcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    {code}

    Why not just make the * operator have a few different forms or use a new operator?

    Use case 1 becomes:
    {code}
    Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, *+;
    {code}
    *+ would mean "all columns not referenced"

    Use case 2 becomes:
    {code}
    Z = foreach Y generate *- (secondcol, thirdcol);
    {code}

    and *- generates all columns other than the set right after it.

    I'm not saying these are the best operators or syntax, but syntax that did not involve number ranges and simply 'works' for 'generate all that have not been referenced' and 'generate all excluding (set of aliases)' would be awesome. I definitely don't want to be counting aliases to discover that fieldFoo is the 23rd alias and fieldBar is the 29th.

    There is a lot of problems with ranges combined with names. And you still have to keep track of the count of columns which isn't fun when there are 40. A "shared" alias uses names so that scripts that consume it never has to change if the alias adds columns, or if it removes columns only scripts that used that field has to change.

    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Eric Yang (JIRA) at Dec 30, 2010 at 7:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976089#action_12976089 ]

    Eric Yang commented on PIG-1693:
    --------------------------------

    *+ and *- could have potential readability problems. It is easy to confuse user with mathematical operation at first glance. I think using ".." would be better choice.

    It should be possible to write as:

    {noformat}
    Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, forthcol .. tenthcol;
    Z = foreach Y generate firstcol, forthcol .. tenthcol;
    {noformat}

    Another approach, It could be written as UDF style.

    {noformat}
    Z = foreach Y generate myUDF(firstcol, secondcol, thirdcol) as result, mirror(forthcol, tenthcol);
    Z = foreach Y generate firstcol, mirror(forthcol, thenthcol);
    {noformat}

    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Daniel Dai
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Mar 3, 2011 at 12:03 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1693:
    --------------------------------

    Assignee: Thejas M Nair (was: Daniel Dai)
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 4, 2011 at 1:18 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002384#comment-13002384 ]

    Thejas M Nair commented on PIG-1693:
    ------------------------------------

    bq. If this doesn't work with named aliases, its almost useless for me. Numbered references are not maintainable,
    Alan's proposal in his comment dated '26/Oct/10 16:27' works with named aliases as well.
    I am planning to go work on that proposal.

    The use of "*" is supported in cogroup, order-by and join statements as well, so I am planning to keep it consistent and support this syntax in those statements as well.

    bq. *+ would mean "all columns not referenced"
    In this initial implementation I am planning to support only 'all columns in range'. If there is enough interest for 'all columns not referenced' feature that can be added later.
    There needs to be a way in foreach to indicate "and all the rest of the fields"
    -------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 25, 2011 at 6:06 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1693:
    -------------------------------

    Summary: support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" ) (was: There needs to be a way in foreach to indicate "and all the rest of the fields")
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 25, 2011 at 6:20 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1693:
    -------------------------------

    Attachment: PIG-1693.1.patch

    PIG-1693.1.patch
    Highlights -
    - ProjectExpression in logical plan now supports project-range
    - ProjectStarExpander is called from LogicalPlanBuilder while building foreach,group,join or sort expression plans, to expand the project-range expression.
    - ProjectStarExpander expands all project-range expressions, except project-to-end (eg. $5 ..) when input schema is null. This is the only case when project-range expression is seen by logical optimizers or the physical plan.
    - Some of the logical optimizer rules have changed to consider project-to-end use cases.
    - POProject supports project-to-end expression, and project-star is a special case of project-to-end.
    - MRCompiler and some MR optimizer rules have changed to handle project-to-end case of POProject

    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Xuefu Zhang (JIRA) at Mar 25, 2011 at 7:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011378#comment-13011378 ]

    Xuefu Zhang commented on PIG-1693:
    ----------------------------------

    I have reviewed the parser related changes:

    1. in LogicalPlanGenerator.g
    $expr = builder.buildRangeProjectExpr(
    loc, plan, $GScope::currentOp,
    $statement::inputIndex,
    startExpr == null ? null : startExpr.expr,
    endExpr == null ? null : endExpr.expr
    );

    instead of startExpr == null ? null : startExpr.expr, just use $startExpr.expr.

    2. LogicalPlanBuilder.java
    try {
    plan.removeAndReconnect(startExpr);
    plan.removeAndReconnect(endExpr);
    } catch (FrontendException e) {
    throw new ParserValidationException(intStream, loc, e);
    }
    It is probably better to check if startExpr and endExpr are null.

    3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now.


    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 25, 2011 at 11:33 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011501#comment-13011501 ]

    Daniel Dai commented on PIG-1693:
    ---------------------------------

    +1 for the other part (non parser part) of the patch.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 26, 2011 at 12:31 am
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011516#comment-13011516 ]

    Daniel Dai commented on PIG-1693:
    ---------------------------------

    One minor comment, it is better to change ProjectExpression.toString to print in format [x..y], [..y], [x..] for range, which consistent with the grammar.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 28, 2011 at 12:06 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012016#comment-13012016 ]

    Thejas M Nair commented on PIG-1693:
    ------------------------------------

    bq. 3. An observation. ProjectExpression class seems getting a little overloaded. We might need to consider subclass it to take care of STAR, RANGE, etc, though it doesn't have to happen now.
    I will re-examine the design when I work on PIG-1938, which adds support for project-range as udf argument.

    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 28, 2011 at 12:09 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1693:
    -------------------------------

    Attachment: PIG-1693.2.patch

    PIG-1693.2.patch - addressing review comments.
    Unit tests pass.
    Test-patch results -
    [exec] -1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 15 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] -1 javac. The applied patch generated 958 javac compiler warnings (more than the trunk's current 941 warnings).
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

    The additional javac warnings are from code generated by antlr.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 28, 2011 at 1:03 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair resolved PIG-1693.
    --------------------------------

    Resolution: Fixed
    Release Note:

    Project-range ( '..' ) can be used to project a range of columns from input.
    For example, the expressions -
    ..$x : projects columns $0 through $x, inclusive
$x.. : projects columns through end, inclusive
$x..$y : projects columns through $y, inclusive
    If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


    This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

    It can be used in following statements -
    - foreach
    - join
    - order (also when it is within a nested foreach block)
    - group/cogroup

    Examples -
    {code}
    grunt> F = foreach IN generate (int)col0, col1 .. col3;
    grunt> describe F;
    F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
    {code}
    {code}
    grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
    {code}
    {code}
    J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
    {code}
    {code}
    g = group l1 by b .. c;
    {code}

    Limitations:
    There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

    1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

    2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
    Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
    example-
    {code}
    grunt> describe IN;
    Schema for IN unknown.

    -- Following statement is supported
    SORT = order IN by $2 .. $3, $6 ..;

    -- Following statement is NOT supported
    SORT = order IN by $2 .. $3, $6 ..;
    {code}



    Patch committed to trunk.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Mar 28, 2011 at 1:05 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1693:
    -------------------------------

    Release Note:

    Project-range ( '..' ) can be used to project a range of columns from input.
    For example, the expressions -
    .. $x : projects columns $0 through $x, inclusive
    $x .. : projects columns through end, inclusive
    $x .. $y : projects columns through $y, inclusive
    If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


    This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

    It can be used in following statements -
    - foreach
    - join
    - order (also when it is within a nested foreach block)
    - group/cogroup

    Examples -
    {code}
    grunt> F = foreach IN generate (int)col0, col1 .. col3;
    grunt> describe F;
    F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
    {code}
    {code}
    grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
    {code}
    {code}
    J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
    {code}
    {code}
    g = group l1 by b .. c;
    {code}

    Limitations:
    There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

    1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

    2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
    Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
    example-
    {code}
    grunt> describe IN;
    Schema for IN unknown.

    -- Following statement is supported
    SORT = order IN by $2 .. $3, $6 ..;

    -- Following statement is NOT supported
    SORT = order IN by $2 .. $3, $6 ..;
    {code}



    was:

    Project-range ( '..' ) can be used to project a range of columns from input.
    For example, the expressions -
    ..$x : projects columns $0 through $x, inclusive
$x.. : projects columns through end, inclusive
$x..$y : projects columns through $y, inclusive
    If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


    This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

    It can be used in following statements -
    - foreach
    - join
    - order (also when it is within a nested foreach block)
    - group/cogroup

    Examples -
    {code}
    grunt> F = foreach IN generate (int)col0, col1 .. col3;
    grunt> describe F;
    F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
    {code}
    {code}
    grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
    {code}
    {code}
    J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
    {code}
    {code}
    g = group l1 by b .. c;
    {code}

    Limitations:
    There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

    1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

    2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
    Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
    example-
    {code}
    grunt> describe IN;
    Schema for IN unknown.

    -- Following statement is supported
    SORT = order IN by $2 .. $3, $6 ..;

    -- Following statement is NOT supported
    SORT = order IN by $2 .. $3, $6 ..;
    {code}



    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Mridul Muralidharan (JIRA) at Apr 4, 2011 at 6:59 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015546#comment-13015546 ]

    Mridul Muralidharan commented on PIG-1693:
    ------------------------------------------

    This is a great feature addition.
    Hopefully, the mess created by forcefully projecting only the fields referenced in the schema/schema(when there is no schema specified) can be allevated without needing dummy schema with 10+ fields at times (atleast, it will make it easier I hope) !


    Just curious about one aspect.
    If you do something like :

    A = LOAD '<path>' USING MyLoader();
    B = FOREACH A $0, $3..;
    STORE B USING MyStore();

    Do we still need a schema to 'con' pig into projecting all the fields ? This is particularly relevant when the number of fields is high (or might be 'fuzzy' at times.)
    An earlier version of pig (still ?), introduced an implicit project which forced projection of only the referenced fields (in case the schema not specified) or strictly adhere to specified schema - dropping rest of the fields from tuple.

    Atleast with this change, I hope, we can do something like this to alleviate the issue :

    A = LOAD '<path>' USING MyLoader();
    B = FOREACH A $0, $3..$64;
    STORE B USING MyStore();


    Thanks for clarifying.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Apr 4, 2011 at 10:09 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015653#comment-13015653 ]

    Daniel Dai commented on PIG-1693:
    ---------------------------------

    Yes, range projection works without schema as well.
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Mridul Muralidharan (JIRA) at Apr 5, 2011 at 10:01 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016145#comment-13016145 ]

    Mridul Muralidharan commented on PIG-1693:
    ------------------------------------------

    I am not sure what the comment means - do you mean (in the example above) :
    a) $3.. works for an unspecified number of columns when there is no load schema ?
    b) or, $3..$MAX is required ? (so we should be schema aware).


    Or do you simply mean '..' works when there is no loader schema (which I assumed it would anyway) without commenting on the actual usecase I refer to above ?

    Thanks,
    Mridul
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 5, 2011 at 10:23 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016158#comment-13016158 ]

    Thejas M Nair commented on PIG-1693:
    ------------------------------------

    bq. a) $3.. works for an unspecified number of columns when there is no load schema ?
    Yes, "$3 .." works for unspecified number of columns.
    This is similar to the way project-star ("*") works without input schema. Since pig does not know how many columns would be there, the expansion happens at runtime. In all other cases, the expansion of the project-range expression happens is done before query plan is generated.

    bq. b) or, $3..$MAX is required ? (so we should be schema aware).
    No, this is not required.

    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Mridul Muralidharan (JIRA) at Apr 5, 2011 at 10:27 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016159#comment-13016159 ]

    Mridul Muralidharan commented on PIG-1693:
    ------------------------------------------

    Thanks for clarifying Thejas !
    Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?), but this perfectly fits our usecases - along with a few in coke I guess.


    - Mridul
    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 5, 2011 at 10:39 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016161#comment-13016161 ]

    Thejas M Nair commented on PIG-1693:
    ------------------------------------

    bq. Not sure how this will interact with JOIN and others (which was one the rationale for forced project I guess ?),
    "$3.." (ie project-range-to-end) without schema, will work with join, but not with group or co-group. (This limitation is documented in release notes of this jira).

    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Thejas M Nair (JIRA) at Apr 21, 2011 at 9:54 pm
    [ https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-1693:
    -------------------------------

    Release Note:

    Project-range ( '..' ) can be used to project a range of columns from input.
    For example, the expressions -
    .. $x : projects columns $0 through $x, inclusive
    $x .. : projects columns through end, inclusive
    $x .. $y : projects columns through $y, inclusive
    If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


    This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

    It can be used in following statements -
    - foreach
    - join
    - order (also when it is within a nested foreach block)
    - group/cogroup

    Examples -
    {code}
    grunt> F = foreach IN generate (int)col0, col1 .. col3;
    grunt> describe F;
    F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
    {code}
    {code}
    grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
    {code}
    {code}
    J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
    {code}
    {code}
    g = group l1 by b .. c;
    {code}

    Limitations:
    There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

    1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

    2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
    example-
    {code}
    grunt> describe IN;
    Schema for IN unknown.

    -- Following statement is supported
    SORT = order IN by $2 .. $3, $6 ..;

    -- Following statement is NOT supported
    SORT = order IN by $2 .. $3, $6 ..;
    {code}



    was:

    Project-range ( '..' ) can be used to project a range of columns from input.
    For example, the expressions -
    .. $x : projects columns $0 through $x, inclusive
    $x .. : projects columns through end, inclusive
    $x .. $y : projects columns through $y, inclusive
    If the input relation has a schema, you can also use column aliases instead of referring to columns using position. You can also combine the use of alias and column positions in a project-range expression (ie, "col1 .. $5" is valid).


    This expression can be used in all cases where the use of '*' (project-star) is allowed, except as a udf argument. Support for that use case will be added in PIG-1938.

    It can be used in following statements -
    - foreach
    - join
    - order (also when it is within a nested foreach block)
    - group/cogroup

    Examples -
    {code}
    grunt> F = foreach IN generate (int)col0, col1 .. col3;
    grunt> describe F;
    F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray}
    {code}
    {code}
    grunt> SORT = order IN by col2 .. col3, col0, col4 ..;
    {code}
    {code}
    J = join IN1 by $0 .. $3, IN2 by $0 .. $3;
    {code}
    {code}
    g = group l1 by b .. c;
    {code}

    Limitations:
    There are some restrictions on the use of project-to-end form of project range (eg "x .. ") when input schema is null (unknown). These are also cases where the use of project-star ('*') is restricted.

    1. In Cogroup/Group statements, project-to-end form of project-range is only allowed if the input has a schema

    2. In order-by statement, project-to-end form of project-range is supported only as last sort column, if input schema is null.
    Note: there is a bug PIG-1939, because of which the use is restricted when schema is present. That should be fixed soon.
    example-
    {code}
    grunt> describe IN;
    Schema for IN unknown.

    -- Following statement is supported
    SORT = order IN by $2 .. $3, $6 ..;

    -- Following statement is NOT supported
    SORT = order IN by $2 .. $3, $6 ..;
    {code}



    support project-range expression. (was: There needs to be a way in foreach to indicate "and all the rest of the fields" )
    -------------------------------------------------------------------------------------------------------------------------

    Key: PIG-1693
    URL: https://issues.apache.org/jira/browse/PIG-1693
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Thejas M Nair
    Fix For: 0.9.0

    Attachments: PIG-1693.1.patch, PIG-1693.2.patch


    A common use case we see in Pig is people have many columns in their data and they only want to operate on a few of them. Consider for example if before storing data with ten columns, the user wants to perform a cast on one column:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
    store Z into 'output';
    {code}
    Obviously this only gets worse as the user has more columns. Ideally the above could be transformed to something like:
    {code}
    ...
    Z = foreach Y generate (int)firstcol, "and all the rest";
    store Z into 'output'
    {code}
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post

1 user in discussion

Thejas M Nair (JIRA): 34 posts