Grokbase Groups Pig user April 2010
FAQ
Hi,

I've developed an UDF that receives two bags as inputs and outputs one bag.

One of the bags is different in every group and the other is always the
same.

Example code:

A = LOAD 'a' AS (group, value);
B = LOAD 'b';
G = GROUP A BY group;
R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

This give an error "Error during parsing. Invalid alias: B".
I can understand this error, but I cannot realize another
way to do this.

Do you know which is the best way to do this?

Thanks

--
a10! i fins aviat.
J:-Deu

Search Discussions

  • Alan Gates at Apr 30, 2010 at 3:54 pm
    You need to change your group to a cogroup so that both bags are in
    your data stream. If you don't want to group bag b by the same keys
    as a (that is, you want all of b available for each group of a) then
    you can load b as a side file inside your udf.

    Alan.
    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,

    I've developed an UDF that receives two bags as inputs and outputs
    one bag.

    One of the bags is different in every group and the other is always
    the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Hc busy at Apr 30, 2010 at 4:45 pm
    Sometimes, I find it necessary to project before performing the group by.
    Because there isn't support for functions or #def's it's not possible to
    pass in which column to group by, except to project before grouping.

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    B2 = foreach B generate $5 as group, *;
    G = GROUP A BY group, *B2 BY group*;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

    Wouldn't introducing #define in pig speed this up? Add a preprocessor
    similar to the parameter substitution to support basic #define would be
    cool.

    #define JordiGroup(t1, t2, f1, f2){
    G = group t1 by f1, t2 by f2;
    FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

    }

    ... and later on

    R = JordiGroup(A, B, group, $5);

    Where the result of the #define is the last line; The implementation would
    have a really simple parser to ensure () [] and {}'s match for blocks
    starting with '#define'. Then it performs substitution in order the macro's
    appear, no recursion is allowed.



    On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in your
    data stream. If you don't want to group bag b by the same keys as a (that
    is, you want all of b available for each group of a) then you can load b as
    a side file inside your udf.

    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs one
    bag.

    One of the bags is different in every group and the other is always the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Dmitriy Ryaboy at Apr 30, 2010 at 4:55 pm
    I don't think there's a need to reinvent, or reimplement, the wheel here.

    You are just talking about templates. Try http://template-toolkit.org/
    (or any of the ruby / python variants on the theme).

    Or the ruby Oink DSL.

    -D
    On Fri, Apr 30, 2010 at 9:45 AM, hc busy wrote:
    Sometimes, I find it necessary to project before performing the group by.
    Because there isn't support for functions or #def's it's not possible to
    pass in which column to group by, except to project before grouping.

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    B2 = foreach B generate $5 as group, *;
    G = GROUP A BY group, *B2 BY group*;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

    Wouldn't introducing #define in pig speed this up? Add a preprocessor
    similar to the parameter substitution to support basic #define would be
    cool.

    #define JordiGroup(t1, t2, f1, f2){
    G = group t1 by f1, t2 by f2;
    FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

    }

    ... and later on

    R = JordiGroup(A, B, group, $5);

    Where the result of the #define is the last line; The implementation would
    have a really simple parser to ensure () [] and {}'s match for blocks
    starting with '#define'. Then it performs substitution in order the macro's
    appear, no recursion is allowed.



    On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in your
    data stream.  If you don't want to group bag b by the same keys as a (that
    is, you want all of b available for each group of a) then you can load b as
    a side file inside your udf.

    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs one
    bag.

    One of the bags is different in every group and the other is always the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Hc busy at Apr 30, 2010 at 4:57 pm
    Is there a Java preprocessor?
    On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy wrote:

    I don't think there's a need to reinvent, or reimplement, the wheel here.

    You are just talking about templates. Try http://template-toolkit.org/
    (or any of the ruby / python variants on the theme).

    Or the ruby Oink DSL.

    -D
    On Fri, Apr 30, 2010 at 9:45 AM, hc busy wrote:
    Sometimes, I find it necessary to project before performing the group by.
    Because there isn't support for functions or #def's it's not possible to
    pass in which column to group by, except to project before grouping.

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    B2 = foreach B generate $5 as group, *;
    G = GROUP A BY group, *B2 BY group*;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

    Wouldn't introducing #define in pig speed this up? Add a preprocessor
    similar to the parameter substitution to support basic #define would be
    cool.

    #define JordiGroup(t1, t2, f1, f2){
    G = group t1 by f1, t2 by f2;
    FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

    }

    ... and later on

    R = JordiGroup(A, B, group, $5);

    Where the result of the #define is the last line; The implementation would
    have a really simple parser to ensure () [] and {}'s match for blocks
    starting with '#define'. Then it performs substitution in order the macro's
    appear, no recursion is allowed.



    On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in your
    data stream. If you don't want to group bag b by the same keys as a
    (that
    is, you want all of b available for each group of a) then you can load b
    as
    a side file inside your udf.

    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs one
    bag.

    One of the bags is different in every group and the other is always the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Dmitriy Ryaboy at Apr 30, 2010 at 5:04 pm
    http://www.stringtemplate.org/
    On Fri, Apr 30, 2010 at 9:57 AM, hc busy wrote:
    Is there a Java preprocessor?
    On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy wrote:

    I don't think there's a need to reinvent, or reimplement, the wheel here.

    You are just talking about templates. Try http://template-toolkit.org/
    (or any of the ruby / python variants on the theme).

    Or the ruby Oink DSL.

    -D
    On Fri, Apr 30, 2010 at 9:45 AM, hc busy wrote:
    Sometimes, I find it necessary to project before performing the group by.
    Because there isn't support for functions or #def's it's not possible to
    pass in which column to group by, except to project before grouping.

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    B2 = foreach B generate $5 as group, *;
    G = GROUP A BY group, *B2 BY group*;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

    Wouldn't introducing #define in pig speed this up? Add a preprocessor
    similar to the parameter substitution to support basic #define would be
    cool.

    #define JordiGroup(t1, t2, f1, f2){
    G = group t1 by f1, t2 by f2;
    FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

    }

    ... and later on

    R = JordiGroup(A, B, group, $5);

    Where the result of the #define is the last line; The implementation would
    have a really simple parser to ensure () [] and {}'s match for blocks
    starting with '#define'. Then it performs substitution in order the macro's
    appear, no recursion is allowed.



    On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in your
    data stream.  If you don't want to group bag b by the same keys as a
    (that
    is, you want all of b available for each group of a) then you can load b
    as
    a side file inside your udf.

    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs one
    bag.

    One of the bags is different in every group and the other is always the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Hc busy at Apr 30, 2010 at 5:51 pm
    But we don't want to extend PigLatin to have #define... ?
    On Fri, Apr 30, 2010 at 10:04 AM, Dmitriy Ryaboy wrote:

    http://www.stringtemplate.org/
    On Fri, Apr 30, 2010 at 9:57 AM, hc busy wrote:
    Is there a Java preprocessor?
    On Fri, Apr 30, 2010 at 9:54 AM, Dmitriy Ryaboy wrote:

    I don't think there's a need to reinvent, or reimplement, the wheel
    here.
    You are just talking about templates. Try http://template-toolkit.org/
    (or any of the ruby / python variants on the theme).

    Or the ruby Oink DSL.

    -D
    On Fri, Apr 30, 2010 at 9:45 AM, hc busy wrote:
    Sometimes, I find it necessary to project before performing the group
    by.
    Because there isn't support for functions or #def's it's not possible
    to
    pass in which column to group by, except to project before grouping.

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    B2 = foreach B generate $5 as group, *;
    G = GROUP A BY group, *B2 BY group*;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B2));

    Wouldn't introducing #define in pig speed this up? Add a preprocessor
    similar to the parameter substitution to support basic #define would
    be
    cool.

    #define JordiGroup(t1, t2, f1, f2){
    G = group t1 by f1, t2 by f2;
    FOREACH G GENERATE FLATTEN(my.udf(t1,t2));

    }

    ... and later on

    R = JordiGroup(A, B, group, $5);

    Where the result of the #define is the last line; The implementation would
    have a really simple parser to ensure () [] and {}'s match for blocks
    starting with '#define'. Then it performs substitution in order the macro's
    appear, no recursion is allowed.



    On Fri, Apr 30, 2010 at 8:51 AM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in
    your
    data stream. If you don't want to group bag b by the same keys as a
    (that
    is, you want all of b available for each group of a) then you can
    load b
    as
    a side file inside your udf.

    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs
    one
    bag.

    One of the bags is different in every group and the other is always
    the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu
  • Jordi Deu-Pons at May 1, 2010 at 6:38 am
    Ok,
    then you can load b as a side file inside your udf.
    I'll will try to implement this approach.

    May be in a future it will be useful to allow a LOAD inside a FOREACH

    Thanks.
    On Fri, Apr 30, 2010 at 5:51 PM, Alan Gates wrote:

    You need to change your group to a cogroup so that both bags are in your
    data stream. If you don't want to group bag b by the same keys as a (that
    is, you want all of b available for each group of a) t
    Alan.


    On Apr 30, 2010, at 4:32 AM, Jordi Deu-Pons wrote:

    Hi,
    I've developed an UDF that receives two bags as inputs and outputs one
    bag.

    One of the bags is different in every group and the other is always the
    same.

    Example code:

    A = LOAD 'a' AS (group, value);
    B = LOAD 'b';
    G = GROUP A BY group;
    R = FOREACH G GENERATE FLATTEN(my.udf(A,B));

    This give an error "Error during parsing. Invalid alias: B".
    I can understand this error, but I cannot realize another
    way to do this.

    Do you know which is the best way to do this?

    Thanks

    --
    a10! i fins aviat.
    J:-Deu

    --
    a10! i fins aviat.
    J:-Deu

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 30, '10 at 11:32a
activeMay 1, '10 at 6:38a
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase