Grokbase Groups Pig dev August 2008
FAQ
This email discusses a use case of flattening a bag or tuple when the
schema of the bag or tuple is not known, i.e., null.

When UDFs return bags or tuples (complex type), the schema of the
complex type can be declared via the outputSchema method of the UDF. By
default, the outputSchema method in EvalFunc (the abstract base class)
returns null. When users try to flatten the output of the UDF, the
schema of the flattened column cannot be determined. An example follows.

E.g.:

--myudf returns a bag whose schema is null, i.e., not declared
B = foreach A generate flatten(myudf), $1 as x;

In the above example, since the schema of the bag returned by myudf is
not known, we have two possible options:

1. Erring on the side of safety, set the schema of the flattened column
to be a bytearray. While this is a safe assumption, authors of the UDF
who are aware of the exact return value of the UDF, will try to access
the elements appropriately. For example, if myudf returned a bag with
tuples containing 3 elements, the following might be a possible use
case:

C = foreach B generate $2 as mycolumn;

At this point, the safe assumption about the flattened column being a
single column of type bytearray will generate {bytearray, x: bytearray}
as the schema for B. As a result, statement C will generate a parse
exception for out of bound access.

Given the fact that UDF authors have complete knowledge about the return
values of the UDF, they should override the outputSchema method in the
UDF to ensure correct schemas. The other option is to specify the schema
as part of the "AS" clause in the generate statement, i.e.,

B = foreach A generate flatten(myudf) as (name: chararray, age: int,
gpa: float), $1 as x;

2. Set the schema of the foreach to be unknown or null. The bag returned
by the UDF could contain arbitrary number of columns, making it
impossible to set the correct column number for the other expression, x
in the generate clause. In all likelihood, this will break existing pig
scripts as:

B = foreach A generate flatten(myudf), $1 as x;
C = foreach B generate $1 + x;


Currently, I have an implementation for option 1. Any
thoughts/suggestions/comments are welcome.

Thanks,
Santhosh

Search Discussions

  • Alan Gates at Sep 4, 2008 at 11:48 pm
    I vote for option 2, as it is consistent with other pig operations.
    When we load a file and no schema is given, we make no assumptions.
    When we union two relations with differing schemas, the resulting
    relation has no schema. I think it makes sense to do the same thing
    here. If the user happens to know his UDF's schema, he can provide it
    via an AS clause. I agree that this will break some scripts but it is
    consistent with the rest of the way we do things.

    Alan.

    Santhosh Srinivasan wrote:
    This email discusses a use case of flattening a bag or tuple when the
    schema of the bag or tuple is not known, i.e., null.

    When UDFs return bags or tuples (complex type), the schema of the
    complex type can be declared via the outputSchema method of the UDF. By
    default, the outputSchema method in EvalFunc (the abstract base class)
    returns null. When users try to flatten the output of the UDF, the
    schema of the flattened column cannot be determined. An example follows.

    E.g.:

    --myudf returns a bag whose schema is null, i.e., not declared
    B = foreach A generate flatten(myudf), $1 as x;

    In the above example, since the schema of the bag returned by myudf is
    not known, we have two possible options:

    1. Erring on the side of safety, set the schema of the flattened column
    to be a bytearray. While this is a safe assumption, authors of the UDF
    who are aware of the exact return value of the UDF, will try to access
    the elements appropriately. For example, if myudf returned a bag with
    tuples containing 3 elements, the following might be a possible use
    case:

    C = foreach B generate $2 as mycolumn;

    At this point, the safe assumption about the flattened column being a
    single column of type bytearray will generate {bytearray, x: bytearray}
    as the schema for B. As a result, statement C will generate a parse
    exception for out of bound access.

    Given the fact that UDF authors have complete knowledge about the return
    values of the UDF, they should override the outputSchema method in the
    UDF to ensure correct schemas. The other option is to specify the schema
    as part of the "AS" clause in the generate statement, i.e.,

    B = foreach A generate flatten(myudf) as (name: chararray, age: int,
    gpa: float), $1 as x;

    2. Set the schema of the foreach to be unknown or null. The bag returned
    by the UDF could contain arbitrary number of columns, making it
    impossible to set the correct column number for the other expression, x
    in the generate clause. In all likelihood, this will break existing pig
    scripts as:

    B = foreach A generate flatten(myudf), $1 as x;
    C = foreach B generate $1 + x;


    Currently, I have an implementation for option 1. Any
    thoughts/suggestions/comments are welcome.

    Thanks,
    Santhosh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 29, '08 at 4:52p
activeSep 4, '08 at 11:48p
posts2
users2
websitepig.apache.org

2 users in discussion

Alan Gates: 1 post Santhosh Srinivasan: 1 post

People

Translate

site design / logo © 2023 Grokbase