FAQ
Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
----------------------------------------------------------------------------------------------

Key: PIG-2259
URL: https://issues.apache.org/jira/browse/PIG-2259
Project: Pig
Issue Type: Bug
Components: parser
Affects Versions: 0.9.0
Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
Reporter: JArod Wen


I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:

For the following scripts:

a = load 'grade.dat' as (name, age, gpa);
b = load 'rate.dat' as (state, age, rate);
ag = group a by (name, age);
c = cogroup ag by group.age, b by age;
cf = foreach c generate $1.$0;

The relation c has the schema as:

bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}

so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.

The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes

bag{tuple(tuple(bytearray, bytearray))}

Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.

For the same reason, the following script can also produce the same "black hole":

cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Daniel Dai (JIRA) at Sep 5, 2011 at 3:26 am
    [ https://issues.apache.org/jira/browse/PIG-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096992#comment-13096992 ]

    Daniel Dai commented on PIG-2259:
    ---------------------------------

    The major problem here is we don't have a way to slice a bag horizontally. Consider the following bag:
    bag:{(1,{(a),(b)})
    (2,{(c),(d)})}
    We can only slice the bag vertically:
    bag.$0, we get all first elements of the bag. And apparently, the resulting data structure can only be bag:
    {(1),(2)}
    Similarly, bag.$1:
    {({(a),(b)}),({(c),(d)})

    I guess what you want is the ability to access in individual cell inside a bag. This require the ability to slice the bag horizontally, such as bag[0]=(1,{(a),(b)}), then you can refer bag[0].$1={(a),(b)}. However, this is not bag designed to be. A bag is a collection of tuples in which no order is defined, so you can only iterate through a bag. You can access tuples inside a bag by a custom UDF. I don't know how to provide something in semantic level to access a specific tuple inside a bag. I would suggest provide more buildin UDFs for bag processing, such as:
    1. GetTupleInBag(bag, i), get ith tuple
    2. GetFirstTupleWithValue(bag, j, value), get first tuple which carry "key" as its jth column
    Both UDF need to iterate through the bag to get the specific elelment, the time complexity is O(n)
    Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
    ----------------------------------------------------------------------------------------------

    Key: PIG-2259
    URL: https://issues.apache.org/jira/browse/PIG-2259
    Project: Pig
    Issue Type: Bug
    Components: parser
    Affects Versions: 0.9.0
    Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
    Reporter: JArod Wen
    Labels: bag, dereference, pig

    I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:
    For the following scripts:
    a = load 'grade.dat' as (name, age, gpa);
    b = load 'rate.dat' as (state, age, rate);
    ag = group a by (name, age);
    c = cogroup ag by group.age, b by age;
    cf = foreach c generate $1.$0;
    The relation c has the schema as:
    bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}
    so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.
    The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes
    bag{tuple(tuple(bytearray, bytearray))}
    Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.
    For the same reason, the following script can also produce the same "black hole":
    cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • JArod Wen (JIRA) at Sep 6, 2011 at 4:47 pm
    [ https://issues.apache.org/jira/browse/PIG-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098152#comment-13098152 ]

    JArod Wen commented on PIG-2259:
    --------------------------------

    Thanks Daniel for your comments. Besides the case of slice a bag horizontally, another way of thinking about "dereference a bag within a bag" may lead to the logical of flatting a nested bag. Since bag is a unordered set of tuples, when all tuples inside have the same schema, and one of the fields is a bag field, it should be doable to extract the fields of the inner bag.

    For example, using an example extending the one you have provided:

    bag: {(1, {(a, 0.3), (b, 0.4)}), (2, {(c, 0.5), (d, 0.6)})}.

    The dereference of bag.$1.$0 may have the output of

    new_bag: {({(a), (b)}), ({(c), (d)})}.

    So here the order still does not matter. This should be different from a horizontally where the order really matters. How do you think?
    Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
    ----------------------------------------------------------------------------------------------

    Key: PIG-2259
    URL: https://issues.apache.org/jira/browse/PIG-2259
    Project: Pig
    Issue Type: Bug
    Components: parser
    Affects Versions: 0.9.0
    Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
    Reporter: JArod Wen
    Labels: bag, dereference, pig

    I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:
    For the following scripts:
    a = load 'grade.dat' as (name, age, gpa);
    b = load 'rate.dat' as (state, age, rate);
    ag = group a by (name, age);
    c = cogroup ag by group.age, b by age;
    cf = foreach c generate $1.$0;
    The relation c has the schema as:
    bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}
    so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.
    The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes
    bag{tuple(tuple(bytearray, bytearray))}
    Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.
    For the same reason, the following script can also produce the same "black hole":
    cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Jonathan Coveney (Commented) (JIRA) at Dec 17, 2011 at 11:16 pm
    [ https://issues.apache.org/jira/browse/PIG-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171703#comment-13171703 ]

    Jonathan Coveney commented on PIG-2259:
    ---------------------------------------

    I actually think I get what Jarod means, and agree. Let's say you have a bag

    b:bag{t:tuple(x:int, b:bag{t:tuple(a:int,b:int,c:int)})}

    It'd be nice to be able to do
    b.$0.$1 in order to grab that inner bag. You could, alternately, do b.$0, flatten it, then access the $0 field, but that is way more clunky.

    I'll look around and see how hard this would be too do (probably not terribly difficult), the question is more whether we should support this (and I would say we should).
    Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
    ----------------------------------------------------------------------------------------------

    Key: PIG-2259
    URL: https://issues.apache.org/jira/browse/PIG-2259
    Project: Pig
    Issue Type: Bug
    Components: parser
    Affects Versions: 0.9.0
    Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
    Reporter: JArod Wen
    Labels: bag, dereference, pig

    I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:
    For the following scripts:
    a = load 'grade.dat' as (name, age, gpa);
    b = load 'rate.dat' as (state, age, rate);
    ag = group a by (name, age);
    c = cogroup ag by group.age, b by age;
    cf = foreach c generate $1.$0;
    The relation c has the schema as:
    bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}
    so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.
    The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes
    bag{tuple(tuple(bytearray, bytearray))}
    Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.
    For the same reason, the following script can also produce the same "black hole":
    cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (Commented) (JIRA) at Dec 20, 2011 at 8:03 pm
    [ https://issues.apache.org/jira/browse/PIG-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173468#comment-13173468 ]

    Daniel Dai commented on PIG-2259:
    ---------------------------------

    It is semantically right if this involves a flatten. Then we need to limit the usage in foreach, since this is the only operator has the notion flatten. I am a little worry about people may misuse it, but I am open to it.
    Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
    ----------------------------------------------------------------------------------------------

    Key: PIG-2259
    URL: https://issues.apache.org/jira/browse/PIG-2259
    Project: Pig
    Issue Type: Bug
    Components: parser
    Affects Versions: 0.9.0
    Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
    Reporter: JArod Wen
    Labels: bag, dereference, pig

    I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:
    For the following scripts:
    a = load 'grade.dat' as (name, age, gpa);
    b = load 'rate.dat' as (state, age, rate);
    ag = group a by (name, age);
    c = cogroup ag by group.age, b by age;
    cf = foreach c generate $1.$0;
    The relation c has the schema as:
    bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}
    so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.
    The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes
    bag{tuple(tuple(bytearray, bytearray))}
    Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.
    For the same reason, the following script can also produce the same "black hole":
    cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • JArod Wen (Commented) (JIRA) at Dec 20, 2011 at 8:15 pm
    [ https://issues.apache.org/jira/browse/PIG-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173475#comment-13173475 ]

    JArod Wen commented on PIG-2259:
    --------------------------------

    Actually when I am rethinking about this problem now, I am preferring Daniel's opinion.

    This may be a question of whether we can assume that the bag is a typed bag or not. In general case, no assumption can be made to the schema within the bag, then in order to get inside of the bag of bag, flatten() is necessary.

    However if the parser knows that it is a typed bag, b.$0.$1 should be preferred.
    Black hole of multiple level dereference on "bag in bag" structure: cannot reach deeper levels
    ----------------------------------------------------------------------------------------------

    Key: PIG-2259
    URL: https://issues.apache.org/jira/browse/PIG-2259
    Project: Pig
    Issue Type: Bug
    Components: parser
    Affects Versions: 0.9.0
    Environment: Pig 0.9.0 local version, on Linux x86 and Mac OS X 10.7.1
    Reporter: JArod Wen
    Labels: bag, dereference, pig

    I noticed that dereference cannot reach the second level of bag in a "bag in bag" structure. Here is a example:
    For the following scripts:
    a = load 'grade.dat' as (name, age, gpa);
    b = load 'rate.dat' as (state, age, rate);
    ag = group a by (name, age);
    c = cogroup ag by group.age, b by age;
    cf = foreach c generate $1.$0;
    The relation c has the schema as:
    bytearray, bag{tuple(tuple(bytearray, bytearray), bag{tuple(bytearray, bytearray, bytearray)})}, bag{tuple(bytearray, bytearray, bytearray)}
    so for c, $1.$0 means the first field of the bag "ag", which will be the tuple group(name, age). However after this, $1.$0.$0 and $1.$0.$0.$0 keep the same tuple but no deeper dereference. Actually we can add arbitrary number of ".$0" after $1.$0 but keep stay at the same position.
    The reason for this interesting "black hole" of the dereference is when we dereferencing a bag, we automatically create another bag structure, so after we obtain the "group(name, age)" tuple from the bag "ag", a bag wrapper is added onto the tuple so it becomes
    bag{tuple(tuple(bytearray, bytearray))}
    Then no matter how many dereferences are appended, this structure cannot be changed since every dereference just "takes off" the outer bag wrapper and "puts on" the same bag wrapper.
    For the same reason, the following script can also produce the same "black hole":
    cf = foreach c generate $1.$1.$0. ... (arbitrary number of ".$0")
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedSep 1, '11 at 1:29a
activeDec 20, '11 at 8:15p
posts6
users1
websitepig.apache.org

1 user in discussion

JArod Wen (Commented) (JIRA): 6 posts

People

Translate

site design / logo © 2022 Grokbase