Grokbase Groups Pig dev December 2008
FAQ
All,

A question on types in pig. When you say:

A = load 'myfile';

what exactly is A? For the moment let us call A a relation, since it
is a set of records, and we can pass it to a relational operator,
such as FILTER, ORDER, etc.

To clarify the question, is a relation equivalent to a bag? In some
ways it seems to be in our current semantics. Certainly you can turn
a relation into a bag:

A = load 'myfile';
B = group A all;

The schema of the relation B at this point is <group, A>, where A is
a bag. This does not necessarily mean that a relation is a bag,
because an operation had to occur to turn the relation into a bag
(the group all).

But bags can be turned into relations, and then treated again as if
they were bags:

C = foreach B {
C1 = filter A by $0 > 0;
generate COUNT(C1);
}

Here the bag A created in the previous grouping step is being treated
as it were a relation and passed to a relational operator, and the
resulting relation (C1) treated as a bag to be passed COUNT. So at a
very minimum it seems that a bag is a type of a relation, even if not
all relations are bags.

But, if top level (non-nested) relations are bags, why isn't it legal
to do:

A = load 'myfile';
B = A.$0;

The second statement would be legal nested inside a foreach, but is
not legal at the top level.

We have been aware of this discrepancy for a while, and lived with
it. But I believe it is time to resolve it. We've noticed that some
parts of pig assume an equivalence between bag and relation (e.g. the
typechecker) and other parts do not (e.g. the syntax example above).
This inconsistency is confusing to users and developers alike. As
Pig Latin matures we need to strive to make it a logically coherent
and complete language.

So, thoughts on how it ought to be?

The advantage I see for saying a relation is equivalent to a bag is
simplicity of the language. There is no need to introduce another
data type. And it allows full relational operations to occur at both
the top level and nested inside foreach.

But this simplicity also seems me the downside. Are we decoupling
the user so far from the underlying implementation that he will not
be able to see side effects of his actions? A top level relation is
assumably spread across many chunks and any operation on it will
require one or more map reduce jobs, whereas a relation nested in a
foreach is contained on one node. This also makes pig much more
complex, because while it may hide this level of detail from the
user, it clearly has to understand the difference between top level
and nested operations and handle both cases.

Alan.

Search Discussions

  • Pi song at Dec 6, 2008 at 7:39 am
    Here is an example that I have given a while ago in JIRA Pig-158 :-

    A = LOAD 'fil1' ;
    B = A.($0,$1) ;
    STORE B ;

    which is similar to your top-level projection example.

    I believe there is no distinction between so-called relations and bags in
    our context.

    "A top level relation is assumably spread across many chunks and any
    operation on it will require one or more map reduce jobs, whereas a relation
    nested in a foreach is contained on one node." <== As I proposed before,
    whether to run across many nodes or not should have nothing to do with
    top-level or inner-level. The factor which comes into play should rather be
    "job size" which is heuristically calculated.

    To give users some power to control whether to run across nodes or not, we
    may later on introduce a hint keyword instead. This keeps the language
    simple but yet powerful if needed.

    Pi

    On Sat, Dec 6, 2008 at 1:04 PM, Alan Gates wrote:

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation, since it is a
    set of records, and we can pass it to a relational operator, such as FILTER,
    ORDER, etc.

    To clarify the question, is a relation equivalent to a bag? In some ways
    it seems to be in our current semantics. Certainly you can turn a relation
    into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>, where A is a bag.
    This does not necessarily mean that a relation is a bag, because an
    operation had to occur to turn the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again as if they
    were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being treated as it
    were a relation and passed to a relational operator, and the resulting
    relation (C1) treated as a bag to be passed COUNT. So at a very minimum it
    seems that a bag is a type of a relation, even if not all relations are
    bags.

    But, if top level (non-nested) relations are bags, why isn't it legal to
    do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach, but is not
    legal at the top level.

    We have been aware of this discrepancy for a while, and lived with it. But
    I believe it is time to resolve it. We've noticed that some parts of pig
    assume an equivalence between bag and relation (e.g. the typechecker) and
    other parts do not (e.g. the syntax example above). This inconsistency is
    confusing to users and developers alike. As Pig Latin matures we need to
    strive to make it a logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a bag is
    simplicity of the language. There is no need to introduce another data
    type. And it allows full relational operations to occur at both the top
    level and nested inside foreach.

    But this simplicity also seems me the downside. Are we decoupling the user
    so far from the underlying implementation that he will not be able to see
    side effects of his actions? A top level relation is assumably spread
    across many chunks and any operation on it will require one or more map
    reduce jobs, whereas a relation nested in a foreach is contained on one
    node. This also makes pig much more complex, because while it may hide
    this level of detail from the user, it clearly has to understand the
    difference between top level and nested operations and handle both cases.

    Alan.
  • Olga Natkovich at Dec 11, 2008 at 7:13 pm
    I think we should consider Bag and relations to be the same so that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.
  • Ted Dunning at Dec 11, 2008 at 7:30 pm
    +1

    I always vote for consistency for the user. The difference between bags and
    relations is a really subtle point for most users.

    PIG should, frankly, be free to handle tiny computations as a bag even for
    the outer loop. There is no absolute requirement that a program be executed
    as map-reduce. Due to the functional nature of the language, PIG could even
    speculatively try to execute every computation with small inputs as a bag
    and abort if the computation takes more than, say, 2 seconds. That is still
    small in comparison to the startup time of an MR program and the win would
    be really big for cases where it works. There are bound to be gobs of other
    tricks that would take more than 10 seconds to come up with.
    On Thu, Dec 11, 2008 at 11:11 AM, Olga Natkovich wrote:

    I think we should consider Bag and relations to be the same so that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)
  • Pradeep Kamath at Dec 11, 2008 at 7:32 pm
    I find it somewhat inconsistent that we treat both relations and bags
    the same.

    SIZE(A) where A is real bag will be different in implementation than
    SIZE(A) where A is a relation - For the former, all the data is already
    in a container and one can just inspect the size. For the latter, you
    have to do a group ALL-COUNT - this would be very confusing from a
    backend implementation point of view.

    If we do treat relations and bags as equivalent, then all statements
    which currently work on relations should work on bags (say in my input
    data). Here is an example:
    A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
    B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
    (which is supposed to be bag too) and filter on on it - likewise other
    operations possible on relations should work).

    Also A = load 'bla'; B = COUNT(A); will have to be supported (implicitly
    by a map reduce boundary doing a group ALL -COUNT). This will be done
    under the covers and it may not be obvious to a user that and explicit
    group ALL - COUNT and a direct COUNT(A) are the same.


    Thanks,
    Pradeep

    -----Original Message-----
    From: Olga Natkovich
    Sent: Thursday, December 11, 2008 11:12 AM
    To: pig-dev@hadoop.apache.org
    Subject: RE: What is a relation?

    I think we should consider Bag and relations to be the same so that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.
  • Ted Dunning at Dec 11, 2008 at 8:10 pm
    All of what you say sounds like a feature to me rather than a problem.

    Yes, the implementor needs to do it right, but that kind of goes with the
    territory.
    On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath wrote:

    I find it somewhat inconsistent that we treat both relations and bags
    the same.

    SIZE(A) where A is real bag will be different in implementation than
    SIZE(A) where A is a relation - For the former, all the data is already
    in a container and one can just inspect the size. For the latter, you
    have to do a group ALL-COUNT - this would be very confusing from a
    backend implementation point of view.

    If we do treat relations and bags as equivalent, then all statements
    which currently work on relations should work on bags (say in my input
    data). Here is an example:
    A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
    B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
    (which is supposed to be bag too) and filter on on it - likewise other
    operations possible on relations should work).

    Also A = load 'bla'; B = COUNT(A); will have to be supported (implicitly
    by a map reduce boundary doing a group ALL -COUNT). This will be done
    under the covers and it may not be obvious to a user that and explicit
    group ALL - COUNT and a direct COUNT(A) are the same.


    Thanks,
    Pradeep

    -----Original Message-----
    From: Olga Natkovich
    Sent: Thursday, December 11, 2008 11:12 AM
    To: pig-dev@hadoop.apache.org
    Subject: RE: What is a relation?

    I think we should consider Bag and relations to be the same so that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)
  • Santhosh Srinivasan at Dec 11, 2008 at 8:16 pm
    In the existing implementation, Pig has the following dilemma.

    1. A subset of the relational operators are allowed inside a foreach
    E.g: filter, distinct, order by

    2. Non relational operators that are allowed inside a foreach are not
    allowed outside a foreach
    E.g: Projections: A = B.$0;, Assignments for scalars: X = COUNT(D);,
    etc.

    Lets assume that a relation is a bag (and vice versa too).

    Scenario 1:
    -----------
    In the future if there are plans to allow all operators that exist
    inside a foreach outside and vice versa, we will have the following
    problem:

    A = load 'input';
    B = COUNT(A);
    C = group A by $0;
    D = foreach C { X = COUNT(A); generate X:};

    Is B a relation?
    Yes - is B a bag that contains tuples of longs?
    No - is B a scalar of type long?

    Scenario 2:
    -----------

    If there are no plans to allow operators inside a foreach outside (and
    not vice versa).

    It makes good sense to treat relations as bags and vice versa but there
    are some open questions:

    1. Do storage functions indicate that the stored data is a bag?
    Likewise, do load functions treat the stored data as bags?
    2. Will there be an equivalence of the operators wherein bags can
    replace relations in all operators that support relational operator
    inputs?
    E.g: Pradeep alluded to the use of a bag column inside a relation with
    other relational operators.
    A = load 'input' as (x: int, b: {t:(a: int)});
    B = filter A.b by a > 10;

    Conclusion
    -----------

    The equivalence of bags and relations is influenced by the long term
    plan of what will be legal in the language and not necessarily
    influenced by the current state of the language. In the short term, it
    makes sense to treat relations as bags (and vice versa in some cases).
    In the long term, relations should be treated as its own type and define
    legal operations on this type.

    Santhosh

    -----Original Message-----
    From: Ted Dunning
    Sent: Thursday, December 11, 2008 12:10 PM
    To: pig-dev@hadoop.apache.org
    Subject: Re: What is a relation?

    All of what you say sounds like a feature to me rather than a problem.

    Yes, the implementor needs to do it right, but that kind of goes with
    the
    territory.

    On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath
    wrote:
    I find it somewhat inconsistent that we treat both relations and bags
    the same.

    SIZE(A) where A is real bag will be different in implementation than
    SIZE(A) where A is a relation - For the former, all the data is already
    in a container and one can just inspect the size. For the latter, you
    have to do a group ALL-COUNT - this would be very confusing from a
    backend implementation point of view.

    If we do treat relations and bags as equivalent, then all statements
    which currently work on relations should work on bags (say in my input
    data). Here is an example:
    A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
    B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
    (which is supposed to be bag too) and filter on on it - likewise other
    operations possible on relations should work).

    Also A = load 'bla'; B = COUNT(A); will have to be supported
    (implicitly
    by a map reduce boundary doing a group ALL -COUNT). This will be done
    under the covers and it may not be obvious to a user that and explicit
    group ALL - COUNT and a direct COUNT(A) are the same.


    Thanks,
    Pradeep

    -----Original Message-----
    From: Olga Natkovich
    Sent: Thursday, December 11, 2008 11:12 AM
    To: pig-dev@hadoop.apache.org
    Subject: RE: What is a relation?

    I think we should consider Bag and relations to be the same so that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)
  • Ted Dunning at Dec 11, 2008 at 8:38 pm
    You are correct.

    This is a problem with the language. The same function means different
    things in different contexts with (essentially) the same kind of input.

    Do you propose to fix the user confusion caused by inconsistent language
    constructs (aka a language defect) by an even less consistent
    implementation?

    That seems backwards to me. To fix a language defect, the language should
    be changed so that users get a consistent world view. Then implementations
    should implement that.

    There a bunch of different ways to fix this. Here are two. Note that
    neither has the benefit of more than a few seconds of thought and that
    thought was from a person who is not an expert (by any stretch).

    proposal 1) COUNT and all other scalar producing functions always return a
    relation with a single row and column that contains the long of interest.
    All operations that require a long will transparently unwrap the desired
    value from such singleton relations transparently.

    proposal 2) COUNT always returns a scalar value, but all scalar values are
    transparently treated as singleton relations when necessary.
    On Thu, Dec 11, 2008 at 12:15 PM, Santhosh Srinivasan wrote:

    A = load 'input';
    B = COUNT(A);
    C = group A by $0;
    D = foreach C { X = COUNT(A); generate X:};

    Is B a relation?


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)
  • Alan Gates at Dec 11, 2008 at 8:51 pm
    I think the question isn't about feature or problem, or how hard it
    is to implement. It's about what level we want the language at. The
    advantage of making a relation equivalent to a bag, as Ted and others
    point out, is ease comprehension on the part of the users. They are
    not required to maintain an artificial distinction between what's
    happening in parallel and what's happening on a single node. To put
    Pradeep's point a slightly different way, by hiding this distinction
    we make it harder for the users to understand how pig will process
    their scripts. Moving Pig Latin further from execution will mean
    that users will, at times, make less optimal choices in writing their
    scripts because they may not realize that counting a bag at the top
    level has a very different cost than counting a bag nested in a
    foreach. So the choice is between a higher level abstraction that is
    easier to think about (e.g. Python) or a lower level abstraction that
    forces the user to think more like the machine and thus hopefully
    make better choices (e.g. C). It sounds like most of the community
    is voting for the higher abstraction.

    To respond to Pradeep's statement about the filter, that
    B = filter A.bg by x < 100;

    which, if I understood correctly, we would be saying that we're
    filtering out records of bg where bg.x is < 100, should be legal if
    we say all relations are bag. I think this is incorrect. The filter
    in this statement is acting on A, not A.bg. The correct way to write
    the above statement would be:

    B = foreach A {
    A1 = filter bg.x < 100;
    generate A1;
    }

    I believe this holds whatever we say about relations being bags. So
    the semantic is that a relational operator always applies only to the
    relation/bag it is applied to. In order to access elements of a
    relation/bag, the foreach operator is provided.

    Alan.

    On Dec 11, 2008, at 12:09 PM, Ted Dunning wrote:

    All of what you say sounds like a feature to me rather than a problem.

    Yes, the implementor needs to do it right, but that kind of goes
    with the
    territory.

    On Thu, Dec 11, 2008 at 11:32 AM, Pradeep Kamath <pradeepk@yahoo-
    inc.com>wrote:
    I find it somewhat inconsistent that we treat both relations and bags
    the same.

    SIZE(A) where A is real bag will be different in implementation than
    SIZE(A) where A is a relation - For the former, all the data is
    already
    in a container and one can just inspect the size. For the latter, you
    have to do a group ALL-COUNT - this would be very confusing from a
    backend implementation point of view.

    If we do treat relations and bags as equivalent, then all statements
    which currently work on relations should work on bags (say in my
    input
    data). Here is an example:
    A = load 'bla' as (bg:{t:(x:int, y, z)}, str:chararray);
    B = filter A.bg by x < 100; -- Directly access the bag "bg" inside A
    (which is supposed to be bag too) and filter on on it - likewise
    other
    operations possible on relations should work).

    Also A = load 'bla'; B = COUNT(A); will have to be supported
    (implicitly
    by a map reduce boundary doing a group ALL -COUNT). This will be done
    under the covers and it may not be obvious to a user that and
    explicit
    group ALL - COUNT and a direct COUNT(A) are the same.


    Thanks,
    Pradeep

    -----Original Message-----
    From: Olga Natkovich
    Sent: Thursday, December 11, 2008 11:12 AM
    To: pig-dev@hadoop.apache.org
    Subject: RE: What is a relation?

    I think we should consider Bag and relations to be the same so
    that we
    can handle processing in the outer script as well as inside of nested
    foreach the same and make it easier to extend the set of operators
    allowed inside of foreach block.

    Olga
    -----Original Message-----
    From: Alan Gates
    Sent: Friday, December 05, 2008 6:04 PM
    To: pig-dev@hadoop.apache.org
    Subject: What is a relation?

    All,

    A question on types in pig. When you say:

    A = load 'myfile';

    what exactly is A? For the moment let us call A a relation,
    since it is a set of records, and we can pass it to a
    relational operator, such as FILTER, ORDER, etc.

    To clarify the question, is a relation equivalent to a bag?
    In some ways it seems to be in our current semantics.
    Certainly you can turn a relation into a bag:

    A = load 'myfile';
    B = group A all;

    The schema of the relation B at this point is <group, A>,
    where A is a bag. This does not necessarily mean that a
    relation is a bag, because an operation had to occur to turn
    the relation into a bag (the group all).

    But bags can be turned into relations, and then treated again
    as if they were bags:

    C = foreach B {
    C1 = filter A by $0 > 0;
    generate COUNT(C1);
    }

    Here the bag A created in the previous grouping step is being
    treated as it were a relation and passed to a relational
    operator, and the resulting relation (C1) treated as a bag to
    be passed COUNT. So at a very minimum it seems that a bag is
    a type of a relation, even if not all relations are bags.

    But, if top level (non-nested) relations are bags, why isn't
    it legal to do:

    A = load 'myfile';
    B = A.$0;

    The second statement would be legal nested inside a foreach,
    but is not legal at the top level.

    We have been aware of this discrepancy for a while, and lived
    with it. But I believe it is time to resolve it. We've
    noticed that some parts of pig assume an equivalence between
    bag and relation (e.g. the
    typechecker) and other parts do not (e.g. the syntax example
    above).
    This inconsistency is confusing to users and developers
    alike. As Pig Latin matures we need to strive to make it a
    logically coherent and complete language.

    So, thoughts on how it ought to be?

    The advantage I see for saying a relation is equivalent to a
    bag is simplicity of the language. There is no need to
    introduce another data type. And it allows full relational
    operations to occur at both the top level and nested inside foreach.

    But this simplicity also seems me the downside. Are we
    decoupling the user so far from the underlying implementation
    that he will not be able to see side effects of his actions?
    A top level relation is assumably spread across many chunks
    and any operation on it will require one or more map reduce
    jobs, whereas a relation nested in a
    foreach is contained on one node. This also makes pig much more
    complex, because while it may hide this level of detail from
    the user, it clearly has to understand the difference between
    top level and nested operations and handle both cases.

    Alan.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedDec 6, '08 at 2:05a
activeDec 11, '08 at 8:51p
posts9
users6
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase