Grokbase Groups Pig user July 2011
FAQ
Hello,

I want to use foreach statement to filter the tuple in the bag. But it
didn't work. My pig-code is as follows:

A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
C = FOREACH B {
D = FILTER A BY A.score > 80;
GENERATE D.name, D.score;}
DUMP C;

It always returns
2011-07-19 14:50:20,329 [main] ERROR org.apache.pig.impl.plan.OperatorPlan -
Attempt to connect operator D: Filter 1-87 which is not in the plan.
2011-07-19 14:50:20,332 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2219: Unable to process scalar in the plan

How can I fix it?

Thanks

Yong

Search Discussions

  • Jacob Perkins at Jul 19, 2011 at 1:13 pm
    I think it's because 'A.score' is a bag but Pig needs a reference to a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:

    (sally,82)
    (fred,120)

    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;
  • 勇胡 at Jul 19, 2011 at 2:05 pm
    How can I understand that 'A.score' is a bag? I mean that if I issue a
    'describe B' command, I can get B: {group:int, A: {name:chararray,
    no:int,score:int}}. From here, I can't get any information that 'A.score' is
    a bag, but I can see that A.score is an element of bag.
    And why if I delete the quantifier 'A.', it works?

    I just changed my pig code as

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

    I got an empty bag!

    The input is as:
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    The output is as:
    ({(sally)},{(82)})
    ({(fred)},{(120)})
    ({},{})

    As you see, I got an empty tuple? why?

    Yong

    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    I think it's because 'A.score' is a bag but Pig needs a reference to a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:


    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;
  • Gianmarco at Jul 19, 2011 at 2:20 pm
    2011/7/19 勇胡 <yongyong313@gmail.com>
    How can I understand that 'A.score' is a bag? I mean that if I issue a
    'describe B' command, I can get B: {group:int, A: {name:chararray,
    no:int,score:int}}. From here, I can't get any information that 'A.score'
    is
    a bag, but I can see that A.score is an element of bag.
    Because A is a a bag and A.score is a projection of A on the score field,
    which is of course still a bag.

    And why if I delete the quantifier 'A.', it works?
    Because it is the correct way to do.
    "Filter relation by field" is the correct syntax.

    I just changed my pig code as

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

    I got an empty bag!

    The input is as:
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    The output is as:
    ({(sally)},{(82)})
    ({(fred)},{(120)})
    ({},{})

    As you see, I got an empty tuple? why?
    Because you are performing the filter inside a foreach on a group by no, and
    no has 3 different values (1,3,4).
    On one of the 3 values (namely 4) the filter returns an empty bag (45 < 80)
    so you get an empty tuple.

    Yong
    Cheers,
    --
    Gianmarco De Francisci Morales


    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    I think it's because 'A.score' is a bag but Pig needs a reference to a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:


    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;
  • Jacob Perkins at Jul 19, 2011 at 2:26 pm

    On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
    How can I understand that 'A.score' is a bag? I mean that if I issue a
    'describe B' command, I can get B: {group:int, A: {name:chararray,
    no:int,score:int}}.
    Looking at the output of describe shows that A is bag (eg. the '{' and
    '}' characters), yes? So 'A.score' is simply the bag of all the scores
    in the group. You can go further and get a bag of both the scores and
    numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
    first.
    From here, I can't get any information that 'A.score' is
    a bag, but I can see that A.score is an element of bag.
    Not true. 'score' is the name of the field. 'A.score' is a bag of just
    the scores. Using the dot '.' is a way of pulling out specific fields
    from every tuple within a bag to result in another bag. Consider:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    DUMP B;

    (1,{(henrietta,1,25),(sally,1,82)})
    (3,{(fred,3,120)})
    (4,{(elsie,4,45)})

    C = FOREACH B GENERATE A.score;
    DUMP C;

    ({(25),(82)})
    ({(120)})
    ({(45)})

    Got it?
    And why if I delete the quantifier 'A.', it works?

    I just changed my pig code as

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

    I got an empty bag!
    'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
    at the end as in the example
    The input is as:
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    The output is as:
    ({(sally)},{(82)})
    ({(fred)},{(120)})
    ({},{})

    As you see, I got an empty tuple? why?
    There are three tuples, one for each group (1, 3, and 4). The filter
    condition left the bags from group 4 empty since the only tuple,
    (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
    empty ones are discarded.

    --jacob
    @thedatachef
    Yong

    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    I think it's because 'A.score' is a bag but Pig needs a reference to a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:


    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;
  • 勇胡 at Jul 20, 2011 at 9:34 am
    Thanks for your response. Now I just think that in which kind of situation I
    can use "." to reference the field. In pig, if I understand right, each
    relation is a bag. If I issue these commands:

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = FILTER A BY A.score>80;

    There is no problem at compile time and the pig code can execute, but
    finally I can't get error results. As you mentioned, A.score is a bag and 8
    is a constant, they are not compatible. There are really big differences
    than SQL. If I use:

    B = FILTER A BY score>80; there is no problem, the statement can execute the
    filter semantics.

    The same problem will occur in the operators "group, cogroup, join, split,
    order, cross". The input of these operators only support fields, not bags
    (if I use "." to reference the field, I get wrong output information). If
    these normal operators can not support "bag" operations, I can't see why the
    pig needs bag type, as the operators can only support flatten type.

    Regards!

    Yong
    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
    How can I understand that 'A.score' is a bag? I mean that if I issue a
    'describe B' command, I can get B: {group:int, A: {name:chararray,
    no:int,score:int}}.
    Looking at the output of describe shows that A is bag (eg. the '{' and
    '}' characters), yes? So 'A.score' is simply the bag of all the scores
    in the group. You can go further and get a bag of both the scores and
    numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
    first.
    From here, I can't get any information that 'A.score' is
    a bag, but I can see that A.score is an element of bag.
    Not true. 'score' is the name of the field. 'A.score' is a bag of just
    the scores. Using the dot '.' is a way of pulling out specific fields
    from every tuple within a bag to result in another bag. Consider:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    DUMP B;

    (1,{(henrietta,1,25),(sally,1,82)})
    (3,{(fred,3,120)})
    (4,{(elsie,4,45)})

    C = FOREACH B GENERATE A.score;
    DUMP C;

    ({(25),(82)})
    ({(120)})
    ({(45)})

    Got it?
    And why if I delete the quantifier 'A.', it works?

    I just changed my pig code as

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

    I got an empty bag!
    'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
    at the end as in the example
    The input is as:
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    The output is as:
    ({(sally)},{(82)})
    ({(fred)},{(120)})
    ({},{})

    As you see, I got an empty tuple? why?
    There are three tuples, one for each group (1, 3, and 4). The filter
    condition left the bags from group 4 empty since the only tuple,
    (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
    empty ones are discarded.

    --jacob
    @thedatachef
    Yong

    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    I think it's because 'A.score' is a bag but Pig needs a reference to a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:


    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;
  • Daniel Dai at Jul 20, 2011 at 6:00 pm
    If you refer some field in the base relation, you only need to refer to
    column name:
    B = FILTER A BY score>80;

    Here A is base relation, so you only need to say "score" instead of
    "A.score". Otherwise, Pig will think you are using A as a scalar (
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars
    )

    Daniel

    2011/7/20 勇胡 <yongyong313@gmail.com>
    Thanks for your response. Now I just think that in which kind of situation
    I
    can use "." to reference the field. In pig, if I understand right, each
    relation is a bag. If I issue these commands:

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = FILTER A BY A.score>80;

    There is no problem at compile time and the pig code can execute, but
    finally I can't get error results. As you mentioned, A.score is a bag and
    80
    is a constant, they are not compatible. There are really big differences
    than SQL. If I use:

    B = FILTER A BY score>80; there is no problem, the statement can execute
    the
    filter semantics.

    The same problem will occur in the operators "group, cogroup, join, split,
    order, cross". The input of these operators only support fields, not bags
    (if I use "." to reference the field, I get wrong output information). If
    these normal operators can not support "bag" operations, I can't see why
    the
    pig needs bag type, as the operators can only support flatten type.

    Regards!

    Yong
    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
    How can I understand that 'A.score' is a bag? I mean that if I issue a
    'describe B' command, I can get B: {group:int, A: {name:chararray,
    no:int,score:int}}.
    Looking at the output of describe shows that A is bag (eg. the '{' and
    '}' characters), yes? So 'A.score' is simply the bag of all the scores
    in the group. You can go further and get a bag of both the scores and
    numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
    first.
    From here, I can't get any information that 'A.score' is
    a bag, but I can see that A.score is an element of bag.
    Not true. 'score' is the name of the field. 'A.score' is a bag of just
    the scores. Using the dot '.' is a way of pulling out specific fields
    from every tuple within a bag to result in another bag. Consider:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    DUMP B;

    (1,{(henrietta,1,25),(sally,1,82)})
    (3,{(fred,3,120)})
    (4,{(elsie,4,45)})

    C = FOREACH B GENERATE A.score;
    DUMP C;

    ({(25),(82)})
    ({(120)})
    ({(45)})

    Got it?
    And why if I delete the quantifier 'A.', it works?

    I just changed my pig code as

    A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

    I got an empty bag!
    'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
    at the end as in the example
    The input is as:
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    The output is as:
    ({(sally)},{(82)})
    ({(fred)},{(120)})
    ({},{})

    As you see, I got an empty tuple? why?
    There are three tuples, one for each group (1, 3, and 4). The filter
    condition left the bags from group 4 empty since the only tuple,
    (elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
    empty ones are discarded.

    --jacob
    @thedatachef
    Yong

    2011/7/19 Jacob Perkins <jacob.a.perkins@gmail.com>
    I think it's because 'A.score' is a bag but Pig needs a reference to
    a
    field in the tuples. This worked for me:

    A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY score > 80;
    GENERATE FLATTEN(D.(name, score));
    };
    DUMP C;

    on the following data:

    $: cat foo.tsv
    henrietta 1 25
    sally 1 82
    fred 3 120
    elsie 4 45

    yields:


    Does that work for you?

    --jacob
    @thedatachef
    On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
    A = LOAD '/home/test/student.txt' AS (name:chararray, no:int,
    score:
    int);
    B = GROUP A BY no;
    C = FOREACH B {
    D = FILTER A BY A.score > 80;
    GENERATE D.name, D.score;}
    DUMP C;

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 19, '11 at 1:01p
activeJul 20, '11 at 6:00p
posts7
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase