Grokbase Groups Pig user May 2009
FAQ
Hi,

I am experiencing a loss of tuples when running queries on an 8-node
cluster using Pig 0.2.0 and Hadoop 0.18.3.
For example, something as simple as the below script causes 41,429,443
to be lost:

raw = LOAD '/dataset/cosmo25cmb.768g/cosmo25cmb.768g.00043.dark' USING
PigStorage(',') as (pid:long,mass:double,x:double,y:double,z:double);

g1 = GROUP raw ALL PARALLEL 7;
cnt1 = FOREACH g1 GENERATE COUNT(raw);

DUMP cnt1; -- 452984832L

g = GROUP raw BY pid PARALLEL 7;
gx = FOREACH g GENERATE FLATTEN(raw);
gy = GROUP gx ALL PARALLEL 7;
cnt = FOREACH gy GENERATE COUNT(gx);

DUMP cnt; -- 411555389L


There are no error messages in any of the map or reduce tasks.
Any ideas of what the problem is?

Thanks,
Dylan

Search Discussions

  • Alan Gates at May 11, 2009 at 3:39 pm
    No idea. Do you know which of the two answers (if either) is
    correct? Also, if you do:

    g = group raw by pid parallel 7;
    gx = foreach g generate count(raw);

    and some all the results, does it equal either of the two count all
    numbers?

    Finally, do you see this on small data or only large data? Some
    other users on the list have indicated that they see similar issues
    only when data crosses a certain size threshold.

    Alan.
    On May 10, 2009, at 3:36 PM, Dylan Nunley wrote:

    Hi,

    I am experiencing a loss of tuples when running queries on an 8-node
    cluster using Pig 0.2.0 and Hadoop 0.18.3.
    For example, something as simple as the below script causes 41,429,443
    to be lost:

    raw = LOAD '/dataset/cosmo25cmb.768g/cosmo25cmb.768g.00043.dark' USING
    PigStorage(',') as (pid:long,mass:double,x:double,y:double,z:double);

    g1 = GROUP raw ALL PARALLEL 7;
    cnt1 = FOREACH g1 GENERATE COUNT(raw);

    DUMP cnt1; -- 452984832L

    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);
    gy = GROUP gx ALL PARALLEL 7;
    cnt = FOREACH gy GENERATE COUNT(gx);

    DUMP cnt; -- 411555389L


    There are no error messages in any of the map or reduce tasks.
    Any ideas of what the problem is?

    Thanks,
    Dylan
  • Dylan Nunley at May 13, 2009 at 1:18 am
    The first count (452984832) is the correct one.
    Grouping on pid and then summing the individual counts produces the
    correct count. Tuples only seem to be dropped when the query involves
    more than one map-reduce job.
    This query was done on 43GB of data. I retested on a much smaller
    dataset and no tuples are lost.

    Thanks,
    Dylan

    On Mon, May 11, 2009 at 8:38 AM, Alan Gates wrote:
    No idea.  Do you know which of the two answers (if either) is correct?
    Also, if you do:

    g = group raw by pid parallel 7;
    gx = foreach g generate count(raw);

    and some all the results, does it equal either of the two count all numbers?

    Finally, do  you see this on small data or only large data?  Some other
    users on the list have indicated that they see similar issues only when data
    crosses a certain size threshold.

    Alan.
    On May 10, 2009, at 3:36 PM, Dylan Nunley wrote:

    Hi,

    I am experiencing a loss of tuples when running queries on an 8-node
    cluster using Pig 0.2.0 and Hadoop 0.18.3.
    For example, something as simple as the below script causes 41,429,443
    to be lost:

    raw = LOAD '/dataset/cosmo25cmb.768g/cosmo25cmb.768g.00043.dark' USING
    PigStorage(',') as (pid:long,mass:double,x:double,y:double,z:double);

    g1 = GROUP raw ALL PARALLEL 7;
    cnt1 = FOREACH g1 GENERATE COUNT(raw);

    DUMP cnt1; -- 452984832L

    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);
    gy = GROUP gx ALL PARALLEL 7;
    cnt = FOREACH gy GENERATE COUNT(gx);

    DUMP cnt; -- 411555389L


    There are no error messages in any of the map or reduce tasks.
    Any ideas of what the problem is?

    Thanks,
    Dylan
  • Mridul Muralidharan at May 13, 2009 at 1:23 am
    Can you try the workaround that Tamir confirmed worked for him ?

    Namely, replace (co)group followed by foreach-flatten (or joins) with
    (co)group followed by filter followed by foreach-flatten ?


    That is, for something like :
    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);

    replace with pattern :
    g_0 = GROUP raw BY pid PARALLEL 7;
    g = FILTER g_0 BY (COUNT(raw) != 0);
    gx = FOREACH g GENERATE FLATTEN(raw);



    If this works, then it might be the same issue that Tamir and I had
    observed with latest pig.

    Thanks,
    Mridul



    Dylan Nunley wrote:
    The first count (452984832) is the correct one.
    Grouping on pid and then summing the individual counts produces the
    correct count. Tuples only seem to be dropped when the query involves
    more than one map-reduce job.
    This query was done on 43GB of data. I retested on a much smaller
    dataset and no tuples are lost.

    Thanks,
    Dylan

    On Mon, May 11, 2009 at 8:38 AM, Alan Gates wrote:
    No idea. Do you know which of the two answers (if either) is correct?
    Also, if you do:

    g = group raw by pid parallel 7;
    gx = foreach g generate count(raw);

    and some all the results, does it equal either of the two count all numbers?

    Finally, do you see this on small data or only large data? Some other
    users on the list have indicated that they see similar issues only when data
    crosses a certain size threshold.

    Alan.
    On May 10, 2009, at 3:36 PM, Dylan Nunley wrote:

    Hi,

    I am experiencing a loss of tuples when running queries on an 8-node
    cluster using Pig 0.2.0 and Hadoop 0.18.3.
    For example, something as simple as the below script causes 41,429,443
    to be lost:

    raw = LOAD '/dataset/cosmo25cmb.768g/cosmo25cmb.768g.00043.dark' USING
    PigStorage(',') as (pid:long,mass:double,x:double,y:double,z:double);

    g1 = GROUP raw ALL PARALLEL 7;
    cnt1 = FOREACH g1 GENERATE COUNT(raw);

    DUMP cnt1; -- 452984832L

    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);
    gy = GROUP gx ALL PARALLEL 7;
    cnt = FOREACH gy GENERATE COUNT(gx);

    DUMP cnt; -- 411555389L


    There are no error messages in any of the map or reduce tasks.
    Any ideas of what the problem is?

    Thanks,
    Dylan
  • Dylan Nunley at May 15, 2009 at 4:20 am
    Nope, this doesn't fix the problem. But thanks for the suggestion.

    -Dylan

    On Tue, May 12, 2009 at 6:23 PM, Mridul Muralidharan
    wrote:
    Can you try the workaround that Tamir confirmed worked for him ?

    Namely, replace (co)group followed by foreach-flatten (or joins) with
    (co)group followed by filter followed by foreach-flatten ?


    That is, for something like :
    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);

    replace with pattern :
    g_0 = GROUP raw BY pid PARALLEL 7;
    g = FILTER g_0 BY (COUNT(raw) != 0);
    gx = FOREACH g GENERATE FLATTEN(raw);



    If this works, then it might be the same issue that Tamir and I had observed
    with latest pig.

    Thanks,
    Mridul



    Dylan Nunley wrote:
    The first count (452984832) is the correct one.
    Grouping on pid and then summing the individual counts produces the
    correct count. Tuples only seem to be dropped when the query involves
    more than one map-reduce job.
    This query was done on 43GB of data. I retested on a much smaller
    dataset and no tuples are lost.

    Thanks,
    Dylan

    On Mon, May 11, 2009 at 8:38 AM, Alan Gates wrote:

    No idea.  Do you know which of the two answers (if either) is correct?
    Also, if you do:

    g = group raw by pid parallel 7;
    gx = foreach g generate count(raw);

    and some all the results, does it equal either of the two count all
    numbers?

    Finally, do  you see this on small data or only large data?  Some other
    users on the list have indicated that they see similar issues only when
    data
    crosses a certain size threshold.

    Alan.
    On May 10, 2009, at 3:36 PM, Dylan Nunley wrote:

    Hi,

    I am experiencing a loss of tuples when running queries on an 8-node
    cluster using Pig 0.2.0 and Hadoop 0.18.3.
    For example, something as simple as the below script causes 41,429,443
    to be lost:

    raw = LOAD '/dataset/cosmo25cmb.768g/cosmo25cmb.768g.00043.dark' USING
    PigStorage(',') as (pid:long,mass:double,x:double,y:double,z:double);

    g1 = GROUP raw ALL PARALLEL 7;
    cnt1 = FOREACH g1 GENERATE COUNT(raw);

    DUMP cnt1; -- 452984832L

    g = GROUP raw BY pid PARALLEL 7;
    gx = FOREACH g GENERATE FLATTEN(raw);
    gy = GROUP gx ALL PARALLEL 7;
    cnt = FOREACH gy GENERATE COUNT(gx);

    DUMP cnt; -- 411555389L


    There are no error messages in any of the map or reduce tasks.
    Any ideas of what the problem is?

    Thanks,
    Dylan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 10, '09 at 10:37p
activeMay 15, '09 at 4:20a
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase