Grokbase Groups Pig user October 2010
FAQ
Hi!

I hope this is not too newbie question, but it's driving me crazy... How do
you count the records in a relation? Like DUMP, but instead of list of
records, I would like their count.

Thanks,

Anze

Search Discussions

  • Gerrit Jansen van Vuuren at Oct 29, 2010 at 11:22 am
    Hi,

    Lets say you have a file with columns userid username location amount

    To count the total number of users:
    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A ALL PARALLEL 40;
    R = FOREACH G GENERATE COUNT($1);

    dump R;

    To count the number of users by location;

    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A BY location PARALLEL 40;
    R = FOREACH G GENERATE FLATTEN(group), COUNT($1);

    dump R;

    To get the sum of amount per location, userid

    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A BY (location, userid) PARALLEL 40;
    R = FOREACH G GENERATE FLATTEN(group), COUNT($1) as usercount,
    SUM($1.amount) as useramount;


    NOTE PARALLEL is set to 40 as an example, this should be set by you, and
    depends on your cluster setup, data etc.

    To count its always GROUP either ALL or BY <column name>
    Then FOREACH and generate COUNT($1) the $1.

    Hope this helps,


    -----Original Message-----
    From: Anze
    Sent: Friday, October 29, 2010 12:01 PM
    To: user@pig.apache.org
    Subject: relations count

    Hi!

    I hope this is not too newbie question, but it's driving me crazy... How do
    you count the records in a relation? Like DUMP, but instead of list of
    records, I would like their count.

    Thanks,

    Anze
  • Anze at Oct 29, 2010 at 12:44 pm
    Thanks, that helps a lot! :)

    Anze

    On Friday 29 October 2010, Gerrit Jansen van Vuuren wrote:
    Hi,

    Lets say you have a file with columns userid username location amount

    To count the total number of users:
    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A ALL PARALLEL 40;
    R = FOREACH G GENERATE COUNT($1);

    dump R;

    To count the number of users by location;

    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A BY location PARALLEL 40;
    R = FOREACH G GENERATE FLATTEN(group), COUNT($1);

    dump R;

    To get the sum of amount per location, userid

    A = LOAD 'myfile' as (userid:long, username:chararray, location:chararray,
    amount:long);
    G = GROUP A BY (location, userid) PARALLEL 40;
    R = FOREACH G GENERATE FLATTEN(group), COUNT($1) as usercount,
    SUM($1.amount) as useramount;


    NOTE PARALLEL is set to 40 as an example, this should be set by you, and
    depends on your cluster setup, data etc.

    To count its always GROUP either ALL or BY <column name>
    Then FOREACH and generate COUNT($1) the $1.

    Hope this helps,


    -----Original Message-----
    From: Anze
    Sent: Friday, October 29, 2010 12:01 PM
    To: user@pig.apache.org
    Subject: relations count

    Hi!

    I hope this is not too newbie question, but it's driving me crazy... How do
    you count the records in a relation? Like DUMP, but instead of list of
    records, I would like their count.

    Thanks,

    Anze

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 29, '10 at 11:01a
activeOct 29, '10 at 12:44p
posts3
users2
websitepig.apache.org

2 users in discussion

Anze: 2 posts Gerrit Jansen van Vuuren: 1 post

People

Translate

site design / logo © 2021 Grokbase