Grokbase Groups Pig user March 2009
FAQ
Hello Pig list,

I have looked at the 'distinct' keyword but it does not seem to operate on a particular fields (columns). I have a file with several categorical variables a1-a3 and am seeking to compute distinct counts of fields a2 and a3 by field a1.

How can I get distinct counts?

For example:
A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
/*
dump A;
(x, X, a)
(y, Y, b)
(x, XX, b)
(z, Z, c)
(w, X, )
(, W, d)
(x, , b)
*/

B = group A by $0;
/*
dump B;
(, {(, W, d)})
(w, {(w, X, )})
(x, {(x, X, a), (x, XX, b), (x, , b)})
(y, {(y, Y, b)})
(z, {(z, Z, c)})
*/


# how do I get distinct counts by $0 ??
#Desired output:
,1,1
w,1,1
x,3,2
y,1,1
z,1,1


Many thanks,
Avram

Search Discussions

  • Alan Gates at Mar 18, 2009 at 9:58 pm
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = foreach B {
    D1 = A.a2;
    D2 = distinct D1;
    E1 = A.a3;
    E2 = distinct D2;
    generate group, COUNT(D2), COUNT(E2);
    }
    store C into 'output';

    Alan.
    On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

    Hello Pig list,

    I have looked at the 'distinct' keyword but it does not seem to
    operate on a particular fields (columns). I have a file with
    several categorical variables a1-a3 and am seeking to compute
    distinct counts of fields a2 and a3 by field a1.

    How can I get distinct counts?

    For example:
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    /*
    dump A;
    (x, X, a)
    (y, Y, b)
    (x, XX, b)
    (z, Z, c)
    (w, X, )
    (, W, d)
    (x, , b)
    */

    B = group A by $0;
    /*
    dump B;
    (, {(, W, d)})
    (w, {(w, X, )})
    (x, {(x, X, a), (x, XX, b), (x, , b)})
    (y, {(y, Y, b)})
    (z, {(z, Z, c)})
    */


    # how do I get distinct counts by $0 ??
    #Desired output:
    ,1,1
    w,1,1
    x,3,2
    y,1,1
    z,1,1


    Many thanks,
    Avram
  • Avram Aelony at Mar 18, 2009 at 11:20 pm
    Thanks for the response... unfortunately the output I get is:

    (, 1, 1)
    (w, 1, 1)
    (x, 3, 3) <--- $2 should be a '2' not a '3'.
    (y, 1, 1)
    (z, 1, 1)

    Desired results:
    ,1,1
    w,1,1
    x,3,2 <--- $2 should be a '2' and not a '3' because the values for 'x' are (a,b,b) and so there are only 2 distinct values.
    y,1,1
    z,1,1

    I was able to get the output of (x,3,3) via another way
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = distinct B;
    D = foreach C generate group, COUNT(A.a2), COUNT(A.a3);

    ...but to make it work as a distinct count I would like to achieve (x,3,2) not (x,3,3).

    Would I need to write a UDF called COUNT_DISTINCT based mostly on the COUNT function? If so how?
    Would replacing...
    public void exec(Tuple input, DataAtom output) throws IOException {
    output.setValue(count(input)); <--- I suppose this should return a count of hash keys or something???
    }

    ...be sufficient?

    Thanks in advance,
    Avram



    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, March 18, 2009 2:56 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: count distinct using pig?

    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = foreach B {
    D1 = A.a2;
    D2 = distinct D1;
    E1 = A.a3;
    E2 = distinct D2;
    generate group, COUNT(D2), COUNT(E2);
    }
    store C into 'output';

    Alan.
    On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

    Hello Pig list,

    I have looked at the 'distinct' keyword but it does not seem to
    operate on a particular fields (columns). I have a file with
    several categorical variables a1-a3 and am seeking to compute
    distinct counts of fields a2 and a3 by field a1.

    How can I get distinct counts?

    For example:
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    /*
    dump A;
    (x, X, a)
    (y, Y, b)
    (x, XX, b)
    (z, Z, c)
    (w, X, )
    (, W, d)
    (x, , b)
    */

    B = group A by $0;
    /*
    dump B;
    (, {(, W, d)})
    (w, {(w, X, )})
    (x, {(x, X, a), (x, XX, b), (x, , b)})
    (y, {(y, Y, b)})
    (z, {(z, Z, c)})
    */


    # how do I get distinct counts by $0 ??
    #Desired output:
    ,1,1
    w,1,1
    x,3,2
    y,1,1
    z,1,1


    Many thanks,
    Avram
  • Mridul Muralidharan at Mar 18, 2009 at 11:30 pm
    I think Alan had a typo in the script which caused your error.

    E2 = distinct E1;

    This should do it ?

    Regards,
    Mridul


    Avram Aelony wrote:
    Thanks for the response... unfortunately the output I get is:

    (, 1, 1)
    (w, 1, 1)
    (x, 3, 3) <--- $2 should be a '2' not a '3'.
    (y, 1, 1)
    (z, 1, 1)

    Desired results:
    ,1,1
    w,1,1
    x,3,2 <--- $2 should be a '2' and not a '3' because the values for 'x' are (a,b,b) and so there are only 2 distinct values.
    y,1,1
    z,1,1

    I was able to get the output of (x,3,3) via another way
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = distinct B;
    D = foreach C generate group, COUNT(A.a2), COUNT(A.a3);

    ...but to make it work as a distinct count I would like to achieve (x,3,2) not (x,3,3).

    Would I need to write a UDF called COUNT_DISTINCT based mostly on the COUNT function? If so how?
    Would replacing...
    public void exec(Tuple input, DataAtom output) throws IOException {
    output.setValue(count(input)); <--- I suppose this should return a count of hash keys or something???
    }

    ...be sufficient?

    Thanks in advance,
    Avram



    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, March 18, 2009 2:56 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: count distinct using pig?

    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = foreach B {
    D1 = A.a2;
    D2 = distinct D1;
    E1 = A.a3;
    E2 = distinct D2;
    generate group, COUNT(D2), COUNT(E2);
    }
    store C into 'output';

    Alan.
    On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

    Hello Pig list,

    I have looked at the 'distinct' keyword but it does not seem to
    operate on a particular fields (columns). I have a file with
    several categorical variables a1-a3 and am seeking to compute
    distinct counts of fields a2 and a3 by field a1.

    How can I get distinct counts?

    For example:
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    /*
    dump A;
    (x, X, a)
    (y, Y, b)
    (x, XX, b)
    (z, Z, c)
    (w, X, )
    (, W, d)
    (x, , b)
    */

    B = group A by $0;
    /*
    dump B;
    (, {(, W, d)})
    (w, {(w, X, )})
    (x, {(x, X, a), (x, XX, b), (x, , b)})
    (y, {(y, Y, b)})
    (z, {(z, Z, c)})
    */


    # how do I get distinct counts by $0 ??
    #Desired output:
    ,1,1
    w,1,1
    x,3,2
    y,1,1
    z,1,1


    Many thanks,
    Avram
  • Avram Aelony at Mar 18, 2009 at 11:00 pm
    Hi Alan and Mridul,

    Thank you, that works now:

    # Count Distinct example:
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = foreach B {
    D1 = A.a2;
    D2 = distinct D1;
    E1 = A.a3;
    E2 = distinct E1;
    generate group, COUNT(D2), COUNT(E2); }
    store C into 'output';


    Many thanks!

    -Avram

    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Wednesday, March 18, 2009 3:28 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: count distinct using pig?


    I think Alan had a typo in the script which caused your error.

    E2 = distinct E1;

    This should do it ?

    Regards,
    Mridul


    Avram Aelony wrote:
    Thanks for the response... unfortunately the output I get is:

    (, 1, 1)
    (w, 1, 1)
    (x, 3, 3) <--- $2 should be a '2' not a '3'.
    (y, 1, 1)
    (z, 1, 1)

    Desired results:
    ,1,1
    w,1,1
    x,3,2 <--- $2 should be a '2' and not a '3' because the values for 'x' are (a,b,b) and so there are only 2 distinct values.
    y,1,1
    z,1,1

    I was able to get the output of (x,3,3) via another way
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = distinct B;
    D = foreach C generate group, COUNT(A.a2), COUNT(A.a3);

    ...but to make it work as a distinct count I would like to achieve (x,3,2) not (x,3,3).

    Would I need to write a UDF called COUNT_DISTINCT based mostly on the COUNT function? If so how?
    Would replacing...
    public void exec(Tuple input, DataAtom output) throws IOException {
    output.setValue(count(input)); <--- I suppose this should return a count of hash keys or something???
    }

    ...be sufficient?

    Thanks in advance,
    Avram



    -----Original Message-----
    From: Alan Gates
    Sent: Wednesday, March 18, 2009 2:56 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: count distinct using pig?

    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    B = group A by $0;
    C = foreach B {
    D1 = A.a2;
    D2 = distinct D1;
    E1 = A.a3;
    E2 = distinct D2;
    generate group, COUNT(D2), COUNT(E2);
    }
    store C into 'output';

    Alan.
    On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

    Hello Pig list,

    I have looked at the 'distinct' keyword but it does not seem to
    operate on a particular fields (columns). I have a file with
    several categorical variables a1-a3 and am seeking to compute
    distinct counts of fields a2 and a3 by field a1.

    How can I get distinct counts?

    For example:
    A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
    /*
    dump A;
    (x, X, a)
    (y, Y, b)
    (x, XX, b)
    (z, Z, c)
    (w, X, )
    (, W, d)
    (x, , b)
    */

    B = group A by $0;
    /*
    dump B;
    (, {(, W, d)})
    (w, {(w, X, )})
    (x, {(x, X, a), (x, XX, b), (x, , b)})
    (y, {(y, Y, b)})
    (z, {(z, Z, c)})
    */


    # how do I get distinct counts by $0 ??
    #Desired output:
    ,1,1
    w,1,1
    x,3,2
    y,1,1
    z,1,1


    Many thanks,
    Avram

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 18, '09 at 8:44p
activeMar 18, '09 at 11:30p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase