(, 1, 1)

(w, 1, 1)

(x, 3, 3) <--- $2 should be a '2' not a '3'.

(y, 1, 1)

(z, 1, 1)

Desired results:

,1,1

w,1,1

x,3,2 <--- $2 should be a '2' and not a '3' because the values for 'x' are (a,b,b) and so there are only 2 distinct values.

y,1,1

z,1,1

I was able to get the output of (x,3,3) via another way

A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);

B = group A by $0;

C = distinct B;

D = foreach C generate group, COUNT(A.a2), COUNT(A.a3);

...but to make it work as a distinct count I would like to achieve (x,3,2) not (x,3,3).

Would I need to write a UDF called COUNT_DISTINCT based mostly on the COUNT function? If so how?

Would replacing...

public void exec(Tuple input, DataAtom output) throws IOException {

output.setValue(count(input)); <--- I suppose this should return a count of hash keys or something???

}

...be sufficient?

Thanks in advance,

Avram

-----Original Message-----

From: Alan Gates

Sent: Wednesday, March 18, 2009 2:56 PM

To: pig-user@hadoop.apache.org

Subject: Re: count distinct using pig?

A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);

B = group A by $0;

C = foreach B {

D1 = A.a2;

D2 = distinct D1;

E1 = A.a3;

E2 = distinct D2;

generate group, COUNT(D2), COUNT(E2);

}

store C into 'output';

Alan.

On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

Hello Pig list,

I have looked at the 'distinct' keyword but it does not seem to

operate on a particular fields (columns). I have a file with

several categorical variables a1-a3 and am seeking to compute

distinct counts of fields a2 and a3 by field a1.

How can I get distinct counts?

For example:

A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);

/*

dump A;

(x, X, a)

(y, Y, b)

(x, XX, b)

(z, Z, c)

(w, X, )

(, W, d)

(x, , b)

*/

B = group A by $0;

/*

dump B;

(, {(, W, d)})

(w, {(w, X, )})

(x, {(x, X, a), (x, XX, b), (x, , b)})

(y, {(y, Y, b)})

(z, {(z, Z, c)})

*/

# how do I get distinct counts by $0 ??

#Desired output:

,1,1

w,1,1

x,3,2

y,1,1

z,1,1

Many thanks,

Avram

Hello Pig list,

I have looked at the 'distinct' keyword but it does not seem to

operate on a particular fields (columns). I have a file with

several categorical variables a1-a3 and am seeking to compute

distinct counts of fields a2 and a3 by field a1.

How can I get distinct counts?

For example:

A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);

/*

dump A;

(x, X, a)

(y, Y, b)

(x, XX, b)

(z, Z, c)

(w, X, )

(, W, d)

(x, , b)

*/

B = group A by $0;

/*

dump B;

(, {(, W, d)})

(w, {(w, X, )})

(x, {(x, X, a), (x, XX, b), (x, , b)})

(y, {(y, Y, b)})

(z, {(z, Z, c)})

*/

# how do I get distinct counts by $0 ??

#Desired output:

,1,1

w,1,1

x,3,2

y,1,1

z,1,1

Many thanks,

Avram