FAQ
Hi Alan and Mridul,

Thank you, that works now:

# Count Distinct example:
A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
B = group A by \$0;
C = foreach B {
D1 = A.a2;
D2 = distinct D1;
E1 = A.a3;
E2 = distinct E1;
generate group, COUNT(D2), COUNT(E2); }
store C into 'output';

Many thanks!

-Avram

-----Original Message-----
From: Mridul Muralidharan
Sent: Wednesday, March 18, 2009 3:28 PM
Subject: Re: count distinct using pig?

I think Alan had a typo in the script which caused your error.

E2 = distinct E1;

This should do it ?

Regards,
Mridul

Avram Aelony wrote:
Thanks for the response... unfortunately the output I get is:

(, 1, 1)
(w, 1, 1)
(x, 3, 3) <--- \$2 should be a '2' not a '3'.
(y, 1, 1)
(z, 1, 1)

Desired results:
,1,1
w,1,1
x,3,2 <--- \$2 should be a '2' and not a '3' because the values for 'x' are (a,b,b) and so there are only 2 distinct values.
y,1,1
z,1,1

I was able to get the output of (x,3,3) via another way
A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
B = group A by \$0;
C = distinct B;
D = foreach C generate group, COUNT(A.a2), COUNT(A.a3);

...but to make it work as a distinct count I would like to achieve (x,3,2) not (x,3,3).

Would I need to write a UDF called COUNT_DISTINCT based mostly on the COUNT function? If so how?
Would replacing...
public void exec(Tuple input, DataAtom output) throws IOException {
output.setValue(count(input)); <--- I suppose this should return a count of hash keys or something???
}

...be sufficient?

Avram

-----Original Message-----
From: Alan Gates
Sent: Wednesday, March 18, 2009 2:56 PM
Subject: Re: count distinct using pig?

A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
B = group A by \$0;
C = foreach B {
D1 = A.a2;
D2 = distinct D1;
E1 = A.a3;
E2 = distinct D2;
generate group, COUNT(D2), COUNT(E2);
}
store C into 'output';

Alan.
On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

Hello Pig list,

I have looked at the 'distinct' keyword but it does not seem to
operate on a particular fields (columns). I have a file with
several categorical variables a1-a3 and am seeking to compute
distinct counts of fields a2 and a3 by field a1.

How can I get distinct counts?

For example:
A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
/*
dump A;
(x, X, a)
(y, Y, b)
(x, XX, b)
(z, Z, c)
(w, X, )
(, W, d)
(x, , b)
*/

B = group A by \$0;
/*
dump B;
(, {(, W, d)})
(w, {(w, X, )})
(x, {(x, X, a), (x, XX, b), (x, , b)})
(y, {(y, Y, b)})
(z, {(z, Z, c)})
*/

# how do I get distinct counts by \$0 ??
#Desired output:
,1,1
w,1,1
x,3,2
y,1,1
z,1,1

Many thanks,
Avram

## Related Discussions

 view thread | post posts ‹ prev | 3 of 5 | next ›
Discussion Overview
 group user categories pig, hadoop posted Mar 18, '09 at 8:44p active Mar 18, '09 at 11:30p posts 5 users 3 website pig.apache.org

### 3 users in discussion

Content

People

Support

Translate

site design / logo © 2021 Grokbase