FAQ
Hi!

I am new to PIG, so pardon my naïve question. I have a data like this:

(A,1)
(A,5)
(B,4)
(C,22)
(C,10)

I need to calculate maximum value for each distinct value of 1st column:

(A,5)
(B,4)
(C,22)

If there is a good way to do it in PIG? The only way I see is to
first group by first column, calculated max values per group. After that
join with with original column, adding max value column:

(A,1,5)
(A,5,5)
(B,4,4)
(C,22,22)
(C,10,22)

Then I need to filter, dropping values where 3rd column is bigger that
2nd.
Finally, I will need to do DISTINCT to remove duplicate max values. It
is all
sounds quite complex computationally and I was wondering if there is a
better way...

Sincerely,

--
"Hated by fools, and fools to hate, be this my motto and my fate"
(Jonathan Swift)

Search Discussions

•  at Dec 31, 2008 at 10:09 pm ⇧
If I understand correctly, what you want is this:

A = load 'yourfile' as (firstcol, secondcol);
B = group A by firstcol;
C = foreach B generate group, MAX(\$1.secondcol);

This will collect like values in the first column and find the
maximum value in the second column for each group.

Alan.
On Dec 31, 2008, at 1:48 PM, Vadim Zaliva wrote:

Hi!

I am new to PIG, so pardon my naïve question. I have a data like this:

(A,1)
(A,5)
(B,4)
(C,22)
(C,10)

I need to calculate maximum value for each distinct value of 1st
column:

(A,5)
(B,4)
(C,22)

If there is a good way to do it in PIG? The only way I see is to
first group by first column, calculated max values per group. After
that
join with with original column, adding max value column:

(A,1,5)
(A,5,5)
(B,4,4)
(C,22,22)
(C,10,22)

Then I need to filter, dropping values where 3rd column is bigger
that 2nd.
Finally, I will need to do DISTINCT to remove duplicate max values.
It is all
sounds quite complex computationally and I was wondering if there is a
better way...

Sincerely,

--
"Hated by fools, and fools to hate, be this my motto and my fate"
(Jonathan Swift)

•  at Dec 31, 2008 at 10:52 pm ⇧

On Dec 31, 2008, at 14:08 , Alan Gates wrote:

If I understand correctly, what you want is this:

A = load 'yourfile' as (firstcol, secondcol);
B = group A by firstcol;
C = foreach B generate group, MAX(\$1.secondcol);

This will collect like values in the first column and find the
maximum value in the second column for each group.
Thanks!

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 3, 2009 at 12:18 am ⇧
On Dec 31, 2008, at 14:08 , Alan Gates wrote:

Perhaps my example was not very good. Let me rephrase it:

Having data like this:

(A,x,1)
(A,u,5)
(A,y,5)
(B,z,4)
(C,g,22)
(C,h,10)

I need to calculate:

(A,u,5)
(B,z,4)
(C,g,22)

So, for each first column value I need to keep only one row,
with max. value in 3rd column.

I come up with something like:

A = load 'file' as (first, second, third);
B = FOREACH A GENERATE first, third;
C = GROUP B by first;
D = FOREACH C GENERATE group AS first, MAX(\$1.third) as max;
E = JOIN D by first, A by first;
F = FILTER E by third == max
G = FOREACH F GENERATE first, second, third

At this point I will get:

(A,u,5)
(A,y,5)
(B,z,4)
(C,g,22)

So I need something like DISTINCT G BY first, third but PIG does not
have it.

Any good way around it?

If I understand correctly, what you want is this:

A = load 'yourfile' as (firstcol, secondcol);
B = group A by firstcol;
C = foreach B generate group, MAX(\$1.secondcol);

This will collect like values in the first column and find the
maximum value in the second column for each group.

Alan.
On Dec 31, 2008, at 1:48 PM, Vadim Zaliva wrote:

Hi!

I am new to PIG, so pardon my naïve question. I have a data like
this:

(A,1)
(A,5)
(B,4)
(C,22)
(C,10)

I need to calculate maximum value for each distinct value of 1st
column:

(A,5)
(B,4)
(C,22)

If there is a good way to do it in PIG? The only way I see is to
first group by first column, calculated max values per group. After
that
join with with original column, adding max value column:

(A,1,5)
(A,5,5)
(B,4,4)
(C,22,22)
(C,10,22)

Then I need to filter, dropping values where 3rd column is bigger
that 2nd.
Finally, I will need to do DISTINCT to remove duplicate max values.
It is all
sounds quite complex computationally and I was wondering if there
is a
better way...

Sincerely,

--
"Hated by fools, and fools to hate, be this my motto and my fate"
(Jonathan Swift)

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 3, 2009 at 4:15 am ⇧
I think that what you need is a custom max function that operates on pairs.

Then you can group by the first field and keep the maximum pair.

Something like this:

A = load 'file' as (first, second, third);
B = GROUP A by first;
C = FOREACH B GENERATE first, PairwiseMax(B);

PairwiseMax should accept a bunch of pairs and keep the one that has the
largest second element. This should be relatively trivial to write in Java,
but I think it would be difficult in Pig.
On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

Perhaps my example was not very good. Let me rephrase it:

Having data like this:

(A,x,1)
(A,u,5)
(A,y,5)
(B,z,4)
(C,g,22)
(C,h,10)

I need to calculate:

(A,u,5)
(B,z,4)
(C,g,22)

So, for each first column value I need to keep only one row,
with max. value in 3rd column.

--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)
•  at Jan 3, 2009 at 7:13 am ⇧
On Jan 2, 2009, at 20:14 , Ted Dunning wrote:

I can certainly write a custom function, but I was curios how this type
of problem could be solved using PIG only.

If I am were to write custom function I would not do as you suggest.
Your approach will not work very well on large data sets. I would write
custom function which prints first record and skip all subsequent ones
with matching set of fields. Thus even very large data set could
be sorted using map/reduce framework first, then it could be
processed by such function, which only needs to keep in memory one
record (or rather matching field(s) of the last record).

Still, I am curios to see if anybody could suggest PIG only solution
to the problem. I was thinking of another approach: grouping by the
first
field, sorting sub-fields in each record, and when taking the first one.
Unfortunately this would not work as well: FOREACH allows nester ORDER,
but not LIMIT :(

Sincerely,

I think that what you need is a custom max function that operates on
pairs.

Then you can group by the first field and keep the maximum pair.

Something like this:

A = load 'file' as (first, second, third);
B = GROUP A by first;
C = FOREACH B GENERATE first, PairwiseMax(B);

PairwiseMax should accept a bunch of pairs and keep the one that has
the
largest second element. This should be relatively trivial to write
in Java,
but I think it would be difficult in Pig.
On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

Perhaps my example was not very good. Let me rephrase it:

Having data like this:

(A,x,1)
(A,u,5)
(A,y,5)
(B,z,4)
(C,g,22)
(C,h,10)

I need to calculate:

(A,u,5)
(B,z,4)
(C,g,22)

So, for each first column value I need to keep only one row,
with max. value in 3rd column.

--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 3, 2009 at 7:41 pm ⇧
As you like, but you still need to sort or compare the results to get what
you want. Either way, the reduce function will have to grovel through all
of the records in the group. With sorting, you pay the price of ordering
all of the records. With max selection, you only need one comparison per
record rather than log n.
On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

If I am were to write custom function I would not do as you suggest.
Your approach will not work very well on large data sets. I would write
custom function which prints first record and skip all subsequent ones
with matching set of fields.
•  at Jan 4, 2009 at 1:52 am ⇧
On Jan 3, 2009, at 11:41 , Ted Dunning wrote:

Assuming that I want to write the function as you suggested, I do not
see under what UDF category it falls (from this document):

http://wiki.apache.org/pig/UDFManual

It is close to "Aggregate Functions" but they must return a scalar
value.

If I am to write the way I suggested, "Filter Functions" may seem
applicable, assuming that I can keep state between invocations and it
is guaranteed that the same instance will be used to process all data
set. But if data is split in chunks and functions applied to them
independently this not gonna work.

So, either way, I am stuck! :)

The only way I see is to split my PIG script into 2 parts, save
After it completion, the second part of my PIG script could pick up
results and continue.

I think this is very clumsy. The problem I am trying to solve seems to
be pretty trivial and common. I think PIG should have a way to solve
it. One of the following modifications of PIG language will solve my
problem:

1. Allowing LIMIT as nested operation in FOREACH (in addition to ORDER
and others which are
currently allowed)
2. Extending DISTINCT operation with "BY" clause, allowing users to
specify list of fields.

Does anybody else besides me raised such suggestions? Any chance to
see them as part
of the language anytime soon?

Sincerely,
As you like, but you still need to sort or compare the results to
get what
you want. Either way, the reduce function will have to grovel
through all
of the records in the group. With sorting, you pay the price of
ordering
all of the records. With max selection, you only need one
comparison per
record rather than log n.
On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

If I am were to write custom function I would not do as you suggest.
Your approach will not work very well on large data sets. I would
write
custom function which prints first record and skip all subsequent
ones
with matching set of fields.

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 5, 2009 at 6:00 pm ⇧

Why don't you write a function that takes a bag and returns a bag? i'm not sure why it bothers you whether or not the function will be considered a aggregation function. in PiggyBank there are functions that take bags and return bags.

if i understand your problem correctly, you want a function that takes tuples grouped by the first field and returns the tuple with the highest third field from the group. that is a very simple function to write as an algebraic function that will be very efficient.

ben
________________________________________
Sent: Saturday, January 03, 2009 5:52 PM
Subject: Re: novice user

On Jan 3, 2009, at 11:41 , Ted Dunning wrote:

Assuming that I want to write the function as you suggested, I do not
see under what UDF category it falls (from this document):

http://wiki.apache.org/pig/UDFManual

It is close to "Aggregate Functions" but they must return a scalar
value.

If I am to write the way I suggested, "Filter Functions" may seem
applicable, assuming that I can keep state between invocations and it
is guaranteed that the same instance will be used to process all data
set. But if data is split in chunks and functions applied to them
independently this not gonna work.

So, either way, I am stuck! :)

The only way I see is to split my PIG script into 2 parts, save
After it completion, the second part of my PIG script could pick up
results and continue.

I think this is very clumsy. The problem I am trying to solve seems to
be pretty trivial and common. I think PIG should have a way to solve
it. One of the following modifications of PIG language will solve my
problem:

1. Allowing LIMIT as nested operation in FOREACH (in addition to ORDER
and others which are
currently allowed)
2. Extending DISTINCT operation with "BY" clause, allowing users to
specify list of fields.

Does anybody else besides me raised such suggestions? Any chance to
see them as part
of the language anytime soon?

Sincerely,
As you like, but you still need to sort or compare the results to
get what
you want. Either way, the reduce function will have to grovel
through all
of the records in the group. With sorting, you pay the price of
ordering
all of the records. With max selection, you only need one
comparison per
record rather than log n.
On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

If I am were to write custom function I would not do as you suggest.
Your approach will not work very well on large data sets. I would
write
custom function which prints first record and skip all subsequent
ones
with matching set of fields.

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 5, 2009 at 6:10 pm ⇧

On Jan 5, 2009, at 9:59 , Benjamin Reed wrote:

Why don't you write a function that takes a bag and returns a bag?
i'm not sure why it bothers you whether or not the function will be
considered a aggregation function. in PiggyBank there are functions
that take bags and return bags.

If this is possible I will gladly do this. I was confused by
documentation that documentation states:

"An aggregate function is an eval function that takes a bag and
returns a scalar value"

So I was not sure if I can return a bag from UDF function. I will try
that approach. Thanks!

--
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
•  at Jan 5, 2009 at 10:28 pm ⇧
Sorry, I've been reading my mail queue LIFO. Ted is exactly right he just has a small typo:

C = FOREACH B GENERATE first, PairwiseMax(A);

PairwiseMax is trivial to write and it is exactly the reason we have UDFs.

ben
________________________________________
From: Ted Dunning [ted.dunning@gmail.com]
Sent: Friday, January 02, 2009 8:14 PM
Subject: Re: novice user

I think that what you need is a custom max function that operates on pairs.

Then you can group by the first field and keep the maximum pair.

Something like this:

A = load 'file' as (first, second, third);
B = GROUP A by first;
C = FOREACH B GENERATE first, PairwiseMax(B);

PairwiseMax should accept a bunch of pairs and keep the one that has the
largest second element. This should be relatively trivial to write in Java,
but I think it would be difficult in Pig.
On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

Perhaps my example was not very good. Let me rephrase it:

Having data like this:

(A,x,1)
(A,u,5)
(A,y,5)
(B,z,4)
(C,g,22)
(C,h,10)

I need to calculate:

(A,u,5)
(B,z,4)
(C,g,22)

So, for each first column value I need to keep only one row,
with max. value in 3rd column.

--
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Related Discussions

Discussion Overview
 group user categories pig, hadoop posted Dec 31, '08 at 9:52p active Jan 5, '09 at 10:28p posts 11 users 5 website pig.apache.org

5 users in discussion

Content

People

Support

Translate

site design / logo © 2021 Grokbase