Why don't you write a function that takes a bag and returns a bag? i'm not sure why it bothers you whether or not the function will be considered a aggregation function. in PiggyBank there are functions that take bags and return bags.
if i understand your problem correctly, you want a function that takes tuples grouped by the first field and returns the tuple with the highest third field from the group. that is a very simple function to write as an algebraic function that will be very efficient.
From: Vadim Zaliva [email@example.com]
Sent: Saturday, January 03, 2009 5:52 PM
Subject: Re: novice user
On Jan 3, 2009, at 11:41 , Ted Dunning wrote:
Assuming that I want to write the function as you suggested, I do not
see under what UDF category it falls (from this document):http://wiki.apache.org/pig/UDFManual
It is close to "Aggregate Functions" but they must return a scalar
If I am to write the way I suggested, "Filter Functions" may seem
applicable, assuming that I can keep state between invocations and it
is guaranteed that the same instance will be used to process all data
set. But if data is split in chunks and functions applied to them
independently this not gonna work.
So, either way, I am stuck! :)
The only way I see is to split my PIG script into 2 parts, save
intermediate values. Then, I can invoke custom hadoop map/reduce task.
After it completion, the second part of my PIG script could pick up
results and continue.
I think this is very clumsy. The problem I am trying to solve seems to
be pretty trivial and common. I think PIG should have a way to solve
it. One of the following modifications of PIG language will solve my
1. Allowing LIMIT as nested operation in FOREACH (in addition to ORDER
and others which are
2. Extending DISTINCT operation with "BY" clause, allowing users to
specify list of fields.
Does anybody else besides me raised such suggestions? Any chance to
see them as part
of the language anytime soon?
As you like, but you still need to sort or compare the results to
you want. Either way, the reduce function will have to grovel
of the records in the group. With sorting, you pay the price of
all of the records. With max selection, you only need one
record rather than log n.
On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:
If I am were to write custom function I would not do as you suggest.
Your approach will not work very well on large data sets. I would
custom function which prints first record and skip all subsequent
with matching set of fields.
"La perfection est atteinte non quand il ne reste rien a ajouter, mais
quand il ne reste rien a enlever." (Antoine de Saint-Exupery)