|
Kaluskar, Sanjay |
at Mar 12, 2011 at 12:40 am
|
⇧ |
| |
It's not very clear how you are doing this in a single eval func efficiently, or are you scanning all the tuples within a single invocation of the eval func?
For this to be scalable and O(n), wouldn't you need at least 2 map-reduce jobs? The first job to calculate histograms (which can be scaled up by the map phase assigning buckets in parallel) and the reduce computing the counts for each bucket; the second job to calculate the percentile rank of each tuple in a map phase using the histograms (which again can be done in parallel).
-----Original Message-----
From: joshdevins@gmail.com On Behalf Of Josh Devins
Sent: 12 March 2011 04:36
To: user@pig.apache.org
Cc: Dmitriy Ryaboy; Jonathan Holloway
Subject: Re: Percentile UDF
Jon,
I haven't tried with any super large DataBags but would definitely pursue Dmitriy's grouping suggestion if memory usage/size is of concern and you have a very large distribution. Something I will have to try as well!
Cheers,
Josh
On 11 March 2011 19:18, Dmitriy Ryaboy wrote:1) change log.info to log.debug... you don't want that spamming your
logs for a billion tuples.
2) instanceof DefaultDataBag should probably be instanceof DataBag
(other databags are ok too)
3) the somewhat more scalable version of this would group by value (or
value range, for histograms) and provide a series of (value,
numentries) pairs -- assuming numentries tends to be above 2, you can
get substantial savings in terms of memory there
4) if you use an ArrayList instead of an array, you don't need to
precalculate the size of the bag. But then a copy to array is required
for Percentile to work. So you can implement your own Percentile that
wouldn't require you to triple-buffer...
D
On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway <
jonathan.holloway@gmail.com> wrote:
This was my attempt at the Percentile function today:
http://pastebin.com/C883rwbJAny tips on improving the efficiency of it would be welcome.
Dmitriy mentioned using a histogram as an approach for this and
looking at the way
GROUP BY works. If anybody has any further info on that approach
I'd be interested in hearing about it.
Cheers,
Jon.
On 10 March 2011 21:01, Jonathan Holloway
<jonathan.holloway@gmail.com
wrote:
Hey Josh,
That's the path I started down today, I don't suppose the UDF you
wrote is
in the public domain
at all - would you consider contributing it to piggybank.jar at all?
How
does it fare with large datasets
as well?
Many thanks,
Jon.
On 10 March 2011 19:49, Josh Devins wrote:
Hey Jon,
I wrote one not long ago that just relies on Apache Commons Math
underlying the UDF. It's pretty straightforward as the Percentile
implementation
will
sort your numbers before doing percentile calculations. The UDF
then
just
needs to take a bag/tuple, pull out all the fields as double[]
and
pass
http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html
Josh
On 10 March 2011 19:38, Kris Coward wrote:On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
HI all,
Does anybody have a UDF for calculating the percentile
(median,
99%)
at
all?
I took a look at the builtins and the piggybank project, but
couldn't seem to see anything. Is
there a
reason why there isn't a builtin for this?
I think it would be because there's an inherently serial
problem in there (i.e. numbering each entry based on its place
in the ordered list).
Cheers,
Kris
--
Kris Coward
http://unripe.melon.org/GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB
12B3