|| at Mar 12, 2011 at 12:40 am
It's not very clear how you are doing this in a single eval func efficiently, or are you scanning all the tuples within a single invocation of the eval func?
For this to be scalable and O(n), wouldn't you need at least 2 map-reduce jobs? The first job to calculate histograms (which can be scaled up by the map phase assigning buckets in parallel) and the reduce computing the counts for each bucket; the second job to calculate the percentile rank of each tuple in a map phase using the histograms (which again can be done in parallel).
From: email@example.com On Behalf Of Josh Devins
Sent: 12 March 2011 04:36
Cc: Dmitriy Ryaboy; Jonathan Holloway
Subject: Re: Percentile UDF
I haven't tried with any super large DataBags but would definitely pursue Dmitriy's grouping suggestion if memory usage/size is of concern and you have a very large distribution. Something I will have to try as well!
On 11 March 2011 19:18, Dmitriy Ryaboy wrote:
1) change log.info to log.debug... you don't want that spamming your
logs for a billion tuples.
2) instanceof DefaultDataBag should probably be instanceof DataBag
(other databags are ok too)
3) the somewhat more scalable version of this would group by value (or
value range, for histograms) and provide a series of (value,
numentries) pairs -- assuming numentries tends to be above 2, you can
get substantial savings in terms of memory there
4) if you use an ArrayList instead of an array, you don't need to
precalculate the size of the bag. But then a copy to array is required
for Percentile to work. So you can implement your own Percentile that
wouldn't require you to triple-buffer...
On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway <
This was my attempt at the Percentile function today:http://pastebin.com/C883rwbJ
Any tips on improving the efficiency of it would be welcome.
Dmitriy mentioned using a histogram as an approach for this and
looking at the way
GROUP BY works. If anybody has any further info on that approach
I'd be interested in hearing about it.
On 10 March 2011 21:01, Jonathan Holloway
That's the path I started down today, I don't suppose the UDF you
in the public domain
at all - would you consider contributing it to piggybank.jar at all?
does it fare with large datasets
On 10 March 2011 19:49, Josh Devins wrote:
I wrote one not long ago that just relies on Apache Commons Math
underlying the UDF. It's pretty straightforward as the Percentile
sort your numbers before doing percentile calculations. The UDF
needs to take a bag/tuple, pull out all the fields as double
On 10 March 2011 19:38, Kris Coward wrote:
On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
Does anybody have a UDF for calculating the percentile
I took a look at the builtins and the piggybank project, but
couldn't seem to see anything. Is
reason why there isn't a builtin for this?
I think it would be because there's an inherently serial
problem in there (i.e. numbering each entry based on its place
in the ordered list).
GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB