Grokbase Groups Pig user March 2011
FAQ
Hi all,

Does anybody know if a Percentile UDF exists at all, I've searched through
the manual and the piggybank project, but can't
seem to see one there.

Many thanks,
Jon.

Search Discussions

  • Jonathan Holloway at Mar 10, 2011 at 3:25 pm
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%) at all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is there a
    reason why there isn't a builtin for
    this?

    Many thanks,
    Jon.
  • Kris Coward at Mar 10, 2011 at 6:38 pm

    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%) at all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Josh Devins at Mar 10, 2011 at 7:50 pm
    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math underlying
    the UDF. It's pretty straightforward as the Percentile implementation will
    sort your numbers before doing percentile calculations. The UDF then just
    needs to take a bag/tuple, pull out all the fields as double[] and pass into
    Percentile.

    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html

    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%) at all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Edward Capriolo at Mar 10, 2011 at 8:09 pm

    On Thu, Mar 10, 2011 at 2:49 PM, Josh Devins wrote:
    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math underlying
    the UDF. It's pretty straightforward as the Percentile implementation will
    sort your numbers before doing percentile calculations. The UDF then just
    needs to take a bag/tuple, pull out all the fields as double[] and pass into
    Percentile.

    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html

    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%) at all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything.  Is there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward                                     http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
    You can get an approximated percentile without completely ordering the list.

    Caveat: Assuming your values have a small distribution. You can use
    rounding to shrink the distribution.

    In SQL terms. 'select value as val,count(1) as count from X group by
    value order by value'

    This should result in a small result set with very large counts. Get
    the total count. You can now find the percentile of this smaller list.
  • Jonathan Holloway at Mar 10, 2011 at 9:02 pm
    Hey Josh,

    That's the path I started down today, I don't suppose the UDF you wrote is
    in the public domain
    at all - would you consider contributing it to piggybank.jar at all? How
    does it fare with large datasets
    as well?

    Many thanks,
    Jon.
    On 10 March 2011 19:49, Josh Devins wrote:

    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math underlying
    the UDF. It's pretty straightforward as the Percentile implementation will
    sort your numbers before doing percentile calculations. The UDF then just
    needs to take a bag/tuple, pull out all the fields as double[] and pass
    into
    Percentile.


    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html

    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%) at all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is there
    a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Jonathan Holloway at Mar 11, 2011 at 5:11 pm
    This was my attempt at the Percentile function today:

    http://pastebin.com/C883rwbJ

    Any tips on improving the efficiency of it would be welcome. Dmitriy
    mentioned using a histogram as an approach for this and looking at the way
    GROUP BY works. If anybody has any further info on that approach I'd be
    interested in hearing about it.

    Cheers,
    Jon.
    On 10 March 2011 21:01, Jonathan Holloway wrote:

    Hey Josh,

    That's the path I started down today, I don't suppose the UDF you wrote is
    in the public domain
    at all - would you consider contributing it to piggybank.jar at all? How
    does it fare with large datasets
    as well?

    Many thanks,
    Jon.
    On 10 March 2011 19:49, Josh Devins wrote:

    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math
    underlying
    the UDF. It's pretty straightforward as the Percentile implementation will
    sort your numbers before doing percentile calculations. The UDF then just
    needs to take a bag/tuple, pull out all the fields as double[] and pass
    into
    Percentile.


    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html

    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%)
    at
    all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is
    there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward
    http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Dmitriy Ryaboy at Mar 11, 2011 at 6:19 pm
    1) change log.info to log.debug... you don't want that spamming your logs
    for a billion tuples.
    2) instanceof DefaultDataBag should probably be instanceof DataBag (other
    databags are ok too)
    3) the somewhat more scalable version of this would group by value (or value
    range, for histograms) and provide a series of (value, numentries) pairs --
    assuming numentries tends to be above 2, you can get substantial savings in
    terms of memory there
    4) if you use an ArrayList instead of an array, you don't need to
    precalculate the size of the bag. But then a copy to array is required for
    Percentile to work. So you can implement your own Percentile that wouldn't
    require you to triple-buffer...

    D
    On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway wrote:

    This was my attempt at the Percentile function today:

    http://pastebin.com/C883rwbJ

    Any tips on improving the efficiency of it would be welcome. Dmitriy
    mentioned using a histogram as an approach for this and looking at the way
    GROUP BY works. If anybody has any further info on that approach I'd be
    interested in hearing about it.

    Cheers,
    Jon.

    On 10 March 2011 21:01, Jonathan Holloway <jonathan.holloway@gmail.com
    wrote:
    Hey Josh,

    That's the path I started down today, I don't suppose the UDF you wrote is
    in the public domain
    at all - would you consider contributing it to piggybank.jar at all? How
    does it fare with large datasets
    as well?

    Many thanks,
    Jon.
    On 10 March 2011 19:49, Josh Devins wrote:

    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math
    underlying
    the UDF. It's pretty straightforward as the Percentile implementation
    will
    sort your numbers before doing percentile calculations. The UDF then
    just
    needs to take a bag/tuple, pull out all the fields as double[] and pass
    into
    Percentile.

    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html
    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median, 99%)
    at
    all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is
    there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward
    http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Josh Devins at Mar 11, 2011 at 11:06 pm
    Jon,

    I haven't tried with any super large DataBags but would definitely pursue
    Dmitriy's grouping suggestion if memory usage/size is of concern and you
    have a very large distribution. Something I will have to try as well!

    Cheers,

    Josh


    On 11 March 2011 19:18, Dmitriy Ryaboy wrote:

    1) change log.info to log.debug... you don't want that spamming your logs
    for a billion tuples.
    2) instanceof DefaultDataBag should probably be instanceof DataBag (other
    databags are ok too)
    3) the somewhat more scalable version of this would group by value (or
    value
    range, for histograms) and provide a series of (value, numentries) pairs --
    assuming numentries tends to be above 2, you can get substantial savings in
    terms of memory there
    4) if you use an ArrayList instead of an array, you don't need to
    precalculate the size of the bag. But then a copy to array is required for
    Percentile to work. So you can implement your own Percentile that wouldn't
    require you to triple-buffer...

    D

    On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway <
    jonathan.holloway@gmail.com> wrote:
    This was my attempt at the Percentile function today:

    http://pastebin.com/C883rwbJ

    Any tips on improving the efficiency of it would be welcome. Dmitriy
    mentioned using a histogram as an approach for this and looking at the way
    GROUP BY works. If anybody has any further info on that approach I'd be
    interested in hearing about it.

    Cheers,
    Jon.

    On 10 March 2011 21:01, Jonathan Holloway <jonathan.holloway@gmail.com
    wrote:
    Hey Josh,

    That's the path I started down today, I don't suppose the UDF you wrote is
    in the public domain
    at all - would you consider contributing it to piggybank.jar at all?
    How
    does it fare with large datasets
    as well?

    Many thanks,
    Jon.
    On 10 March 2011 19:49, Josh Devins wrote:

    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math
    underlying
    the UDF. It's pretty straightforward as the Percentile implementation
    will
    sort your numbers before doing percentile calculations. The UDF then
    just
    needs to take a bag/tuple, pull out all the fields as double[] and
    pass
    into
    Percentile.

    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/rank/Percentile.html
    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile (median,
    99%)
    at
    all?
    I took a look at the builtins
    and the piggybank project, but couldn't seem to see anything. Is
    there a
    reason why there isn't a builtin for
    this?
    I think it would be because there's an inherently serial problem in
    there (i.e. numbering each entry based on its place in the ordered
    list).

    Cheers,
    Kris

    --
    Kris Coward
    http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
  • Kaluskar, Sanjay at Mar 12, 2011 at 12:40 am
    It's not very clear how you are doing this in a single eval func efficiently, or are you scanning all the tuples within a single invocation of the eval func?

    For this to be scalable and O(n), wouldn't you need at least 2 map-reduce jobs? The first job to calculate histograms (which can be scaled up by the map phase assigning buckets in parallel) and the reduce computing the counts for each bucket; the second job to calculate the percentile rank of each tuple in a map phase using the histograms (which again can be done in parallel).

    -----Original Message-----
    From: joshdevins@gmail.com On Behalf Of Josh Devins
    Sent: 12 March 2011 04:36
    To: user@pig.apache.org
    Cc: Dmitriy Ryaboy; Jonathan Holloway
    Subject: Re: Percentile UDF

    Jon,

    I haven't tried with any super large DataBags but would definitely pursue Dmitriy's grouping suggestion if memory usage/size is of concern and you have a very large distribution. Something I will have to try as well!

    Cheers,

    Josh


    On 11 March 2011 19:18, Dmitriy Ryaboy wrote:

    1) change log.info to log.debug... you don't want that spamming your
    logs for a billion tuples.
    2) instanceof DefaultDataBag should probably be instanceof DataBag
    (other databags are ok too)
    3) the somewhat more scalable version of this would group by value (or
    value range, for histograms) and provide a series of (value,
    numentries) pairs -- assuming numentries tends to be above 2, you can
    get substantial savings in terms of memory there
    4) if you use an ArrayList instead of an array, you don't need to
    precalculate the size of the bag. But then a copy to array is required
    for Percentile to work. So you can implement your own Percentile that
    wouldn't require you to triple-buffer...

    D

    On Fri, Mar 11, 2011 at 9:10 AM, Jonathan Holloway <
    jonathan.holloway@gmail.com> wrote:
    This was my attempt at the Percentile function today:

    http://pastebin.com/C883rwbJ

    Any tips on improving the efficiency of it would be welcome.
    Dmitriy mentioned using a histogram as an approach for this and
    looking at the way
    GROUP BY works. If anybody has any further info on that approach
    I'd be interested in hearing about it.

    Cheers,
    Jon.

    On 10 March 2011 21:01, Jonathan Holloway
    <jonathan.holloway@gmail.com
    wrote:
    Hey Josh,

    That's the path I started down today, I don't suppose the UDF you
    wrote is
    in the public domain
    at all - would you consider contributing it to piggybank.jar at all?
    How
    does it fare with large datasets
    as well?

    Many thanks,
    Jon.
    On 10 March 2011 19:49, Josh Devins wrote:

    Hey Jon,

    I wrote one not long ago that just relies on Apache Commons Math
    underlying the UDF. It's pretty straightforward as the Percentile
    implementation
    will
    sort your numbers before doing percentile calculations. The UDF
    then
    just
    needs to take a bag/tuple, pull out all the fields as double[]
    and
    pass
    into
    Percentile.

    http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/de
    scriptive/rank/Percentile.html
    Josh

    On 10 March 2011 19:38, Kris Coward wrote:
    On Thu, Mar 10, 2011 at 03:24:52PM +0000, Jonathan Holloway wrote:
    HI all,

    Does anybody have a UDF for calculating the percentile
    (median,
    99%)
    at
    all?
    I took a look at the builtins and the piggybank project, but
    couldn't seem to see anything. Is
    there a
    reason why there isn't a builtin for this?
    I think it would be because there's an inherently serial
    problem in there (i.e. numbering each entry based on its place
    in the ordered list).

    Cheers,
    Kris

    --
    Kris Coward
    http://unripe.melon.org/
    GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB
    12B3

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 10, '11 at 2:58p
activeMar 12, '11 at 12:40a
posts10
users6
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase