FAQ
For anyone who still needs it, I implemented Gavin's approach
at https://github.com/nathanmarz/cascalog/pull/161.
On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method should have
been included.

See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning <ted.d...@gmail.com<javascript:>
wrote:

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte <msli...@gmail.com<javascript:>
wrote:
Fantastic. Quantile calculations were next on my list, so this is
fortuitous. I've been thinking about a few different options for how to
calculate quantile in a streaming/parallel fashion. I'm curious... what
approach is used in Mahout to do this?
The basic idea is a modification of Welford's technique for means. With
that method, you have a progressive update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you use a constant
then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to the median
about as fast as keeping all the samples and sorting them.

The trick is finding the right value of p. The theoretically correct
value is the probability density function at the current value of the
median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100 samples to get a
first value of median, 25th and 75th %-iles by sorting. Then
I use the current estimate of the inter-quartile range to get my estimate
of p. This works very well except on massively skewed
distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value is
negative, but the 25th %-ile is something like 10^-50.
This algorithm estimates the 25th %-ile as about 10^-20 which is good
enough for me, but does have a large relative error. See
the test cases for the real details; there is a disabled test case that
demonstrates this failing.

The wikipedia page I mentioned also has skewness and kertosis, which I
was planning to calculate as well. I didn't mention them in my post
because the solution would be essentially the same as for variance. I see
they are not in your OnlineSummarizer implementation. On the other hand,
it's not clear how important these stats are... to some extent they were
included from the viewpoint of: "if we're calculating variance, we might as
well calculate these stats, too".
Kurtosis and skewness are largely concepts from the method of moments and
are harder to interpret than min, max, mean and a few quantiles (IMNSHO).
Many people simply look at the median versus mean comparison to get a feel
for skewness. I like to take a leaf from R where the summary function does
this:
summary(rgamma(10000, 0.1, rate=0.1))
Min. 1st Qu. Median Mean
3rd Qu. Max.
2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01
3.6167000776e-01 5.2394817659e+01

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Ted Dunning at Jul 5, 2013 at 7:33 pm
    In looking at the code here, I note that the variance computation uses sum
    of squares.

    That is very bad for numerical instability reasons. The problem arises
    because you are looking for (small) differences between the square of large
    numbers. When the ratio of mean to variance gets large you quickly lose
    all significance and can even get a negative value for variance.

    For instance, with a million samples of a normal deviate with mean of 1e9,
    I get values of variance that are -128, 0 or 128 instead of the desired
    value of 1.

    Welford's method is far superior.

    See
    http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm



    On Fri, Jul 5, 2013 at 11:34 AM, Mason wrote:

    For anyone who still needs it, I implemented Gavin's approach at
    https://github.com/nathanmarz/cascalog/pull/161.

    On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

    Ooops. The original reference that I used for this method should have
    been included.

    See http://citeseerx.ist.psu.**edu/viewdoc/summary?doi=10.1.**1.105.1580<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>

    On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning wrote:

    On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte wrote:

    Fantastic. Quantile calculations were next on my list, so this is
    fortuitous. I've been thinking about a few different options for how to
    calculate quantile in a streaming/parallel fashion. I'm curious... what
    approach is used in Mahout to do this?
    The basic idea is a modification of Welford's technique for means. With
    that method, you have a progressive update formula:

    m = m + (x-m)/n

    If you change that so that instead of the difference, you use a constant
    then you get a formula for median:

    median = median + p * signum(x-m) / n

    For the correct constant value of p, this will converge to the median
    about as fast as keeping all the samples and sorting them.

    The trick is finding the right value of p. The theoretically correct
    value is the probability density function at the current value of the
    median, but we can't know that, especially at the beginning.

    So the algorithm that I used was to take the first 100 samples to get a
    first value of median, 25th and 75th %-iles by sorting. Then
    I use the current estimate of the inter-quartile range to get my
    estimate of p. This works very well except on massively skewed
    distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value
    is negative, but the 25th %-ile is something like 10^-50.
    This algorithm estimates the 25th %-ile as about 10^-20 which is good
    enough for me, but does have a large relative error. See
    the test cases for the real details; there is a disabled test case that
    demonstrates this failing.

    The wikipedia page I mentioned also has skewness and kertosis, which I
    was planning to calculate as well. I didn't mention them in my post
    because the solution would be essentially the same as for variance. I see
    they are not in your OnlineSummarizer implementation. On the other hand,
    it's not clear how important these stats are... to some extent they were
    included from the viewpoint of: "if we're calculating variance, we might as
    well calculate these stats, too".
    Kurtosis and skewness are largely concepts from the method of moments
    and are harder to interpret than min, max, mean and a few quantiles
    (IMNSHO). Many people simply look at the median versus mean comparison to
    get a feel for skewness. I like to take a leaf from R where the summary
    function does this:
    summary(rgamma(10000, 0.1, rate=0.1))
    Min. 1st Qu. Median Mean
    3rd Qu. Max.
    2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01
    3.6167000776e-01 5.2394817659e+01

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Mason at Jul 5, 2013 at 8:38 pm
    Yes. My change is to add the function "sample-variance-parallel", which
    should be stable; I left in "variance" and "sample-variance" to
    demonstrate the numerical problems (the tests should show this), and as
    a result the PR is a bit sloppy and confusing.
    On 7/5/13 12:32 PM, Ted Dunning wrote:

    In looking at the code here, I note that the variance computation uses
    sum of squares.

    That is very bad for numerical instability reasons. The problem
    arises because you are looking for (small) differences between the
    square of large numbers. When the ratio of mean to variance gets
    large you quickly lose all significance and can even get a negative
    value for variance.

    For instance, with a million samples of a normal deviate with mean of
    1e9, I get values of variance that are -128, 0 or 128 instead of the
    desired value of 1.

    Welford's method is far superior.

    See
    http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm




    On Fri, Jul 5, 2013 at 11:34 AM, Mason wrote:

    For anyone who still needs it, I implemented Gavin's approach at
    https://github.com/nathanmarz/cascalog/pull/161.


    On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

    Ooops. The original reference that I used for this method
    should have been included.

    See
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

    On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning
    wrote:



    On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte
    wrote:

    Fantastic. Quantile calculations were next on my
    list, so this is fortuitous. I've been thinking about
    a few different options for how to calculate quantile
    in a streaming/parallel fashion. I'm curious... what
    approach is used in Mahout to do this?


    The basic idea is a modification of Welford's technique
    for means. With that method, you have a progressive
    update formula:

    m = m + (x-m)/n

    If you change that so that instead of the difference, you
    use a constant then you get a formula for median:

    median = median + p * signum(x-m) / n

    For the correct constant value of p, this will converge to
    the median about as fast as keeping all the samples and
    sorting them.

    The trick is finding the right value of p. The
    theoretically correct value is the probability density
    function at the current value of the
    median, but we can't know that, especially at the beginning.

    So the algorithm that I used was to take the first 100
    samples to get a first value of median, 25th and 75th
    %-iles by sorting. Then
    I use the current estimate of the inter-quartile range to
    get my estimate of p. This works very well except on
    massively skewed
    distributions like Gamma(0.01, 0.01). Here the mean is 1
    and no value is negative, but the 25th %-ile is something
    like 10^-50.
    This algorithm estimates the 25th %-ile as about 10^-20
    which is good enough for me, but does have a large
    relative error. See
    the test cases for the real details; there is a disabled
    test case that demonstrates this failing.


    The wikipedia page I mentioned also has skewness and
    kertosis, which I was planning to calculate as well.
    I didn't mention them in my post because the solution
    would be essentially the same as for variance. I see
    they are not in your OnlineSummarizer implementation.
    On the other hand, it's not clear how important these
    stats are... to some extent they were included from
    the viewpoint of: "if we're calculating variance, we
    might as well calculate these stats, too".


    Kurtosis and skewness are largely concepts from the method
    of moments and are harder to interpret than min, max, mean
    and a few quantiles (IMNSHO). Many people simply look at
    the median versus mean comparison to get a feel for
    skewness. I like to take a leaf from R where the summary
    function does this:
    summary(rgamma(10000, 0.1, rate=0.1))
    Min. 1st Qu. Median
    Mean 3rd Qu. Max.
    2.0096837668e-48 7.6196326624e-06 6.7717972442e-03
    9.9165978893e-01 3.6167000776e-01 5.2394817659e+01



    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-user+unsubscribe@googlegroups.com
    For more options, visit https://groups.google.com/groups/opt_out.



    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJul 5, '13 at 6:34p
activeJul 5, '13 at 8:38p
posts3
users2
websiteclojure.org
irc#clojure

2 users in discussion

Mason: 2 posts Ted Dunning: 1 post

People

Translate

site design / logo © 2021 Grokbase