FAQ
Yes. My change is to add the function "sample-variance-parallel", which
should be stable; I left in "variance" and "sample-variance" to
demonstrate the numerical problems (the tests should show this), and as
a result the PR is a bit sloppy and confusing.
On 7/5/13 12:32 PM, Ted Dunning wrote:

In looking at the code here, I note that the variance computation uses
sum of squares.

That is very bad for numerical instability reasons. The problem
arises because you are looking for (small) differences between the
square of large numbers. When the ratio of mean to variance gets
large you quickly lose all significance and can even get a negative
value for variance.

For instance, with a million samples of a normal deviate with mean of
1e9, I get values of variance that are -128, 0 or 128 instead of the
desired value of 1.

Welford's method is far superior.

See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm




On Fri, Jul 5, 2013 at 11:34 AM, Mason wrote:

For anyone who still needs it, I implemented Gavin's approach at
https://github.com/nathanmarz/cascalog/pull/161.


On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method
should have been included.

See
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning
wrote:



On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte
wrote:

Fantastic. Quantile calculations were next on my
list, so this is fortuitous. I've been thinking about
a few different options for how to calculate quantile
in a streaming/parallel fashion. I'm curious... what
approach is used in Mahout to do this?


The basic idea is a modification of Welford's technique
for means. With that method, you have a progressive
update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you
use a constant then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to
the median about as fast as keeping all the samples and
sorting them.

The trick is finding the right value of p. The
theoretically correct value is the probability density
function at the current value of the
median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100
samples to get a first value of median, 25th and 75th
%-iles by sorting. Then
I use the current estimate of the inter-quartile range to
get my estimate of p. This works very well except on
massively skewed
distributions like Gamma(0.01, 0.01). Here the mean is 1
and no value is negative, but the 25th %-ile is something
like 10^-50.
This algorithm estimates the 25th %-ile as about 10^-20
which is good enough for me, but does have a large
relative error. See
the test cases for the real details; there is a disabled
test case that demonstrates this failing.


The wikipedia page I mentioned also has skewness and
kertosis, which I was planning to calculate as well.
I didn't mention them in my post because the solution
would be essentially the same as for variance. I see
they are not in your OnlineSummarizer implementation.
On the other hand, it's not clear how important these
stats are... to some extent they were included from
the viewpoint of: "if we're calculating variance, we
might as well calculate these stats, too".


Kurtosis and skewness are largely concepts from the method
of moments and are harder to interpret than min, max, mean
and a few quantiles (IMNSHO). Many people simply look at
the median versus mean comparison to get a feel for
skewness. I like to take a leaf from R where the summary
function does this:
summary(rgamma(10000, 0.1, rate=0.1))
Min. 1st Qu. Median
Mean 3rd Qu. Max.
2.0096837668e-48 7.6196326624e-06 6.7717972442e-03
9.9165978893e-01 3.6167000776e-01 5.2394817659e+01



--
You received this message because you are subscribed to the Google
Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to cascalog-user+unsubscribe@googlegroups.com
For more options, visit https://groups.google.com/groups/opt_out.



--
You received this message because you are subscribed to the Google
Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJul 5, '13 at 6:34p
activeJul 5, '13 at 8:38p
posts3
users2
websiteclojure.org
irc#clojure

2 users in discussion

Mason: 2 posts Ted Dunning: 1 post

People

Translate

site design / logo © 2021 Grokbase