FAQ
For anyone who still needs it, I implemented Gavin's approach
at https://github.com/nathanmarz/cascalog/pull/161.
On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method should have
been included.

See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning <ted.d...@gmail.com<javascript:>
wrote:

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte <msli...@gmail.com<javascript:>
wrote:
Fantastic. Quantile calculations were next on my list, so this is
fortuitous. I've been thinking about a few different options for how to
calculate quantile in a streaming/parallel fashion. I'm curious... what
approach is used in Mahout to do this?
The basic idea is a modification of Welford's technique for means. With
that method, you have a progressive update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you use a constant
then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to the median
about as fast as keeping all the samples and sorting them.

The trick is finding the right value of p. The theoretically correct
value is the probability density function at the current value of the
median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100 samples to get a
first value of median, 25th and 75th %-iles by sorting. Then
I use the current estimate of the inter-quartile range to get my estimate
of p. This works very well except on massively skewed
distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value is
negative, but the 25th %-ile is something like 10^-50.
This algorithm estimates the 25th %-ile as about 10^-20 which is good
enough for me, but does have a large relative error. See
the test cases for the real details; there is a disabled test case that
demonstrates this failing.

The wikipedia page I mentioned also has skewness and kertosis, which I
was planning to calculate as well. I didn't mention them in my post
because the solution would be essentially the same as for variance. I see
they are not in your OnlineSummarizer implementation. On the other hand,
it's not clear how important these stats are... to some extent they were
included from the viewpoint of: "if we're calculating variance, we might as
well calculate these stats, too".
Kurtosis and skewness are largely concepts from the method of moments and
are harder to interpret than min, max, mean and a few quantiles (IMNSHO).
Many people simply look at the median versus mean comparison to get a feel
for skewness. I like to take a leaf from R where the summary function does
this:
summary(rgamma(10000, 0.1, rate=0.1))
Min. 1st Qu. Median Mean
3rd Qu. Max.
2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01
3.6167000776e-01 5.2394817659e+01

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.

## Search Discussions

•  at Jul 5, 2013 at 7:33 pm ⇧
In looking at the code here, I note that the variance computation uses sum
of squares.

That is very bad for numerical instability reasons. The problem arises
because you are looking for (small) differences between the square of large
numbers. When the ratio of mean to variance gets large you quickly lose
all significance and can even get a negative value for variance.

For instance, with a million samples of a normal deviate with mean of 1e9,
I get values of variance that are -128, 0 or 128 instead of the desired
value of 1.

Welford's method is far superior.

See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

On Fri, Jul 5, 2013 at 11:34 AM, Mason wrote:

For anyone who still needs it, I implemented Gavin's approach at
https://github.com/nathanmarz/cascalog/pull/161.

On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method should have
been included.

See http://citeseerx.ist.psu.**edu/viewdoc/summary?doi=10.1.**1.105.1580<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning wrote:

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte wrote:

Fantastic. Quantile calculations were next on my list, so this is
fortuitous. I've been thinking about a few different options for how to
calculate quantile in a streaming/parallel fashion. I'm curious... what
approach is used in Mahout to do this?
The basic idea is a modification of Welford's technique for means. With
that method, you have a progressive update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you use a constant
then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to the median
about as fast as keeping all the samples and sorting them.

The trick is finding the right value of p. The theoretically correct
value is the probability density function at the current value of the
median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100 samples to get a
first value of median, 25th and 75th %-iles by sorting. Then
I use the current estimate of the inter-quartile range to get my
estimate of p. This works very well except on massively skewed
distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value
is negative, but the 25th %-ile is something like 10^-50.
This algorithm estimates the 25th %-ile as about 10^-20 which is good
enough for me, but does have a large relative error. See
the test cases for the real details; there is a disabled test case that
demonstrates this failing.

The wikipedia page I mentioned also has skewness and kertosis, which I
was planning to calculate as well. I didn't mention them in my post
because the solution would be essentially the same as for variance. I see
they are not in your OnlineSummarizer implementation. On the other hand,
it's not clear how important these stats are... to some extent they were
included from the viewpoint of: "if we're calculating variance, we might as
well calculate these stats, too".
Kurtosis and skewness are largely concepts from the method of moments
and are harder to interpret than min, max, mean and a few quantiles
(IMNSHO). Many people simply look at the median versus mean comparison to
get a feel for skewness. I like to take a leaf from R where the summary
function does this:
summary(rgamma(10000, 0.1, rate=0.1))
Min. 1st Qu. Median Mean
3rd Qu. Max.
2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01
3.6167000776e-01 5.2394817659e+01

--
You received this message because you are subscribed to the Google Groups
"cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
•  at Jul 5, 2013 at 8:38 pm ⇧
Yes. My change is to add the function "sample-variance-parallel", which
should be stable; I left in "variance" and "sample-variance" to
demonstrate the numerical problems (the tests should show this), and as
a result the PR is a bit sloppy and confusing.
On 7/5/13 12:32 PM, Ted Dunning wrote:

In looking at the code here, I note that the variance computation uses
sum of squares.

That is very bad for numerical instability reasons. The problem
arises because you are looking for (small) differences between the
square of large numbers. When the ratio of mean to variance gets
large you quickly lose all significance and can even get a negative
value for variance.

For instance, with a million samples of a normal deviate with mean of
1e9, I get values of variance that are -128, 0 or 128 instead of the
desired value of 1.

Welford's method is far superior.

See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

On Fri, Jul 5, 2013 at 11:34 AM, Mason wrote:

For anyone who still needs it, I implemented Gavin's approach at
https://github.com/nathanmarz/cascalog/pull/161.

On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method
should have been included.

See
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning
wrote:

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte
wrote:

Fantastic. Quantile calculations were next on my
list, so this is fortuitous. I've been thinking about
a few different options for how to calculate quantile
in a streaming/parallel fashion. I'm curious... what
approach is used in Mahout to do this?

The basic idea is a modification of Welford's technique
for means. With that method, you have a progressive
update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you
use a constant then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to
the median about as fast as keeping all the samples and
sorting them.

The trick is finding the right value of p. The
theoretically correct value is the probability density
function at the current value of the
median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100
samples to get a first value of median, 25th and 75th
%-iles by sorting. Then
I use the current estimate of the inter-quartile range to
get my estimate of p. This works very well except on
massively skewed
distributions like Gamma(0.01, 0.01). Here the mean is 1
and no value is negative, but the 25th %-ile is something
like 10^-50.
This algorithm estimates the 25th %-ile as about 10^-20
which is good enough for me, but does have a large
relative error. See
the test cases for the real details; there is a disabled
test case that demonstrates this failing.

kertosis, which I was planning to calculate as well.
I didn't mention them in my post because the solution
would be essentially the same as for variance. I see
they are not in your OnlineSummarizer implementation.
On the other hand, it's not clear how important these
stats are... to some extent they were included from
the viewpoint of: "if we're calculating variance, we
might as well calculate these stats, too".

Kurtosis and skewness are largely concepts from the method
of moments and are harder to interpret than min, max, mean
and a few quantiles (IMNSHO). Many people simply look at
the median versus mean comparison to get a feel for
skewness. I like to take a leaf from R where the summary
function does this:
summary(rgamma(10000, 0.1, rate=0.1))
Min. 1st Qu. Median
Mean 3rd Qu. Max.
2.0096837668e-48 7.6196326624e-06 6.7717972442e-03
9.9165978893e-01 3.6167000776e-01 5.2394817659e+01

--
Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it,

--
Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send
--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.

## Related Discussions

Discussion Overview
 group cascalog-user categories clojure, hadoop posted Jul 5, '13 at 6:34p active Jul 5, '13 at 8:38p posts 3 users 2 website clojure.org irc #clojure

### 2 users in discussion

Content

People

Support

Translate

site design / logo © 2021 Grokbase