at https://github.com/nathanmarz/cascalog/pull/161.

On Friday, October 1, 2010 11:40:59 AM UTC-7, Ted Dunning wrote:

Ooops. The original reference that I used for this method should have

been included.

See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning <ted.d...@gmail.com<javascript:>

--Ooops. The original reference that I used for this method should have

been included.

See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580

On Fri, Oct 1, 2010 at 11:39 AM, Ted Dunning <ted.d...@gmail.com<javascript:>

wrote:

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte <msli...@gmail.com<javascript:>

that method, you have a progressive update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you use a constant

then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to the median

about as fast as keeping all the samples and sorting them.

The trick is finding the right value of p. The theoretically correct

value is the probability density function at the current value of the

median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100 samples to get a

first value of median, 25th and 75th %-iles by sorting. Then

I use the current estimate of the inter-quartile range to get my estimate

of p. This works very well except on massively skewed

distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value is

negative, but the 25th %-ile is something like 10^-50.

This algorithm estimates the 25th %-ile as about 10^-20 which is good

enough for me, but does have a large relative error. See

the test cases for the real details; there is a disabled test case that

demonstrates this failing.

are harder to interpret than min, max, mean and a few quantiles (IMNSHO).

Many people simply look at the median versus mean comparison to get a feel

for skewness. I like to take a leaf from R where the summary function does

this:

3rd Qu. Max.

2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01

3.6167000776e-01 5.2394817659e+01

On Fri, Oct 1, 2010 at 10:25 AM, Marc Limotte <msli...@gmail.com<javascript:>

wrote:

Fantastic. Quantile calculations were next on my list, so this is

fortuitous. I've been thinking about a few different options for how to

calculate quantile in a streaming/parallel fashion. I'm curious... what

approach is used in Mahout to do this?

The basic idea is a modification of Welford's technique for means. WithFantastic. Quantile calculations were next on my list, so this is

fortuitous. I've been thinking about a few different options for how to

calculate quantile in a streaming/parallel fashion. I'm curious... what

approach is used in Mahout to do this?

that method, you have a progressive update formula:

m = m + (x-m)/n

If you change that so that instead of the difference, you use a constant

then you get a formula for median:

median = median + p * signum(x-m) / n

For the correct constant value of p, this will converge to the median

about as fast as keeping all the samples and sorting them.

The trick is finding the right value of p. The theoretically correct

value is the probability density function at the current value of the

median, but we can't know that, especially at the beginning.

So the algorithm that I used was to take the first 100 samples to get a

first value of median, 25th and 75th %-iles by sorting. Then

I use the current estimate of the inter-quartile range to get my estimate

of p. This works very well except on massively skewed

distributions like Gamma(0.01, 0.01). Here the mean is 1 and no value is

negative, but the 25th %-ile is something like 10^-50.

This algorithm estimates the 25th %-ile as about 10^-20 which is good

enough for me, but does have a large relative error. See

the test cases for the real details; there is a disabled test case that

demonstrates this failing.

The wikipedia page I mentioned also has skewness and kertosis, which I

was planning to calculate as well. I didn't mention them in my post

because the solution would be essentially the same as for variance. I see

they are not in your OnlineSummarizer implementation. On the other hand,

it's not clear how important these stats are... to some extent they were

included from the viewpoint of: "if we're calculating variance, we might as

well calculate these stats, too".

Kurtosis and skewness are largely concepts from the method of moments andwas planning to calculate as well. I didn't mention them in my post

because the solution would be essentially the same as for variance. I see

they are not in your OnlineSummarizer implementation. On the other hand,

it's not clear how important these stats are... to some extent they were

included from the viewpoint of: "if we're calculating variance, we might as

well calculate these stats, too".

are harder to interpret than min, max, mean and a few quantiles (IMNSHO).

Many people simply look at the median versus mean comparison to get a feel

for skewness. I like to take a leaf from R where the summary function does

this:

summary(rgamma(10000, 0.1, rate=0.1))

Min. 1st Qu. Median Mean3rd Qu. Max.

2.0096837668e-48 7.6196326624e-06 6.7717972442e-03 9.9165978893e-01

3.6167000776e-01 5.2394817659e+01

You received this message because you are subscribed to the Google Groups "cascalog-user" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.