I've written a (trivial) aggregator for mean, and another for variance:

(defaggregateop mean
"Aggregates the arithmetic mean of its input."
([] [0 0])
([[sum count] val] [(+ sum val) (inc count)])
([[sum count]] [(float (/ sum count))])
)

(defaggregateop var
"Aggregates the variance of its input."
([] [0 0 0])
([[sumsq sum count] val] [(+ sumsq (* val val)) (+ sum val) (inc count)])
([[sumsq sum count]] (let [mean (float (/ sum count))] [(- (float (/
sumsq count)) (* mean mean))]))
)

I'd like to do two things:

1) combine them into one aggregator that returns both (and thus
accumulates sum and count once only);
2) implement them with a combiner.

I've seen the definition of sum (each sum-parallel), where sum-parallel is
a defparallelop,
but I don't see how that applies here, since defparallelop doesn't allow
maintaining state;
and I don't see any other way to use a combiner. Are one or both of these
problems
(easily) solvable?

Many thanks (still learning!),

Mike

Search Discussions

  • Nathanmarz at Nov 9, 2011 at 5:41 am
    Hey Mike,

    Here's how to do mean and variance using combiners:

    https://gist.github.com/1350522

    Note that the implementation of "avg" is from cascalog.ops. This
    implementation makes use of predicate macros, which allow for the
    arbitrary composition of predicates. So even though "avg" itself can't
    be defined as a parallel aggregator, its components "count" and "sum"
    can, and they can be composed with "div" to produce an optimized
    version of the aggregator. A similar approach is taken for variance.

    Predicate macros, as you can see, are really powerful.

    -Nathan

    On Nov 8, 9:07 pm, R Daneel wrote:
    I've written a (trivial) aggregator for mean, and another for variance:

    (defaggregateop mean
    "Aggregates the arithmetic mean of its input."
    ([] [0 0])
    ([[sum count] val] [(+ sum val) (inc count)])
    ([[sum count]] [(float (/ sum count))])
    )

    (defaggregateop var
    "Aggregates the variance of its input."
    ([] [0 0 0])
    ([[sumsq sum count] val] [(+ sumsq (* val val)) (+ sum val) (inc count)])
    ([[sumsq sum count]] (let [mean (float (/ sum count))] [(- (float (/
    sumsq count)) (* mean mean))]))
    )

    I'd like to do two things:

    1) combine them into one aggregator that returns both (and thus
    accumulates sum and count once only);
    2) implement them with a combiner.

    I've seen the definition of sum (each sum-parallel), where sum-parallel is
    a defparallelop,
    but I don't see how that applies here, since defparallelop doesn't allow
    maintaining state;
    and I don't see any other way to use a combiner.  Are one or both of these
    problems
    (easily) solvable?

    Many thanks (still learning!),

    Mike
  • R Daneel at Nov 9, 2011 at 5:48 am
    Thanks very much!!! I'll be digesting this for a while :)

    Mike
  • R Daneel at Nov 9, 2011 at 7:19 am
    I'm just playing around and have added another predicate that's composed
    with the variance:

    (def unbiased-variance
    (<- [!val :> !var]
    (variance !val :> !j)
    (c/count !count)
    (* !j !count :> !k)
    (- !count 1 :> !l)
    (/ !k !l :> !var)))

    but I had to re-calculate the !count variable. Is cascalog smart enough to
    avoid actually accumulating !count twice?
  • Nathan Marz at Nov 9, 2011 at 7:34 am
    Not yet. I opened up a ticket for Cascalog to detect duplicate operations
    and rewrite the query to avoid wasted work:

    https://www.assembla.com/spaces/cascalog/tickets/31-cascalog-should-detect-duplicate-operations-and-rewrite-the-query-to-avoid-wasted-work


    On Tue, Nov 8, 2011 at 10:46 PM, R Daneel wrote:

    I'm just playing around and have added another predicate that's composed
    with the variance:

    (def unbiased-variance
    (<- [!val :> !var]
    (variance !val :> !j)
    (c/count !count)
    (* !j !count :> !k)
    (- !count 1 :> !l)
    (/ !k !l :> !var)))

    but I had to re-calculate the !count variable. Is cascalog smart enough
    to avoid actually accumulating !count twice?

    --
    Twitter: @nathanmarz
    http://nathanmarz.com
  • R Daneel at Nov 9, 2011 at 6:14 pm
    Great! Thanks again: predicate macros are awesome :)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedNov 9, '11 at 5:08a
activeNov 9, '11 at 6:14p
posts6
users2
websiteclojure.org
irc#clojure

2 users in discussion

R Daneel: 4 posts Nathan Marz: 2 posts

People

Translate

site design / logo © 2021 Grokbase