FAQ
I am experimenting with twitter's algebird project in a cascalog query to
do some set approximation with HyperLogLog. I have this example which
works great:

(use 'cascalog.api)
(require '(cascalog [ops :as c]))
(import [com.twitter.algebird HyperLogLogMonoid HLL])

(def animals
   [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

(defn hll-create
   [^HyperLogLogMonoid hll ^String s]
   [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

(defn hll-plus
   [^HLL x ^HLL y]
   [(.$plus x y)])

(defparallelagg hll-unique
   :init-var #'hll-create
   :combine-var #'hll-plus)

(defn hll-estimate-cardinality
   [^HLL hll]
   (int (.estimatedSize hll)))

(time (?<- (stdout)
         [?barn ?unique-animals]
         (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
         (animals ?barn !animal)
         (hll-unique ?hll-monoid !animal :> ?animal-hll)
         (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


RESULTS
-----------------------
barn1 6
barn2 3
-----------------------
"Elapsed time: 3691.545 msecs"

However there is a serious performance issue that I don't understand. When
running the query over a larger dataset the HyperLogLog approach takes
significantly longer than using cascalog.ops/distinct-count. This has
brought up a few questions:

1.) Is there a way to get a "query plan" from cascalog? I would like to
know for sure which portions of my query are being run in the map and
reduce phases.

2.) I have little experience with scala, still learning cascalog/clojure
and this is my first time using algebird. Am I doing something
fundamentally wrong that would degrade performance? I am having a tough
time finding examples using algebird in clojure, if anybody knows of any
I'd like to take a look at them.

Thanks in advance!
Tom

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Jeroen van Dijk at Jul 10, 2013 at 3:10 pm
    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query to
    do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like to
    know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning cascalog/clojure
    and this is my first time using algebird. Am I doing something
    fundamentally wrong that would degrade performance? I am having a tough
    time finding examples using algebird in clojure, if anybody knows of any
    I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part of
    those 3 seconds is probably needed just for booting. I don't see anything
    suspicious with the code, but I'm not sure how performant the algebird
    library is in combination with Cascalog. You might want to benchmark it
    against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code I use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
       ([init-value]
         (doto (HyperLogLog. 12) (.offer init-value)))
       ([std-dev init-value]
         (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
       (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
       :init-var #'hll/construct
       :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Thomas Norden at Jul 10, 2013 at 3:51 pm
    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,




    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden <norde...@gmail.com<javascript:>
    wrote:
    I am experimenting with twitter's algebird project in a cascalog query to
    do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like to
    know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning cascalog/clojure
    and this is my first time using algebird. Am I doing something
    fundamentally wrong that would degrade performance? I am having a tough
    time finding examples using algebird in clojure, if anybody knows of any
    I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part of
    those 3 seconds is probably needed just for booting. I don't see anything
    suspicious with the code, but I'm not sure how performant the algebird
    library is in combination with Cascalog. You might want to benchmark it
    against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code I
    use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-use...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • David Kincaid at Jul 10, 2013 at 6:35 pm
    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query
    to do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like to
    know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning cascalog/clojure
    and this is my first time using algebird. Am I doing something
    fundamentally wrong that would degrade performance? I am having a tough
    time finding examples using algebird in clojure, if anybody knows of any
    I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part of
    those 3 seconds is probably needed just for booting. I don't see anything
    suspicious with the code, but I'm not sure how performant the algebird
    library is in combination with Cascalog. You might want to benchmark it
    against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code I
    use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Thomas Norden at Jul 10, 2013 at 7:57 pm
    If things work out the way I hope they will, I should be able to batch
    create the HLL object for each day, and merge days on a
    weekly/monthly/yearly scale without having to read the entire data set
    again. I would like to provide data in a more ad hoc way, such that I
    don't have to recalculate uniques if the client wants to roll the data up
    in an alternative way.
    On Wednesday, July 10, 2013 2:35:11 PM UTC-4, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query
    to do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part of
    those 3 seconds is probably needed just for booting. I don't see anything
    suspicious with the code, but I'm not sure how performant the algebird
    library is in combination with Cascalog. You might want to benchmark it
    against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code I
    use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • David Kincaid at Jul 10, 2013 at 8:24 pm
    Thanks for the explanation. Maybe I'm just not seeing your use case well
    enough, but if you have the counts for each day can't you just sum them
    together to get weekly/monthly/yearly counts without having to read the
    entire data set?

    I'm not intentionally being obtuse here. I'm truly trying to understand the
    use case for HLL in a batch processing environment. It's a very clever
    algorithm and a lot of people seem to be using it, so it must be useful.
    I'm just not seeing it for some reason.

    Thanks again,

    Dave
    On Wednesday, July 10, 2013 2:57:04 PM UTC-5, Thomas Norden wrote:

    If things work out the way I hope they will, I should be able to batch
    create the HLL object for each day, and merge days on a
    weekly/monthly/yearly scale without having to read the entire data set
    again. I would like to provide data in a more ad hoc way, such that I
    don't have to recalculate uniques if the client wants to roll the data up
    in an alternative way.
    On Wednesday, July 10, 2013 2:35:11 PM UTC-4, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query
    to do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part
    of those 3 seconds is probably needed just for booting. I don't see
    anything suspicious with the code, but I'm not sure how performant the
    algebird library is in combination with Cascalog. You might want to
    benchmark it against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code I
    use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Thomas Norden at Jul 10, 2013 at 8:54 pm
    I am interested in unique counts. I work for a video analytics company,
    clients are interested in unique viewers vs repeat viewers. If I already
    have the unique viewers for 1 day I can not simple add it to another day.
    If the same user views a video on day1 and again on day2 and I sum the two
    unique counts it looks like there were two unique viewers but if I recount
    (or merge two HLL sets) for the two days together there would only be 1
    unique viewer.

    On Wednesday, July 10, 2013 4:24:08 PM UTC-4, David Kincaid wrote:

    Thanks for the explanation. Maybe I'm just not seeing your use case well
    enough, but if you have the counts for each day can't you just sum them
    together to get weekly/monthly/yearly counts without having to read the
    entire data set?

    I'm not intentionally being obtuse here. I'm truly trying to understand
    the use case for HLL in a batch processing environment. It's a very clever
    algorithm and a lot of people seem to be using it, so it must be useful.
    I'm just not seeing it for some reason.

    Thanks again,

    Dave
    On Wednesday, July 10, 2013 2:57:04 PM UTC-5, Thomas Norden wrote:

    If things work out the way I hope they will, I should be able to batch
    create the HLL object for each day, and merge days on a
    weekly/monthly/yearly scale without having to read the entire data set
    again. I would like to provide data in a more ad hoc way, such that I
    don't have to recalculate uniques if the client wants to roll the data up
    in an alternative way.
    On Wednesday, July 10, 2013 2:35:11 PM UTC-4, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids
    and the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog
    query to do some set approximation with HyperLogLog. I have this example
    which works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part
    of those 3 seconds is probably needed just for booting. I don't see
    anything suspicious with the code, but I'm not sure how performant the
    algebird library is in combination with Cascalog. You might want to
    benchmark it against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code
    I use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-use...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • David Kincaid at Jul 10, 2013 at 9:16 pm
    Of course. That makes perfect sense. Thank you for the explanation.
    On Wednesday, July 10, 2013 3:54:58 PM UTC-5, Thomas Norden wrote:

    I am interested in unique counts. I work for a video analytics company,
    clients are interested in unique viewers vs repeat viewers. If I already
    have the unique viewers for 1 day I can not simple add it to another day.
    If the same user views a video on day1 and again on day2 and I sum the two
    unique counts it looks like there were two unique viewers but if I recount
    (or merge two HLL sets) for the two days together there would only be 1
    unique viewer.

    On Wednesday, July 10, 2013 4:24:08 PM UTC-4, David Kincaid wrote:

    Thanks for the explanation. Maybe I'm just not seeing your use case well
    enough, but if you have the counts for each day can't you just sum them
    together to get weekly/monthly/yearly counts without having to read the
    entire data set?

    I'm not intentionally being obtuse here. I'm truly trying to understand
    the use case for HLL in a batch processing environment. It's a very clever
    algorithm and a lot of people seem to be using it, so it must be useful.
    I'm just not seeing it for some reason.

    Thanks again,

    Dave
    On Wednesday, July 10, 2013 2:57:04 PM UTC-5, Thomas Norden wrote:

    If things work out the way I hope they will, I should be able to batch
    create the HLL object for each day, and merge days on a
    weekly/monthly/yearly scale without having to read the entire data set
    again. I would like to provide data in a more ad hoc way, such that I
    don't have to recalculate uniques if the client wants to roll the data up
    in an alternative way.
    On Wednesday, July 10, 2013 2:35:11 PM UTC-4, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I
    haven't tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids
    and the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog
    query to do some set approximation with HyperLogLog. I have this example
    which works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't
    understand. When running the query over a larger dataset the HyperLogLog
    approach takes significantly longer than using cascalog.ops/distinct-count.
    This has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would
    like to know for sure which portions of my query are being run in the map
    and reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part
    of those 3 seconds is probably needed just for booting. I don't see
    anything suspicious with the code, but I'm not sure how performant the
    algebird library is in combination with Cascalog. You might want to
    benchmark it against the Hyperloglog implementation of
    https://github.com/clearspring/stream-lib . Here is some of the code
    I use:

    (import '[com.clearspring.analytics.stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-use...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jeroen van Dijk at Jul 10, 2013 at 9:57 pm
    Without benchmarking I would guess it is more efficient to use Hyperloglog,
    because you have small datastructure to summarize the data instead of
    having to collect a huge set of identifiers.

    The reason I'm using Hyperloglog values is because I'm actually exporting
    these values to Elephantdb for ad hoc queries later on.

    Jeroen

    On Wed, Jul 10, 2013 at 8:35 PM, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query
    to do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/**nathanmarz/cascalog/wiki/**
    Cascading-Flow-visualization<https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization>


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part of
    those 3 seconds is probably needed just for booting. I don't see anything
    suspicious with the code, but I'm not sure how performant the algebird
    library is in combination with Cascalog. You might want to benchmark it
    against the Hyperloglog implementation of https://github.com/**
    clearspring/stream-lib <https://github.com/clearspring/stream-lib> .
    Here is some of the code I use:

    (import '[com.clearspring.analytics.**stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.**com.
    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jeroen van Dijk at Jul 10, 2013 at 9:59 pm
    Sorry missed the above conversation, I already have something like Thomas
    described and it works quite well.

    On Wed, Jul 10, 2013 at 11:57 PM, Jeroen van Dijk wrote:

    Without benchmarking I would guess it is more efficient to use
    Hyperloglog, because you have small datastructure to summarize the data
    instead of having to collect a huge set of identifiers.

    The reason I'm using Hyperloglog values is because I'm actually exporting
    these values to Elephantdb for ad hoc queries later on.

    Jeroen

    On Wed, Jul 10, 2013 at 8:35 PM, David Kincaid wrote:

    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids and
    the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog query
    to do some set approximation with HyperLogLog. I have this example which
    works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/**nathanmarz/cascalog/wiki/**
    Cascading-Flow-visualization<https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization>


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part
    of those 3 seconds is probably needed just for booting. I don't see
    anything suspicious with the code, but I'm not sure how performant the
    algebird library is in combination with Cascalog. You might want to
    benchmark it against the Hyperloglog implementation of
    https://github.com/**clearspring/stream-lib<https://github.com/clearspring/stream-lib>. Here is some of the code I use:

    (import '[com.clearspring.analytics.**stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.**com.
    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Thomas Norden at Jul 12, 2013 at 7:04 pm
    A co-worker sent this video to me, its a pretty good talk about devs trying
    to implement algorithms from a academic papers:
    http://www.youtube.com/watch?v=r9P8G7_avv4

    I can relate a lot to this guy.
    On Wednesday, July 10, 2013 5:59:15 PM UTC-4, Jeroen van Dijk wrote:

    Sorry missed the above conversation, I already have something like Thomas
    described and it works quite well.


    On Wed, Jul 10, 2013 at 11:57 PM, Jeroen van Dijk <jeroentj...@gmail.com<javascript:>
    wrote:
    Without benchmarking I would guess it is more efficient to use
    Hyperloglog, because you have small datastructure to summarize the data
    instead of having to collect a huge set of identifiers.

    The reason I'm using Hyperloglog values is because I'm actually exporting
    these values to Elephantdb for ad hoc queries later on.

    Jeroen


    On Wed, Jul 10, 2013 at 8:35 PM, David Kincaid <kincai...@gmail.com<javascript:>
    wrote:
    I don't mean to hijack this thread, but I've been thinking about how to
    apply HLL algorithm to some of the work that I'm doing and it isn't clear
    to me right now what the benefit is in a MapReduce setting. From the
    examples you pasted it looks like you're reading through all your data
    anyway, so why not just count it and get a 100% accurate count? What is the
    benefit of using HLL in a MapReduce/batch setting?

    Thanks,

    Dave
    On Wednesday, July 10, 2013 10:50:59 AM UTC-5, Thomas Norden wrote:

    Thanks Jeroen

    This is good stuff. I was aware of clearspring's library but I haven't
    tested it out, I'll try that this afternoon.

    I have tested this out on a larger larger set (~15 million) user ids
    and the job crawls during the reduce phase.
    On Wednesday, July 10, 2013 11:10:14 AM UTC-4, Jeroen van Dijk wrote:

    Hi Tom,



    On Wed, Jul 10, 2013 at 4:35 PM, Thomas Norden wrote:

    I am experimenting with twitter's algebird project in a cascalog
    query to do some set approximation with HyperLogLog. I have this example
    which works great:

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))
    (import [com.twitter.algebird HyperLogLogMonoid HLL])

    (def animals
    [["barn1" "dog"]
    ["barn1" "pig"]
    ["barn1" "horse"]
    ["barn2" "giraffe"]
    ["barn1" "pig"]
    ["barn2" "aardvark"]
    ["barn1" "pig"]
    ["barn1" "camel"]
    ["barn1" "duck"]
    ["barn2" "pig"]
    ["barn2" nil]
    ["barn1" "cat"]])

    (defn hll-create
    [^HyperLogLogMonoid hll ^String s]
    [(if (= s nil) (.zero hll) (.create hll (.getBytes s)))])

    (defn hll-plus
    [^HLL x ^HLL y]
    [(.$plus x y)])

    (defparallelagg hll-unique
    :init-var #'hll-create
    :combine-var #'hll-plus)

    (defn hll-estimate-cardinality
    [^HLL hll]
    (int (.estimatedSize hll)))

    (time (?<- (stdout)
    [?barn ?unique-animals]
    (identity (HyperLogLogMonoid. 12) :> ?hll-monoid)
    (animals ?barn !animal)
    (hll-unique ?hll-monoid !animal :> ?animal-hll)
    (hll-estimate-cardinality ?animal-hll :> ?unique-animals)))


    RESULTS
    -----------------------
    barn1 6
    barn2 3
    -----------------------
    "Elapsed time: 3691.545 msecs"

    However there is a serious performance issue that I don't understand.
    When running the query over a larger dataset the HyperLogLog approach
    takes significantly longer than using cascalog.ops/distinct-count. This
    has brought up a few questions:

    1.) Is there a way to get a "query plan" from cascalog? I would like
    to know for sure which portions of my query are being run in the map and
    reduce phases.
    From Cascalog 1.10.1 there is `explain which creates a dotfile, see
    https://github.com/**nathanmarz/cascalog/wiki/**
    Cascading-Flow-visualization<https://github.com/nathanmarz/cascalog/wiki/Cascading-Flow-visualization>


    2.) I have little experience with scala, still learning
    cascalog/clojure and this is my first time using algebird. Am I doing
    something fundamentally wrong that would degrade performance? I am having
    a tough time finding examples using algebird in clojure, if anybody knows
    of any I'd like to take a look at them.
    You should probably try to benchmark on a bigger set since a big part
    of those 3 seconds is probably needed just for booting. I don't see
    anything suspicious with the code, but I'm not sure how performant the
    algebird library is in combination with Cascalog. You might want to
    benchmark it against the Hyperloglog implementation of
    https://github.com/**clearspring/stream-lib<https://github.com/clearspring/stream-lib>. Here is some of the code I use:

    (import '[com.clearspring.analytics.**stream.cardinality HyperLogLog]

    (defn construct
    ([init-value]
    (doto (HyperLogLog. 12) (.offer init-value)))
    ([std-dev init-value]
    (doto (HyperLogLog. (double std-dev)) (.offer init-value))))

    (defn merge [^HyperLogLog first-hlog & hlogs]
    (.merge ^HyperLogLog first-hlog (into-array HyperLogLog hlogs)))

    (defparallelagg agg-hyperloglog
    :init-var #'hll/construct
    :combine-var #'hll/merge)


    HTH,

    Jeroen
    Thanks in advance!
    Tom

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to cascalog-use...@googlegroups.**com.
    For more options, visit https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
    .

    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-use...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJul 10, '13 at 2:35p
activeJul 12, '13 at 7:04p
posts11
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase