FAQ
What i would like to do is normalize a set of values from a subquery.

(defn normalize
   [n min max]
   (/ (- n min) (- max min)))

Say that my subquery produces values like:

(def values [[1] [2] [3]])

I would like to feed the values through my normalize function with the
min/max of values across the entire data set. However, when I use
cascalog.ops/min and cascalog.ops/max the values are grouped together.

(?<- (stdout) [?v ?min-v ?max-v]
               (values ?v)
               (c/min ?v :> ?min-v)
               (c/max ?v :> ?max-v))

RESULTS
-----------------------
1 1 1
2 2 2
3 3 3
-----------------------

What i would like to see is:

RESULTS
-----------------------
1 1 3
2 1 3
3 1 3
-----------------------

At this point the dataset is small enough that I could do it post-cascalog
with some other code but I would like to see if I can keep this inside of
cascalog.

Thanks,
Tom

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Thomas Norden at Jul 8, 2013 at 12:16 pm
    I have a solution that works in this simple example, but I have a feeling
    it would cause my subquery that produces values to be run twice.

    (use 'cascalog.api)
    (require '(cascalog [ops :as c]))

    (def values
       [[1]
        [2]
        [3]])

    (defn normalize
       [n min max]
       (/ (- n min) (- max min)))

    (?<- (stdout) [?v ?min-v ?max-v ?norm-v]
       (values ?v)
       ((<- [?min-v ?max-v] (values ?v) (c/min ?v :> ?min-v) (c/max ?v :>
    ?max-v)) ?min-v ?max-v)
       (normalize ?v ?min-v ?max-v :> ?norm-v)
       (cross-join))

    RESULTS
    -----------------------
    1 1 3 0
    2 1 3 1/2
    3 1 3 1
    -----------------------

    If I swap values out for my subquery; would this mean that my subquery
    needs to be run twice, once to get the values and again to cross-join with
    the min and max?

    On Monday, July 8, 2013 7:21:59 AM UTC-4, Thomas Norden wrote:

    What i would like to do is normalize a set of values from a subquery.

    (defn normalize
    [n min max]
    (/ (- n min) (- max min)))

    Say that my subquery produces values like:

    (def values [[1] [2] [3]])

    I would like to feed the values through my normalize function with the
    min/max of values across the entire data set. However, when I use
    cascalog.ops/min and cascalog.ops/max the values are grouped together.

    (?<- (stdout) [?v ?min-v ?max-v]
    (values ?v)
    (c/min ?v :> ?min-v)
    (c/max ?v :> ?max-v))

    RESULTS
    -----------------------
    1 1 1
    2 2 2
    3 3 3
    -----------------------

    What i would like to see is:

    RESULTS
    -----------------------
    1 1 3
    2 1 3
    3 1 3
    -----------------------

    At this point the dataset is small enough that I could do it post-cascalog
    with some other code but I would like to see if I can keep this inside of
    cascalog.

    Thanks,
    Tom
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • David Kincaid at Jul 8, 2013 at 1:39 pm
    Thomas, the most important information that I read about Cascalog was the
    section in the wiki called "How cascalog executes a query" (
    https://github.com/nathanmarz/cascalog/wiki/How-cascalog-executes-a-query).
    Once you read that and apply the steps to your queries you'll better
    understand what Cascalog is doing.

    In your example, it is indeed grouping on the ?v var when the min and max
    aggregations are done since going into the aggregation step the ?v output
    var has been satisfied.

    I'm not sure what the most efficient way to go about what you're trying to
    do is going to be. It sounds to me like you're going to need two
    subqueries. One to compute the min and max and the other to use that
    min/max subquery do the normalization. Maybe something like this (untested,
    so I don't know if this will even run):

    (def min-max
          (<- [?min-v ?max-v]
              (values ?v)
              (c/min ?v :> ?min-v)
              (c/max ?v :> ?max-v)))

    (<- [?v ?norm-v]
         (values ?v)
         (min-max ?min-v ?max-v)
         (normalize ?v ?min-v ?max-v :> ?norm-v)))

    Of course, that's going to have to run over the values twice. Once to get
    the min and max and then a second time to so the normalization. Unless you
    can hold all the values in memory inside a buffer or aggregator I'm not
    sure how you can get away with one pass. Maybe others that are better at
    Cascalog/Cascading/MapReduce know a better way.
    On Monday, July 8, 2013 6:21:59 AM UTC-5, Thomas Norden wrote:

    What i would like to do is normalize a set of values from a subquery.

    (defn normalize
    [n min max]
    (/ (- n min) (- max min)))

    Say that my subquery produces values like:

    (def values [[1] [2] [3]])

    I would like to feed the values through my normalize function with the
    min/max of values across the entire data set. However, when I use
    cascalog.ops/min and cascalog.ops/max the values are grouped together.

    (?<- (stdout) [?v ?min-v ?max-v]
    (values ?v)
    (c/min ?v :> ?min-v)
    (c/max ?v :> ?max-v))

    RESULTS
    -----------------------
    1 1 1
    2 2 2
    3 3 3
    -----------------------

    What i would like to see is:

    RESULTS
    -----------------------
    1 1 3
    2 1 3
    3 1 3
    -----------------------

    At this point the dataset is small enough that I could do it post-cascalog
    with some other code but I would like to see if I can keep this inside of
    cascalog.

    Thanks,
    Tom
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jeroen van Dijk at Jul 9, 2013 at 9:27 am
    I think it is not possible to do it in one run, because you need at least
    one run to determine the min and max and than you need another run to add
    the min and max to the values, if that is what you really want.

    I think you should stay away from cross-join in this case. It is not very
    performant because it needs to go through one reducer for all combinations.

    I think the easiest way is to capture the min and max in memory and than
    just appending the min and max to the values like so:

    (let [min-max (ffirst (??- (<- [?min-v ?max-v]
              (values ?v)
              (c/min ?v :> ?min-v)
              (c/max ?v :> ?max-v))))]
       (?<- (stdout) [?v ?min-v ?max-v]
         (identity min-max :> ?min-v ?max-v)
         (values ?v)))

    HTH

    On Mon, Jul 8, 2013 at 3:39 PM, David Kincaid wrote:

    Thomas, the most important information that I read about Cascalog was the
    section in the wiki called "How cascalog executes a query" (
    https://github.com/nathanmarz/cascalog/wiki/How-cascalog-executes-a-query).
    Once you read that and apply the steps to your queries you'll better
    understand what Cascalog is doing.

    In your example, it is indeed grouping on the ?v var when the min and max
    aggregations are done since going into the aggregation step the ?v output
    var has been satisfied.

    I'm not sure what the most efficient way to go about what you're trying to
    do is going to be. It sounds to me like you're going to need two
    subqueries. One to compute the min and max and the other to use that
    min/max subquery do the normalization. Maybe something like this (untested,
    so I don't know if this will even run):

    (def min-max
    (<- [?min-v ?max-v]
    (values ?v)
    (c/min ?v :> ?min-v)
    (c/max ?v :> ?max-v)))

    (<- [?v ?norm-v]
    (values ?v)
    (min-max ?min-v ?max-v)
    (normalize ?v ?min-v ?max-v :> ?norm-v)))

    Of course, that's going to have to run over the values twice. Once to get
    the min and max and then a second time to so the normalization. Unless you
    can hold all the values in memory inside a buffer or aggregator I'm not
    sure how you can get away with one pass. Maybe others that are better at
    Cascalog/Cascading/MapReduce know a better way.
    On Monday, July 8, 2013 6:21:59 AM UTC-5, Thomas Norden wrote:

    What i would like to do is normalize a set of values from a subquery.

    (defn normalize
    [n min max]
    (/ (- n min) (- max min)))

    Say that my subquery produces values like:

    (def values [[1] [2] [3]])

    I would like to feed the values through my normalize function with the
    min/max of values across the entire data set. However, when I use
    cascalog.ops/min and cascalog.ops/max the values are grouped together.

    (?<- (stdout) [?v ?min-v ?max-v]
    (values ?v)
    (c/min ?v :> ?min-v)
    (c/max ?v :> ?max-v))

    RESULTS
    -----------------------
    1 1 1
    2 2 2
    3 3 3
    -----------------------

    What i would like to see is:

    RESULTS
    -----------------------
    1 1 3
    2 1 3
    3 1 3
    -----------------------

    At this point the dataset is small enough that I could do it
    post-cascalog with some other code but I would like to see if I can keep
    this inside of cascalog.

    Thanks,
    Tom
    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJul 8, '13 at 11:22a
activeJul 9, '13 at 9:27a
posts4
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase