Hi everybody,

I am wondering how one might write a query that would do aggregations
in parallel. Say I have huge dataset of sales. The output should be
two numbers sum of sales this month and sum of sales last month.
Create the respective queries is easy. My understanding is that each
of the queries will initate whole map/reduce cycle and then I need to
join them together. Is it somehow possible to define an query that
would return both numbers in just one swoop?

Many thanks Tomas

Search Discussions

  • Sam Ritchie at Nov 23, 2011 at 2:33 pm
    The cascalog.ops/sum aggregator works in parallel, so you can do

    (<- [?month ?sum-sales]
    (sum ?sales :> ?sum-sales)
    (src ?date ?sales)
    (extract-month ?date :> ?month))

    You can add a filter in there for getting only the past two months, or use
    cascalog.ops/first-n. check out the API docs at
    nathanmarz.github.com/cascalog.

    Cheers,
    Sam
    On Wednesday, November 23, 2011, Tomas Svarovsky wrote:
    Hi everybody,

    I am wondering how one might write a query that would do aggregations
    in parallel. Say I have huge dataset of sales. The output should be
    two numbers sum of sales this month and sum of sales last month.
    Create the respective queries is easy. My understanding is that each
    of the queries will initate whole map/reduce cycle and then I need to
    join them together. Is it somehow possible to define an query that
    would return both numbers in just one swoop?

    Many thanks Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Tomas Svarovsky at Nov 23, 2011 at 2:47 pm
    Sam,

    thanks for your super fast response. This is a good method if I would
    like just those numbers. I was thinking about being able to return
    those numbers in one line so I can group it for example by user. The
    output should look something like this.

    User|MTD|MTD-1|
    1 | 10 | 20 |
    2 | 30 | 40 |

    Sorry for not being clearer first time.

    T
    On Wed, Nov 23, 2011 at 3:33 PM, Sam Ritchie wrote:
    The cascalog.ops/sum aggregator works in parallel, so you can do

    (<- [?month ?sum-sales]
    (sum ?sales :> ?sum-sales)
    (src ?date ?sales)
    (extract-month ?date :> ?month))

    You can add a filter in there for getting only the past two months, or use
    cascalog.ops/first-n. check out the API docs at
    nathanmarz.github.com/cascalog.

    Cheers,
    Sam
    On Wednesday, November 23, 2011, Tomas Svarovsky wrote:
    Hi everybody,

    I am wondering how one might write a query that would do aggregations
    in parallel. Say I have huge dataset of sales. The output should be
    two numbers sum of sales this month and sum of sales last month.
    Create the respective queries is easy. My understanding is that each
    of the queries will initate whole map/reduce cycle and then I need to
    join them together. Is it somehow possible to define an query that
    would return both numbers in just one swoop?

    Many thanks Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why! http://emailcharter.org)
  • Andrew Xue at Nov 25, 2011 at 2:43 am
    hey tomas - i find that i need to "horizontalize" or "columnize" data
    often and i usually do this by putting the row data into a json map

    gist of the map maker looks like this

    (defaggregateop mk-map
    ([] {})
    ([state key val] ((fn [m k v]
    (conj m [k v]))
    state key val))
    ([state] [state]))

    so the query would be like

    (<- [?user-id ?sales-map-json]
    (src ?date ?sales ?user-id)
    (sum ?sales :> ?sum-sales)
    (extract-month ?date :> ?month)
    (mk-map :<< [?month ?sum-sales] :> ?sales-map)
    (json-str ?sales-map :> ?sales-map-json))


    output would look like

    USER-ID | SALES-MAP-JSON
    2342423 | {"month1": "sales1", "month2": "sales2" ... "monthN":
    "salesN"}

    sam, nathan feel free to comment if there is a more optimal solution
    to this
    hope that helps
    andy


    On Nov 23, 9:47 am, Tomas Svarovsky wrote:
    Sam,

    thanks for your super fast response. This is a good method if I would
    like just those numbers. I was thinking about being able to return
    those numbers in one line so I can group it for example by user. The
    output should look something like this.

    User|MTD|MTD-1|
    1     |  10  |  20    |
    2     |  30  |  40    |

    Sorry for not being clearer first time.

    T






    On Wed, Nov 23, 2011 at 3:33 PM, Sam Ritchie wrote:
    The cascalog.ops/sum aggregator works in parallel, so you can do
    (<- [?month ?sum-sales]
    (sum ?sales :> ?sum-sales)
    (src ?date ?sales)
    (extract-month ?date :> ?month))
    You can add a filter in there for getting only the past two months, or use
    cascalog.ops/first-n. check out the API docs at
    nathanmarz.github.com/cascalog.
    Cheers,
    Sam
    On Wednesday, November 23, 2011, Tomas Svarovsky <svarov...@gooddata.com>
    wrote:
    Hi everybody,
    I am wondering how one might write a query that would do aggregations
    in parallel. Say I have huge dataset of sales. The output should be
    two numbers sum of sales this month and sum of sales last month.
    Create the respective queries is easy. My understanding is that each
    of the queries will initate whole map/reduce cycle and then I need to
    join them together. Is it somehow possible to define an query that
    would return both numbers in just one swoop?
    Many thanks Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Sam Ritchie at Nov 25, 2011 at 3:18 am
    Hey, that's pretty much what I do:

    (use 'cascalog.api)
    (defbufferop transpose-buf [tuples]
    [(into {} (map vec tuples))])

    (def src [["sam" true 10]
    ["sam" false 5]])

    (??<- [?name ?s]
    (src ?name ?bool ?val)
    (transpose-buf ?bool ?val :> ?outmap)
    (str ?outmap :> ?s))

    ;; produces ["sam" "{true 10, false 5}"]
    On Thu, Nov 24, 2011 at 9:42 PM, Andrew Xue wrote:

    hey tomas - i find that i need to "horizontalize" or "columnize" data
    often and i usually do this by putting the row data into a json map

    gist of the map maker looks like this

    (defaggregateop mk-map
    ([] {})
    ([state key val] ((fn [m k v]
    (conj m [k v]))
    state key val))
    ([state] [state]))

    so the query would be like

    (<- [?user-id ?sales-map-json]
    (src ?date ?sales ?user-id)
    (sum ?sales :> ?sum-sales)
    (extract-month ?date :> ?month)
    (mk-map :<< [?month ?sum-sales] :> ?sales-map)
    (json-str ?sales-map :> ?sales-map-json))


    output would look like

    USER-ID | SALES-MAP-JSON
    2342423 | {"month1": "sales1", "month2": "sales2" ... "monthN":
    "salesN"}

    sam, nathan feel free to comment if there is a more optimal solution
    to this
    hope that helps
    andy


    On Nov 23, 9:47 am, Tomas Svarovsky wrote:
    Sam,

    thanks for your super fast response. This is a good method if I would
    like just those numbers. I was thinking about being able to return
    those numbers in one line so I can group it for example by user. The
    output should look something like this.

    User|MTD|MTD-1|
    1 | 10 | 20 |
    2 | 30 | 40 |

    Sorry for not being clearer first time.

    T






    On Wed, Nov 23, 2011 at 3:33 PM, Sam Ritchie wrote:
    The cascalog.ops/sum aggregator works in parallel, so you can do
    (<- [?month ?sum-sales]
    (sum ?sales :> ?sum-sales)
    (src ?date ?sales)
    (extract-month ?date :> ?month))
    You can add a filter in there for getting only the past two months, or
    use
    cascalog.ops/first-n. check out the API docs at
    nathanmarz.github.com/cascalog.
    Cheers,
    Sam
    On Wednesday, November 23, 2011, Tomas Svarovsky <
    svarov...@gooddata.com>
    wrote:
    Hi everybody,
    I am wondering how one might write a query that would do aggregations
    in parallel. Say I have huge dataset of sales. The output should be
    two numbers sum of sales this month and sum of sales last month.
    Create the respective queries is easy. My understanding is that each
    of the queries will initate whole map/reduce cycle and then I need to
    join them together. Is it somehow possible to define an query that
    would return both numbers in just one swoop?
    Many thanks Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedNov 23, '11 at 2:24p
activeNov 25, '11 at 3:18a
posts5
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase