Hi everybody,

I am starting with Cascalog and though it is fun and really concise I
am still struggling with some queries.

Lets say that I have users. These users are selling things (lets say
orders) and these orders are tied to particular account. Data could
look like this

Orders
id,acc,user
1,1,1
2,1,1
3,1,1
4,2,1
5,2,1
6,3,2
7,4,2

id,credit
1,true
2,true
3,true
4,false

I would like to generate query that would tell me count of accounts
for each user. If I try something like this

(defn account-query []
(let [opptys (lfs-textline "/Users/fluke/sandbox/living_social/
data/opportunities-test.csv")
accounts (lfs-textline "/Users/fluke/sandbox/living_social/
data/accounts-test.csv")]
(?<- (stdout) [?user ?count]
(opptys ?o-line) (my-csv-parser ?o-line :> ?
oppty-id ?acc-id ?user)
(accounts ?a-line) (my-csv-parser ?a-line :> ?
acc-id ?acc-credit)
(c/count ?count)
)))

Cascalog will join all the lines so it will count the orders for each
user. How can I give it a hint to count on a different level of
aggregation? I tried to play with count and distinct but I must be
missing something. Or maybe I am looking at it from too SQL angle.

Thanks everybody in advance and thanks for this wonderful tool.

Tomas

Search Discussions

  • Sam Ritchie at Nov 18, 2011 at 3:34 am
    Hey Tomas,

    The "count" aggregator picks its grouping based on the fields you've placed
    into the results vector. In your example, you've used

    [?user ?count]

    Cascalog will perform your parsing operations and joins, then bucket all up
    the tuples by "?user", in this case. If you use this result vector instead:

    [?user ?acc-credit ?count]

    You'll get result tuples that look like this (just subbing in numbers here):

    [1 true 10]
    [1 false 5]
    [2 true 4]
    [2 false 1]

    And so on and so forth. In short, to get specific, just include all of the
    variables you want to group on.

    Hope that helps!
    On Thu, Nov 17, 2011 at 8:09 AM, Tomas Svarovsky wrote:

    Hi everybody,

    I am starting with Cascalog and though it is fun and really concise I
    am still struggling with some queries.

    Lets say that I have users. These users are selling things (lets say
    orders) and these orders are tied to particular account. Data could
    look like this

    Orders
    id,acc,user
    1,1,1
    2,1,1
    3,1,1
    4,2,1
    5,2,1
    6,3,2
    7,4,2

    id,credit
    1,true
    2,true
    3,true
    4,false

    I would like to generate query that would tell me count of accounts
    for each user. If I try something like this

    (defn account-query []
    (let [opptys (lfs-textline "/Users/fluke/sandbox/living_social/
    data/opportunities-test.csv")
    accounts (lfs-textline "/Users/fluke/sandbox/living_social/
    data/accounts-test.csv")]
    (?<- (stdout) [?user ?count]
    (opptys ?o-line) (my-csv-parser ?o-line :> ?
    oppty-id ?acc-id ?user)
    (accounts ?a-line) (my-csv-parser ?a-line :> ?
    acc-id ?acc-credit)
    (c/count ?count)
    )))

    Cascalog will join all the lines so it will count the orders for each
    user. How can I give it a hint to count on a different level of
    aggregation? I tried to play with count and distinct but I must be
    missing something. Or maybe I am looking at it from too SQL angle.

    Thanks everybody in advance and thanks for this wonderful tool.

    Tomas


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Tomas Svarovsky at Nov 19, 2011 at 7:08 am
    Sam,

    Thanks for your quick answer. I know about this grouping behavior but
    the problem I am encountering is a little different I believe. My
    point is that (if I again use my example) count is always counting all
    lines (maybe distributed over different groups). So since I am joining
    orders with accounts the count will be dependent on number of orders.
    I would like to be able to count accounts. These numbers are equal as
    long as there is only one order per account which is usually not true.

    I found this thread http://groups.google.com/group/cascalog-user/browse_thread/thread/33f9b69bf18c9bdc
    with reference to distinct-count so maybe that is something that I
    might use. I will try to experiment with it. I would still appreciate
    any guidance or explanations if I am getting it wrong.

    Thanks

    Tomas
    On Nov 18, 4:33 am, Sam Ritchie wrote:
    Hey Tomas,

    The "count" aggregator picks its grouping based on the fields you've placed
    into the results vector. In your example, you've used

    [?user ?count]

    Cascalog will perform your parsing operations and joins, then bucket all up
    the tuples by "?user", in this case. If you use this result vector instead:

    [?user ?acc-credit ?count]

    You'll get result tuples that look like this (just subbing in numbers here):

    [1 true 10]
    [1 false 5]
    [2 true 4]
    [2 false 1]

    And so on and so forth. In short, to get specific, just include all of the
    variables you want to group on.

    Hope that helps!

    On Thu, Nov 17, 2011 at 8:09 AM, Tomas Svarovsky wrote:








    Hi everybody,
    I am starting with Cascalog and though it is fun and really concise I
    am still struggling with some queries.
    Lets say that I have users. These users are selling things (lets say
    orders) and these orders are tied to particular account. Data could
    look like this
    Orders
    id,acc,user
    1,1,1
    2,1,1
    3,1,1
    4,2,1
    5,2,1
    6,3,2
    7,4,2
    id,credit
    1,true
    2,true
    3,true
    4,false
    I would like to generate query that would tell me count of accounts
    for each user. If I try something like this
    (defn account-query []
    (let [opptys       (lfs-textline "/Users/fluke/sandbox/living_social/
    data/opportunities-test.csv")
    accounts     (lfs-textline "/Users/fluke/sandbox/living_social/
    data/accounts-test.csv")]
    (?<- (stdout) [?user ?count]
    (opptys ?o-line) (my-csv-parser ?o-line :> ?
    oppty-id ?acc-id ?user)
    (accounts ?a-line) (my-csv-parser ?a-line :> ?
    acc-id ?acc-credit)
    (c/count ?count)
    )))
    Cascalog will join all the lines so it will count the orders for each
    user. How can I give it a hint to count on a different level of
    aggregation? I tried to play with count and distinct but I must be
    missing something. Or maybe I am looking at it from too SQL angle.
    Thanks everybody in advance and thanks for this wonderful tool.
    Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)
  • Sam Ritchie at Nov 19, 2011 at 4:00 pm
    Ah, got it. Yeah, distinct count is the right way to go here. Here's a gist
    with an example and a test: https://gist.github.com/1378988.

    Here's the code from the gist:

    (use 'cascalog.api)
    (require '[cascalog.ops :as c])

    (def src
    [[1 4 10]
    [1 4 11]
    [1 5 12]
    [2 6 13]
    [2 6 14]])

    (?<- (stdout)
    [?user-id ?count]
    (src ?user-id ?account-id ?order-id)
    (c/distinct-count ?account-id :> ?count))

    ;; RESULTS
    ;; -----------------------
    ;; 1 2
    ;; 2 1
    ;; -----------------------


    ;; If you include this in :dev-dependencies:
    ;;
    ;; [midje-cascalog "0.3.0"]
    ;;
    ;; Then you can write tests like this:

    (use 'midje.cascalog)

    (let [src [[1 4 10]
    [1 4 11]
    [1 5 12]
    [2 6 13]
    [2 6 14]]
    result [[1 2]
    [2 1]]]
    (fact?<- "Query should return 2-tuples of the form
    (user-id, #-accounts.)"
    result
    [?user-id ?count]
    (src ?user-id ?account-id ?order-id)
    (c/distinct-count ?account-id :> ?count)))


    On Fri, Nov 18, 2011 at 6:14 AM, Tomas Svarovsky
    wrote:
    Sam,

    Thanks for your quick answer. I know about this grouping behavior but
    the problem I am encountering is a little different I believe. My
    point is that (if I again use my example) count is always counting all
    lines (maybe distributed over different groups). So since I am joining
    orders with accounts the count will be dependent on number of orders.
    I would like to be able to count accounts. These numbers are equal as
    long as there is only one order per account which is usually not true.

    I found this thread
    http://groups.google.com/group/cascalog-user/browse_thread/thread/33f9b69bf18c9bdc
    with reference to distinct-count so maybe that is something that I
    might use. I will try to experiment with it. I would still appreciate
    any guidance or explanations if I am getting it wrong.

    Thanks

    Tomas
    On Nov 18, 4:33 am, Sam Ritchie wrote:
    Hey Tomas,

    The "count" aggregator picks its grouping based on the fields you've placed
    into the results vector. In your example, you've used

    [?user ?count]

    Cascalog will perform your parsing operations and joins, then bucket all up
    the tuples by "?user", in this case. If you use this result vector instead:
    [?user ?acc-credit ?count]

    You'll get result tuples that look like this (just subbing in numbers here):
    [1 true 10]
    [1 false 5]
    [2 true 4]
    [2 false 1]

    And so on and so forth. In short, to get specific, just include all of the
    variables you want to group on.

    Hope that helps!

    On Thu, Nov 17, 2011 at 8:09 AM, Tomas Svarovsky <svarov...@gooddata.com
    wrote:








    Hi everybody,
    I am starting with Cascalog and though it is fun and really concise I
    am still struggling with some queries.
    Lets say that I have users. These users are selling things (lets say
    orders) and these orders are tied to particular account. Data could
    look like this
    Orders
    id,acc,user
    1,1,1
    2,1,1
    3,1,1
    4,2,1
    5,2,1
    6,3,2
    7,4,2
    id,credit
    1,true
    2,true
    3,true
    4,false
    I would like to generate query that would tell me count of accounts
    for each user. If I try something like this
    (defn account-query []
    (let [opptys (lfs-textline "/Users/fluke/sandbox/living_social/
    data/opportunities-test.csv")
    accounts (lfs-textline "/Users/fluke/sandbox/living_social/
    data/accounts-test.csv")]
    (?<- (stdout) [?user ?count]
    (opptys ?o-line) (my-csv-parser ?o-line :> ?
    oppty-id ?acc-id ?user)
    (accounts ?a-line) (my-csv-parser ?a-line :> ?
    acc-id ?acc-credit)
    (c/count ?count)
    )))
    Cascalog will join all the lines so it will count the orders for each
    user. How can I give it a hint to count on a different level of
    aggregation? I tried to play with count and distinct but I must be
    missing something. Or maybe I am looking at it from too SQL angle.
    Thanks everybody in advance and thanks for this wonderful tool.
    Tomas
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedNov 17, '11 at 11:15p
activeNov 19, '11 at 4:00p
posts4
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase