FAQ
Could someone help me understand what I'm misunderstanding? I'm very new to
Cascalog. I've done a little with Cascading and a little with Clojure so a
lot of this is very new.

I'm trying to create a query which will summarize some invoice data. I've
just started to play with this and ran into an error that I'm not
understanding. Here is what I'm doing:

(def monthly-summary
(<- [ ?practiceid ?year ?month ?patient-count ?client-count ]
(transaction :#> 12 {0 ?practiceid 1 ?invoiceid 2 ?clientid 3
?patientid 4 ?txn-date 6 ?price})
(parse-year-and-month ?txn-date :> ?year ?month)
(c/distinct-count ?patientid :> ?patient-count)
(c/distinct-count ?clientid :> ?client-count))

when I try to execute this at the repl I get and exception which says "Same
option set to conflicting values!". But if I take out the second
(c/distinct-count ...) and the corresponding ?client-count it works great.
If I want to count the number of distinct patients and distinct clients for
each ?practiceid, ?year, ?month, what is the proper way to do it.

Thanks,

Dave

Search Discussions

  • Sam Ritchie at Jun 18, 2012 at 2:08 am
    David,

    This happens because distinct-count does its work by sorting all tuples on
    the variable to be de-duplicated. To do multiple distinct-counts, you'll
    need to break this query up into multiple subqueries and compute a separate
    distinct count in each one. You can use these subqueries together in
    another query if you need to perform a join between ?patient-count and
    ?client-count.

    Hope that helps,
    Sam
    On Sun, Jun 17, 2012 at 6:29 PM, David Kincaid wrote:

    Could someone help me understand what I'm misunderstanding? I'm very new
    to Cascalog. I've done a little with Cascading and a little with Clojure so
    a lot of this is very new.

    I'm trying to create a query which will summarize some invoice data. I've
    just started to play with this and ran into an error that I'm not
    understanding. Here is what I'm doing:

    (def monthly-summary
    (<- [ ?practiceid ?year ?month ?patient-count ?client-count ]
    (transaction :#> 12 {0 ?practiceid 1 ?invoiceid 2 ?clientid 3
    ?patientid 4 ?txn-date 6 ?price})
    (parse-year-and-month ?txn-date :> ?year ?month)
    (c/distinct-count ?patientid :> ?patient-count)
    (c/distinct-count ?clientid :> ?client-count))

    when I try to execute this at the repl I get and exception which says
    "Same option set to conflicting values!". But if I take out the second
    (c/distinct-count ...) and the corresponding ?client-count it works great.
    If I want to count the number of distinct patients and distinct clients for
    each ?practiceid, ?year, ?month, what is the proper way to do it.

    Thanks,

    Dave

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • David Kincaid at Jun 18, 2012 at 2:23 am
    Thanks, Sam. That makes sense. That brings up another question I had. Let's
    say I split this into two sub-queries, each one using the same Tap as its
    generator. Will the resulting process that runs both subqueries and
    combines them run through the data two separate times? Let's say the data
    is in comma delimited text files. Will two sub-queries result in processing
    all of the data twice?

    I have the whole thing working in a Cascading flow, so was using this as a
    Cascalog learning opportunity. The Cascading flow uses a Buffer to
    partition the data on customer id (?practiceid), month (?month) and year
    (?year) and calculates several counts and sums that is emits as a single
    wide tuple for each practiceid, month and year. Should I be doing the same
    thing in Cascalog? It seemed like there might be a more elegant way to do
    it, but the buffer at least only runs over all the data one time.

    Thanks again,

    Dave
    On Sunday, June 17, 2012 9:08:25 PM UTC-5, Sam Ritchie wrote:

    David,

    This happens because distinct-count does its work by sorting all tuples on
    the variable to be de-duplicated. To do multiple distinct-counts, you'll
    need to break this query up into multiple subqueries and compute a separate
    distinct count in each one. You can use these subqueries together in
    another query if you need to perform a join between ?patient-count and
    ?client-count.

    Hope that helps,
    Sam
    On Sun, Jun 17, 2012 at 6:29 PM, David Kincaid wrote:

    Could someone help me understand what I'm misunderstanding? I'm very new
    to Cascalog. I've done a little with Cascading and a little with Clojure so
    a lot of this is very new.

    I'm trying to create a query which will summarize some invoice data. I've
    just started to play with this and ran into an error that I'm not
    understanding. Here is what I'm doing:

    (def monthly-summary
    (<- [ ?practiceid ?year ?month ?patient-count ?client-count ]
    (transaction :#> 12 {0 ?practiceid 1 ?invoiceid 2 ?clientid 3
    ?patientid 4 ?txn-date 6 ?price})
    (parse-year-and-month ?txn-date :> ?year ?month)
    (c/distinct-count ?patientid :> ?patient-count)
    (c/distinct-count ?clientid :> ?client-count))

    when I try to execute this at the repl I get and exception which says
    "Same option set to conflicting values!". But if I take out the second
    (c/distinct-count ...) and the corresponding ?client-count it works great.
    If I want to count the number of distinct patients and distinct clients for
    each ?practiceid, ?year, ?month, what is the proper way to do it.

    Thanks,

    Dave

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Nathan Marz at Jun 18, 2012 at 3:09 pm
    To make things more composable, you can make a distinct count aggregator
    that doesn't make use of secondary sorting (by building an in-memory set of
    the elements and counting at the end). Then you can do both distinct counts
    at the same time with one pass of the tuples.
    On Sun, Jun 17, 2012 at 10:23 PM, David Kincaid wrote:

    Thanks, Sam. That makes sense. That brings up another question I had.
    Let's say I split this into two sub-queries, each one using the same Tap as
    its generator. Will the resulting process that runs both subqueries and
    combines them run through the data two separate times? Let's say the data
    is in comma delimited text files. Will two sub-queries result in processing
    all of the data twice?

    I have the whole thing working in a Cascading flow, so was using this as a
    Cascalog learning opportunity. The Cascading flow uses a Buffer to
    partition the data on customer id (?practiceid), month (?month) and year
    (?year) and calculates several counts and sums that is emits as a single
    wide tuple for each practiceid, month and year. Should I be doing the same
    thing in Cascalog? It seemed like there might be a more elegant way to do
    it, but the buffer at least only runs over all the data one time.

    Thanks again,

    Dave
    On Sunday, June 17, 2012 9:08:25 PM UTC-5, Sam Ritchie wrote:

    David,

    This happens because distinct-count does its work by sorting all tuples
    on the variable to be de-duplicated. To do multiple distinct-counts, you'll
    need to break this query up into multiple subqueries and compute a separate
    distinct count in each one. You can use these subqueries together in
    another query if you need to perform a join between ?patient-count and
    ?client-count.

    Hope that helps,
    Sam
    On Sun, Jun 17, 2012 at 6:29 PM, David Kincaid wrote:

    Could someone help me understand what I'm misunderstanding? I'm very new
    to Cascalog. I've done a little with Cascading and a little with Clojure so
    a lot of this is very new.

    I'm trying to create a query which will summarize some invoice data.
    I've just started to play with this and ran into an error that I'm not
    understanding. Here is what I'm doing:

    (def monthly-summary
    (<- [ ?practiceid ?year ?month ?patient-count ?client-count ]
    (transaction :#> 12 {0 ?practiceid 1 ?invoiceid 2 ?clientid 3
    ?patientid 4 ?txn-date 6 ?price})
    (parse-year-and-month ?txn-date :> ?year ?month)
    (c/distinct-count ?patientid :> ?patient-count)
    (c/distinct-count ?clientid :> ?client-count))

    when I try to execute this at the repl I get and exception which says
    "Same option set to conflicting values!". But if I take out the second
    (c/distinct-count ...) and the corresponding ?client-count it works great.
    If I want to count the number of distinct patients and distinct clients for
    each ?practiceid, ?year, ?month, what is the proper way to do it.

    Thanks,

    Dave

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

    --
    Twitter: @nathanmarz
    http://nathanmarz.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedJun 18, '12 at 1:29a
activeJun 18, '12 at 3:09p
posts4
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase