FAQ
I have a situation where I need to read some text files, do some processing
and send the results to 5 different Oracle database tables. Currently I
have it setup as 5 different queries that I run in parallel using ?-. Since
each one of the queries has it's own TextDelimited tap I end up reading
through the text files 5 times. What I'd really like to be able to do is to
read through the text files one time and use the output from that to feed 5
jdbc sinks.

I thought that I could do this using a subquery then use that subquery as a
generator in the 5 queries that use the jdbc sinks. My first attempt didn't
work. It's still reading through the input data 5 times. I'll try to
simpify what I'm doing here and see if anyone can point out my mistaken
thinking.

So I'll use a sequence as my generator for this example:

(def source [ [ "a" 1] [ "b" 2] ])

Now I'll create a subquery that will be used as the generator for my
aggregating queries:

(defn initial-query [source-tap]
(<- [ ?id ?letter ?sum-letter ]
(source-tap :> ?letter ?count)
(c/sum ?letter :> ?sum-letter)
(identity 100 :> ?id)))

Now create a couple of queries that use the results of this subquery:

(defn calculate-total [source-tap]
(<- [ ?total ]
((initial-query source-tap) :> ?id ?letter ?sum-letter)
(c/sum ?sum-letter :> total)))

(defn count-letters [source-tap]
(<- [ ?total-letters ]
((initial-query source-tap) :> ?id ?letter ?sum-letter)
(c/count ?letter :> ?total-letters)))

Now when I want to execute that using

(?- (sink1) (calculate-total source)
(sink2) (count-letters source))

I'm trying to get it to only read over the source data one time and then
use that for the two second level queries. Note that my actual data and
queries are quite a bit more complex than this. Where is my setup going
wrong?

Thanks,

Dave

Search Discussions

  • David Kincaid at Aug 12, 2012 at 7:05 pm
    It almost never fails that after I post a question to a mailing list the
    answer pops into my head. I figured out that what I needed to do was to
    call the subquery only one time and pass that one instance as a value to
    the other queries. So the main() method looks something like this:

    (defn -main [& args]
    (let [ agg-source (initial-query source) ]
    (?- (sink1) (calculate-total agg-source)
    (sink2) (count-letters agg-source))))

    once I did that it is behaving as I was hoping. Making this one change has
    reduced the processing time of the MR jobs on this data from 3 hours to 1
    hours (on a 4 node cluster). Big win!
    On Sunday, August 12, 2012 9:27:15 AM UTC-5, David Kincaid wrote:

    I have a situation where I need to read some text files, do some
    processing and send the results to 5 different Oracle database tables.
    Currently I have it setup as 5 different queries that I run in parallel
    using ?-. Since each one of the queries has it's own TextDelimited tap I
    end up reading through the text files 5 times. What I'd really like to be
    able to do is to read through the text files one time and use the output
    from that to feed 5 jdbc sinks.

    I thought that I could do this using a subquery then use that subquery as
    a generator in the 5 queries that use the jdbc sinks. My first attempt
    didn't work. It's still reading through the input data 5 times. I'll try to
    simpify what I'm doing here and see if anyone can point out my mistaken
    thinking.

    So I'll use a sequence as my generator for this example:

    (def source [ [ "a" 1] [ "b" 2] ])

    Now I'll create a subquery that will be used as the generator for my
    aggregating queries:

    (defn initial-query [source-tap]
    (<- [ ?id ?letter ?sum-letter ]
    (source-tap :> ?letter ?count)
    (c/sum ?letter :> ?sum-letter)
    (identity 100 :> ?id)))

    Now create a couple of queries that use the results of this subquery:

    (defn calculate-total [source-tap]
    (<- [ ?total ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/sum ?sum-letter :> total)))

    (defn count-letters [source-tap]
    (<- [ ?total-letters ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/count ?letter :> ?total-letters)))

    Now when I want to execute that using

    (?- (sink1) (calculate-total source)
    (sink2) (count-letters source))

    I'm trying to get it to only read over the source data one time and then
    use that for the two second level queries. Note that my actual data and
    queries are quite a bit more complex than this. Where is my setup going
    wrong?

    Thanks,

    Dave
  • Paul Lam at Aug 12, 2012 at 9:08 pm
    cascalog-checkpoint is what this is for


    On Sunday, 12 August 2012 20:05:20 UTC+1, David Kincaid wrote:

    It almost never fails that after I post a question to a mailing list the
    answer pops into my head. I figured out that what I needed to do was to
    call the subquery only one time and pass that one instance as a value to
    the other queries. So the main() method looks something like this:

    (defn -main [& args]
    (let [ agg-source (initial-query source) ]
    (?- (sink1) (calculate-total agg-source)
    (sink2) (count-letters agg-source))))

    once I did that it is behaving as I was hoping. Making this one change has
    reduced the processing time of the MR jobs on this data from 3 hours to 1
    hours (on a 4 node cluster). Big win!
    On Sunday, August 12, 2012 9:27:15 AM UTC-5, David Kincaid wrote:

    I have a situation where I need to read some text files, do some
    processing and send the results to 5 different Oracle database tables.
    Currently I have it setup as 5 different queries that I run in parallel
    using ?-. Since each one of the queries has it's own TextDelimited tap I
    end up reading through the text files 5 times. What I'd really like to be
    able to do is to read through the text files one time and use the output
    from that to feed 5 jdbc sinks.

    I thought that I could do this using a subquery then use that subquery as
    a generator in the 5 queries that use the jdbc sinks. My first attempt
    didn't work. It's still reading through the input data 5 times. I'll try to
    simpify what I'm doing here and see if anyone can point out my mistaken
    thinking.

    So I'll use a sequence as my generator for this example:

    (def source [ [ "a" 1] [ "b" 2] ])

    Now I'll create a subquery that will be used as the generator for my
    aggregating queries:

    (defn initial-query [source-tap]
    (<- [ ?id ?letter ?sum-letter ]
    (source-tap :> ?letter ?count)
    (c/sum ?letter :> ?sum-letter)
    (identity 100 :> ?id)))

    Now create a couple of queries that use the results of this subquery:

    (defn calculate-total [source-tap]
    (<- [ ?total ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/sum ?sum-letter :> total)))

    (defn count-letters [source-tap]
    (<- [ ?total-letters ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/count ?letter :> ?total-letters)))

    Now when I want to execute that using

    (?- (sink1) (calculate-total source)
    (sink2) (count-letters source))

    I'm trying to get it to only read over the source data one time and then
    use that for the two second level queries. Note that my actual data and
    queries are quite a bit more complex than this. Where is my setup going
    wrong?

    Thanks,

    Dave
  • David Kincaid at Aug 12, 2012 at 11:56 pm
    Thanks, Paul! That is exactly what I'm looking for. I hadn't run across
    this yet. Cascalog is such an amazing language, but it can be difficult to
    find all of the great features.

    Thanks again,

    Dave
    On Sunday, August 12, 2012 4:08:07 PM UTC-5, Paul Lam wrote:

    cascalog-checkpoint is what this is for


    On Sunday, 12 August 2012 20:05:20 UTC+1, David Kincaid wrote:

    It almost never fails that after I post a question to a mailing list the
    answer pops into my head. I figured out that what I needed to do was to
    call the subquery only one time and pass that one instance as a value to
    the other queries. So the main() method looks something like this:

    (defn -main [& args]
    (let [ agg-source (initial-query source) ]
    (?- (sink1) (calculate-total agg-source)
    (sink2) (count-letters agg-source))))

    once I did that it is behaving as I was hoping. Making this one change
    has reduced the processing time of the MR jobs on this data from 3 hours to
    1 hours (on a 4 node cluster). Big win!
    On Sunday, August 12, 2012 9:27:15 AM UTC-5, David Kincaid wrote:

    I have a situation where I need to read some text files, do some
    processing and send the results to 5 different Oracle database tables.
    Currently I have it setup as 5 different queries that I run in parallel
    using ?-. Since each one of the queries has it's own TextDelimited tap I
    end up reading through the text files 5 times. What I'd really like to be
    able to do is to read through the text files one time and use the output
    from that to feed 5 jdbc sinks.

    I thought that I could do this using a subquery then use that subquery
    as a generator in the 5 queries that use the jdbc sinks. My first attempt
    didn't work. It's still reading through the input data 5 times. I'll try to
    simpify what I'm doing here and see if anyone can point out my mistaken
    thinking.

    So I'll use a sequence as my generator for this example:

    (def source [ [ "a" 1] [ "b" 2] ])

    Now I'll create a subquery that will be used as the generator for my
    aggregating queries:

    (defn initial-query [source-tap]
    (<- [ ?id ?letter ?sum-letter ]
    (source-tap :> ?letter ?count)
    (c/sum ?letter :> ?sum-letter)
    (identity 100 :> ?id)))

    Now create a couple of queries that use the results of this subquery:

    (defn calculate-total [source-tap]
    (<- [ ?total ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/sum ?sum-letter :> total)))

    (defn count-letters [source-tap]
    (<- [ ?total-letters ]
    ((initial-query source-tap) :> ?id ?letter ?sum-letter)
    (c/count ?letter :> ?total-letters)))

    Now when I want to execute that using

    (?- (sink1) (calculate-total source)
    (sink2) (count-letters source))

    I'm trying to get it to only read over the source data one time and then
    use that for the two second level queries. Note that my actual data and
    queries are quite a bit more complex than this. Where is my setup going
    wrong?

    Thanks,

    Dave

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedAug 12, '12 at 2:27p
activeAug 12, '12 at 11:56p
posts4
users2
websiteclojure.org
irc#clojure

2 users in discussion

David Kincaid: 3 posts Paul Lam: 1 post

People

Translate

site design / logo © 2022 Grokbase