Hi

I have a query which uses the same subquery several times. It looks
something like

S3 SrcData => QueryA => QueryB => QueryC => etc.
^
S3 SrcData => QueryA


QueryA is being called twice and it would be nice to be able to
"cache" the result in a temp file in HDFS instead of running the query
twice. This is especially true because QueryA is filter job on the
SrcData and the cached result would be much smaller.

I found a few threads in the cascading list which says to implement
the isSafe() function on Operation to return false

http://groups.google.com/group/cascading-user/browse_thread/thread/59e3463093c1eebb#
http://groups.google.com/group/cascading-user/browse_thread/thread/cd283dadc6f76bbe/d09111e95b6a8852?lnk=gst&q=%22Dumping+pipe+to+disk#d09111e95b6a8852

Is there anyway to access and implement from Cascalog? Thanks

Andy

Search Discussions

  • Andrew Xue at Dec 16, 2011 at 9:11 pm
    ok, the ascii picture didn't work so here is a google doc drawing

    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FWdkAe-QS7NU/edit
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi

    I have a query which uses the same subquery several times. It looks
    something like

    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA

    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.

    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false

    http://groups.google.com/group/cascading-user/browse_thread/thread/59...http://groups.google.com/group/cascading-user/browse_thread/thread/cd...

    Is there anyway to access and implement from Cascalog? Thanks

    Andy
  • Nathanmarz at Dec 21, 2011 at 10:20 pm
    Cascading will do this automatically as long as the subquery includes
    a reduce step... if it's map-only (e.g. just a filter) it will redo
    query A. I opened up an issue to expose the isSafe method for
    subqueries so you can force this optimization: https://github.com/nathanmarz/cascalog/issues/38

    You can probably force this optimization now by implementing a regular
    Cascading filter that always returns false and sets that isSafe
    method. So something like:

    (<- [?foo ?bar] (source ?foo ?bar) (my-filter ?foo)
    ((IdentityUnsafe.) ?foo) (:distinct false))

    where IdentityUnsafe is your Cascading filter implementation.

    On Dec 16, 1:11 pm, Andrew Xue wrote:
    ok, the ascii picture didn't work so here is a google doc drawing

    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...

    On Dec 16, 1:07 pm, Andrew Xue wrote:






    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59......
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
  • Sam Ritchie at Dec 21, 2011 at 11:56 pm
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue wrote:

    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FWdkAe-QS7NU/edit
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi

    I have a query which uses the same subquery several times. It looks
    something like

    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA

    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.

    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59...http://groups.google.com/group/cascading-user/browse_thread/thread/cd.
    ..
    Is there anyway to access and implement from Cascalog? Thanks

    Andy


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why! http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Andrew Xue at Dec 25, 2011 at 10:39 pm
    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?

    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)

    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)


    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib helpful.
    I discuss an example here:http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue wrote:

    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...








    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59....
    ..


    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)
  • Sam Ritchie at Dec 25, 2011 at 10:54 pm
    Andy, try using http://clojars.org/cascalog-checkpoint, or

    [cascalog-checkpoint "0.1.0"]

    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?

    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)

    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)


    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue wrote:

    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...








    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..


    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why! http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Andrew Xue at Dec 25, 2011 at 11:45 pm
    cool that works -- this is seriously awesome, thanks!
    On Dec 25, 5:54 pm, Sam Ritchie wrote:
    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or

    [cascalog-checkpoint "0.1.0"]

    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.








    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn  cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.








    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue wrote:

    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)
  • Andrew Xue at Dec 26, 2011 at 4:11 am
    can you nest workflows?

    for example something like:

    (defn inner-workflow
    [input-path output-path]
    (workflow ["tmp/inner-workflow"]
    step 1 ...
    step 2 ... etc
    ))

    (defn outer-workflow
    [input-path output-path]
    (workflow ["tmp/outer-workflow"]
    step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
    path inner-workflow-staging))
    step 2 ... etc
    ))

    On Dec 25, 6:45 pm, Andrew Xue wrote:
    cool that works -- this is seriously awesome, thanks!

    On Dec 25, 5:54 pm, Sam Ritchie wrote:






    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or
    [cascalog-checkpoint "0.1.0"]
    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn  cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com>
    wrote:
    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Nathanmarz at Dec 28, 2011 at 8:34 am
    You could... but it's not really recommended. For example, if you were
    to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
    run into problems. If the workflow had previously only run part way
    through inner-workflow, inner-workflow will end up emitting stale
    results if outer-workflow is rerun.

    On Dec 25, 8:11 pm, Andrew Xue wrote:
    can you nest workflows?

    for example something like:

    (defn inner-workflow
    [input-path output-path]
    (workflow ["tmp/inner-workflow"]
    step 1 ...
    step 2 ... etc
    ))

    (defn outer-workflow
    [input-path output-path]
    (workflow ["tmp/outer-workflow"]
    step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
    path inner-workflow-staging))
    step 2 ... etc
    ))

    On Dec 25, 6:45 pm, Andrew Xue wrote:






    cool that works -- this is seriously awesome, thanks!
    On Dec 25, 5:54 pm, Sam Ritchie wrote:

    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or
    [cascalog-checkpoint "0.1.0"]
    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn  cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com>
    wrote:
    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Andrew Xue at Jan 2, 2012 at 11:05 pm
    hey sam -- there seems to be an issue with checkpoint using u/
    collectify

    i am using cacsalog-1.8.5-SNAPSHOT and util.clj no longer has this
    function


    On Dec 28 2011, 3:34 am, nathanmarz wrote:
    You could... but it's not really recommended. For example, if you were
    to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
    run into problems. If the workflow had previously only run part way
    through inner-workflow, inner-workflow will end up emitting stale
    results if outer-workflow is rerun.

    On Dec 25, 8:11 pm, Andrew Xue wrote:






    can you nest workflows?
    for example something like:
    (defn inner-workflow
    [input-path output-path]
    (workflow ["tmp/inner-workflow"]
    step 1 ...
    step 2 ... etc
    ))
    (defn outer-workflow
    [input-path output-path]
    (workflow ["tmp/outer-workflow"]
    step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
    path inner-workflow-staging))
    step 2 ... etc
    ))
    On Dec 25, 6:45 pm, Andrew Xue wrote:

    cool that works -- this is seriously awesome, thanks!
    On Dec 25, 5:54 pm, Sam Ritchie wrote:

    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or
    [cascalog-checkpoint "0.1.0"]
    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn  cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com>
    wrote:
    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Andrew Xue at Jan 2, 2012 at 11:08 pm
    ah ok, i guess cascalog is using the collectify in jacknife now
    On Jan 2, 6:05 pm, Andrew Xue wrote:
    hey sam -- there seems to be an issue with checkpoint using u/
    collectify

    i am using cacsalog-1.8.5-SNAPSHOT and util.clj no longer has this
    function

    On Dec 28 2011, 3:34 am, nathanmarz wrote:






    You could... but it's not really recommended. For example, if you were
    to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
    run into problems. If the workflow had previously only run part way
    through inner-workflow, inner-workflow will end up emitting stale
    results if outer-workflow is rerun.
    On Dec 25, 8:11 pm, Andrew Xue wrote:

    can you nest workflows?
    for example something like:
    (defn inner-workflow
    [input-path output-path]
    (workflow ["tmp/inner-workflow"]
    step 1 ...
    step 2 ... etc
    ))
    (defn outer-workflow
    [input-path output-path]
    (workflow ["tmp/outer-workflow"]
    step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
    path inner-workflow-staging))
    step 2 ... etc
    ))
    On Dec 25, 6:45 pm, Andrew Xue wrote:

    cool that works -- this is seriously awesome, thanks!
    On Dec 25, 5:54 pm, Sam Ritchie wrote:

    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or
    [cascalog-checkpoint "0.1.0"]
    instead of the global cascalog-contrib; I changed the blog post, but I
    haven't been able to figure out how to take the cascalog-contrib jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue wrote:

    hey sam -- this looks great, but having some trouble with getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn  cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as when i cut
    and pasted in the checkpoint_test.clj code and tried to (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <and...@lumoslabs.com>
    wrote:
    ok, the ascii picture didn't work so here is a google doc drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^
    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be able to
    "cache" the result in a temp file in HDFS instead of running the
    query
    twice. This is especially true because QueryA is filter job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
  • Sam Ritchie at Jan 2, 2012 at 11:44 pm
    Andy, I found that we were re-using a number of functions between
    ElephantDB, Storm and Cascalog and decided to pull them out into a separate
    library. Once I write a few more tests I'll announce it formally on the
    list. Hopefully you and the rest of the gang will find some of the pieces
    useful in your own projects.

    Cheers,
    Sam
    On Mon, Jan 2, 2012 at 3:08 PM, Andrew Xue wrote:

    ah ok, i guess cascalog is using the collectify in jacknife now
    On Jan 2, 6:05 pm, Andrew Xue wrote:
    hey sam -- there seems to be an issue with checkpoint using u/
    collectify

    i am using cacsalog-1.8.5-SNAPSHOT and util.clj no longer has this
    function

    On Dec 28 2011, 3:34 am, nathanmarz wrote:






    You could... but it's not really recommended. For example, if you were
    to delete "/tmp/outer-workflow" but not "/tmp/inner-workflow", you'd
    run into problems. If the workflow had previously only run part way
    through inner-workflow, inner-workflow will end up emitting stale
    results if outer-workflow is rerun.
    On Dec 25, 8:11 pm, Andrew Xue wrote:

    can you nest workflows?
    for example something like:
    (defn inner-workflow
    [input-path output-path]
    (workflow ["tmp/inner-workflow"]
    step 1 ...
    step 2 ... etc
    ))
    (defn outer-workflow
    [input-path output-path]
    (workflow ["tmp/outer-workflow"]
    step 1 ([:tmp-dirs inner-workflow-staging] (inner-workflow input-
    path inner-workflow-staging))
    step 2 ... etc
    ))
    On Dec 25, 6:45 pm, Andrew Xue wrote:

    cool that works -- this is seriously awesome, thanks!
    On Dec 25, 5:54 pm, Sam Ritchie wrote:

    Andy, try usinghttp://clojars.org/cascalog-checkpoint, or
    [cascalog-checkpoint "0.1.0"]
    instead of the global cascalog-contrib; I changed the blog post,
    but I
    haven't been able to figure out how to take the cascalog-contrib
    jar off of
    clojars.
    On Sun, Dec 25, 2011 at 5:39 PM, Andrew Xue <
    and...@lumoslabs.com> wrote:
    hey sam -- this looks great, but having some trouble with
    getting
    checkpoint working -- have you seen this error before?
    ClassCastException java.lang.String cannot be cast to
    clojure.lang.IFn cascalog.contrib.checkpoint/exec-workflow!/
    iter--210--214/fn--215 (checkpoint.clj:88)
    i get this error with the workflow i was testing as well as
    when i cut
    and pasted in the checkpoint_test.clj code and tried to
    (run-test!)
    On Dec 21, 6:56 pm, Sam Ritchie wrote:
    Hey, you'll probably find cascalog.checkpoint in
    cascalog-contrib
    helpful.
    I discuss an example here:
    http://sritchie.github.com/2011/11/15/introducing-cascalogcontrib.html.
    Something like
    (workflow ["/tmp/checkpoints"]
    queryA ([:tmp-dirs query-a-data]
    (?- (hfs-seqfile query-a-data)
    (query-a s3-path)))
    queryB ([:deps queryA :tmp-dirs query-b-data]
    (?- (hfs-seqfile query-b-data)
    (query-b query-a-data)))
    queryC ([:deps [queryA queryB] :tmp-dirs query-c-data]
    (?- (hfs-seqfile query-c-data)
    (query-c query-a-data
    query-b-data))))
    and so on and so forth.
    On Fri, Dec 16, 2011 at 1:11 PM, Andrew Xue <
    and...@lumoslabs.com>
    wrote:
    ok, the ascii picture didn't work so here is a google doc
    drawing
    https://docs.google.com/drawings/d/1JClgjaMOimujRtOThMoSZCihcvG6k_8FW...
    On Dec 16, 1:07 pm, Andrew Xue wrote:
    Hi
    I have a query which uses the same subquery several
    times. It looks
    something like
    S3 SrcData => QueryA => QueryB => QueryC => etc.
    ^

    S3 SrcData => QueryA
    QueryA is being called twice and it would be nice to be
    able to
    "cache" the result in a temp file in HDFS instead of
    running the
    query
    twice. This is especially true because QueryA is filter
    job on the
    SrcData and the cached result would be much smaller.
    I found a few threads in the cascading list which says
    to implement
    the isSafe() function on Operation to return false
    http://groups.google.com/group/cascading-user/browse_thread/thread/59..
    ..
    ..
    Is there anyway to access and implement from Cascalog?
    Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09
    (Too brief? Here's why!http://emailcharter.org)


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedDec 16, '11 at 9:07p
activeJan 2, '12 at 11:44p
posts12
users3
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2022 Grokbase