FAQ
I am trying to use a template tap in ":update" mode, so that the
results of multiple queries will accumulate in the output files. This
appears to usually not work (although in < 10% of my test runs it
actually does work). Below is code to reproduce the issue.

I looked through the Cascalog source and convinced myself that it
should be passing the sinkmode on to Cascading, even for template
taps. Can anyone confirm this is a bug, or perhaps provide advice on
how to use a template tap in update mode?

Thanks.
-David

====

(use 'cascalog.api)

(let [age [["alice" 28]
["bob" 33]]]
(?<- (hfs-seqfile "foo"
:sinkmode :update
:sink-template "%s/"
:templatefields ["?n"]
:outfields ["?a"])
[?n ?a]
(age :> ?n ?a)))

(??- (hfs-seqfile "foo/alice"))
;;=> (([28]))

(let [age [["alice" 280]
["bob" 330]]]
(?<- (hfs-seqfile "foo"
:sinkmode :update
:sink-template "%s/"
:templatefields ["?n"]
:outfields ["?a"])
[?n ?a]
(age :> ?n ?a)))

(??- (hfs-seqfile "foo/alice"))
;;=> (([280]))
;; This is surprising! I expected to see both [28] and [280] in
the results.

Search Discussions

  • Marshall T. Vandegrift at Mar 7, 2012 at 7:11 pm
    David McNeil <[email protected]>
    writes:
    I am trying to use a template tap in ":update" mode, so that the
    results of multiple queries will accumulate in the output files. This
    appears to usually not work (although in < 10% of my test runs it
    actually does work). Below is code to reproduce the issue.
    Unfortunately the ultimate backing Hadoop HDFS OutputFormats don't have
    any idea of "updating" a file from the result of a MapReduce job. A
    Cascading Hfs Tap is actually supposed to throw an exception if created
    with SinkMode.UPDATE, so there may in fact be a bug somewhere in here,
    but the expected behavior is still not what you're looking for.

    You *can* however safely overlap multiple TemplateTap outputs if no two
    jobs write to the same file. This might not be optimal for your use
    case, but might get you closer.

    --
    Marshall T. Vandegrift <[email protected]>
    Damballa Staff Software Engineer | 518.859.4559m
  • Sam Ritchie at Mar 7, 2012 at 8:50 pm
    Marshall's got it, here. I brought this up to Chris (Wensel) when I was
    first learning Cascalog, and he noted that UPDATE was more of a hint to the
    tap/scheme combo than an actual directive.

    If you need a SequenceFile format with the ability to update, check out the
    dfs-datastores <https://github.com/nathanmarz/dfs-datastores> and
    dfs-datastores-cascading<https://github.com/nathanmarz/dfs-datastores-cascading>
    projects.
    "Pail" is really what you need. dfs-datastores isn't documented, but I'm
    hoping I can talk Soren Macbeth into working on a guide with me :)

    Chris, should unsupported modes throw exceptions unless overridden?

    Cheers,
    Sam
    On Wed, Mar 7, 2012 at 11:10 AM, Marshall T. Vandegrift wrote:

    David McNeil <[email protected]>
    writes:
    I am trying to use a template tap in ":update" mode, so that the
    results of multiple queries will accumulate in the output files. This
    appears to usually not work (although in < 10% of my test runs it
    actually does work). Below is code to reproduce the issue.
    Unfortunately the ultimate backing Hadoop HDFS OutputFormats don't have
    any idea of "updating" a file from the result of a MapReduce job. A
    Cascading Hfs Tap is actually supposed to throw an exception if created
    with SinkMode.UPDATE, so there may in fact be a bug somewhere in here,
    but the expected behavior is still not what you're looking for.

    You *can* however safely overlap multiple TemplateTap outputs if no two
    jobs write to the same file. This might not be optimal for your use
    case, but might get you closer.

    --
    Marshall T. Vandegrift <[email protected]>
    Damballa Staff Software Engineer | 518.859.4559m

    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Chris K Wensel at Mar 26, 2012 at 12:33 am
    Chris, should unsupported modes throw exceptions unless overridden?
    Already have a note to see what can be done about that.
  • Andrew Xue at Mar 26, 2012 at 11:36 pm
    is there a tap for text files that can update? normally files are
    written out as part-nnnnnn files -- it seems like if the tap can
    figure out the nnnnn value and use that to seed the "start index" for
    the filenames it could work.
    On Mar 25, 5:33 pm, Chris K Wensel wrote:
    Chris, should unsupported modes throw exceptions unless overridden?
    Already have a note to see what can be done about that.

    --
    Chris K Wensel
    [email protected]://concurrentinc.com
  • Chris K Wensel at Mar 27, 2012 at 4:40 am
    you would need to convince hadoop not to barf because the directory already exists..

    that said, TemplateTap does bypass much of this hadoopness, and probably could be done, but it could get expensive without some some meta info lying around already. i would suggest this be a fork of TemplateTap if someone really needed it.

    ckw
    On Mar 26, 2012, at 4:35 PM, Andrew Xue wrote:

    is there a tap for text files that can update? normally files are
    written out as part-nnnnnn files -- it seems like if the tap can
    figure out the nnnnn value and use that to seed the "start index" for
    the filenames it could work.
    On Mar 25, 5:33 pm, Chris K Wensel wrote:
    Chris, should unsupported modes throw exceptions unless overridden?
    Already have a note to see what can be done about that.

    --
    Chris K Wensel
    [email protected]://concurrentinc.com
    --
    Chris K Wensel
    [email protected]
    http://concurrentinc.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMar 7, '12 at 6:52p
activeMar 27, '12 at 4:40a
posts6
users5
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2023 Grokbase