FAQ
I am also looking for something like this in Jcascalog. For example: I have
one data set, I need parse the data and if foo condition satisfied , push
the data to foo variable ( some intermediate data store instead of tab) and
if bar condition satisfied push the data to bar variable.

This is something like split sub assembly in cascading.

Please suggest how can I do that in Jcascalog?


Thanks
Sourabh
On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

Is there a way in Cascalog to output to multiple output taps within
the same job? For example, tuples for which predicate foo matches
would go to output tap foo-tap, and tuples for which predicate bar
matches go to bar-tap.

I can do it by first writing to an intermediate tap, then reading from
it in multiple jobs, but that seems unnecessarily complex. Here's some
code I wrote that takes the intermediate tap/multiple jobs approach,
but I'm hoping there's a better way.

The intermediate tap in the code below is 'extr-tap'.

(defn extract-from-urls
"Takes a directory of tabbed files where URLs are the second field
(after UUID), fetches xhtml either from
dcache or the local cache (depending on doc/*use-local-cache*),
runs all extractors on the xhtml,
and writes JSON strings with extractor name/values and URL to json-
dir.
URLs with parse errors are written to parse-error-dir.
URLs not in dcache are written to cache-miss-dir.
Other errors are written to trap-dir.
If out-prefix is present it is prepended to the output paths."

[url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
prefix]]

(cascalog.io/with-fs-tmp [_ tmp-dir]
(let [extr-tap (hfs-seqfile tmp-dir)
json-tap (hfs-textline (str out-prefix json-dir))
parse-error-tap (hfs-textline (str out-prefix parse-error-
dir))
cache-miss-tap (hfs-textline (str out-prefix cache-miss-
dir))]
(let [extr-query (make-extractor-query url-dir (str out-prefix
trap-dir))]
(?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
(extr-query ?uuid ?url !json !parse-error !cache-miss)))
(?- json-tap
(<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
parse-error-tap
(<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
error _))
cache-miss-tap
(<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Sourabh Chaki at Aug 6, 2013 at 11:02 am
    Any has any idea how can we implement multiple tap in jcascalog?
    Thanks
    Sourabh

    On Friday, May 31, 2013 3:08:33 PM UTC+5:30, sourab...@corp.247customer.com
    wrote:
    I am also looking for something like this in Jcascalog. For example: I
    have one data set, I need parse the data and if foo condition satisfied ,
    push the data to foo variable ( some intermediate data store instead of
    tab) and if bar condition satisfied push the data to bar variable.

    This is something like split sub assembly in cascading.

    Please suggest how can I do that in Jcascalog?


    Thanks
    Sourabh
    On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

    Is there a way in Cascalog to output to multiple output taps within
    the same job? For example, tuples for which predicate foo matches
    would go to output tap foo-tap, and tuples for which predicate bar
    matches go to bar-tap.

    I can do it by first writing to an intermediate tap, then reading from
    it in multiple jobs, but that seems unnecessarily complex. Here's some
    code I wrote that takes the intermediate tap/multiple jobs approach,
    but I'm hoping there's a better way.

    The intermediate tap in the code below is 'extr-tap'.

    (defn extract-from-urls
    "Takes a directory of tabbed files where URLs are the second field
    (after UUID), fetches xhtml either from
    dcache or the local cache (depending on doc/*use-local-cache*),
    runs all extractors on the xhtml,
    and writes JSON strings with extractor name/values and URL to json-
    dir.
    URLs with parse errors are written to parse-error-dir.
    URLs not in dcache are written to cache-miss-dir.
    Other errors are written to trap-dir.
    If out-prefix is present it is prepended to the output paths."

    [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
    prefix]]

    (cascalog.io/with-fs-tmp [_ tmp-dir]
    (let [extr-tap (hfs-seqfile tmp-dir)
    json-tap (hfs-textline (str out-prefix json-dir))
    parse-error-tap (hfs-textline (str out-prefix parse-error-
    dir))
    cache-miss-tap (hfs-textline (str out-prefix cache-miss-
    dir))]
    (let [extr-query (make-extractor-query url-dir (str out-prefix
    trap-dir))]
    (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
    (extr-query ?uuid ?url !json !parse-error !cache-miss)))
    (?- json-tap
    (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
    parse-error-tap
    (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
    error _))
    cache-miss-tap
    (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Robin Kraft at Aug 6, 2013 at 2:41 pm
    In regular cascalog you could use a template tap. It may guide you for figuring it out with Jcascalog?

    https://groups.google.com/d/msg/cascalog-user/Ceq-gBFmjDI/pP1ED0SrFx8J

    On Aug 6, 2013, at 4:02 AM, sourabh.chaki@corp.247customer.com wrote:

    Any has any idea how can we implement multiple tap in jcascalog?
    Thanks
    Sourabh
    On Friday, May 31, 2013 3:08:33 PM UTC+5:30, sourab...@corp.247customer.com wrote:

    I am also looking for something like this in Jcascalog. For example: I have one data set, I need parse the data and if foo condition satisfied , push the data to foo variable ( some intermediate data store instead of tab) and if bar condition satisfied push the data to bar variable.

    This is something like split sub assembly in cascading.

    Please suggest how can I do that in Jcascalog?


    Thanks
    Sourabh
    On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

    Is there a way in Cascalog to output to multiple output taps within
    the same job? For example, tuples for which predicate foo matches
    would go to output tap foo-tap, and tuples for which predicate bar
    matches go to bar-tap.

    I can do it by first writing to an intermediate tap, then reading from
    it in multiple jobs, but that seems unnecessarily complex. Here's some
    code I wrote that takes the intermediate tap/multiple jobs approach,
    but I'm hoping there's a better way.

    The intermediate tap in the code below is 'extr-tap'.

    (defn extract-from-urls
    "Takes a directory of tabbed files where URLs are the second field
    (after UUID), fetches xhtml either from
    dcache or the local cache (depending on doc/*use-local-cache*),
    runs all extractors on the xhtml,
    and writes JSON strings with extractor name/values and URL to json-
    dir.
    URLs with parse errors are written to parse-error-dir.
    URLs not in dcache are written to cache-miss-dir.
    Other errors are written to trap-dir.
    If out-prefix is present it is prepended to the output paths."

    [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
    prefix]]

    (cascalog.io/with-fs-tmp [_ tmp-dir]
    (let [extr-tap (hfs-seqfile tmp-dir)
    json-tap (hfs-textline (str out-prefix json-dir))
    parse-error-tap (hfs-textline (str out-prefix parse-error-
    dir))
    cache-miss-tap (hfs-textline (str out-prefix cache-miss-
    dir))]
    (let [extr-query (make-extractor-query url-dir (str out-prefix
    trap-dir))]
    (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
    (extr-query ?uuid ?url !json !parse-error !cache-miss)))
    (?- json-tap
    (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
    parse-error-tap
    (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
    error _))
    cache-miss-tap
    (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMay 31, '13 at 9:38a
activeAug 6, '13 at 2:41p
posts3
users2
websiteclojure.org
irc#clojure

2 users in discussion

Sourabh Chaki: 2 posts Robin Kraft: 1 post

People

Translate

site design / logo © 2021 Grokbase