FAQ
How do you replicate this in JCascalog? As in use the same SubQuery
(possibly a source tap) in two different queries which apply predicates and
then sink the data to two different output taps.

--
You received this message because you are subscribed to the Google Groups "cascalog-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Sourabh Chaki at May 31, 2013 at 9:38 am
    I am also looking for something like this in Jcascalog. For example: I have
    one data set, I need parse the data and if foo condition satisfied , push
    the data to foo variable ( some intermediate data store instead of tab) and
    if bar condition satisfied push the data to bar variable.

    This is something like split sub assembly in cascading.

    Please suggest how can I do that in Jcascalog?


    Thanks
    Sourabh
    On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

    Is there a way in Cascalog to output to multiple output taps within
    the same job? For example, tuples for which predicate foo matches
    would go to output tap foo-tap, and tuples for which predicate bar
    matches go to bar-tap.

    I can do it by first writing to an intermediate tap, then reading from
    it in multiple jobs, but that seems unnecessarily complex. Here's some
    code I wrote that takes the intermediate tap/multiple jobs approach,
    but I'm hoping there's a better way.

    The intermediate tap in the code below is 'extr-tap'.

    (defn extract-from-urls
    "Takes a directory of tabbed files where URLs are the second field
    (after UUID), fetches xhtml either from
    dcache or the local cache (depending on doc/*use-local-cache*),
    runs all extractors on the xhtml,
    and writes JSON strings with extractor name/values and URL to json-
    dir.
    URLs with parse errors are written to parse-error-dir.
    URLs not in dcache are written to cache-miss-dir.
    Other errors are written to trap-dir.
    If out-prefix is present it is prepended to the output paths."

    [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
    prefix]]

    (cascalog.io/with-fs-tmp [_ tmp-dir]
    (let [extr-tap (hfs-seqfile tmp-dir)
    json-tap (hfs-textline (str out-prefix json-dir))
    parse-error-tap (hfs-textline (str out-prefix parse-error-
    dir))
    cache-miss-tap (hfs-textline (str out-prefix cache-miss-
    dir))]
    (let [extr-query (make-extractor-query url-dir (str out-prefix
    trap-dir))]
    (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
    (extr-query ?uuid ?url !json !parse-error !cache-miss)))
    (?- json-tap
    (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
    parse-error-tap
    (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
    error _))
    cache-miss-tap
    (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sourabh Chaki at Aug 6, 2013 at 11:02 am
    Any has any idea how can we implement multiple tap in jcascalog?
    Thanks
    Sourabh

    On Friday, May 31, 2013 3:08:33 PM UTC+5:30, sourab...@corp.247customer.com
    wrote:
    I am also looking for something like this in Jcascalog. For example: I
    have one data set, I need parse the data and if foo condition satisfied ,
    push the data to foo variable ( some intermediate data store instead of
    tab) and if bar condition satisfied push the data to bar variable.

    This is something like split sub assembly in cascading.

    Please suggest how can I do that in Jcascalog?


    Thanks
    Sourabh
    On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

    Is there a way in Cascalog to output to multiple output taps within
    the same job? For example, tuples for which predicate foo matches
    would go to output tap foo-tap, and tuples for which predicate bar
    matches go to bar-tap.

    I can do it by first writing to an intermediate tap, then reading from
    it in multiple jobs, but that seems unnecessarily complex. Here's some
    code I wrote that takes the intermediate tap/multiple jobs approach,
    but I'm hoping there's a better way.

    The intermediate tap in the code below is 'extr-tap'.

    (defn extract-from-urls
    "Takes a directory of tabbed files where URLs are the second field
    (after UUID), fetches xhtml either from
    dcache or the local cache (depending on doc/*use-local-cache*),
    runs all extractors on the xhtml,
    and writes JSON strings with extractor name/values and URL to json-
    dir.
    URLs with parse errors are written to parse-error-dir.
    URLs not in dcache are written to cache-miss-dir.
    Other errors are written to trap-dir.
    If out-prefix is present it is prepended to the output paths."

    [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
    prefix]]

    (cascalog.io/with-fs-tmp [_ tmp-dir]
    (let [extr-tap (hfs-seqfile tmp-dir)
    json-tap (hfs-textline (str out-prefix json-dir))
    parse-error-tap (hfs-textline (str out-prefix parse-error-
    dir))
    cache-miss-tap (hfs-textline (str out-prefix cache-miss-
    dir))]
    (let [extr-query (make-extractor-query url-dir (str out-prefix
    trap-dir))]
    (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
    (extr-query ?uuid ?url !json !parse-error !cache-miss)))
    (?- json-tap
    (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
    parse-error-tap
    (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
    error _))
    cache-miss-tap
    (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Robin Kraft at Aug 6, 2013 at 2:41 pm
    In regular cascalog you could use a template tap. It may guide you for figuring it out with Jcascalog?

    https://groups.google.com/d/msg/cascalog-user/Ceq-gBFmjDI/pP1ED0SrFx8J

    On Aug 6, 2013, at 4:02 AM, sourabh.chaki@corp.247customer.com wrote:

    Any has any idea how can we implement multiple tap in jcascalog?
    Thanks
    Sourabh
    On Friday, May 31, 2013 3:08:33 PM UTC+5:30, sourab...@corp.247customer.com wrote:

    I am also looking for something like this in Jcascalog. For example: I have one data set, I need parse the data and if foo condition satisfied , push the data to foo variable ( some intermediate data store instead of tab) and if bar condition satisfied push the data to bar variable.

    This is something like split sub assembly in cascading.

    Please suggest how can I do that in Jcascalog?


    Thanks
    Sourabh
    On Saturday, June 25, 2011 8:49:50 AM UTC+5:30, Evan Gamble wrote:

    Is there a way in Cascalog to output to multiple output taps within
    the same job? For example, tuples for which predicate foo matches
    would go to output tap foo-tap, and tuples for which predicate bar
    matches go to bar-tap.

    I can do it by first writing to an intermediate tap, then reading from
    it in multiple jobs, but that seems unnecessarily complex. Here's some
    code I wrote that takes the intermediate tap/multiple jobs approach,
    but I'm hoping there's a better way.

    The intermediate tap in the code below is 'extr-tap'.

    (defn extract-from-urls
    "Takes a directory of tabbed files where URLs are the second field
    (after UUID), fetches xhtml either from
    dcache or the local cache (depending on doc/*use-local-cache*),
    runs all extractors on the xhtml,
    and writes JSON strings with extractor name/values and URL to json-
    dir.
    URLs with parse errors are written to parse-error-dir.
    URLs not in dcache are written to cache-miss-dir.
    Other errors are written to trap-dir.
    If out-prefix is present it is prepended to the output paths."

    [url-dir json-dir parse-error-dir cache-miss-dir trap-dir & [out-
    prefix]]

    (cascalog.io/with-fs-tmp [_ tmp-dir]
    (let [extr-tap (hfs-seqfile tmp-dir)
    json-tap (hfs-textline (str out-prefix json-dir))
    parse-error-tap (hfs-textline (str out-prefix parse-error-
    dir))
    cache-miss-tap (hfs-textline (str out-prefix cache-miss-
    dir))]
    (let [extr-query (make-extractor-query url-dir (str out-prefix
    trap-dir))]
    (?<- extr-tap [?uuid ?url !json !parse-error !cache-miss]
    (extr-query ?uuid ?url !json !parse-error !cache-miss)))
    (?- json-tap
    (<- [?uuid ?url ?json] (extr-tap ?uuid ?url ?json _ _))
    parse-error-tap
    (<- [?uuid ?url ?parse-error] (extr-tap ?uuid ?url _ ?parse-
    error _))
    cache-miss-tap
    (<- [?uuid ?url] (extr-tap ?uuid ?url _ _ ?cache-miss))))))
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Sam Ritchie at Aug 6, 2013 at 3:20 pm
    You can bind the subquery to a variable, then use that variable in the
    two other subqueries. Can you send some code you've tried?

    In Cascalog, you can execute multiple queries with

    (?- <first-tap> <first-subq> <second-tap> <second-subq> ...)
    PK May 29, 2013 11:40 AM
    How do you replicate this in JCascalog? As in use the same SubQuery
    (possibly a source tap) in two different queries which apply
    predicates and then sink the data to two different output taps.
    --
    You received this message because you are subscribed to the Google
    Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie

    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Tomas Svarovsky at Nov 17, 2013 at 4:23 am
    Hey Guys

    I am hitting similar problem. I have one JSON file and content of different
    keys should go to different files. I thought that something like this could
    work.

    (defn parse-profile-json-query
       [in]
       (<- [?customer-id ?json-out]
                        (in ?stuff)
                        (json-in ?stuff :> ?json-out)
                        (filter-customer-id ?json-out :> ?customer-id)))

    For brevity showing just one query

    (defn purchases-query
       [in]
       (let [items-counter (count_key ["items"])
             filter-fields (filter-fields-op [["_id" "$oid"]
                                              ["price"]
                                              ["qty"]
                                              ["time" "$date"]
                                              ["message" "_id"]])]
         (<- [!csv-out]
             (in _ ?data-in)
             (filter-customer-id ?data-in :> !customer-id)
             (filter-purchases ?data-in :> ?purchases)
             (explode ?purchases :> ?purchase)
             (items-counter ?purchase :> !items-count)
             (filter-fields ?purchase :> !id !price !qty !time !message)
             (csv-out !customer-id !items-count !id !price !qty !time !message
    :> !csv-out))))


    (let [json-input (parse-profile-json-query profile-input-path)]
               (?-
                  purchases-output-path (purchases-query json-input)
                  purchased-items-output-path (purchased-items-query json-input)
                  customer-output-path (customer-query json-input)))

    I thought that all the queries would be performed in one map tasks but this
    does not seem to be the case. From my very brief checking of EMR monitoring
    they seems to grow linearly with number of executed queries. I wanted to
    use explain to verify that the initial json parsing is really reused but
    that seems to work only on one query.

    Any suggestions I should go about improving, profiling or debugging the
    issue?

    Thanks Tomas

    On Tuesday, August 6, 2013 8:20:45 AM UTC-7, Sam Ritchie wrote:

    You can bind the subquery to a variable, then use that variable in the two
    other subqueries. Can you send some code you've tried?

    In Cascalog, you can execute multiple queries with

    (?- <first-tap> <first-subq> <second-tap> <second-subq> ...)

    PK <javascript:>
    May 29, 2013 11:40 AM
    How do you replicate this in JCascalog? As in use the same SubQuery
    (possibly a source tap) in two different queries which apply predicates and
    then sink the data to two different output taps.
    --
    You received this message because you are subscribed to the Google Groups
    "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to cascalog-use...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.




    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie
    --
    You received this message because you are subscribed to the Google Groups "cascalog-user" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to cascalog-user+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedMay 29, '13 at 6:40p
activeNov 17, '13 at 4:23a
posts6
users5
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase