Hi,

I have a flow which uses a number of subqueries. All the subqueries
eventually flow into a final "master" query, and I wrap this executing
"master" query with a (with-job-conf {"mapred.reduce.tasks" N} (?- out
query)) statement when it comes to running it.

I am using the mapred.reduce.task # to control the file size and to
avoid small files. The issue is that what I am doing limits the #
reduce task in the entire flow including all the previous subqueries.
Performance really suffers as a result.

Is there some way to specify a job-conf specifically only for the last
part of the flow? Thanks
Andy

Search Discussions

  • Sam Ritchie at Dec 15, 2011 at 9:01 am
    Andy,

    You can configure your tap to limit the number of reducers it uses on the
    final write like this:

    (hfs-textline "path/to/output" :sinkparts 5) ;; this tap only uses 5
    reducers!

    See the docs<http://nathanmarz.github.com/cascalog/cascalog.api-api.html#cascalog.api/hfs-tap>
    for
    some other helpful options. This behavior works for all of these taps:

    (for [prefix ["hfs" "lfs"], suffix ["textline" "seqfile"]]
    (str prefix "-" suffix))

    :)

    Cheers,
    Sam
    On Thu, Dec 15, 2011 at 12:50 AM, Andrew Xue wrote:

    Hi,

    I have a flow which uses a number of subqueries. All the subqueries
    eventually flow into a final "master" query, and I wrap this executing
    "master" query with a (with-job-conf {"mapred.reduce.tasks" N} (?- out
    query)) statement when it comes to running it.

    I am using the mapred.reduce.task # to control the file size and to
    avoid small files. The issue is that what I am doing limits the #
    reduce task in the entire flow including all the previous subqueries.
    Performance really suffers as a result.

    Is there some way to specify a job-conf specifically only for the last
    part of the flow? Thanks
    Andy


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Andrew Xue at Dec 15, 2011 at 10:55 am
    exactly what i wanted!
    On Dec 15, 1:00 am, Sam Ritchie wrote:
    Andy,

    You can configure your tap to limit the number of reducers it uses on the
    final write like this:

    (hfs-textline "path/to/output" :sinkparts 5) ;; this tap only uses 5
    reducers!

    See the docs<http://nathanmarz.github.com/cascalog/cascalog.api-api.html#cascalog....>
    for
    some other helpful options. This behavior works for all of these taps:

    (for [prefix ["hfs" "lfs"], suffix ["textline" "seqfile"]]
    (str prefix "-" suffix))

    :)

    Cheers,
    Sam








    On Thu, Dec 15, 2011 at 12:50 AM, Andrew Xue wrote:
    Hi,
    I have a flow which uses a number of subqueries. All the subqueries
    eventually flow into a final "master" query, and I wrap this executing
    "master" query with a (with-job-conf {"mapred.reduce.tasks" N} (?- out
    query)) statement when it comes to running it.
    I am using the mapred.reduce.task # to control the file size and to
    avoid small files. The issue is that what I am doing limits the #
    reduce task in the entire flow including all the previous subqueries.
    Performance really suffers as a result.
    Is there some way to specify a job-conf specifically only for the last
    part of the flow? Thanks
    Andy
    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why!http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedDec 15, '11 at 8:51a
activeDec 15, '11 at 10:55a
posts3
users2
websiteclojure.org
irc#clojure

2 users in discussion

Andrew Xue: 2 posts Sam Ritchie: 1 post

People

Translate

site design / logo © 2022 Grokbase