Grokbase Groups Pig user March 2012
FAQ
Hi,
I am running a pig query on around 500 GB input data.
The current block size is 128 MB and split size is the default 128 MB.
I have also specified 16 reducers and around 3800 mappers are running.

Now I observe that shuffling is taking a long time to complete execution,
approximately 25 mins per job.

Can anyone suggest how I can bring down the shuffling time? Is there any
property that I can tweak to improve performance?

Thanks & Regards,
Austin

Search Discussions

  • Prashant Kommireddi at Mar 13, 2012 at 3:34 pm
    What is the number of reduce shuffle bytes for this job? Also, is this
    job CPU intensive on reducers or is it simple aggregation?

    Sent from my iPhone
    On Mar 13, 2012, at 5:25 AM, Austin Chungath wrote:

    Hi,
    I am running a pig query on around 500 GB input data.
    The current block size is 128 MB and split size is the default 128 MB.
    I have also specified 16 reducers and around 3800 mappers are running.

    Now I observe that shuffling is taking a long time to complete execution,
    approximately 25 mins per job.

    Can anyone suggest how I can bring down the shuffling time? Is there any
    property that I can tweak to improve performance?

    Thanks & Regards,
    Austin
  • Austin Chungath at Mar 14, 2012 at 9:15 am
    Hi Prashant,

    The number of shuffle bytes is around 7.5GB and it's taking around 25 to 30
    mins for the shuffling to finish.
    No, the job is not CPU intensive but it contains lots of GROUP BY and JOINS.

    For more details about the job, go to the following link to get a job
    details screenshot
    http://imageshack.us/f/69/jobscreenshot.jpg/
    (ps. I am using Pig from trunk, on hadoop 0.20.205)

    Thanks,
    Austin
    On Tue, Mar 13, 2012 at 9:04 PM, Prashant Kommireddi wrote:

    What is the number of reduce shuffle bytes for this job? Also, is this
    job CPU intensive on reducers or is it simple aggregation?

    Sent from my iPhone
    On Mar 13, 2012, at 5:25 AM, Austin Chungath wrote:

    Hi,
    I am running a pig query on around 500 GB input data.
    The current block size is 128 MB and split size is the default 128 MB.
    I have also specified 16 reducers and around 3800 mappers are running.

    Now I observe that shuffling is taking a long time to complete execution,
    approximately 25 mins per job.

    Can anyone suggest how I can bring down the shuffling time? Is there any
    property that I can tweak to improve performance?

    Thanks & Regards,
    Austin
  • Prashant Kommireddi at Mar 14, 2012 at 9:01 pm
    Can you also provide numbers for Reduce Shuffle Bytes? And Combine Input
    and Output records? How many map slots do you have on the cluster? How many
    spills on the Map and Reduce side?

    If you are confident about shuffle time being the bottleneck, you could try
    tuning a couple of parameters

    1. mapred.inmem.merge.threshold - # of map output to be merged at once
    on reduce side. Set it to 0 so it depends on
    mapred.job.reduce.input.buffer.percent
    2. mapred.job.reduce.input.buffer.percent - you mentioned reduce is not
    memory intensive, in which case you can try increasing this to 0.70 or 0.80
    3. Make sure the combiners are doing work (aggregation). If not, you
    could shut off combiner.

    You can also play with io.sort.mb and io.sort.factor which really depends
    on how much memory you have allocated each task (mapred.child.java.opts).
    Tuning depends on a lot of factors, you might have to dig deeper into the
    counters.

    Thanks,

    Prashant
    On Tue, Mar 13, 2012 at 5:24 AM, Austin Chungath wrote:

    Hi,
    I am running a pig query on around 500 GB input data.
    The current block size is 128 MB and split size is the default 128 MB.
    I have also specified 16 reducers and around 3800 mappers are running.

    Now I observe that shuffling is taking a long time to complete execution,
    approximately 25 mins per job.

    Can anyone suggest how I can bring down the shuffling time? Is there any
    property that I can tweak to improve performance?

    Thanks & Regards,
    Austin
  • Prashant Kommireddi at Mar 14, 2012 at 9:12 pm
    My bad, did not realize you had pasted Counters on your previous message.

    Please try playing with the reduce side properties I mentioned in my
    previous reply (mapred.inmem... and mapred.job.reduce...).
    Combiner is doing some good amount of work, please do not turn it off.

    Also, the reducers are not doing a whole lot of aggregation -> # of reduce
    input records approx equals # of reduce input groups. Which means more I/O
    intensive than memory (that is if aggregation is the goal of reducer and
    there is no huge deal of processing logic in there). Nothing you can tune
    there.

    Might be useful to play with io.sort.* properties on map side to reduce map
    side spills.

    Thanks,
    Prashant
    On Wed, Mar 14, 2012 at 2:01 PM, Prashant Kommireddi wrote:

    Can you also provide numbers for Reduce Shuffle Bytes? And Combine Input
    and Output records? How many map slots do you have on the cluster? How many
    spills on the Map and Reduce side?

    If you are confident about shuffle time being the bottleneck, you could
    try tuning a couple of parameters

    1. mapred.inmem.merge.threshold - # of map output to be merged at once
    on reduce side. Set it to 0 so it depends on
    mapred.job.reduce.input.buffer.percent
    2. mapred.job.reduce.input.buffer.percent - you mentioned reduce is
    not memory intensive, in which case you can try increasing this to 0.70 or
    0.80
    3. Make sure the combiners are doing work (aggregation). If not, you
    could shut off combiner.

    You can also play with io.sort.mb and io.sort.factor which really depends
    on how much memory you have allocated each task (mapred.child.java.opts).
    Tuning depends on a lot of factors, you might have to dig deeper into the
    counters.

    Thanks,

    Prashant
    On Tue, Mar 13, 2012 at 5:24 AM, Austin Chungath wrote:

    Hi,
    I am running a pig query on around 500 GB input data.
    The current block size is 128 MB and split size is the default 128 MB.
    I have also specified 16 reducers and around 3800 mappers are running.

    Now I observe that shuffling is taking a long time to complete execution,
    approximately 25 mins per job.

    Can anyone suggest how I can bring down the shuffling time? Is there any
    property that I can tweak to improve performance?

    Thanks & Regards,
    Austin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 13, '12 at 12:25p
activeMar 14, '12 at 9:12p
posts5
users2
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase