FAQ
Hi!

I have a problem with one of my reducers getting 3 times as much
data as the other 15 reducers, causing longer total runtime per job.

What would be the best way to debug this? I'm guessing I'm outputting
keys that somehow fool the partitioner. Can I tell hadoop to save the
map outputs per reducer to be able to inspect what's in them?

Thanks,
\EF
--
Erik Forsberg <forsberg@opera.com>
Developer, Opera Software - http://www.opera.com/

Search Discussions

  • Amogh Vasekar at Jan 20, 2010 at 12:25 pm
    Can I tell hadoop to save the map outputs per reducer to be able to inspect what's in them
    You can set keep.tasks.files.pattern will save mapper output, set this regex to match your job/task as need be. But this will eat up a lot of local disk space.

    The problem most likely is your data ( or more specifically map output data ) being skewed, hence most keys hash to same partition id, and hence to one reducer. Are you implementing a join? If not, writing a custom partitioner would help.

    Amogh

    On 1/20/10 5:33 PM, "Erik Forsberg" wrote:

    Hi!

    I have a problem with one of my reducers getting 3 times as much
    data as the other 15 reducers, causing longer total runtime per job.

    What would be the best way to debug this? I'm guessing I'm outputting
    keys that somehow fool the partitioner. Can I tell hadoop to save the
    map outputs per reducer to be able to inspect what's in them?

    Thanks,
    \EF
    --
    Erik Forsberg <forsberg@opera.com>
    Developer, Opera Software - http://www.opera.com/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 20, '10 at 12:04p
activeJan 20, '10 at 12:25p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Amogh Vasekar: 1 post Erik Forsberg: 1 post

People

Translate

site design / logo © 2022 Grokbase