FAQ
My streaming mappers frequently die with this error:

Task attempt_201103101623_12864_m_000032_1 failed to report status for 602 seconds. Killing!

A repeated attempt of the same task generally succeeds, but it's very time-wasteful that the task has been held up by 10 minutes. My mapper (and reducer) are C++ and use pthreads. I start a reporter thread as soon as the task starts and that reporter thread sends periodic reporter and status messages to cout using the streaming reporter syntax, but I still get these errors occasionally.

Also, the task logs for such failed mappers are always either empty or unretrievable. They don't show ten minutes of actual work on the worker thread while the reporter should have been reporting. Rather, they are empty (or like I said, totally unretrievable). It seems to me that Hadoop is failing to even start these tasks. If the C++ binary had actually been kicked off, the logs would show SOME kind of output (on cerr) even if the reporter thread had not been started properly because I send output to cerr before even starting the reporter thread, in fact, before any pthread-related wonkery at all (I send output right from the entry to main(), yet the logs are empty), so I really think Hadoop isn't even starting the binary, but then waits ten minutes to kill the task anyway.

Has anyone else seen anything like this?

Thanks.

________________________________________________________________________________
Keith Wiley [email protected] keithwiley.com music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
-- Edwin A. Abbott, Flatland
________________________________________________________________________________

Search Discussions

  • Keith Wiley at Mar 23, 2011 at 7:06 pm

    On Mar 23, 2011, at 10:33 AM, Keith Wiley wrote:

    I start a reporter thread as soon as the task starts and that reporter thread sends periodic reporter and status messages to cout using the streaming reporter syntax, but I still get these errors occasionally.

    Sorry, I meant to say my reporter thread sends counter an status messages to cerr, not cout.

    ________________________________________________________________________________
    Keith Wiley [email protected] keithwiley.com music.keithwiley.com

    "I do not feel obliged to believe that the same God who has endowed us with
    sense, reason, and intellect has intended us to forgo their use."
    -- Galileo Galilei
    ________________________________________________________________________________
  • Jim Falgout at Mar 23, 2011 at 7:16 pm
    I've run into that before. Try setting mapreduce.task.timeout. I seem to remember that setting it to zero may turn off the timeout, but of course can be dangerous if you have a runaway task. The default is 600 seconds ;-)

    Check out http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html. It lists a bunch of map reduce properties.

    -----Original Message-----
    From: Keith Wiley
    Sent: Wednesday, March 23, 2011 12:33 PM
    To: [email protected]
    Subject: Streaming mappers frequently time out

    My streaming mappers frequently die with this error:

    Task attempt_201103101623_12864_m_000032_1 failed to report status for 602 seconds. Killing!

    A repeated attempt of the same task generally succeeds, but it's very time-wasteful that the task has been held up by 10 minutes. My mapper (and reducer) are C++ and use pthreads. I start a reporter thread as soon as the task starts and that reporter thread sends periodic reporter and status messages to cout using the streaming reporter syntax, but I still get these errors occasionally.

    Also, the task logs for such failed mappers are always either empty or unretrievable. They don't show ten minutes of actual work on the worker thread while the reporter should have been reporting. Rather, they are empty (or like I said, totally unretrievable). It seems to me that Hadoop is failing to even start these tasks. If the C++ binary had actually been kicked off, the logs would show SOME kind of output (on cerr) even if the reporter thread had not been started properly because I send output to cerr before even starting the reporter thread, in fact, before any pthread-related wonkery at all (I send output right from the entry to main(), yet the logs are empty), so I really think Hadoop isn't even starting the binary, but then waits ten minutes to kill the task anyway.

    Has anyone else seen anything like this?

    Thanks.

    ________________________________________________________________________________
    Keith Wiley [email protected] keithwiley.com music.keithwiley.com

    "Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy."
    -- Edwin A. Abbott, Flatland ________________________________________________________________________________
  • Keith Wiley at Mar 23, 2011 at 6:57 pm
    Maybe I could just turn it down to two or three minutes. It wouldn't fix the problem where the task doesn't start, but it would kill it and restart it more quickly.

    Thanks.
    On Mar 23, 2011, at 11:46 AM, Jim Falgout wrote:

    I've run into that before. Try setting mapreduce.task.timeout. I seem to remember that setting it to zero may turn off the timeout, but of course can be dangerous if you have a runaway task. The default is 600 seconds ;-)

    Check out http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html. It lists a bunch of map reduce properties.

    ________________________________________________________________________________
    Keith Wiley [email protected] keithwiley.com music.keithwiley.com

    "I used to be with it, but then they changed what it was. Now, what I'm with
    isn't it, and what's it seems weird and scary to me."
    -- Abe (Grandpa) Simpson
    ________________________________________________________________________________

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 23, '11 at 6:57p
activeMar 23, '11 at 7:23p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Keith Wiley: 3 posts Jim Falgout: 1 post

People

Translate

site design / logo © 2023 Grokbase