My streaming mappers frequently die with this error:
Task attempt_201103101623_12864_m_000032_1 failed to report status for 602 seconds. Killing!
A repeated attempt of the same task generally succeeds, but it's very time-wasteful that the task has been held up by 10 minutes. My mapper (and reducer) are C++ and use pthreads. I start a reporter thread as soon as the task starts and that reporter thread sends periodic reporter and status messages to cout using the streaming reporter syntax, but I still get these errors occasionally.
Also, the task logs for such failed mappers are always either empty or unretrievable. They don't show ten minutes of actual work on the worker thread while the reporter should have been reporting. Rather, they are empty (or like I said, totally unretrievable). It seems to me that Hadoop is failing to even start these tasks. If the C++ binary had actually been kicked off, the logs would show SOME kind of output (on cerr) even if the reporter thread had not been started properly because I send output to cerr before even starting the reporter thread, in fact, before any pthread-related wonkery at all (I send output right from the entry to main(), yet the logs are empty), so I really think Hadoop isn't even starting the binary, but then waits ten minutes to kill the task anyway.
Has anyone else seen anything like this?
Thanks.
________________________________________________________________________________
Keith Wiley [email protected] keithwiley.com music.keithwiley.com
"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
-- Edwin A. Abbott, Flatland
________________________________________________________________________________