I got the time out error as mentioned below -- after 600 seconds,
that attempt was killed and the attempt would be deemed a failure. I
searched around about this error, and one of the suggestions to include
"progress" statements in the reducer -- it might be taking longer than 600
seconds and so is timing out. I added calls to context.progress() and
context.setStatus(str) in the reducer. Now, it works fine -- there are no
timeout errors.
But, for a few jobs, it takes awfully long time to move from "Map
100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
was more than an hour. The reduce code is not complex -- 2 level loop and
couple of if-else blocks. The input size is also not huge, for the job that
gets struck for an hour at reduce 99%, it would take in 130. Some of them
are 1-3 MB in size and couple of them are 16MB in size.
Has anyone encountered this problem before? Any pointers? I use
Hadoop 0.20.2 on a linux cluster of 16 nodes.
Thank you.
Regards,
Raghava.
On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju wrote:
Hi all,
I am running a series of jobs one after another. While executing the
4th job, the job fails. It fails in the reducer --- the progress percentage
would be map 100%, reduce 99%. It gives out the following message
10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
attempt_201003240138_0110_r_000018_1, Status : FAILED
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
seconds. Killing!
It makes several attempts again to execute it but fails with similar
message. I couldn't get anything from this error message and wanted to look
at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
find any files which match the timestamp of the job. Also I did not find
history and userlogs in the logs folder. Should I look at some other place
for the logs? What could be the possible causes for the above error?
I am using Hadoop 0.20.2 and I am running it on a cluster with 16
nodes.
Thank you.
Regards,
Raghava.
Hi all,
I am running a series of jobs one after another. While executing the
4th job, the job fails. It fails in the reducer --- the progress percentage
would be map 100%, reduce 99%. It gives out the following message
10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
attempt_201003240138_0110_r_000018_1, Status : FAILED
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
seconds. Killing!
It makes several attempts again to execute it but fails with similar
message. I couldn't get anything from this error message and wanted to look
at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
find any files which match the timestamp of the job. Also I did not find
history and userlogs in the logs folder. Should I look at some other place
for the logs? What could be the possible causes for the above error?
I am using Hadoop 0.20.2 and I am running it on a cluster with 16
nodes.
Thank you.
Regards,
Raghava.