i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a situation where every task scheduled on 2 of the 4 nodes failed.
Seems like the child jvm crashes. There are no child logs under logs/userlogs. Tasktracker gives this:
2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201006091425_0049_m_-946174604 spawned.
2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201006091425_0049_m_003179_0 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job created the logs/userlogs again and no error ocuured anymore on this host.
The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD contains about 378M in 132747 files. When copying the content of userlogsOLD into userlogs, the tasks of the belonging node starts failing again.
- this seems to me like a problem with too many files in one folder - any thoughts on this ?
- is the content of logs/userlogs cleaned up by hadoop regularly ?
- the logs/stdout file of the tasks are not existent, the logs/out fiels of the tasktracker hasn't any specific message (other then message posted above) - is there any log file left where an error message could be found ?