I run into task failures if I run several jobs on my 10 node cluster.
I start seeing warnings of the following type before the job fails.
WARN mapred.JobClient: Error reading task
INFO mapred.JobClient: Task Id :
attempt_201001221644_0001_r_000001_2, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 1.
After doing a search on the mailing list, I found some information
stating that this could be due to DNS name resolution failures. It was
suggested that in the /etc/hosts file, one should
add the IP addresses (for e,g, 127.0.0.1 machine.domainname) to make
sure that the jobtracker and the tasktracker can locate each other. I
did that but still the problem occurs if I run too many jobs.
I believe I am running into DNS resolution quotas somewhere because my
cluster does not have a local DNS server and contacts the university
servers for name resolution.
When this problem occurs restarting the cluster did not help and the
last time the problem went away after 24 hours (I am assuming the
admins replenish the quotas daily).
My questions are:
1) Why is hadoop looking for http://<machine.domainname>:port...
instead of http://ipaddress:port even when I provide IP addresses in
the /etc/hosts as well as the conf/slaves file?
2) Has anyone faced similar problems? How did you resolve it?
I understand that the problem is not directly related to Hadoop but
the way Linux does DNS name resolution (and the way things are set up
on my end). To the best of my knowledge, I see the problem as my
hadoop jobs generating a certain number of DNS queries that exhaust
the allowed DNS query quota over time. How do I reduce the number of
these queries in Hadoop?