Two of our data nodes out of 60+ node cluster have had disk error.
We have not recognized it until we have encountered unusual job failure
several phases of M/R jobs(i.e. 1st M/R - 2nd M/R - 3rd M/R ...) and survey.
The error message of jobtracker was
2009-..-.. INFO org.apache.hadoop.mapred.TaskInProgress: Error from
attempt_ ...: java.io.IOException: Could not obtain block: blk_.....
The file "part-00173" was to be generated as output file of some earlier
phase of M/R.
We've tried to look into the file part-00173 of HDFS to have found a message
at the bottom of screen (<another_data_node>:50075) saying
java.io.IOException: No nodes contain this block
Why is this? When does the replication really happen?
The file "part-00173" must have been moved from "attempt_..."
after job ends successfully and have had predetermined replications, right?
It seems that the part-00173 is not replicated enough.
What is worse, there are no ERROR labelled messages in the log other than
Actually, our system reports any ERROR labelled message from NameNode
and JobTracker logs. But if this (probably) critical error is labelled as
we need to re-design our monitoring policy. In order to find out any disk
do we need to scan DataNode's logs also?
Any help will be appreciated.