FAQ
Hi all,

Two of our data nodes out of 60+ node cluster have had disk error.
We have not recognized it until we have encountered unusual job failure
after
several phases of M/R jobs(i.e. 1st M/R - 2nd M/R - 3rd M/R ...) and survey.
The error message of jobtracker was

2009-..-.. INFO org.apache.hadoop.mapred.TaskInProgress: Error from
attempt_ ...: java.io.IOException: Could not obtain block: blk_.....
file=/user/.../part-00173

The file "part-00173" was to be generated as output file of some earlier
phase of M/R.
We've tried to look into the file part-00173 of HDFS to have found a message
at the bottom of screen (<another_data_node>:50075) saying

java.io.IOException: No nodes contain this block

Why is this? When does the replication really happen?
The file "part-00173" must have been moved from "attempt_..."
after job ends successfully and have had predetermined replications, right?
It seems that the part-00173 is not replicated enough.

What is worse, there are no ERROR labelled messages in the log other than
datanode.
Actually, our system reports any ERROR labelled message from NameNode
and JobTracker logs. But if this (probably) critical error is labelled as
INFO,
we need to re-design our monitoring policy. In order to find out any disk
failures,
do we need to scan DataNode's logs also?

Any help will be appreciated.


Thanks,
Manhee

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 28, '09 at 7:40a
activeDec 28, '09 at 7:40a
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Manhee Jo: 1 post

People

Translate

site design / logo © 2022 Grokbase