Project: Hadoop Core
Issue Type: Bug
Affects Versions: 0.20.0, 0.19.1, 0.19.0, 0.18.3
We see this frequently in our application, hbase, where dfsclients are held open across long periods of time. It would seem that any hiccup fetching a block becomes a permanent black mark and though the serving datanode passes out a temporary slowness or outage, the dfsclient never seems to pick up on this fact. Our perception is too sensitive to the vagaries of cluster comings and goings and succumbs too easily, especially given that a fresh dfsclient has not problem fetching the designated block.
Chatting with Raghu and Hairong yesterday, Hairong pointed out that the dfsclient frequently updates its list of block locations -- if a block has moved or if a datanode is dead, then dfsclient should be keeping with the changing state of the cluster (I see this happening in DFSClient#chooseDatanode on failure) but Raghu looks like he put his finger on our problem by noticing that the failures count is only incremented -- never decremented. ANY three failures, no matter how many blocks in a file nor that a block that failed once now works, are enough for the DFSClient to start throwing "Could not obtain block:...".
The failures counter needs to be a little smarter. Would a patch that adds a map of blocks to failure counts be the right way to go? Failures should note the datanode that the failure was gotten against so that if the datanode came online again (retry), we could decrement the mark that had made against the block?
What do folks think?
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.