FAQ
DFSClient "Could not obtain block:..."
--------------------------------------

Key: HADOOP-5903
URL: https://issues.apache.org/jira/browse/HADOOP-5903
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.20.0, 0.19.1, 0.19.0, 0.18.3
Reporter: stack


We see this frequently in our application, hbase, where dfsclients are held open across long periods of time. It would seem that any hiccup fetching a block becomes a permanent black mark and though the serving datanode passes out a temporary slowness or outage, the dfsclient never seems to pick up on this fact. Our perception is too sensitive to the vagaries of cluster comings and goings and succumbs too easily, especially given that a fresh dfsclient has not problem fetching the designated block.

Chatting with Raghu and Hairong yesterday, Hairong pointed out that the dfsclient frequently updates its list of block locations -- if a block has moved or if a datanode is dead, then dfsclient should be keeping with the changing state of the cluster (I see this happening in DFSClient#chooseDatanode on failure) but Raghu looks like he put his finger on our problem by noticing that the failures count is only incremented -- never decremented. ANY three failures, no matter how many blocks in a file nor that a block that failed once now works, are enough for the DFSClient to start throwing "Could not obtain block:...".

The failures counter needs to be a little smarter. Would a patch that adds a map of blocks to failure counts be the right way to go? Failures should note the datanode that the failure was gotten against so that if the datanode came online again (retry), we could decrement the mark that had made against the block?

What do folks think?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Chris Douglas (JIRA) at May 24, 2009 at 3:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas resolved HADOOP-5903.
    -----------------------------------

    Resolution: Duplicate

    Duplicate of HADOOP-3185, HADOOP-4681
    DFSClient "Could not obtain block:..."
    --------------------------------------

    Key: HADOOP-5903
    URL: https://issues.apache.org/jira/browse/HADOOP-5903
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.18.3, 0.19.0, 0.19.1, 0.20.0
    Reporter: stack

    We see this frequently in our application, hbase, where dfsclients are held open across long periods of time. It would seem that any hiccup fetching a block becomes a permanent black mark and though the serving datanode passes out a temporary slowness or outage, the dfsclient never seems to pick up on this fact. Our perception is too sensitive to the vagaries of cluster comings and goings and succumbs too easily, especially given that a fresh dfsclient has not problem fetching the designated block.
    Chatting with Raghu and Hairong yesterday, Hairong pointed out that the dfsclient frequently updates its list of block locations -- if a block has moved or if a datanode is dead, then dfsclient should be keeping with the changing state of the cluster (I see this happening in DFSClient#chooseDatanode on failure) but Raghu looks like he put his finger on our problem by noticing that the failures count is only incremented -- never decremented. ANY three failures, no matter how many blocks in a file nor that a block that failed once now works, are enough for the DFSClient to start throwing "Could not obtain block:...".
    The failures counter needs to be a little smarter. Would a patch that adds a map of blocks to failure counts be the right way to go? Failures should note the datanode that the failure was gotten against so that if the datanode came online again (retry), we could decrement the mark that had made against the block?
    What do folks think?
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMay 23, '09 at 8:05p
activeMay 24, '09 at 3:22a
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Chris Douglas (JIRA): 2 posts

People

Translate

site design / logo © 2023 Grokbase