FAQ
Failed block replication leaves an incomplete block in receiver's tmp data directory
------------------------------------------------------------------------------------

Key: HADOOP-4702
URL: https://issues.apache.org/jira/browse/HADOOP-4702
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.17.2
Reporter: Hairong Kuang
Fix For: 0.20.0


When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.

A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Tsz Wo (Nicholas), SZE (JIRA) at Nov 26, 2008 at 6:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651103#action_12651103 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-4702:
    ------------------------------------------------

    Block replication and block creation should be different: block creation allows partial block but block replication should be atomic, either replicate the entire block or do nothing.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Fix For: 0.20.0


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 5, 2008 at 11:16 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Hairong Kuang updated HADOOP-4702:
    ----------------------------------

    Attachment: tmpBlockRemoval.patch

    This patch removes all traces (in-memory & on disk) of a temporary block if block replication fails.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Fix For: 0.20.0

    Attachments: tmpBlockRemoval.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 5, 2008 at 11:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Hairong Kuang updated HADOOP-4702:
    ----------------------------------

    Priority: Blocker (was: Major)
    Fix Version/s: (was: 0.20.0)
    0.18.3
    Assignee: Hairong Kuang
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Konstantin Shvachko (JIRA) at Dec 6, 2008 at 1:38 am
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654018#action_12654018 ]

    Konstantin Shvachko commented on HADOOP-4702:
    ---------------------------------------------

    # {{file.delete()}} in {{unfinalizeBlock()}} should check the return value and log if unsuccessful.
    # It would be good to designate a method for block and meta files removal. It could be reused in {{invalidate()}} and {{unfinalizeBlock()}}, may ne other places too.
    # Reverse the condition for checking whether the transfer is replication-related in {{BlockReceiver.removeBlock()}}.

    I verified that temporary block is removed when failed transfer is initiated by block replication or replacement (the balancer).
    And is not removed if this a client write, which is what the patch is intended to do.

    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 8, 2008 at 7:39 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Hairong Kuang updated HADOOP-4702:
    ----------------------------------

    Attachment: tmpBlockRemoval1.patch

    This patch incorporates all Konstantin's review comments.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 8, 2008 at 10:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Hairong Kuang updated HADOOP-4702:
    ----------------------------------

    Attachment: tmpBlockRemoval2.patch

    This patch makes two more changes:
    1. FSDataSet.unfinalize returns if the block is not in ongoingCreates list;
    2. BlockReceiver initializes fields srcDataNode & datanode before intializing checksum. This is to avoid NullPointerException in removeBlock in case an exception is thrown when intializing field checksum.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Konstantin Shvachko (JIRA) at Dec 8, 2008 at 10:48 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654620#action_12654620 ]

    Konstantin Shvachko commented on HADOOP-4702:
    ---------------------------------------------

    +1
    Let us create a separate issue for reusing delBlockFromDisk() where it should/can be used. This patch goes to 0.18 so we want to minimize changes.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 9, 2008 at 8:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654946#action_12654946 ]

    Hairong Kuang commented on HADOOP-4702:
    ---------------------------------------

    Ant test-core was successful:
    BUILD SUCCESSFUL
    Total time: 127 minutes 28 seconds

    Ant test-patch result:
    [exec] +1 overall.

    [exec] +1 @author. The patch does not contain any @author tags.

    [exec] +1 tests included. The patch appears to include 5 new or modified tests.

    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

    [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 9, 2008 at 10:03 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654980#action_12654980 ]

    Hairong Kuang commented on HADOOP-4702:
    ---------------------------------------
    Let us create a separate issue for reusing delBlockFromDisk() where it should/can be used..
    I created HADOOP-4812.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at Dec 9, 2008 at 10:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Hairong Kuang resolved HADOOP-4702.
    -----------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]

    I've just committed this.
    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Dec 10, 2008 at 4:49 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655275#action_12655275 ]

    Hudson commented on HADOOP-4702:
    --------------------------------

    Integrated in Hadoop-trunk #684 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/684/])
    . Failed block replication leaves an incomplete block in receiver's tmp data directory. Contributed by Hairong Kuang.

    Failed block replication leaves an incomplete block in receiver's tmp data directory
    ------------------------------------------------------------------------------------

    Key: HADOOP-4702
    URL: https://issues.apache.org/jira/browse/HADOOP-4702
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs
    Affects Versions: 0.17.2
    Reporter: Hairong Kuang
    Assignee: Hairong Kuang
    Priority: Blocker
    Fix For: 0.18.3

    Attachments: tmpBlockRemoval.patch, tmpBlockRemoval1.patch, tmpBlockRemoval2.patch


    When a failure occurs while replicating a block from a source DataNode to a target DataNode, the target node keeps an incomplete on-disk copy of the block in its temp data directory and an in-memory copy of the block in ongoingCreates queue. This causes two problems:
    1. Since this block is not (should not) be finalized, NameNode is not aware of the existence of this incomplete block. It may schedule replicating the same block to this node again, which will fail with a message: "Block XX has already been started (though not completed), and thus cannot be created."
    2. Restarting the datanode moves the blocks under the temp data directory to be valid blocks, thus introduces corrupted blocks into HDFS. Sometimes those corrupted blocks stay in the system undetected if it happens that the partial block and its checksums match.
    A failed block replication should clean up both the in-memory & on-disk copies of the incomplete block.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedNov 20, '08 at 11:15p
activeDec 10, '08 at 4:49p
posts12
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 12 posts

People

Translate

site design / logo © 2022 Grokbase