Todd Lipcon resolved HDFS-795.
------------------------------
Resolution: Duplicate
HDFS-101 duplicates this, and fix is under way there.
DFS Write pipeline does not detect defective datanode correctly in some cases (HADOOP-3339)
-------------------------------------------------------------------------------------------
Key: HDFS-795
URL: https://issues.apache.org/jira/browse/HDFS-795
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Raghu Angadi
Priority: Critical
Fix For: 0.20.2
Attachments: toreproduce-5796.patch
HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.
Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.
It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.
I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.
My preferred target for this fix is 0.20.1.
---------------------------------------------------------------------------------------------
Key: HDFS-795
URL: https://issues.apache.org/jira/browse/HDFS-795
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Raghu Angadi
Priority: Critical
Fix For: 0.20.2
Attachments: toreproduce-5796.patch
HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.
Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.
It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.
I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.
My preferred target for this fix is 0.20.1.
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.