FAQ
All datanodes are bad in 2nd phase
----------------------------------

Key: HDFS-1239
URL: https://issues.apache.org/jira/browse/HDFS-1239
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Setups:
number of datanodes = 2
replication factor = 2
Type of failure: transient fault (a java i/o call throws an exception or return false)
Number of failures = 2
when/where failures happen = during the 2nd phase of the pipeline, each happens at each datanode when trying to perform I/O
(e.g. dataoutputstream.flush())

- Details:

This is similar to HDFS-1237.
In this case, node1 throws exception that makes client creates
a pipeline only with node2, then tries to redo the whole thing,
which throws another failure. So at this point, the client considers
all datanodes are bad, and never retries the whole thing again,
(i.e. it never asks the namenode again to ask for a new set of datanodes).
In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about transient error.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
Haryadi Gunawi (haryadi@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Konstantin Shvachko (JIRA) at Jun 22, 2010 at 7:13 pm
    [ https://issues.apache.org/jira/browse/HDFS-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Konstantin Shvachko resolved HDFS-1239.
    ---------------------------------------

    Resolution: Invalid
    All datanodes are bad in 2nd phase
    ----------------------------------

    Key: HDFS-1239
    URL: https://issues.apache.org/jira/browse/HDFS-1239
    Project: Hadoop HDFS
    Issue Type: Bug
    Components: hdfs client
    Affects Versions: 0.20.1
    Reporter: Thanh Do

    - Setups:
    number of datanodes = 2
    replication factor = 2
    Type of failure: transient fault (a java i/o call throws an exception or return false)
    Number of failures = 2
    when/where failures happen = during the 2nd phase of the pipeline, each happens at each datanode when trying to perform I/O
    (e.g. dataoutputstream.flush())

    - Details:

    This is similar to HDFS-1237.
    In this case, node1 throws exception that makes client creates
    a pipeline only with node2, then tries to redo the whole thing,
    which throws another failure. So at this point, the client considers
    all datanodes are bad, and never retries the whole thing again,
    (i.e. it never asks the namenode again to ask for a new set of datanodes).
    In HDFS-1237, the bug is due to permanent disk fault. In this case, it's about transient error.
    This bug was found by our Failure Testing Service framework:
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
    For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
    Haryadi Gunawi (haryadi@eecs.berkeley.edu)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-dev @
categorieshadoop
postedJun 17, '10 at 1:10p
activeJun 22, '10 at 7:13p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Konstantin Shvachko (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase