FAQ
Bad retry logic at DFSClient
----------------------------

Key: HDFS-1233
URL: https://issues.apache.org/jira/browse/HDFS-1233
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 0.20.1
Reporter: Thanh Do


- Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 2nd disk

- Setups:
+ # available datanodes = 1
+ # disks / datanode = 2
+ # failures = 1
+ failure type = bad disk
+ When/where failure happens = (see below)

- Details:

The setup is:
1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2).

We injected a single disk failure to see if we can failover to the
second disk or not.

If a persistent disk failure happens during createBlockOutputStream
(the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad),
then createBlockOutputStream (cbos) will get an exception and it
will retry! When it retries it will get the same DN1 from the namenode,
and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile,
and finally getNextVolume() which a moving volume#. Thus, on the
second try, the write will be successfully go to the second disk.
So essentially createBlockOutputStream is wrapped in a
do/while(retry && --count >= 0). The first cbos will fail, the second
will be successful in this particular scenario.

NOW, say cbos is successful, but the failure is persistent.
Then the "retry" is in a different while loop.
First, hasError is set to true in RP.run (responder packet).
Thus, DataStreamer.run() will go back to the loop:
while(!closed && clientRunning && !lastPacketInBlock).
Now this second iteration of the loop will call
processDatanodeError because hasError has been set to true.
In processDatanodeError (pde), the client sees that this is the only datanode
in the pipeline, and hence it considers that the node is bad! Although actually
only 1 disk is bad! Hence, pde throws IOException suggesting
all the datanodes (in this case, only DN1) in the pipeline is bad.
Hence, in this error, the exception is thrown to the client.
But if the exception, say, is catched by the most outer while loop
do-while(retry && --count >= 0), then this outer retry will be
successful then (as suggested in the previous paragraph).

In summary, if in a deployment scenario, we only have one datanode
that has multiple disks, and one disk goes bad, then the current
retry logic at the DFSClient side is not robust enough to mask the
failure from the client.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
Haryadi Gunawi (haryadi@eecs.berkeley.edu)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Todd Lipcon (JIRA) at Jun 17, 2010 at 5:35 pm
    [ https://issues.apache.org/jira/browse/HDFS-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Todd Lipcon resolved HDFS-1233.
    -------------------------------

    Resolution: Won't Fix

    This is a known deficiency, don't think anyone has plans to fix it. Any cluster that has multiple disks per DN likely has multiple DNs too.
    Bad retry logic at DFSClient
    ----------------------------

    Key: HDFS-1233
    URL: https://issues.apache.org/jira/browse/HDFS-1233
    Project: Hadoop HDFS
    Issue Type: Bug
    Components: hdfs client
    Affects Versions: 0.20.1
    Reporter: Thanh Do

    - Summary: failover bug, bad retry logic at DFSClient, cannot failover to the 2nd disk

    - Setups:
    + # available datanodes = 1
    + # disks / datanode = 2
    + # failures = 1
    + failure type = bad disk
    + When/where failure happens = (see below)

    - Details:
    The setup is:
    1 datanode, 1 replica, and each datanode has 2 disks (Disk1 and Disk2).

    We injected a single disk failure to see if we can failover to the
    second disk or not.

    If a persistent disk failure happens during createBlockOutputStream
    (the first phase of the pipeline creation) (e.g. say DN1-Disk1 is bad),
    then createBlockOutputStream (cbos) will get an exception and it
    will retry! When it retries it will get the same DN1 from the namenode,
    and then DN1 will call DN.writeBlock(), FSVolume.createTmpFile,
    and finally getNextVolume() which a moving volume#. Thus, on the
    second try, the write will be successfully go to the second disk.
    So essentially createBlockOutputStream is wrapped in a
    do/while(retry && --count >= 0). The first cbos will fail, the second
    will be successful in this particular scenario.

    NOW, say cbos is successful, but the failure is persistent.
    Then the "retry" is in a different while loop.
    First, hasError is set to true in RP.run (responder packet).
    Thus, DataStreamer.run() will go back to the loop:
    while(!closed && clientRunning && !lastPacketInBlock).
    Now this second iteration of the loop will call
    processDatanodeError because hasError has been set to true.
    In processDatanodeError (pde), the client sees that this is the only datanode
    in the pipeline, and hence it considers that the node is bad! Although actually
    only 1 disk is bad! Hence, pde throws IOException suggesting
    all the datanodes (in this case, only DN1) in the pipeline is bad.
    Hence, in this error, the exception is thrown to the client.
    But if the exception, say, is catched by the most outer while loop
    do-while(retry && --count >= 0), then this outer retry will be
    successful then (as suggested in the previous paragraph).

    In summary, if in a deployment scenario, we only have one datanode
    that has multiple disks, and one disk goes bad, then the current
    retry logic at the DFSClient side is not robust enough to mask the
    failure from the client.
    This bug was found by our Failure Testing Service framework:
    http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
    For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
    Haryadi Gunawi (haryadi@eecs.berkeley.edu)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-dev @
categorieshadoop
postedJun 17, '10 at 12:39p
activeJun 17, '10 at 5:35p
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Todd Lipcon (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase