FAQ
Is replica management built into HDFS ? What I mean is if I set replication
factor to 3 and if I lose 3 disks is that data lost forever ? I mean all 3
disks dying at the same time I know is a far fetched scenario but if they
die over a certain period of time does HDFS re-replicate the data to ensure
that there are always 3 copies in the system ?

Thanks
A

Search Discussions

  • Ted Dunning at Jul 17, 2007 at 3:45 pm
    Assuming that you have many more disks than 3, then the chances that 3
    simultaneous disk failures being just the right 3 is much lower than the
    chances of losing any 3 disks. This is enhanced by the ability of Hadoop to
    allocate files in different racks since one of the few mechanisms of
    coordinating failures is losing an entire rack.

    For example, if you have 20 disks, then the chance of losing a particular
    three disks given that you are losing 3 disks is about one chance in a
    thousand (assuming independent error location) and should be impossible if
    the failures are rack aligned.

    Remember, you can always increase the number of replicas if you like.

    On 7/17/07 12:55 AM, "Phantom" wrote:

    Is replica management built into HDFS ? What I mean is if I set replication
    factor to 3 and if I lose 3 disks is that data lost forever ? I mean all 3
    disks dying at the same time I know is a far fetched scenario but if they
    die over a certain period of time does HDFS re-replicate the data to ensure
    that there are always 3 copies in the system ?

    Thanks
    A
  • Phantom at Jul 17, 2007 at 5:57 pm
    Here is the scenario I was concerned about. Consider three nodes in the
    system A, B and C which are placed say in different racks. Let us say that
    the disk on A fries up today. Now the blocks that were stored on A are not
    going to re-replicated (this is my understanding but I could be wrong in
    this assumption) to some other node or to the new disk with which you would
    bring back A. Now a month later the disk of B could fry and then another
    month later disk on C could fry. This way you could slowly start losing data
    in the absence of a replica synchronization algorithm like that in S3. This
    would never happen in S3 because there is always a replica synchronization
    algorithm that is running to give the guarantee that there will always be 3
    replicas in the system. So if a disk fries then the data is re-replicated.
    Of course there is no way to protect oneself from 3 machines which store
    replicas losing their disks at the same time.

    So I was wondering if there is a replica synchronization algorithm in place
    or is it a feature that is planned for the future.

    A

    On 7/17/07, Ted Dunning wrote:



    Assuming that you have many more disks than 3, then the chances that 3
    simultaneous disk failures being just the right 3 is much lower than the
    chances of losing any 3 disks. This is enhanced by the ability of Hadoop
    to
    allocate files in different racks since one of the few mechanisms of
    coordinating failures is losing an entire rack.

    For example, if you have 20 disks, then the chance of losing a particular
    three disks given that you are losing 3 disks is about one chance in a
    thousand (assuming independent error location) and should be impossible if
    the failures are rack aligned.

    Remember, you can always increase the number of replicas if you like.

    On 7/17/07 12:55 AM, "Phantom" wrote:

    Is replica management built into HDFS ? What I mean is if I set
    replication
    factor to 3 and if I lose 3 disks is that data lost forever ? I mean all 3
    disks dying at the same time I know is a far fetched scenario but if they
    die over a certain period of time does HDFS re-replicate the data to ensure
    that there are always 3 copies in the system ?

    Thanks
    A
  • Doug Cutting at Jul 17, 2007 at 6:28 pm

    Phantom wrote:
    Here is the scenario I was concerned about. Consider three nodes in the
    system A, B and C which are placed say in different racks. Let us say that
    the disk on A fries up today. Now the blocks that were stored on A are not
    going to re-replicated (this is my understanding but I could be wrong in
    this assumption) to some other node or to the new disk with which you would
    bring back A.
    That's incorrect. When a datanode fails to send a heartbeat to the
    namenode in a timely manner then its data is assumed missing and is
    re-replicated. And when block corruption is detected, corrupt replicas
    are removed and non-corrupt replicas are re-replicated to maintain the
    desired level of replication.

    Doug
  • Phantom at Jul 17, 2007 at 6:30 pm
    That's awesome.

    Thanks
    A
    On 7/17/07, Doug Cutting wrote:

    Phantom wrote:
    Here is the scenario I was concerned about. Consider three nodes in the
    system A, B and C which are placed say in different racks. Let us say that
    the disk on A fries up today. Now the blocks that were stored on A are not
    going to re-replicated (this is my understanding but I could be wrong in
    this assumption) to some other node or to the new disk with which you would
    bring back A.
    That's incorrect. When a datanode fails to send a heartbeat to the
    namenode in a timely manner then its data is assumed missing and is
    re-replicated. And when block corruption is detected, corrupt replicas
    are removed and non-corrupt replicas are re-replicated to maintain the
    desired level of replication.

    Doug
  • Phantom at Jul 17, 2007 at 6:38 pm
    I am sure re-replication is not done on every heartbeat miss since that
    would be very expensive and inefficient. At the same time you cannot really
    tell if a node is partitioned away, crashed or just slow. Is it threshold
    based i.e I missed N heartbeats so re-replicate ? Which package in the
    source code could I look at to glean this information ?

    Thanks
    A
    On 7/17/07, Phantom wrote:

    That's awesome.

    Thanks
    A
    On 7/17/07, Doug Cutting wrote:

    Phantom wrote:
    Here is the scenario I was concerned about. Consider three nodes in the
    system A, B and C which are placed say in different racks. Let us say that
    the disk on A fries up today. Now the blocks that were stored on A are not
    going to re-replicated (this is my understanding but I could be wrong in
    this assumption) to some other node or to the new disk with which you would
    bring back A.
    That's incorrect. When a datanode fails to send a heartbeat to the
    namenode in a timely manner then its data is assumed missing and is
    re-replicated. And when block corruption is detected, corrupt replicas
    are removed and non-corrupt replicas are re-replicated to maintain the
    desired level of replication.

    Doug
  • Phantom at Jul 17, 2007 at 6:42 pm
    The reason I ask is because I know in S3 and in P2P storage systems that I
    have been involved in we had a replica synchronization algorithm that would
    run once every night and it relied on techniques like Merkle tree
    comparisons. Anyway understanding that would be beneficial. I don't mind
    reading through the sources but would appreciate if pointed to the correct
    package.

    Thanks
    A
    On 7/17/07, Phantom wrote:

    I am sure re-replication is not done on every heartbeat miss since that
    would be very expensive and inefficient. At the same time you cannot really
    tell if a node is partitioned away, crashed or just slow. Is it threshold
    based i.e I missed N heartbeats so re-replicate ? Which package in the
    source code could I look at to glean this information ?

    Thanks
    A
    On 7/17/07, Phantom wrote:

    That's awesome.

    Thanks
    A
    On 7/17/07, Doug Cutting wrote:

    Phantom wrote:
    Here is the scenario I was concerned about. Consider three nodes in the
    system A, B and C which are placed say in different racks. Let us say that
    the disk on A fries up today. Now the blocks that were stored on A are not
    going to re-replicated (this is my understanding but I could be wrong in
    this assumption) to some other node or to the new disk with which you would
    bring back A.
    That's incorrect. When a datanode fails to send a heartbeat to the
    namenode in a timely manner then its data is assumed missing and is
    re-replicated. And when block corruption is detected, corrupt
    replicas
    are removed and non-corrupt replicas are re-replicated to maintain the

    desired level of replication.

    Doug
  • Doug Cutting at Jul 17, 2007 at 6:50 pm

    Phantom wrote:
    I am sure re-replication is not done on every heartbeat miss since that
    would be very expensive and inefficient. At the same time you cannot really
    tell if a node is partitioned away, crashed or just slow. Is it threshold
    based i.e I missed N heartbeats so re-replicate ?
    Yes, detection of datanode failure is threshold-based. It is currently
    ten minutes plus ten missed heartbeats.
    Which package in the
    source code could I look at to glean this information ?
    This is in dfs/FSNameSystem.java.

    Doug
  • Hairong Kuang at Jul 17, 2007 at 7:30 pm

    Which package in the
    source code could I look at to glean this information ?
    This is in dfs/FSNameSystem.java.
    FSNameSystem.java is a huge chunk of source code. To be more specific,
    datanode failure detection is done by HeartbeatMonitor. Once a data node is
    detected as dead, all blocks belonged to this data node will be put in
    neededReplications queue. Then the ReplicationMonitor will start to
    replicate those under-replicated blocks. All the replication target chosen
    logic is in dfs/ReplicationTargetChooser.java.

    Hairong
  • Dhruba Borthakur at Jul 17, 2007 at 9:50 pm
    A Datanode is declared dead if heartbeats are missing for 10 minutes. The
    Datanodes typically send a heartbeat every 3 seconds.

    Thanks,
    dhruba

    -----Original Message-----
    From: Hairong Kuang
    Sent: Tuesday, July 17, 2007 12:30 PM
    To: hadoop-user@lucene.apache.org
    Subject: RE: HDFS replica management
    Which package in the
    source code could I look at to glean this information ?
    This is in dfs/FSNameSystem.java.
    FSNameSystem.java is a huge chunk of source code. To be more specific,
    datanode failure detection is done by HeartbeatMonitor. Once a data node is
    detected as dead, all blocks belonged to this data node will be put in
    neededReplications queue. Then the ReplicationMonitor will start to
    replicate those under-replicated blocks. All the replication target chosen
    logic is in dfs/ReplicationTargetChooser.java.

    Hairong

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 17, '07 at 10:46a
activeJul 17, '07 at 9:50p
posts10
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase