Aaron -

Thank you!

Is the 10 minute interval configurable?

Joseph Hammerman

-----Original Message-----
From: Aaron Kimball
Sent: Tuesday, May 26, 2009 4:29 PM
To: core-user@hadoop.apache.org
Subject: Re: Question regarding dwontime model for DataNodes

A DataNode is regarded as "dead" after 10 minutes of inactivity. If a
DataNode is down for < 10 minutes, and quickly returns, then nothing
happens. After 10 minutes, its blocks are assumed to be lost forever. So the
NameNode then begins scheduling those blocks for re-replication from the two
surviving copies. There is no real-time bound on how long this process
takes. It's a function of your available network bandwidth, server loads,
disk speeds, amount of storage used, etc. Blocks which are down to a single
surviving replica are prioritized over blocks which have two surviving
replicas (in the event of a simultaneous or near-simultaneous double fault),
since they are more vulnerable.

If a DataNode does reappear, then its re-replication is cancelled, and
over-provisioned blocks are scaled back to the target number of replicas. If
that machine comes back with all blocks intact, then the node needs no
"rebuilding." (In fact, some of the over-provisioned replicas that get
removed might be from the original node, if they're available elsewhere

Don't forget that machines in Hadoop do not have strong notions of identity.
If a particular machine is taken offline and its disks are wiped, the blocks
which were there (which also existed in two other places) will be
re-replicated elsewhere from the live copies. When that same machine is then
brought back online, it has no incentive to "copy back" all the blocks that
it used to have, as there will be three replicas elsewhere in the cluster.
Blocks are never permanently bound to particular machines.

If you add recommissed or new nodes to a cluster, you should run the
rebalancing script which will take a random sampling of blocks from
heavily-laden nodes and move them onto emptier nodes in an attempt to spread
the data as evenly as possible.

- Aaron

On Tue, May 26, 2009 at 3:08 PM, Joe Hammerman wrote:

Hello Hadoop Users list:

We are running Hadoop version 0.18.2. My team lead has asked
me to investigate the answer to a particular question regarding Hadoop's
handling of offline DataNodes - specifically, we would like to know how long
a node can be offline before it is totally rebuilt when it has been readded
to the cluster.
From what I've been able to determine from the documentation
it appears to me that the NameNode will simply begin scheduling block
replication on its remaining cluster members. If the offline node comes back
online, and it reports all its blocks as being uncorrupted, then the
NameNode just cleans up the "extra" blocks.
In other words, there is no explicit handling based on the
length of the outage - the behavior of the cluster will depend entirely on
the outage duration.

Anyone care to shed some light on this?

Joseph Hammerman

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupcommon-user @
postedMay 26, '09 at 10:10p
activeMay 26, '09 at 11:32p

2 users in discussion

Joe Hammerman: 2 posts Aaron Kimball: 1 post



site design / logo © 2022 Grokbase