I have looked at the HBase MTTR scenario when we lose a full box with
its datanode and its hbase region server altogether: It means a RS
recovery, hence reading the logs files and writing new ones (splitting
By default, HDFS considers a DN as dead when there is no heartbeat for
10:30 minutes. Until this point, the NaneNode will consider it as
perfectly valid and it will get involved in all read & write
And, as we lost a RegionServer, the recovery process will take place,
so we will read the WAL & write new log files. And with the RS, we
lost the replica of the WAL that was with the DN of the dead box. In
other words, 33% of the DN we need are dead. So, to read the WAL, per
block to read and per reader, we've got one chance out of 3 to go to
the dead DN, and to get a connect or read timeout issue. With a
reasonnable cluster and a distributed log split, we will have a sure
I looked in details at the hdfs configuration parameters and their
impacts. We have the calculated values:
heartbeat.interval = 3s ("dfs.heartbeat.interval").
heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes
At least on 1.0.3, there is no shutdown hook to tell the NN to
consider this DN as dead, for example on a software crash.
So before the 10:30 minutes, the DN is considered as fully available
by the NN. After this delay, HDFS is likely to start replicating the
blocks contained in the dead node to get back to the right number of
replica. As a consequence, if we're too aggressive we will have a side
effect here, adding workload to an already damaged cluster. According
to Stack: "even with this 10 minutes wait, the issue was met in real
production case in the past, and the latency increased badly". May be
there is some tuning to do here, but going under these 10 minutes does
not seem to be an easy path.
For the clients, they don't fully rely on the NN feedback, and they
keep, per stream, a dead node list. So for a single file, a given
client will do the error once, but if there are multiple files it will
go back to the wrong DN. The settings are:
connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
write: (5s (hardcoded) * NumberOfReplica) + 480s
That will set a 69s timeout to get a "connect" error with the default config.
I also had a look at larger failure scenarios, when we're loosing a
20% of a cluster. The smaller the cluster is the easier it is to get
there. With the distributed log split, we're actually on a better
shape from an hdfs point of view: the master could have error writing
the files, because it could bet a dead DN 3 times in a row. If the
split is done by the RS, this issue disappears. We will however get a
lot of errors between the nodes.
Finally, I had a look at the lease stuff Lease: write access lock to a
file, no other client can write to the file. But another client can
read it. Soft lease limit: another client can preempt the lease.
Default: 1 minute.
Hard lease limit: hdfs closes the file and free the resources on
behalf of the initial writer. Default: 60 minutes.
=> This should not impact HBase, as it does not prevent the recovery
process to read the WAL or to write new files. We just need writes to
be immediately available to readers, and it's possible thanks to
HDFS-200. So if a RS dies we should have no waits even if the lease
was not freed. This seems to be confirmed by tests.
=> It's interesting to note that this setting is much more aggressive
than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
HBase, than the default ZK timeout (3 minutes).
=> This said, HDFS states this: "When reading a file open for writing,
the length of the last block still being written is unknown
to the NameNode. In this case, the client asks one of the replicas for
the latest length before starting to read its content.". This leads to
an extra call to get the file length on the recovery (likely with the
ipc.Client), and we may once again go to the wrong dead DN. In this
case we have an extra socket timeout to consider.
On paper, it would be great to set "dfs.socket.timeout" to a minimal
value during a log split, as we know we will get a dead DN 33% of the
time. It may be more complicated in real life as the connections are
shared per process. And we could still have the issue with the
As a conclusion, I think it could be interesting to have a third
status for DN in HDFS: between live and dead as today, we could have
"sick". We would have:
1) Dead, known as such => As today: Start to replicate the blocks to
other nodes. You enter this state after 10 minutes. We could even wait
2) Likely to be dead: don't propose it for write blocks, put it with a
lower priority for read blocks. We would enter this state in two
2.1) No heartbeat for 30 seconds (configurable of course). As there
is an existing heartbeat of 3 seconds, we could even be more
2.2) We could have a shutdown hook in hdfs such as when a DN dies
'properly' it says to the NN, and the NN can put it in this 'half dead
=> In all cases, the node stays in the second state until the 10.30
timeout is reached or until a heartbeat is received.
For HBase it would make life much simpler I think:
- no 69s timeout on mttr path
- less connection to dead nodes leading to ressources held all other
the place finishing by a timeout...
- and there is already a very aggressive 3s heartbeat, so we would
not add any workload.