Grokbase Groups HBase dev July 2012
FAQ
Hi,

I have looked at the HBase MTTR scenario when we lose a full box with
its datanode and its hbase region server altogether: It means a RS
recovery, hence reading the logs files and writing new ones (splitting
logs).

By default, HDFS considers a DN as dead when there is no heartbeat for
10:30 minutes. Until this point, the NaneNode will consider it as
perfectly valid and it will get involved in all read & write
operations.

And, as we lost a RegionServer, the recovery process will take place,
so we will read the WAL & write new log files. And with the RS, we
lost the replica of the WAL that was with the DN of the dead box. In
other words, 33% of the DN we need are dead. So, to read the WAL, per
block to read and per reader, we've got one chance out of 3 to go to
the dead DN, and to get a connect or read timeout issue. With a
reasonnable cluster and a distributed log split, we will have a sure
winner.


I looked in details at the hdfs configuration parameters and their
impacts. We have the calculated values:
heartbeat.interval = 3s ("dfs.heartbeat.interval").
heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

At least on 1.0.3, there is no shutdown hook to tell the NN to
consider this DN as dead, for example on a software crash.

So before the 10:30 minutes, the DN is considered as fully available
by the NN. After this delay, HDFS is likely to start replicating the
blocks contained in the dead node to get back to the right number of
replica. As a consequence, if we're too aggressive we will have a side
effect here, adding workload to an already damaged cluster. According
to Stack: "even with this 10 minutes wait, the issue was met in real
production case in the past, and the latency increased badly". May be
there is some tuning to do here, but going under these 10 minutes does
not seem to be an easy path.

For the clients, they don't fully rely on the NN feedback, and they
keep, per stream, a dead node list. So for a single file, a given
client will do the error once, but if there are multiple files it will
go back to the wrong DN. The settings are:

connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
write: (5s (hardcoded) * NumberOfReplica) + 480s
("dfs.datanode.socket.write.timeout")

That will set a 69s timeout to get a "connect" error with the default config.

I also had a look at larger failure scenarios, when we're loosing a
20% of a cluster. The smaller the cluster is the easier it is to get
there. With the distributed log split, we're actually on a better
shape from an hdfs point of view: the master could have error writing
the files, because it could bet a dead DN 3 times in a row. If the
split is done by the RS, this issue disappears. We will however get a
lot of errors between the nodes.

Finally, I had a look at the lease stuff Lease: write access lock to a
file, no other client can write to the file. But another client can
read it. Soft lease limit: another client can preempt the lease.
Configurable.
Default: 1 minute.
Hard lease limit: hdfs closes the file and free the resources on
behalf of the initial writer. Default: 60 minutes.

=> This should not impact HBase, as it does not prevent the recovery
process to read the WAL or to write new files. We just need writes to
be immediately available to readers, and it's possible thanks to
HDFS-200. So if a RS dies we should have no waits even if the lease
was not freed. This seems to be confirmed by tests.
=> It's interesting to note that this setting is much more aggressive
than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
HBase, than the default ZK timeout (3 minutes).
=> This said, HDFS states this: "When reading a file open for writing,
the length of the last block still being written is unknown
to the NameNode. In this case, the client asks one of the replicas for
the latest length before starting to read its content.". This leads to
an extra call to get the file length on the recovery (likely with the
ipc.Client), and we may once again go to the wrong dead DN. In this
case we have an extra socket timeout to consider.

On paper, it would be great to set "dfs.socket.timeout" to a minimal
value during a log split, as we know we will get a dead DN 33% of the
time. It may be more complicated in real life as the connections are
shared per process. And we could still have the issue with the
ipc.Client.


As a conclusion, I think it could be interesting to have a third
status for DN in HDFS: between live and dead as today, we could have
"sick". We would have:
1) Dead, known as such => As today: Start to replicate the blocks to
other nodes. You enter this state after 10 minutes. We could even wait
more.
2) Likely to be dead: don't propose it for write blocks, put it with a
lower priority for read blocks. We would enter this state in two
conditions:
2.1) No heartbeat for 30 seconds (configurable of course). As there
is an existing heartbeat of 3 seconds, we could even be more
aggressive here.
2.2) We could have a shutdown hook in hdfs such as when a DN dies
'properly' it says to the NN, and the NN can put it in this 'half dead
state'.
=> In all cases, the node stays in the second state until the 10.30
timeout is reached or until a heartbeat is received.
3) Live.

For HBase it would make life much simpler I think:
- no 69s timeout on mttr path
- less connection to dead nodes leading to ressources held all other
the place finishing by a timeout...
- and there is already a very aggressive 3s heartbeat, so we would
not add any workload.

Thougths?

Nicolas

Search Discussions

  • Todd Lipcon at Jul 12, 2012 at 9:25 pm
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 12, 2012 at 10:17 pm
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 13, 2012 at 7:54 am
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 13, 2012 at 1:28 pm
    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Andrew Purtell at Jul 13, 2012 at 4:31 pm

    On Fri, Jul 13, 2012 at 6:27 AM, N Keywal wrote:
    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    "branch-1" is the root of Hadoop 1, the next release should be
    branched from there.

    Changing the block placement policy for HBase WALs to avoid the local
    machine is interesting. Those who could make use of this approach
    would be those who:

    - Control their NameNode, it's classpath, and it's configuration

    - Would make the risk assessment that it is acceptable to plug
    something in there from the HBase project

    - Are not using another block placement policy, either the one
    from HDFS-RAID or their own

    Best regards,

    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet
    Hein (via Tom White)
  • Michael Stack at Jul 18, 2012 at 9:36 am

    On Fri, Jul 13, 2012 at 6:31 PM, Andrew Purtell wrote:
    Changing the block placement policy for HBase WALs to avoid the local
    machine is interesting. Those who could make use of this approach
    would be those who:

    - Control their NameNode, it's classpath, and it's configuration

    - Would make the risk assessment that it is acceptable to plug
    something in there from the HBase project

    - Are not using another block placement policy, either the one
    from HDFS-RAID or their own
    The above list makes this approach pretty much untenable.

    (We'd customize block policy by looking at the path to the file? We'd
    then fall back on default or pass the buck to another user-specified
    policy?)

    St.Ack
  • Ted Yu at Jul 13, 2012 at 5:03 pm
    One clarification on HDFS-385: the last post on that JIRA only means the
    submission of patch for branch-1. I don't see integration yet.
    On Fri, Jul 13, 2012 at 6:27 AM, N Keywal wrote:

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s
    ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default
    config.
    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Lars hofhansl at Jul 13, 2012 at 5:12 pm
    In that case, though, we'd slow down normal operation.
    Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned in HBASE-6116, I have not been able to measure any performance improvement from this so far.

    -- Lars



    ________________________________
    From: N Keywal <nkeywal@gmail.com>
    To: dev@hbase.apache.org
    Sent: Friday, July 13, 2012 6:27 AM
    Subject: Re: hbase mttr vs. hdfs

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN.  After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read:  (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Jul 13, 2012 at 5:19 pm
    With HBASE-5699, the slow down should be tolerable.
    On Fri, Jul 13, 2012 at 10:11 AM, lars hofhansl wrote:

    In that case, though, we'd slow down normal operation.
    Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as
    mentioned in HBASE-6116, I have not been able to measure any performance
    improvement from this so far.

    -- Lars



    ________________________________
    From: N Keywal <nkeywal@gmail.com>
    To: dev@hbase.apache.org
    Sent: Friday, July 13, 2012 6:27 AM
    Subject: Re: hbase mttr vs. hdfs

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s
    ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default
    config.
    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 13, 2012 at 5:46 pm
    From a performance point of view, I think it could be manageable.
    If I put it to an extreme, today we're writing to 3 locations, with
    local one being often useless. If we write only to the 2 remote
    locations, we have the same reliability, without the issue of using
    the dead node when we read for recovery.

    And when we write to 2 remote locations today, we write to one which
    is on a remote rack. So if tomorrow we write to 3 remote locations, 2
    on the same rack and one on another:
    - we don't add disk i/o to the cluster: still 3 blocks written in the cluster.
    - the added latency should be low compared to the existing ones as
    it's on the same rack.
    - we're adding some network i/o, but on the same rack.
    - as it's an append, we're on the same network socket, the connection
    cost is not important.

    So we're getting a reliability boost at a very reasonable price I
    think (it's always cheap on paper :-) And, for someone needing better
    write perfs, having only two replica is not unreasonable compared to
    what we have today in terms of reliability...

    I'm trying to find something different that could be made available
    sooner without deployment issue. May be there is a hack possible
    around DFSClient#reportBadBlocks, but there are some side effects as
    well...
    On Fri, Jul 13, 2012 at 7:11 PM, lars hofhansl wrote:
    In that case, though, we'd slow down normal operation.
    Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned in HBASE-6116, I have not been able to measure any performance improvement from this so far.

    -- Lars



    ________________________________
    From: N Keywal <nkeywal@gmail.com>
    To: dev@hbase.apache.org
    Sent: Friday, July 13, 2012 6:27 AM
    Subject: Re: hbase mttr vs. hdfs

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 16, 2012 at 12:00 pm
    I found another solution, better than the workaround I was previously
    mentionning, that could be implemented in the DFS client or the
    namenode:

    The NN returns a set of ordered DN. We could open this ordering. For
    an hlog file, if there is a DN on the same node as the dead RS, this
    DN would get the lowest priority. HBase would just need the file name
    of the block to make this decision.

    Advantages are:
    - this part is already centralized in hdfs namenode. To do it cleanly
    it requires publishing racks & node distribution in an interface; but
    I hope it's possible if not already done.
    - it can be also put in the DFSClient, and this solves the issues
    mentioned by Andrew: the customization would be for HBase only, and
    would not impact other applications sharing the cluster.
    - The client already modifies the nodes list returned by the NN, so
    we're not adding much responsibility here.
    - We just change the order of the blocks, nothing else it change in
    hdfs. Nevertheless, the dead node will be tried only if all the other
    nodes failed as well, so it solved the issue for the block transfer (I
    still need to look after the ipc.Client stuff, but it's another point
    hopefully)...

    Issues
    - it requires a change in hdfs...

    Do you see any other issue with this approach?

    Cheers,

    N.
    On Fri, Jul 13, 2012 at 7:46 PM, N Keywal wrote:
    From a performance point of view, I think it could be manageable.
    If I put it to an extreme, today we're writing to 3 locations, with
    local one being often useless. If we write only to the 2 remote
    locations, we have the same reliability, without the issue of using
    the dead node when we read for recovery.

    And when we write to 2 remote locations today, we write to one which
    is on a remote rack. So if tomorrow we write to 3 remote locations, 2
    on the same rack and one on another:
    - we don't add disk i/o to the cluster: still 3 blocks written in the cluster.
    - the added latency should be low compared to the existing ones as
    it's on the same rack.
    - we're adding some network i/o, but on the same rack.
    - as it's an append, we're on the same network socket, the connection
    cost is not important.

    So we're getting a reliability boost at a very reasonable price I
    think (it's always cheap on paper :-) And, for someone needing better
    write perfs, having only two replica is not unreasonable compared to
    what we have today in terms of reliability...

    I'm trying to find something different that could be made available
    sooner without deployment issue. May be there is a hack possible
    around DFSClient#reportBadBlocks, but there are some side effects as
    well...
    On Fri, Jul 13, 2012 at 7:11 PM, lars hofhansl wrote:
    In that case, though, we'd slow down normal operation.
    Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned in HBASE-6116, I have not been able to measure any performance improvement from this so far.

    -- Lars



    ________________________________
    From: N Keywal <nkeywal@gmail.com>
    To: dev@hbase.apache.org
    Sent: Friday, July 13, 2012 6:27 AM
    Subject: Re: hbase mttr vs. hdfs

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • N Keywal at Jul 16, 2012 at 5:09 pm
    And to continue on this, for the files still opened (i.e. our wal
    files), we've got two calls to the dead DN:

    one, during the input stream opening, from DFSClient#updateBlockInfo.
    This calls fails, but the exception is shallowed without being logged.
    The node info is not updated, but there is no error, so we continue
    without the right info. The timeout will be 60 seconds. This call is
    one the port 50020.
    the second, will be the one already mentioned for the data transfer,
    with the timeout of 69 seconds. The dead nodes list is not updated by
    the first failure, leading to a total wait time >2 minutes if we got
    directed to the bad location.

    On Mon, Jul 16, 2012 at 2:00 PM, N Keywal wrote:
    I found another solution, better than the workaround I was previously
    mentionning, that could be implemented in the DFS client or the
    namenode:

    The NN returns a set of ordered DN. We could open this ordering. For
    an hlog file, if there is a DN on the same node as the dead RS, this
    DN would get the lowest priority. HBase would just need the file name
    of the block to make this decision.

    Advantages are:
    - this part is already centralized in hdfs namenode. To do it cleanly
    it requires publishing racks & node distribution in an interface; but
    I hope it's possible if not already done.
    - it can be also put in the DFSClient, and this solves the issues
    mentioned by Andrew: the customization would be for HBase only, and
    would not impact other applications sharing the cluster.
    - The client already modifies the nodes list returned by the NN, so
    we're not adding much responsibility here.
    - We just change the order of the blocks, nothing else it change in
    hdfs. Nevertheless, the dead node will be tried only if all the other
    nodes failed as well, so it solved the issue for the block transfer (I
    still need to look after the ipc.Client stuff, but it's another point
    hopefully)...

    Issues
    - it requires a change in hdfs...

    Do you see any other issue with this approach?

    Cheers,

    N.
    On Fri, Jul 13, 2012 at 7:46 PM, N Keywal wrote:
    From a performance point of view, I think it could be manageable.
    If I put it to an extreme, today we're writing to 3 locations, with
    local one being often useless. If we write only to the 2 remote
    locations, we have the same reliability, without the issue of using
    the dead node when we read for recovery.

    And when we write to 2 remote locations today, we write to one which
    is on a remote rack. So if tomorrow we write to 3 remote locations, 2
    on the same rack and one on another:
    - we don't add disk i/o to the cluster: still 3 blocks written in the cluster.
    - the added latency should be low compared to the existing ones as
    it's on the same rack.
    - we're adding some network i/o, but on the same rack.
    - as it's an append, we're on the same network socket, the connection
    cost is not important.

    So we're getting a reliability boost at a very reasonable price I
    think (it's always cheap on paper :-) And, for someone needing better
    write perfs, having only two replica is not unreasonable compared to
    what we have today in terms of reliability...

    I'm trying to find something different that could be made available
    sooner without deployment issue. May be there is a hack possible
    around DFSClient#reportBadBlocks, but there are some side effects as
    well...
    On Fri, Jul 13, 2012 at 7:11 PM, lars hofhansl wrote:
    In that case, though, we'd slow down normal operation.
    Maybe that can be alleviated with HDFS-1783/HBASE-6116, although as mentioned in HBASE-6116, I have not been able to measure any performance improvement from this so far.

    -- Lars



    ________________________________
    From: N Keywal <nkeywal@gmail.com>
    To: dev@hbase.apache.org
    Sent: Friday, July 13, 2012 6:27 AM
    Subject: Re: hbase mttr vs. hdfs

    I looked at this part of hdfs code, and
    - it's not simple to add it in a clean way, even if doing it is possible.
    - i was wrong the the 3s hearbeat: the hearbeat is every 5 minutes
    actually. So changing this would not be without a lot of side effects.
    - as a side note HADOOP-8144 is interesting...

    So not writing the WAL on the local machine could be a good medium
    term option, that could likely be implemented with HDFS-385 (made
    available recently in "branch-1". I don't know what it stands for).
    On Fri, Jul 13, 2012 at 9:53 AM, N Keywal wrote:
    Another option would be to never write the wal locally: in nearly all
    cases it won't be used as it's on the dead box. And then the recovery
    would be directed by the NN to a dead DN in a single box failure. And
    we would have 3 copies instead of 2, increasing global reliability...
    On Fri, Jul 13, 2012 at 12:16 AM, N Keywal wrote:
    Hi Todd,

    Do you think the change would be too intrusive for hdfs? I aggree,
    there are many less critical components in hadoop :-). I was hoping
    that this state could be internal to the NN and could remain localized
    without any interface change...

    Your proposal would help for sure. I see 3 points if we try to do it
    for specific functions like recovery.
    - we would then need to manage the case when all 3 nodes timeouts
    after 1s, hoping that two of them are wrong positive...
    - the writes between DN would still be with the old timeout. I didn't
    look in details at the impact. It won't be an issue for single box
    crash, but for large failure it could.
    - we would want to change it to for the ipc.Client as well. Note sure
    if the change would not be visible to all functions.

    What worries me about setting very low timeouts is that it's difficult
    to validate, it tends to work until it goes to production...

    I was also thinking of making the deadNodes list public in the client,
    so hbase could tell to the DFSClient: 'this node is dead, I know it
    because I'm recovering the RS', but it would have some false positive
    (software region server crash), and seems a little like a
    workaround...

    In the middle (thinking again about your proposal), we could add a
    function in hbase that would first check the DNs owning the WAL,
    trying to connect with a 1s timeout, to be able to tell the DFSClient
    who's dead.
    Or we could put this function in DFSClient, a kind of boolean to say
    fail fast on dn errors for this read...


    On Thu, Jul 12, 2012 at 11:24 PM, Todd Lipcon wrote:
    Hey Nicolas,

    Another idea that might be able to help this without adding an entire
    new state to the protocol would be to just improve the HDFS client
    side in a few ways:

    1) change the "deadnodes" cache to be a per-DFSClient structure
    instead of per-stream. So, after reading one block, we'd note that the
    DN was dead, and de-prioritize it on future reads. Of course we'd need
    to be able to re-try eventually since dead nodes do eventually
    restart.
    2) when connecting to a DN, if the connection hasn't succeeded within
    1-2 seconds, start making a connection to another replica. If the
    other replica succeeds first, then drop the connection to the first
    (slow) node.

    Wouldn't this solve the problem less invasively?

    -Todd
    On Thu, Jul 12, 2012 at 2:20 PM, N Keywal wrote:
    Hi,

    I have looked at the HBase MTTR scenario when we lose a full box with
    its datanode and its hbase region server altogether: It means a RS
    recovery, hence reading the logs files and writing new ones (splitting
    logs).

    By default, HDFS considers a DN as dead when there is no heartbeat for
    10:30 minutes. Until this point, the NaneNode will consider it as
    perfectly valid and it will get involved in all read & write
    operations.

    And, as we lost a RegionServer, the recovery process will take place,
    so we will read the WAL & write new log files. And with the RS, we
    lost the replica of the WAL that was with the DN of the dead box. In
    other words, 33% of the DN we need are dead. So, to read the WAL, per
    block to read and per reader, we've got one chance out of 3 to go to
    the dead DN, and to get a connect or read timeout issue. With a
    reasonnable cluster and a distributed log split, we will have a sure
    winner.


    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes

    At least on 1.0.3, there is no shutdown hook to tell the NN to
    consider this DN as dead, for example on a software crash.

    So before the 10:30 minutes, the DN is considered as fully available
    by the NN. After this delay, HDFS is likely to start replicating the
    blocks contained in the dead node to get back to the right number of
    replica. As a consequence, if we're too aggressive we will have a side
    effect here, adding workload to an already damaged cluster. According
    to Stack: "even with this 10 minutes wait, the issue was met in real
    production case in the past, and the latency increased badly". May be
    there is some tuning to do here, but going under these 10 minutes does
    not seem to be an easy path.

    For the clients, they don't fully rely on the NN feedback, and they
    keep, per stream, a dead node list. So for a single file, a given
    client will do the error once, but if there are multiple files it will
    go back to the wrong DN. The settings are:

    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s
    ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.

    I also had a look at larger failure scenarios, when we're loosing a
    20% of a cluster. The smaller the cluster is the easier it is to get
    there. With the distributed log split, we're actually on a better
    shape from an hdfs point of view: the master could have error writing
    the files, because it could bet a dead DN 3 times in a row. If the
    split is done by the RS, this issue disappears. We will however get a
    lot of errors between the nodes.

    Finally, I had a look at the lease stuff Lease: write access lock to a
    file, no other client can write to the file. But another client can
    read it. Soft lease limit: another client can preempt the lease.
    Configurable.
    Default: 1 minute.
    Hard lease limit: hdfs closes the file and free the resources on
    behalf of the initial writer. Default: 60 minutes.

    => This should not impact HBase, as it does not prevent the recovery
    process to read the WAL or to write new files. We just need writes to
    be immediately available to readers, and it's possible thanks to
    HDFS-200. So if a RS dies we should have no waits even if the lease
    was not freed. This seems to be confirmed by tests.
    => It's interesting to note that this setting is much more aggressive
    than the one to declare a DN dead (1 minute vs. 10 minutes). Or, in
    HBase, than the default ZK timeout (3 minutes).
    => This said, HDFS states this: "When reading a file open for writing,
    the length of the last block still being written is unknown
    to the NameNode. In this case, the client asks one of the replicas for
    the latest length before starting to read its content.". This leads to
    an extra call to get the file length on the recovery (likely with the
    ipc.Client), and we may once again go to the wrong dead DN. In this
    case we have an extra socket timeout to consider.

    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.


    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    3) Live.

    For HBase it would make life much simpler I think:
    - no 69s timeout on mttr path
    - less connection to dead nodes leading to ressources held all other
    the place finishing by a timeout...
    - and there is already a very aggressive 3s heartbeat, so we would
    not add any workload.

    Thougths?

    Nicolas


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Michael Stack at Jul 18, 2012 at 10:00 am
    The proposal seems good to me. Its minimally intrusive.

    See also below...
    On Mon, Jul 16, 2012 at 7:08 PM, N Keywal wrote:
    And to continue on this, for the files still opened (i.e. our wal
    files), we've got two calls to the dead DN:

    one, during the input stream opening, from DFSClient#updateBlockInfo.
    This calls fails, but the exception is shallowed without being logged.
    The node info is not updated, but there is no error, so we continue
    without the right info. The timeout will be 60 seconds. This call is
    one the port 50020.
    the second, will be the one already mentioned for the data transfer,
    with the timeout of 69 seconds. The dead nodes list is not updated by
    the first failure, leading to a total wait time >2 minutes if we got
    directed to the bad location.
    Saving this extra second timeout is worth our doing a bit of work.

    The NN is like the federal government. It has general high-level
    policies and knows about 'conditions' from the macro level; network
    topologies, placement policies. The DFSInput/OutputStream is like
    local government. It reacts to the local conditions reordering the
    node list if it just timed out the node in position zero. Whats
    missing is state government, smarts in DFSClient, a means of being
    able to inform adjacent local governments about conditions that might
    effect their operation; dead of lagging DNs, etc.

    St.Ack
  • Michael Stack at Jul 17, 2012 at 3:42 pm

    On Thu, Jul 12, 2012 at 11:20 PM, N Keywal wrote:
    I looked in details at the hdfs configuration parameters and their
    impacts. We have the calculated values:
    heartbeat.interval = 3s ("dfs.heartbeat.interval").
    heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
    heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes ...
    connect/read: (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
    write: (5s (hardcoded) * NumberOfReplica) + 480s ("dfs.datanode.socket.write.timeout")

    That will set a 69s timeout to get a "connect" error with the default config.
    Adding this list of configs to the manual in a table would be
    generally useful I think (these and the lease ones below), especially
    if had a note on what happens if you change the configs.

    The 69s timeout is a bit rough espeically if a read on another open
    file already figured the DN dead; ditto on the write.
    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.
    Seems like we read the DFSClient.this.socketTimeout opening
    connections to blocks.
    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    I suppose as Todd suggests, we could do this client side. The extra
    state would complicate NN (making it difficult to get such a change
    in). The API to mark a DN dead seems like a nice-to-have. Master or
    client could pull on it when it knows a server dead (not just the RS).

    St.Ack
  • N Keywal at Jul 17, 2012 at 5:14 pm

    Adding this list of configs to the manual in a table would be
    generally useful I think (these and the lease ones below), especially
    if had a note on what happens if you change the configs.
    The 69s timeout is a bit rough espeically if a read on another open
    file already figured the DN dead; ditto on the write.
    Aggreed. I'm currently doing that. I have as well a set of log
    analysis that could make it to the ref book. I will create a Jira to
    propose them.
    On paper, it would be great to set "dfs.socket.timeout" to a minimal
    value during a log split, as we know we will get a dead DN 33% of the
    time. It may be more complicated in real life as the connections are
    shared per process. And we could still have the issue with the
    ipc.Client.
    Seems like we read the DFSClient.this.socketTimeout opening
    connections to blocks.
    As a conclusion, I think it could be interesting to have a third
    status for DN in HDFS: between live and dead as today, we could have
    "sick". We would have:
    1) Dead, known as such => As today: Start to replicate the blocks to
    other nodes. You enter this state after 10 minutes. We could even wait
    more.
    2) Likely to be dead: don't propose it for write blocks, put it with a
    lower priority for read blocks. We would enter this state in two
    conditions:
    2.1) No heartbeat for 30 seconds (configurable of course). As there
    is an existing heartbeat of 3 seconds, we could even be more
    aggressive here.
    2.2) We could have a shutdown hook in hdfs such as when a DN dies
    'properly' it says to the NN, and the NN can put it in this 'half dead
    state'.
    => In all cases, the node stays in the second state until the 10.30
    timeout is reached or until a heartbeat is received.
    I suppose as Todd suggests, we could do this client side. The extra
    state would complicate NN (making it difficult to get such a change
    After some iterations I came to a solution close to his proposition,
    mentionned in my mail from yesterday.
    To me we should fix this, and this includes HBASE-6401. The question
    is mainly on which hdfs branch hbase would need it, as HDFS code
    changed between the 1.0.3 release and the branch 2. HADOOP-8144 is
    also important for people configuring the topology imho.
    in). The API to mark a DN dead seems like a nice-to-have. Master or
    client could pull on it when it knows a server dead (not just the RS).
    Yes, there is a mechanism today to tell the NN to decommision a NN,
    but it's complex, we need to write a file with the 'unwanted' nodes,
    and we need to tell the NN to reload it. Not really a 'mark as dead"
    function.
  • Andrew Purtell at Jul 17, 2012 at 7:51 pm

    On Tue, Jul 17, 2012 at 10:14 AM, N Keywal wrote:
    To me we should fix this, and this includes HBASE-6401. The question
    is mainly on which hdfs branch hbase would need it, as HDFS code
    changed between the 1.0.3 release and the branch 2. HADOOP-8144 is
    also important for people configuring the topology imho.
    It's an interesting question if the HDFS 2 write/append pipeline is
    superior to the HDFS 1 one. I've seen a common opinion that it is
    indeed superior (as in, does not have known bugs) but have not had the
    bandwidth to validate this personally. For us, it's mostly a moot
    point, we are planning our next production generation to be on HDFS 2
    so we can run a truly HA (hot failover) NN configuration, perferably
    with the Quorum Journal approach to edit log sharing (HDFS-3077).

    Best regards,

    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet
    Hein (via Tom White)
  • Michael Stack at Jul 18, 2012 at 9:27 am

    On Tue, Jul 17, 2012 at 7:14 PM, N Keywal wrote:
    I suppose as Todd suggests, we could do this client side. The extra
    state would complicate NN (making it difficult to get such a change
    After some iterations I came to a solution close to his proposition,
    mentionned in my mail from yesterday.
    To me we should fix this, and this includes HBASE-6401. The question
    is mainly on which hdfs branch hbase would need it, as HDFS code
    changed between the 1.0.3 release and the branch 2. HADOOP-8144 is
    also important for people configuring the topology imho.
    Yes. Needs to be fixed for 1.0 and 2.0. This is ugly but could we
    have an HBase modified DFSClient load ahead of the hadoop one on
    CLASSPATH so we could get the fix in earlier? (Maybe its worth
    starting up an HBaseDFSClient effort if there are a list of particular
    behaviors such as the proposed reordering of replicas given us by the
    namenode, socket timeouts that differ dependent on who is opening the
    DFSInput/OutputStream, etc). We should work on getting fixes into
    hadoop meantime (because a hbase dfsclient won't help the intra-DN
    traffic timeouts). Its kinda silly the way we can repeatedly timeout
    on a DN we know elsewhere is dead while meantime data is offline. Its
    kind of an important one to fix I'd say.
    in). The API to mark a DN dead seems like a nice-to-have. Master or
    client could pull on it when it knows a server dead (not just the RS).
    Yes, there is a mechanism today to tell the NN to decommision a NN,
    but it's complex, we need to write a file with the 'unwanted' nodes,
    and we need to tell the NN to reload it. Not really a 'mark as dead"
    function.
    Yeah. I remember that bit of messing now. Useful when ops want to
    decommission a node but does not serve this particular need.

    Its a bit of a tough one though in that the NN would have to 'trust'
    the client that is pulling on this new API that says a DN is down.

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedJul 12, '12 at 9:20p
activeJul 18, '12 at 10:00a
posts18
users6
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase