FAQ
Hi,

is it accurate to say that

- In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
recreate the HDFS if the Primary NameNode fails, but with the delay of
minutes if not hours, and there is also some data loss;
- in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
replaces the Secondary NameNode. The Backup Node can be used as a warm
spare, with the failover being a matter of seconds. There can be multiple
Backup Nodes, for additional insurance against failure, and previous best
common practices apply to it;
- 0.22 will have further improvements to the HDFS performance, such
as HDFS-1093.

Does the paper on HDFS Reliability by Tom
White<http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>still
represent the current state of things?

Thank you. Sincerely,
Mark

Search Discussions

  • M. C. Srivas at Feb 14, 2011 at 5:51 pm
    The summary is quite inaccurate.
    On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner wrote:

    Hi,

    is it accurate to say that

    - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
    recreate the HDFS if the Primary NameNode fails, but with the delay of
    minutes if not hours, and there is also some data loss;

    The Secondary NN is not a spare. It is used to augment the work of the
    Primary, by offloading some of its work to another machine. The work
    offloaded is "log rollup" or "checkpointing". This has been a source of
    constant confusion (some named it incorrectly as a "secondary" and now we
    are stuck with it).

    The Secondary NN certainly cannot take over for the Primary. It is not its
    purpose.

    Yes, there is data loss.



    - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
    replaces the Secondary NameNode. The Backup Node can be used as a warm
    spare, with the failover being a matter of seconds. There can be multiple
    Backup Nodes, for additional insurance against failure, and previous best
    common practices apply to it;

    There is no "Backup NN" in the manner you are thinking of. It is completely
    manual, and requires restart of the "whole world", and takes about 2-3 hours
    to happen. If you are lucky, you may have only a little data loss (people
    have lost entire clusters due to this -- from what I understand, you are far
    better off resurrecting the Primary instead of trying to bring up a Backup
    NN).

    In any case, when you run it like you mention above, you will have to
    (a) make sure that the primary is dead
    (b) edit hdfs-site.xml on *every* datanode to point to the new IP address of
    the backup, and restart each datanode.
    (c) wait for 2-3 hours for all the block-reports from every restarted DN to
    finish

    2-3 hrs afterwards:
    (d) after that, restart all TT and the JT to connect to the new NN
    (e) finally, restart all the clients (eg, HBase, Oozie, etc)

    Many companies, including Yahoo! and Facebook, use a couple of NetApp filers
    to hold the actual data that the NN writes. The two NetApp filers are run in
    "HA" mode with NVRAM copying. But the NN remains a single point of failure,
    and there is probably some data loss.


    - 0.22 will have further improvements to the HDFS performance, such
    as HDFS-1093.

    Does the paper on HDFS Reliability by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    still
    represent the current state of things?

    See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
    code on all the clients + Zookeeper + the dual NetApp filers.

    It helps Facebook do manual, controlled, fail-over during software upgrades,
    at the cost of some performance loss on the DataNodes (the DataNodes have to
    do 2x block reports, and each block-report is expensive, so it limits the
    DataNode a bit). The article does not talk about dataloss when the
    fail-over is initiated manually, so I don't know about that.

    http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html



    Thank you. Sincerely,
    Mark
  • Mark Kerzner at Feb 14, 2011 at 8:31 pm
    Thank you, M. C. Srivas, that was enormously useful. I understand it now,
    but just to be complete, I have re-formulated my points according to your
    comments:

    - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
    used to recreate the HDFS if the Primary NameNode fails. The procedure is
    manual and may take hours, and there is also data loss since the last
    snapshot;
    - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
    HA and act as a cold spare. The data loss is less than with Secondary NN,
    but it is still manual and potentially error-prone, and it takes hours;
    - There is an AvatarNode patch available for 0.20, and Facebook runs its
    cluster that way, but the patch submitted to Apache requires testing and the
    developers adopting it must do some custom configurations and also exercise
    care in their work.

    As a conclusion, when building an HA HDFS cluster, one needs to follow the best
    practices outlined by Tom
    White<http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>,
    and may still need to resort to specialized NSF filers for running the
    NameNode.

    Sincerely,
    Mark


    On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas wrote:

    The summary is quite inaccurate.
    On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner wrote:

    Hi,

    is it accurate to say that

    - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
    recreate the HDFS if the Primary NameNode fails, but with the delay of
    minutes if not hours, and there is also some data loss;

    The Secondary NN is not a spare. It is used to augment the work of the
    Primary, by offloading some of its work to another machine. The work
    offloaded is "log rollup" or "checkpointing". This has been a source of
    constant confusion (some named it incorrectly as a "secondary" and now we
    are stuck with it).

    The Secondary NN certainly cannot take over for the Primary. It is not its
    purpose.

    Yes, there is data loss.



    - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
    replaces the Secondary NameNode. The Backup Node can be used as a warm
    spare, with the failover being a matter of seconds. There can be multiple
    Backup Nodes, for additional insurance against failure, and previous best
    common practices apply to it;

    There is no "Backup NN" in the manner you are thinking of. It is completely
    manual, and requires restart of the "whole world", and takes about 2-3
    hours
    to happen. If you are lucky, you may have only a little data loss (people
    have lost entire clusters due to this -- from what I understand, you are
    far
    better off resurrecting the Primary instead of trying to bring up a Backup
    NN).

    In any case, when you run it like you mention above, you will have to
    (a) make sure that the primary is dead
    (b) edit hdfs-site.xml on *every* datanode to point to the new IP address
    of
    the backup, and restart each datanode.
    (c) wait for 2-3 hours for all the block-reports from every restarted DN to
    finish

    2-3 hrs afterwards:
    (d) after that, restart all TT and the JT to connect to the new NN
    (e) finally, restart all the clients (eg, HBase, Oozie, etc)

    Many companies, including Yahoo! and Facebook, use a couple of NetApp
    filers
    to hold the actual data that the NN writes. The two NetApp filers are run
    in
    "HA" mode with NVRAM copying. But the NN remains a single point of
    failure,
    and there is probably some data loss.


    - 0.22 will have further improvements to the HDFS performance, such
    as HDFS-1093.

    Does the paper on HDFS Reliability by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    still
    represent the current state of things?

    See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
    code on all the clients + Zookeeper + the dual NetApp filers.

    It helps Facebook do manual, controlled, fail-over during software
    upgrades,
    at the cost of some performance loss on the DataNodes (the DataNodes have
    to
    do 2x block reports, and each block-report is expensive, so it limits the
    DataNode a bit). The article does not talk about dataloss when the
    fail-over is initiated manually, so I don't know about that.


    http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html



    Thank you. Sincerely,
    Mark
  • Ted Dunning at Feb 14, 2011 at 8:41 pm
    Note that document purports to be from 2008 and, at best, was uploaded just
    about a year ago.

    That it is still pretty accurate is kind of a tribute to either the
    stability of hbase or the stagnation depending on how you read it.
    On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner wrote:

    As a conclusion, when building an HA HDFS cluster, one needs to follow the
    best
    practices outlined by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>,
    and may still need to resort to specialized NSF filers for running the
    NameNode.
  • M. C. Srivas at Feb 14, 2011 at 10:46 pm
    I understand you are writing a book "Hadoop in Practice". If so, its
    important that what's recommended in the book should be verified in
    practice. (I mean, beyond simply posting in this newsgroup - for instance,
    the recommendations on NN fail-over should be tried out first before writing
    about how to do it). Otherwise you won't know your recommendations really
    work or not.


    On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner wrote:

    Thank you, M. C. Srivas, that was enormously useful. I understand it now,
    but just to be complete, I have re-formulated my points according to your
    comments:

    - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
    used to recreate the HDFS if the Primary NameNode fails. The procedure is
    manual and may take hours, and there is also data loss since the last
    snapshot;
    - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
    HA and act as a cold spare. The data loss is less than with Secondary NN,
    but it is still manual and potentially error-prone, and it takes hours;
    - There is an AvatarNode patch available for 0.20, and Facebook runs its
    cluster that way, but the patch submitted to Apache requires testing and
    the
    developers adopting it must do some custom configurations and also
    exercise
    care in their work.

    As a conclusion, when building an HA HDFS cluster, one needs to follow the
    best
    practices outlined by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf>,
    and may still need to resort to specialized NSF filers for running the
    NameNode.

    Sincerely,
    Mark


    On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas wrote:

    The summary is quite inaccurate.

    On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    is it accurate to say that

    - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
    recreate the HDFS if the Primary NameNode fails, but with the delay
    of
    minutes if not hours, and there is also some data loss;

    The Secondary NN is not a spare. It is used to augment the work of the
    Primary, by offloading some of its work to another machine. The work
    offloaded is "log rollup" or "checkpointing". This has been a source of
    constant confusion (some named it incorrectly as a "secondary" and now we
    are stuck with it).

    The Secondary NN certainly cannot take over for the Primary. It is not its
    purpose.

    Yes, there is data loss.



    - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
    replaces the Secondary NameNode. The Backup Node can be used as a
    warm
    spare, with the failover being a matter of seconds. There can be multiple
    Backup Nodes, for additional insurance against failure, and previous best
    common practices apply to it;

    There is no "Backup NN" in the manner you are thinking of. It is
    completely
    manual, and requires restart of the "whole world", and takes about 2-3
    hours
    to happen. If you are lucky, you may have only a little data loss (people
    have lost entire clusters due to this -- from what I understand, you are
    far
    better off resurrecting the Primary instead of trying to bring up a Backup
    NN).

    In any case, when you run it like you mention above, you will have to
    (a) make sure that the primary is dead
    (b) edit hdfs-site.xml on *every* datanode to point to the new IP address
    of
    the backup, and restart each datanode.
    (c) wait for 2-3 hours for all the block-reports from every restarted DN to
    finish

    2-3 hrs afterwards:
    (d) after that, restart all TT and the JT to connect to the new NN
    (e) finally, restart all the clients (eg, HBase, Oozie, etc)

    Many companies, including Yahoo! and Facebook, use a couple of NetApp
    filers
    to hold the actual data that the NN writes. The two NetApp filers are run
    in
    "HA" mode with NVRAM copying. But the NN remains a single point of
    failure,
    and there is probably some data loss.


    - 0.22 will have further improvements to the HDFS performance, such
    as HDFS-1093.

    Does the paper on HDFS Reliability by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    still
    represent the current state of things?

    See Dhruba's blog-post about the Avatar NN + some custom "stackable HDFS"
    code on all the clients + Zookeeper + the dual NetApp filers.

    It helps Facebook do manual, controlled, fail-over during software
    upgrades,
    at the cost of some performance loss on the DataNodes (the DataNodes have
    to
    do 2x block reports, and each block-report is expensive, so it limits the
    DataNode a bit). The article does not talk about dataloss when the
    fail-over is initiated manually, so I don't know about that.


    http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html


    Thank you. Sincerely,
    Mark
  • Mark Kerzner at Feb 14, 2011 at 10:53 pm
    I completely agree, and I am using yours and the group's posting to define
    the direction and approaches, but I am also trying every solution - and I am
    beginning to do just that, the AvatarNode now.

    Thank you,
    Mark
    On Mon, Feb 14, 2011 at 4:43 PM, M. C. Srivas wrote:

    I understand you are writing a book "Hadoop in Practice". If so, its
    important that what's recommended in the book should be verified in
    practice. (I mean, beyond simply posting in this newsgroup - for instance,
    the recommendations on NN fail-over should be tried out first before
    writing
    about how to do it). Otherwise you won't know your recommendations really
    work or not.



    On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner <markkerzner@gmail.com
    wrote:
    Thank you, M. C. Srivas, that was enormously useful. I understand it now,
    but just to be complete, I have re-formulated my points according to your
    comments:

    - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
    used to recreate the HDFS if the Primary NameNode fails. The procedure is
    manual and may take hours, and there is also data loss since the last
    snapshot;
    - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
    HA and act as a cold spare. The data loss is less than with Secondary NN,
    but it is still manual and potentially error-prone, and it takes hours;
    - There is an AvatarNode patch available for 0.20, and Facebook runs its
    cluster that way, but the patch submitted to Apache requires testing and
    the
    developers adopting it must do some custom configurations and also
    exercise
    care in their work.

    As a conclusion, when building an HA HDFS cluster, one needs to follow the
    best
    practices outlined by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    ,
    and may still need to resort to specialized NSF filers for running the
    NameNode.

    Sincerely,
    Mark


    On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas wrote:

    The summary is quite inaccurate.

    On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    is it accurate to say that

    - In 0.20 the Secondary NameNode acts as a cold spare; it can be
    used
    to
    recreate the HDFS if the Primary NameNode fails, but with the delay
    of
    minutes if not hours, and there is also some data loss;

    The Secondary NN is not a spare. It is used to augment the work of the
    Primary, by offloading some of its work to another machine. The work
    offloaded is "log rollup" or "checkpointing". This has been a source of
    constant confusion (some named it incorrectly as a "secondary" and now
    we
    are stuck with it).

    The Secondary NN certainly cannot take over for the Primary. It is not its
    purpose.

    Yes, there is data loss.



    - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
    replaces the Secondary NameNode. The Backup Node can be used as a
    warm
    spare, with the failover being a matter of seconds. There can be multiple
    Backup Nodes, for additional insurance against failure, and
    previous
    best
    common practices apply to it;

    There is no "Backup NN" in the manner you are thinking of. It is
    completely
    manual, and requires restart of the "whole world", and takes about 2-3
    hours
    to happen. If you are lucky, you may have only a little data loss
    (people
    have lost entire clusters due to this -- from what I understand, you
    are
    far
    better off resurrecting the Primary instead of trying to bring up a Backup
    NN).

    In any case, when you run it like you mention above, you will have to
    (a) make sure that the primary is dead
    (b) edit hdfs-site.xml on *every* datanode to point to the new IP
    address
    of
    the backup, and restart each datanode.
    (c) wait for 2-3 hours for all the block-reports from every restarted
    DN
    to
    finish

    2-3 hrs afterwards:
    (d) after that, restart all TT and the JT to connect to the new NN
    (e) finally, restart all the clients (eg, HBase, Oozie, etc)

    Many companies, including Yahoo! and Facebook, use a couple of NetApp
    filers
    to hold the actual data that the NN writes. The two NetApp filers are
    run
    in
    "HA" mode with NVRAM copying. But the NN remains a single point of
    failure,
    and there is probably some data loss.


    - 0.22 will have further improvements to the HDFS performance, such
    as HDFS-1093.

    Does the paper on HDFS Reliability by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    still
    represent the current state of things?

    See Dhruba's blog-post about the Avatar NN + some custom "stackable
    HDFS"
    code on all the clients + Zookeeper + the dual NetApp filers.

    It helps Facebook do manual, controlled, fail-over during software
    upgrades,
    at the cost of some performance loss on the DataNodes (the DataNodes
    have
    to
    do 2x block reports, and each block-report is expensive, so it limits
    the
    DataNode a bit). The article does not talk about dataloss when the
    fail-over is initiated manually, so I don't know about that.

    http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html


    Thank you. Sincerely,
    Mark
  • Allen Wittenauer at Feb 16, 2011 at 7:31 pm
    I'm more than a little concerned that you missed the whole multiple directories--including a remote one--for the fsimage thing. That's probably the #1 thing that most of the big grids do to maintain the NN data. I can only remember one failure where the NFS copy wasn't used to recover a namenode in all the failures I've personally been involved (and that was an especially odd bug, not a NN failure, per se). The only reason to fall back to the 2ndary NN in 0.20 should be is if you've hit a similarly spectacular bug. Point blank: anyone who runs the NN without it writing to a remote copy doesn't know what they are doing.

    Also, until AvatarNode comes of age (which, from what I understand, FB has only been doing for very long themselves), there is no such thing as HA NN. We all have high hopes that it works out, but it likely isn't anywhere near ready for primetime yet.
    On Feb 14, 2011, at 2:52 PM, Mark Kerzner wrote:

    I completely agree, and I am using yours and the group's posting to define
    the direction and approaches, but I am also trying every solution - and I am
    beginning to do just that, the AvatarNode now.

    Thank you,
    Mark
    On Mon, Feb 14, 2011 at 4:43 PM, M. C. Srivas wrote:

    I understand you are writing a book "Hadoop in Practice". If so, its
    important that what's recommended in the book should be verified in
    practice. (I mean, beyond simply posting in this newsgroup - for instance,
    the recommendations on NN fail-over should be tried out first before
    writing
    about how to do it). Otherwise you won't know your recommendations really
    work or not.



    On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner <markkerzner@gmail.com
    wrote:
    Thank you, M. C. Srivas, that was enormously useful. I understand it now,
    but just to be complete, I have re-formulated my points according to your
    comments:

    - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
    used to recreate the HDFS if the Primary NameNode fails. The procedure is
    manual and may take hours, and there is also data loss since the last
    snapshot;
    - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
    HA and act as a cold spare. The data loss is less than with Secondary NN,
    but it is still manual and potentially error-prone, and it takes hours;
    - There is an AvatarNode patch available for 0.20, and Facebook runs its
    cluster that way, but the patch submitted to Apache requires testing and
    the
    developers adopting it must do some custom configurations and also
    exercise
    care in their work.

    As a conclusion, when building an HA HDFS cluster, one needs to follow the
    best
    practices outlined by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    ,
    and may still need to resort to specialized NSF filers for running the
    NameNode.

    Sincerely,
    Mark



    On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas <mcsrivas@gmail.com>
    wrote:
    The summary is quite inaccurate.

    On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    is it accurate to say that

    - In 0.20 the Secondary NameNode acts as a cold spare; it can be
    used
    to
    recreate the HDFS if the Primary NameNode fails, but with the delay
    of
    minutes if not hours, and there is also some data loss;

    The Secondary NN is not a spare. It is used to augment the work of the
    Primary, by offloading some of its work to another machine. The work
    offloaded is "log rollup" or "checkpointing". This has been a source of
    constant confusion (some named it incorrectly as a "secondary" and now
    we
    are stuck with it).

    The Secondary NN certainly cannot take over for the Primary. It is not its
    purpose.

    Yes, there is data loss.



    - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
    replaces the Secondary NameNode. The Backup Node can be used as a
    warm
    spare, with the failover being a matter of seconds. There can be multiple
    Backup Nodes, for additional insurance against failure, and
    previous
    best
    common practices apply to it;

    There is no "Backup NN" in the manner you are thinking of. It is
    completely
    manual, and requires restart of the "whole world", and takes about 2-3
    hours
    to happen. If you are lucky, you may have only a little data loss
    (people
    have lost entire clusters due to this -- from what I understand, you
    are
    far
    better off resurrecting the Primary instead of trying to bring up a Backup
    NN).

    In any case, when you run it like you mention above, you will have to
    (a) make sure that the primary is dead
    (b) edit hdfs-site.xml on *every* datanode to point to the new IP
    address
    of
    the backup, and restart each datanode.
    (c) wait for 2-3 hours for all the block-reports from every restarted
    DN
    to
    finish

    2-3 hrs afterwards:
    (d) after that, restart all TT and the JT to connect to the new NN
    (e) finally, restart all the clients (eg, HBase, Oozie, etc)

    Many companies, including Yahoo! and Facebook, use a couple of NetApp
    filers
    to hold the actual data that the NN writes. The two NetApp filers are
    run
    in
    "HA" mode with NVRAM copying. But the NN remains a single point of
    failure,
    and there is probably some data loss.


    - 0.22 will have further improvements to the HDFS performance, such
    as HDFS-1093.

    Does the paper on HDFS Reliability by Tom
    White<
    http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
    still
    represent the current state of things?

    See Dhruba's blog-post about the Avatar NN + some custom "stackable
    HDFS"
    code on all the clients + Zookeeper + the dual NetApp filers.

    It helps Facebook do manual, controlled, fail-over during software
    upgrades,
    at the cost of some performance loss on the DataNodes (the DataNodes
    have
    to
    do 2x block reports, and each block-report is expensive, so it limits
    the
    DataNode a bit). The article does not talk about dataloss when the
    fail-over is initiated manually, so I don't know about that.

    http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html


    Thank you. Sincerely,
    Mark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 14, '11 at 4:48p
activeFeb 16, '11 at 7:31p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase