FAQ
Hi,

In the original MapReduce paper from Google, it mentioned
that healthy workers can take over failed task from other
workers. Does Hadoop has the same failure recovery strategy?
Also the other question is, in the paper, it seems the nodes can
be added/removed while the cluster is running jobs. How does
Hadoop achieve this? Since the slave locations are saved in the
file and the master doesn't know about new nodes until it
restart and reload the slave list.

Thanks,

Ming Yang

Search Discussions

  • Owen O'Malley at Oct 19, 2007 at 4:02 am

    On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:

    In the original MapReduce paper from Google, it mentioned
    that healthy workers can take over failed task from other
    workers. Does Hadoop has the same failure recovery strategy?
    Yes. If a task fails on one node, it is assigned to another free node
    automatically.
    Also the other question is, in the paper, it seems the nodes can
    be added/removed while the cluster is running jobs. How does
    Hadoop achieve this? Since the slave locations are saved in the
    file and the master doesn't know about new nodes until it
    restart and reload the slave list.
    The slaves file is only used by the startup scripts when bringing up
    the cluster. If additional data nodes or task trackers (ie. slaves)
    are started they automatically join the cluster and will be given
    work. If the servers on one of the slaves are killed, the work will
    be redone on other nodes.

    -- Owen
  • Nguyen Manh Tien at Oct 19, 2007 at 7:46 am
    Owen, Could you show me how to start additional data nodes or tasktrackers ?


    2007/10/19, Owen O'Malley <oom@yahoo-inc.com>:
    On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:

    In the original MapReduce paper from Google, it mentioned
    that healthy workers can take over failed task from other
    workers. Does Hadoop has the same failure recovery strategy?
    Yes. If a task fails on one node, it is assigned to another free node
    automatically.
    Also the other question is, in the paper, it seems the nodes can
    be added/removed while the cluster is running jobs. How does
    Hadoop achieve this? Since the slave locations are saved in the
    file and the master doesn't know about new nodes until it
    restart and reload the slave list.
    The slaves file is only used by the startup scripts when bringing up
    the cluster. If additional data nodes or task trackers (ie. slaves)
    are started they automatically join the cluster and will be given
    work. If the servers on one of the slaves are killed, the work will
    be redone on other nodes.

    -- Owen

  • Arun C Murthy at Oct 19, 2007 at 8:11 am

    Nguyen Manh Tien wrote:
    Owen, Could you show me how to start additional data nodes or tasktrackers ?
    On a new node that you want to be a part of the cluster run:

    $ $HADOOP_HOME/bin/hadoop-daemon.sh start datanode

    or

    $ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

    Arun
    2007/10/19, Owen O'Malley <oom@yahoo-inc.com>:
    On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:

    In the original MapReduce paper from Google, it mentioned
    that healthy workers can take over failed task from other
    workers. Does Hadoop has the same failure recovery strategy?
    Yes. If a task fails on one node, it is assigned to another free node
    automatically.

    Also the other question is, in the paper, it seems the nodes can
    be added/removed while the cluster is running jobs. How does
    Hadoop achieve this? Since the slave locations are saved in the
    file and the master doesn't know about new nodes until it
    restart and reload the slave list.
    The slaves file is only used by the startup scripts when bringing up
    the cluster. If additional data nodes or task trackers (ie. slaves)
    are started they automatically join the cluster and will be given
    work. If the servers on one of the slaves are killed, the work will
    be redone on other nodes.

    -- Owen

  • Lance Amundsen at Oct 20, 2007 at 6:23 am
    How can I get more "real", but simulated Hadoop nodes for testing, without
    more machines? Can I add duplicate names in the slaves file? Or do I need
    separate IP's for everything?

    This is just for testing purposes. Ideally, I'd like to simulate a few
    nodes on my Windows laptop, to make it easier to debug actual distributed
    stuff.

    Maybe if I am clever with Cygwin, I could get several sessions
    communicating with each other using virtual IP's, albeit all through
    localhost. But would I have issues with a single hadoop directory
    structure if I did this?
  • dhruba Borthakur at Oct 20, 2007 at 7:12 am
    I have done this earlier by starting more than one Datanode on the same
    machine manually. Each instance of the Datanode has to have its own
    configuration directory. In the configuration, the names of the
    dfs.data.dir have to be different from one another.

    Thanks,
    dhruba

    -----Original Message-----
    From: Lance Amundsen
    Sent: Friday, October 19, 2007 11:21 PM
    To: hadoop-user@lucene.apache.org
    Subject: Simulated Nodes

    How can I get more "real", but simulated Hadoop nodes for testing,
    without
    more machines? Can I add duplicate names in the slaves file? Or do I
    need
    separate IP's for everything?

    This is just for testing purposes. Ideally, I'd like to simulate a few
    nodes on my Windows laptop, to make it easier to debug actual
    distributed
    stuff.

    Maybe if I am clever with Cygwin, I could get several sessions
    communicating with each other using virtual IP's, albeit all through
    localhost. But would I have issues with a single hadoop directory
    structure if I did this?
  • Ming Yang at Oct 19, 2007 at 1:59 pm
    Does it happen when running Hadoop locally as a single process?
    I was testing my code locally and didn't see failure recovery.
    The other thing is, how is "failure" defined? If in map(...) there's
    a uncaught exception occurs and the task fails, can this kind of
    failure be recovered?

    Thanks,

    Ming

    2007/10/18, Owen O'Malley <oom@yahoo-inc.com>:
    On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:

    In the original MapReduce paper from Google, it mentioned
    that healthy workers can take over failed task from other
    workers. Does Hadoop has the same failure recovery strategy?
    Yes. If a task fails on one node, it is assigned to another free node
    automatically.
  • Stu Hood at Oct 20, 2007 at 12:55 pm
    You can also use the patch attached to:
    https://issues.apache.org/jira/browse/HADOOP-1989

    Thanks,
    Stu


    -----Original Message-----
    From: dhruba Borthakur <dhruba@yahoo-inc.com>
    Sent: Saturday, October 20, 2007 3:08am
    To: hadoop-user@lucene.apache.org
    Subject: RE: Simulated Nodes

    I have done this earlier by starting more than one Datanode on the same
    machine manually. Each instance of the Datanode has to have its own
    configuration directory. In the configuration, the names of the
    dfs.data.dir have to be different from one another.

    Thanks,
    dhruba

    -----Original Message-----
    From: Lance Amundsen
    Sent: Friday, October 19, 2007 11:21 PM
    To: hadoop-user@lucene.apache.org
    Subject: Simulated Nodes

    How can I get more "real", but simulated Hadoop nodes for testing,
    without
    more machines? Can I add duplicate names in the slaves file? Or do I
    need
    separate IP's for everything?

    This is just for testing purposes. Ideally, I'd like to simulate a few
    nodes on my Windows laptop, to make it easier to debug actual
    distributed
    stuff.

    Maybe if I am clever with Cygwin, I could get several sessions
    communicating with each other using virtual IP's, albeit all through
    localhost. But would I have issues with a single hadoop directory
    structure if I did this?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 19, '07 at 3:05a
activeOct 20, '07 at 12:55p
posts8
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase