FAQ
snapshot a map-reduce to DFS ... and restore
--------------------------------------------

Key: HADOOP-91
URL: http://issues.apache.org/jira/browse/HADOOP-91
Project: Hadoop
Type: New Feature
Components: mapred
Reporter: eric baldeschwieler
Priority: Minor


The idea is to be able to issue a command to the job tracker that
will halt a map-reduce and archive it to a directory in such a way
that it can later be restarted.

We could also set a mode that would cause this to happen to a job
when it fails. This would allow one to debug and restart a failing
job reasonably, which might be important, for long running jobs. It
has certainly been important in similar systems I've seen before. One
could restart with a new jar or work bench a single failing map or reduce.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

Search Discussions

  • Bryan Pendleton (JIRA) at Mar 20, 2006 at 6:54 pm
    [ http://issues.apache.org/jira/browse/HADOOP-91?page=comments#action_12371127 ]

    Bryan Pendleton commented on HADOOP-91:
    ---------------------------------------

    This would be very useful, although, it should be noted that snapshotting a job to DFS means that it will take as much extra space to store as the replication (well, replication+1) level. If you're running jobs that produce large intermediate results, then attempting to checkpoint with, say, the default 3x replication, requires 4 times as much space as the job would, otherwise. For no-side-effect jobs, perhaps the default should be to checkpoint but with replication of 1 (assuming per-file replication gets added to DFS), and just let lost blocks turn into lost tasks that just get re-run. Hadoop should minimize space usage wherever possible, if it's really going to scale up to huge workloads.
    snapshot a map-reduce to DFS ... and restore
    --------------------------------------------

    Key: HADOOP-91
    URL: http://issues.apache.org/jira/browse/HADOOP-91
    Project: Hadoop
    Type: New Feature
    Components: mapred
    Reporter: eric baldeschwieler
    Priority: Minor
    The idea is to be able to issue a command to the job tracker that
    will halt a map-reduce and archive it to a directory in such a way
    that it can later be restarted.
    We could also set a mode that would cause this to happen to a job
    when it fails. This would allow one to debug and restart a failing
    job reasonably, which might be important, for long running jobs. It
    has certainly been important in similar systems I've seen before. One
    could restart with a new jar or work bench a single failing map or reduce.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • eric baldeschwieler (JIRA) at Mar 20, 2006 at 7:21 pm
    [ http://issues.apache.org/jira/browse/HADOOP-91?page=comments#action_12371131 ]

    eric baldeschwieler commented on HADOOP-91:
    -------------------------------------------

    Good points. Adding an option to specify replication level would be a good addition. Some of this data will be automatically regeneratable, only meta-data may really need high level replication.
    snapshot a map-reduce to DFS ... and restore
    --------------------------------------------

    Key: HADOOP-91
    URL: http://issues.apache.org/jira/browse/HADOOP-91
    Project: Hadoop
    Type: New Feature
    Components: mapred
    Reporter: eric baldeschwieler
    Priority: Minor
    The idea is to be able to issue a command to the job tracker that
    will halt a map-reduce and archive it to a directory in such a way
    that it can later be restarted.
    We could also set a mode that would cause this to happen to a job
    when it fails. This would allow one to debug and restart a failing
    job reasonably, which might be important, for long running jobs. It
    has certainly been important in similar systems I've seen before. One
    could restart with a new jar or work bench a single failing map or reduce.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 17, '06 at 4:27a
activeMar 20, '06 at 7:21p
posts3
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

eric baldeschwieler (JIRA): 3 posts

People

Translate

site design / logo © 2022 Grokbase