FAQ
Improvements to Global Black-listing of TaskTrackers
----------------------------------------------------

Key: HADOOP-6014
URL: https://issues.apache.org/jira/browse/HADOOP-6014
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Affects Versions: 0.20.0
Reporter: Arun C Murthy
Fix For: 0.21.0


HADOOP-4305 added a global black-list of tasktrackers.

We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!

The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Jim Huang (JIRA) at Jun 12, 2009 at 2:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718696#action_12718696 ]

    Jim Huang commented on HADOOP-6014:
    -----------------------------------

    In HADOOP-5478, we are asking for the ability to blacklist a tasktrackers via a node health check script. This should help to differentiate detectable system failures against user/application failures. Please take HADOOP-5478 as a consideration for this bug.
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Jun 12, 2009 at 11:30 am
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718801#action_12718801 ]

    Devaraj Das commented on HADOOP-6014:
    -------------------------------------

    Maybe as a first step, we can just treat the failures that were explicitly initiated by the TaskTracker differently, and not have the TaskTracker be penalized for those.
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Jun 18, 2009 at 5:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721362#action_12721362 ]

    Owen O'Malley commented on HADOOP-6014:
    ---------------------------------------

    I'd tend to agree with Jim that we should just use HADOOP-5478 and revert the cross-job blacklisting. The problem is that user jobs are written by users and therefore are not really suitable for detecting unhealthy nodes.
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Jun 19, 2009 at 4:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721647#action_12721647 ]

    Owen O'Malley commented on HADOOP-6014:
    ---------------------------------------

    I guess, I'll go further that user jobs can demonstrate that a node is healthy, as long as we are willing to run tasks on it. But it is very hard to draw the inference the other way when you may have a run of "bad" jobs that are all expected to fail. About all that we could do is notice they are failing on all/many of the nodes and thus weaken their contribution to the node badness measure.
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Jun 19, 2009 at 5:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721662#action_12721662 ]

    Amar Kamat commented on HADOOP-6014:
    ------------------------------------

    bq. Maybe as a first step, we can just treat the failures that were explicitly initiated by the TaskTracker differently, and not have the TaskTracker be penalized for those.
    I think for now this will be a simple thing to do. A task can fail because of
    # code issues (failure e.g buggy code)
    # node issues (killed e.g disk)
    # mismatch (killed-failure e.g insufficient memory)

    In case #3, its not tt's fault and hence we should be less aggressive in deciding on such counts.

    bq. I'd tend to agree with Jim that we should just use HADOOP-5478 and revert the cross-job blacklisting.
    Cross blacklisting will still be required. Consider a case where a node's environment is messed up (all the basic apps e.g wc, sort etc are missing). In such case I dont think node scripts will help. Number of tasks/job failures looks like the right metric to me.
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 19, 2009 at 9:03 am
    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721701#action_12721701 ]

    Amareshwari Sriramadasu commented on HADOOP-6014:
    -------------------------------------------------

    bq. But it is very hard to draw the inference the other way when you may have a run of "bad" jobs that are all expected to fail.
    Current blacklisting strategy looks at trackers blacklisted by Successful jobs. Also, a TT gets blacklisted onlyif #blacklists for the tracker is 50% above the average #blacklists, over the active and potentially faulty trackers
    Improvements to Global Black-listing of TaskTrackers
    ----------------------------------------------------

    Key: HADOOP-6014
    URL: https://issues.apache.org/jira/browse/HADOOP-6014
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Affects Versions: 0.20.0
    Reporter: Arun C Murthy
    Fix For: 0.21.0


    HADOOP-4305 added a global black-list of tasktrackers.
    We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers to immediately be blacklisted. This was caused by a specific set of jobs which (same user) whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing some tasktrackers, which in itself is wrong since the failures had nothing to do with the node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized the tasktracker. We clearly need to start treating system and user failures separately for black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while a task which was 'failed' for using more memory shouldn't count against the tasktracker at all!
    The other problem is that we never configured mapred.max.tracker.blacklists and continue to use the default value of 4. Further more this config should really be a percent of the cluster-size and not a whole number.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJun 11, '09 at 10:37p
activeJun 19, '09 at 9:03a
posts7
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Amareshwari Sriramadasu (JIRA): 7 posts

People

Translate

site design / logo © 2023 Grokbase