FAQ
[ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576085#action_12576085 ]

Amar Kamat commented on HADOOP-2175:
------------------------------------

HADOOP-1984 will cause more waiting here. Since the time for sending the fetch failures will be more. The simplest solution seems to reschedule all the maps on the black listed node. The question is can we do better? Can the JT infer and handle this situation?
Blacklisted hosts may not be able to serve map outputs
------------------------------------------------------

Key: HADOOP-2175
URL: https://issues.apache.org/jira/browse/HADOOP-2175
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Reporter: Runping Qi
Assignee: Arun C Murthy

After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
But, it will continue serve the map outputs of any mappers that ran successfully there.
However, the node may not be able serve the map outputs either.
This will cause the reducers to mark the corresponding map outputs as from slow hosts,
but continue to try to get the map outputs from that node.
This may lead to waiting forever.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Runping Qi (JIRA) at Mar 7, 2008 at 2:21 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576208#action_12576208 ]

    Runping Qi commented on HADOOP-2175:
    ------------------------------------


    Re-executing the mappers will likely be the correct action in most cases.
    Even for some cases where this is not optimal, its cost will not be that expensive.
    Thus, I think that is the right behavior, before comebody comes up with a simple and effective alternative

    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Arun C Murthy

    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 9, 2008 at 12:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576760#action_12576760 ]

    Devaraj Das commented on HADOOP-2175:
    -------------------------------------

    bq. Re-executing the mappers will likely be the correct action in most cases.
    +1
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Arun C Murthy

    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 17, 2008 at 8:39 am
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579345#action_12579345 ]

    Hadoop QA commented on HADOOP-2175:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12377928/HADOOP-2175-v1.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included -1. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1979/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1979/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1979/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1979/console

    This message is automatically generated.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Mar 20, 2008 at 5:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580865#action_12580865 ]

    Amareshwari Sriramadasu commented on HADOOP-2175:
    -------------------------------------------------

    Code looks good
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 20, 2008 at 8:43 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580918#action_12580918 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    Submitting a patch with a test case.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 22, 2008 at 10:37 am
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581260#action_12581260 ]

    Devaraj Das commented on HADOOP-2175:
    -------------------------------------

    I am not clear why you have the check in JobInProgress for doing lostTaskTracker outside the addTrackerTaskFailure. You could do the check inside the method, right?
    Also, inside lostTaskTracker you check for whether the task was already FAILED/KILLED. Do you need to do the check for KILLED?
    On the change to MiniMRCluster, I am not convinced that this is the right thing to do (wait for 10 seconds and then giving up).
    On the TestLostBlackListedTracker, i don't think you need to make it that complicated. A simple dummy split based map should work. In that case you don't have to change TestRackAwareTaskPlacement. The way you get events is also not very reliable w.r.t timing. In the first call to getTaskCompletionEvents, you might get events.length = 0. Isn't this a problem. I'd say that you wait for the job to complete and then get the events.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 22, 2008 at 12:35 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581264#action_12581264 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    {quote}
    am not clear why you have the check in JobInProgress for doing lostTaskTracker outside the addTrackerTaskFailure.
    {quote}
    +1
    bq. Also, inside lostTaskTracker you check for whether the task was already FAILED/KILLED
    I did that because the TIP failed/killed before the TT got lost, should be kept failed/kiiled. There is no need to reschedule or change their status. Since the task was not killed because of lost TT, I ignored it.
    bq. On the change to MiniMRCluster ....
    I think there is a problem with MiniMRCluster w.r.t lost TTs. It keeps on trying for the TT to be idle and eventually the test times out. I am still trying to find out why the MiniMR gets stuck.
    bq. On the TestLostBlackListedTracker, i don't think you need to make it that complicated
    +1.
    {quote}
    The way you get events is also not very reliable w.r.t timing. In the first call to getTaskCompletionEvents, you might get events.length = 0
    {quote}
    I use launchJob which waits for the job to complete. Its a blocking call.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 25, 2008 at 6:57 am
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581796#action_12581796 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    This patch incorporates Devaraj's comments. Changes are as follows
    - The call to lostTaskTracker now expects a reason/info for why/how the tracker is lost.
    - Losing a blacklisted tracker happens in {{addTrackerFailure}}
    - Before losing a tracker a check is made if the tracker exists in the JT and also the status is updated so that in the next heartbeat cycle the TT gets reinitialized.
    - The test case now doesn't depend on timing.
    - The only changes in {{MiniMRCluster}} is to do with default conf for JT/TT.

    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 25, 2008 at 8:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581821#action_12581821 ]

    Hadoop QA commented on HADOOP-2175:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12378538/HADOOP-2175-v2.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 7 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2043/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2043/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2043/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2043/console

    This message is automatically generated.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 25, 2008 at 9:37 am
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581851#action_12581851 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    The log messages in the earlier patch were different from the ones in the trunk. Changing the log messages (rest of the patch remains same).
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 25, 2008 at 3:59 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581990#action_12581990 ]

    Hadoop QA commented on HADOOP-2175:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12378549/HADOOP-2175-v2.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 7 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2047/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2047/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2047/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2047/console

    This message is automatically generated.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 25, 2008 at 8:03 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582047#action_12582047 ]

    Devaraj Das commented on HADOOP-2175:
    -------------------------------------

    Come to think about it, lostTaskTracker may not be the best way to go since it can potentially affect tasks from multiple jobs. We probably need to make the lostTaskTracker take a JobInProgress argument and do failedTask for tasks of that job only. The other option is to implement APIs that gives the TIPs and taskIds corresponding to a job & tasktracker combination, and then invoke failedTask in the JobInProgress for each TIP/taskId. The second approach seems cleaner generally but for the first approach most of the necessary infrastructure is already there. Thoughts?
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 26, 2008 at 2:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582301#action_12582301 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    After some offline discussions with some folks here this is what seems reasonable: kill the map on a per map basis and tweak the logic of killing maps due to "too many fetch failures" that currently depends on notifications from all running reducers, to just *one notification* if the tracker in question has been blacklisted. That way we will not be too aggressive (we don't kill too many maps in one go) and we will be harsh with the map corresponding to the failed fetch.. Thoughts?
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Mar 26, 2008 at 2:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582307#action_12582307 ]

    Runping Qi commented on HADOOP-2175:
    ------------------------------------

    +1

    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 26, 2008 at 3:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582325#action_12582325 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    The only concern is when all the maps that are yet to be fetched are from the same blacklisted tracker. The reason being that each of the reducer will fetch one map per host. Hence killing all the maps will take
    {{5min * num-maps-on-tracker/num-reducers}} in the best case and {{5min * num-maps-on-tracker}} in the worst case assuming default config.
    Following are some of the tweaks
    1) Keep track of the total failures registered against the tracker (per job) and kill all the maps for a job if the total failures for a job is greater than 25% .
    2) Keep a set of unique hosts per job that have registered against a blacklisted tracker and kill all the maps for a job if all the reducers have complained against the blacklisted tracker.
    Currently we do similar stuff for killing a map based on fetch failures. We should do something similar in case of trackers i.e re-schedule all the maps (per job maybe) in case of blacklisted trackers. In future we may relax the condition of the tracker being blacklisted. Thoughts?
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Mar 26, 2008 at 4:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582330#action_12582330 ]

    Runping Qi commented on HADOOP-2175:
    ------------------------------------

    bq. will take 5min * num-maps-on-tracker/num-reducers in the best case and 5min * num-maps-on-tracker in the worst case assuming default config.

    why will it take so long to fetch the map output?
    why fetch only one map per host?



    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 26, 2008 at 4:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582341#action_12582341 ]

    Amar Kamat commented on HADOOP-2175:
    ------------------------------------

    In the default case it will take atleast 5min to send a notification. The reducer code is such that only one map task per host is tried at a time. It will try a new map task only if the earlier one succeeds/fails.
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.17.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sameer Paranjpye (JIRA) at Mar 26, 2008 at 6:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582376#action_12582376 ]

    Sameer Paranjpye commented on HADOOP-2175:
    ------------------------------------------

    Let's not confuse lost and blacklisted tasktrackers. A lost tasktracker is one that doesn't check in with the JT and a tasktracker blacklisted for a job is one that causes tasks to fail for that job and they need to be handled very differently.

    We should move this issue to 0.18. We don't have a coherent model for task failures, the blacklisting logic is already messy and adding half a dozen if statements will only make it messier.

    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.18.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 26, 2008 at 9:01 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582425#action_12582425 ]

    Devaraj Das commented on HADOOP-2175:
    -------------------------------------

    I agree with Sameer. We should probably step back and look at the model of killing a map based on fetch failure notifications. Today, we do killing of maps based on fetch failure notifications on a per map basis and we wait for a majority of the reducers to tell the JobTracker about the fetch failing for a particular map.
    With the random ordering of map output fetches and the backoff per failed fetch, this might take a long time per map. This is what you observed Runping, IMO.
    Instead we probably should include the tracker name on which map ran in the logic for killing a map - if we get too many fetch failure notifications for maps that ran on a particular tracker, which we will detect much faster, we should probably kill those maps that ran on that tracker, for which we are seeing fetch failure notifications. That will take care of the case where only the jetty is faulty (the tracker is not blacklisted as it could, and probably still can, execute tasks).
    Blacklisted hosts may not be able to serve map outputs
    ------------------------------------------------------

    Key: HADOOP-2175
    URL: https://issues.apache.org/jira/browse/HADOOP-2175
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Fix For: 0.18.0

    Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch


    After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
    But, it will continue serve the map outputs of any mappers that ran successfully there.
    However, the node may not be able serve the map outputs either.
    This will cause the reducers to mark the corresponding map outputs as from slow hosts,
    but continue to try to get the map outputs from that node.
    This may lead to waiting forever.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 7, '08 at 8:49a
activeMar 26, '08 at 9:01p
posts20
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Devaraj Das (JIRA): 20 posts

People

Translate

site design / logo © 2022 Grokbase