FAQ
[ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-2393:
--------------------------------------------

Fix Version/s: 0.18.0
Status: Patch Available (was: Open)
TaskTracker locks up removing job files within a synchronized method
---------------------------------------------------------------------

Key: HADOOP-2393
URL: https://issues.apache.org/jira/browse/HADOOP-2393
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.14.4
Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
ipc.client.timeout = 10000
Reporter: Joydeep Sen Sarma
Assignee: Amareshwari Sriramadasu
Priority: Critical
Fix For: 0.18.0

Attachments: patch-2393.txt


we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
tasktracker log:
// notice the good 10+ second gap in logs on otherwise busy node:
2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
24 active threads
... huge stack trace dump in logfile ...
something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
discarded for being too old (21380)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
discarded for being too old (21380)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
discarded for being too old (10367)
2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
discarded for being too old (10360)
2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
looking at code, failure of client to ping causes termination:
else {
// send ping
taskFound = umbilical.ping(taskId);
}
...
catch (Throwable t) {
LOG.info("Communication exception: " + StringUtils.stringifyException(t));
remainingRetries -=1;
if (remainingRetries == 0) {
ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
LOG.warn("Last retry, killing "+taskId);
System.exit(65);
exit code is 65 as reported by task tracker.
i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Amareshwari Sriramadasu (JIRA) at Jun 2, 2008 at 10:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Attachment: patch-2393.txt

    Here is patch adding a daemon thread to the tasktracker delete files that are queued up. TaskTracker queues up all the file deletions to this new thread. This should remove the io operations in purgeJob().
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 3, 2008 at 9:15 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Open (was: Patch Available)
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 3, 2008 at 9:15 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Patch Available (was: Open)

    resubmitting for hudson
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Jun 3, 2008 at 2:45 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-2393:
    --------------------------------

    Status: Open (was: Patch Available)

    Sorry, the tasktracker should do a join on the cleanup thread in order to give it a chance to complete cleanup of the paths the tasktracker asked it to. Also, the LOG message has "directory" misspelt.
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 5, 2008 at 7:50 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Patch Available (was: Open)
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 5, 2008 at 7:50 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Attachment: patch-2393.txt

    The patch adds join for task cleanup thread and directory cleanup thread in TaskTracker.shutdown() as suggested.
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 5, 2008 at 10:42 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Open (was: Patch Available)
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 5, 2008 at 11:18 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Attachment: patch-2393.txt

    boolean shuttingDown is made volatile.
    Tested the patch on cluster.
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 5, 2008 at 11:20 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Patch Available (was: Open)
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 6, 2008 at 6:21 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Open (was: Patch Available)

    trying hudson again
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 6, 2008 at 6:21 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Status: Patch Available (was: Open)
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Jun 6, 2008 at 6:21 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Amareshwari Sriramadasu updated HADOOP-2393:
    --------------------------------------------

    Attachment: patch-2393.txt

    class CleaupQueue is made private in TaskTracker.
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Jun 6, 2008 at 10:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-2393:
    --------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Amareshwari!
    TaskTracker locks up removing job files within a synchronized method
    ---------------------------------------------------------------------

    Key: HADOOP-2393
    URL: https://issues.apache.org/jira/browse/HADOOP-2393
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.14.4
    Environment: 0.13.1, quad-code x86-64, FC-linux. -xmx2048
    ipc.client.timeout = 10000
    Reporter: Joydeep Sen Sarma
    Assignee: Amareshwari Sriramadasu
    Priority: Critical
    Fix For: 0.18.0

    Attachments: patch-2393.txt, patch-2393.txt, patch-2393.txt, patch-2393.txt


    we have some bad jobs where the reduces are getting stalled (for unknown reason). The task tracker kills these processes from time to time.
    Everytime one of these events happens - other (healthy) map tasks in the same node are also killed. Looking at the logs and code up to 0.14.3 - it seems like the child tasks pings to the task tracker are timed out and the child task self-terminates.
    tasktracker log:
    // notice the good 10+ second gap in logs on otherwise busy node:
    2007-12-10 09:26:53,047 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_r_000001_47 done; removing files.
    2007-12-10 09:27:26,878 INFO org.apache.hadoop.mapred.TaskRunner: task_0120_m_000618_0 done; removing files.
    2007-12-10 09:27:26,883 INFO org.apache.hadoop.ipc.Server: Process Thread Dump: Discarding call ping(task_0149_m_000007_0) from 10.16.158.113:43941
    24 active threads
    ... huge stack trace dump in logfile ...
    something was going on at this time which caused to the tasktracker to essentially stall. all the pings are discarded. after stack trace dump:
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (21380)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50050, call ping(task_0149_m_000007_0) from 10.16.158.113:43941:\
    discarded for being too old (10367)
    2007-12-10 09:27:26,883 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50050, call ping(task_0149_m_000002_1) from 10.16.158.113:44183:\
    discarded for being too old (10360)
    2007-12-10 09:27:26,982 WARN org.apache.hadoop.mapred.TaskRunner: task_0149_m_000002_1 Child Error
    looking at code, failure of client to ping causes termination:
    else {
    // send ping
    taskFound = umbilical.ping(taskId);
    }
    ...
    catch (Throwable t) {
    LOG.info("Communication exception: " + StringUtils.stringifyException(t));
    remainingRetries -=1;
    if (remainingRetries == 0) {
    ReflectionUtils.logThreadInfo(LOG, "Communication exception", 0);
    LOG.warn("Last retry, killing "+taskId);
    System.exit(65);
    exit code is 65 as reported by task tracker.
    i don't see an option to turn off stack trace dump (which could be a likely cause) - and i would hate to bump up timeout because of this. Crap.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJun 2, '08 at 10:12a
activeJun 6, '08 at 10:49a
posts14
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Devaraj Das (JIRA): 14 posts

People

Translate

site design / logo © 2022 Grokbase