FAQ
[ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576361#action_12576361 ]

Runping Qi commented on HADOOP-2119:
------------------------------------


Devaraj,

Chris D. has a simple test job that can reproduce this problem.
You may use that job to validate your patch.

JobTracker becomes non-responsive if the task trackers finish task too fast
---------------------------------------------------------------------------

Key: HADOOP-2119
URL: https://issues.apache.org/jira/browse/HADOOP-2119
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.16.0
Reporter: Runping Qi
Assignee: Amar Kamat
Priority: Critical
Fix For: 0.17.0

Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


I ran a job with 0 reducer on a cluster with 390 nodes.
The mappers ran very fast.
The jobtracker lacks behind on committing completed mapper tasks.
The number of running mappers displayed on web UI getting bigger and bigger.
The jos tracker eventually stopped responding to web UI.
No progress is reported afterwards.
Job tracker is running on a separate node.
The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Amar Kamat (JIRA) at Mar 11, 2008 at 2:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577454#action_12577454 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    With a similar approach as discussed above and some optimizations (one of which is that the batching (task commit) now is in stages i.e *batch-size* tips from the queue get batch committed in one go) we could process large number of maps successfully.
    The job description is as follows
    1) 250 nodes
    2) random-writer modified to do the following : map data goes to the local filesystem and reducers do nothing.
    3) num maps : 3,20,000
    4) num reducers : 450
    5) bytes per map : 7mb
    6) total data : 2.5 TB
    7) batch commit size = 5000 i.e at a time only 5000 tips are committed
    The map phase took approx 40 min.
    The only problem is that of the reducer-scheduling from the JT. The maps finish so fast that the map load is always low and the reducers always start after the maps are done. Simple tricks of increasing the number of _task completion events_, _jetty threads_ etc might help but wont provide a scalable solution. So it seems that tweaking the load logic in the JT i.e {{getNewTaskForTaskTracker()}} is the only way. We are currently trying lots of optimizations and will post a stable/final version of the approach along with a patch soon.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 11, 2008 at 4:50 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577509#action_12577509 ]

    Devaraj Das commented on HADOOP-2119:
    -------------------------------------

    bq. The only problem is that of the reducer-scheduling from the JT. The maps finish so fast that the map load is always low and the reducers always start after the maps are done. Simple tricks of increasing the number of task completion events, jetty threads etc might help but wont provide a scalable solution. So it seems that tweaking the load logic in the JT i.e getNewTaskForTaskTracker() is the only way.

    The load logic seems to be there by design and is there even in the existing codebase. Since the maps are really small and they complete really fast (even before the scheduled tasktracker heartbeat interval), the tasktracker always reports with countMapTasks() = 0. Thus they always get a map task. Increasing the number of taskcompletion events or the Jetty threads will not help here since the reducers are not even launched. If we decide to tweak the load logic it should be done as a separate Jira IMO.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 11, 2008 at 7:30 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577569#action_12577569 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    True. My main concern there was fastening the reducers. Which as of now can be done by starting the reducers early and speeding up the shuffling process. Maps takes ~1hr while shuffling takes ~4hrs in the context of the benchmarks reported. Hence the reducers are hit more because of the slow shuffling. Fixing the load logic will require detailed analysis whereas improving the shuffling *might* not (parametric tweaks might make it better). Hence the load logic should scale.
    bq. ... should be done as a separate Jira IMO
    +1
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 17, 2008 at 7:17 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579335#action_12579335 ]

    Hadoop QA commented on HADOOP-2119:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12377817/HADOOP-2119-v4.1.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included -1. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests -1. The patch failed core unit tests.

    contrib tests -1. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1978/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1978/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1978/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1978/console

    This message is automatically generated.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 17, 2008 at 8:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579343#action_12579343 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    The attached patch does the following
    Maps :
    1) Replaces {{ArrayList}} with {{LinkedList}} for the currently used caches (call it *NR* caches).
    2) Failed TIPs are added (if it can be) at the front of the *NR* caches. [for fail-early]
    3) Removal of a tip from the *NR* caches is on demand i.e remove running/non-runnable TIPs while searching for a new TIP.
    4) Maintains a new set of caches called *R* caches for running TIPs. This caches are similar to the *NR* caches but provides faster removal. Additions to the caches are in the form of appends. Removal is one shot i.e a non-running TIP is removed at once from all the *R* caches. [for speculation]

    Reduces :
    1) Maintains a LinkedList of non-running reducers i.e *NR* cache. [for non-running tasks]
    2) Failed reducers are added to the front of *NR* cache. [for fail-early]
    3) Maintains a set of running reducers with faster removal capability. [for speculation]
    ----
    Also,
    1) Search preference is as follows {{FAILED}}, {{NON-RUNNING}}, {{RUNNING}}
    2) Search order is as follows
    {noformat}
    1. Search local cache i.e strong locality
    2. Search bottom-up (i.e from the node's parent to the node's top level ancestor) for a TIP i.e weak locality.
    3. Search breadth wise across top-level ancestors for a TIP i.e for a non local TIP.
    {noformat}
    3) Introducing a _default-node_. TIP's that are not local to any of the node are local to default node. This node takes care of random-writer like cases i.e adapting the random-writer like cases to the cache structure. _default-node_ belongs to _default-rack_ and hence all the nodes share the non-local TIPs through _default-rack_.
    4) The JobTracker need not be synchronized for providing reports to the JobClient and hence these API's doesn't lock the JT. Some staleness is okay.
    5) Commits are now in batches. But batching takes fixed number of tasks at a time. Default is 5000. So at a time 5000 tasks will be batch committed. The reason for doing this 'fixed sized batching' is that committing too many TIPs in one go locks the JobTracker for a very long duration causing *lost rpc/tracker* issues.
    6) TIPs use trackers hostname instead of tracker name for maintaining the list of machines where the TIP failed.
    7) One major bottleneck which we observed was in {{JobInProgress.isJobComplete()}} where all the TIPs were scanned. This is costly since {{isJobComplete()}} is called once every completed/failed task (via {{TaskCommit}} thread) and proves costly in case of large number of maps. Now this check is done by using the counts of finished TIPs.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 18, 2008 at 12:01 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579813#action_12579813 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Some comments on the attached patch
    1) It uses hostname to detect if the tip failed on a machine as compared to tracker-name. This becomes an issue if there are two trackers on a same node e.g ant tests. This is the reason why some of the tests failed.
    2) The list of ancestors maintained at the JT can be incomplete leading to stuck jobs. This can happen if the nodes have just the datanodes and no trackers.
    3) isJobComplete logic is broken. It should also consider failed TIPs.
    ----
    Also, {{JobInProgress.isJobComplete()}} now depends on {{failedMapTIPs}} and {{failedReduceTIPs}}. The patch fixes the update to {{failedMapTIPs/failedReduceTIPs}} in {{failedTask}} since it was broken (in cases where a TIP has a speculative task).

    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 19, 2008 at 6:52 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580500#action_12580500 ]

    Devaraj Das commented on HADOOP-2119:
    -------------------------------------

    Some comments.
    1) Remove default-node --> use a separate list for non-local running/non-running maps. So instead of falling to the array on a cache miss you hit the list that you can update as well (remove items, and add them to a equivalent list for running, etc.).
    2) Maintain a mapping from the level to the set of nodes in that level (except level 0). You should look at the TIPs at the topmost level cache (in case max cache level is 2, then that will mean all racks), when you look for something to run on a cache miss.
    3) Change the JobInProgress code to reflect proper terminologies like caches/lists etc
    4) TIPs that don't have locations get added to a special list instead of the default-node cache (point 1)
    5) Change the signature of findNewCachedTask to take the level instead of a boolean. Also, i think it'd be better if you call the method findTaskFromList since it caters to both maps and reduces and reduces really don't have a cache.
    6) getCurrentTime should be moved out to a place where it is called exactly once per findTask
    7) I don't think it is that important to move tasks to the back of the list in case of speculative tasks.

    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 20, 2008 at 4:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580659#action_12580659 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Submitting a patch after incorporating Devaraj's comments.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 20, 2008 at 5:48 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580670#action_12580670 ]

    Owen O'Malley commented on HADOOP-2119:
    ---------------------------------------

    I really wish the removing of synchronization had been done in a different patch. It makes me very nervous...
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 20, 2008 at 6:03 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580874#action_12580874 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Some comments about the synchronization changes
    1) The changes for synchronization are done to avoid the JobTracker locking wherever possible
    2) At the JobTracker following are the API's that can be unsynchronized w.r.t JobTracker
    {noformat}
    a) getMapTaskReports
    b) getReduceTaskReports
    c) getTaskDiagnostics
    d) getTaskCompletionEvents
    {noformat}
    3) *a*, *b* and *c* are the APIs for JobClient while *d* is for the reduceTasks
    4) *a* and *b* basically locks the JobTracker (then the JobInProgress and then the TaskInProgress) so that it can get the correct values of {{completes}} (via {{isComplete()}}) while *d* locks for diagnostic information ({{taskDiagnosticData}}) (via {{taskDiagnosticData()}} , {{generateSingleReport()}} and {{addDiagnosticInfo()}} ).
    5) I made {{completes}} as AtomicInteger. Updates to {{taskDiagnosticData}} is done only after sync on {{taskDiagnosticData}} i.e the object itself.
    6) Also the patch makes sure that data is always correct but it might be stale. For example when a task Task1 completes the TaskInProgress (via {{TaskInProgress.setSuccessfulTaskid(Task1)}}) there will not be any case the {{isComplete()}} is true and the {{completes(Task1)}} is false.
    7) *d* actually need not lock the JobTracker. JobInProgress locking seems sufficient. Removing the synchronization doesn't affect in any sense.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 20, 2008 at 8:29 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580912#action_12580912 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Note that the staleness will be for a very short time. It will be visible at the client (JobClient/WebUI) side only for the current request.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 20, 2008 at 10:49 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580958#action_12580958 ]

    Owen O'Malley commented on HADOOP-2119:
    ---------------------------------------

    I really wish that the synchronization changes could be done in another patch. Without a *very* careful design of the locking protocols, there are bound to be problems that will take us a long time to discover. The last time someone changed the synchronization it took a couple weeks before everyone could agree there weren't new race conditions.



    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 20, 2008 at 11:49 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580969#action_12580969 ]

    Owen O'Malley commented on HADOOP-2119:
    ---------------------------------------

    In the JobTracker:
    * siblingSetAtLevel seems really arcane. I would propose that instead you add getChildren to the Node interface.
    * why is there yet another map from hostname to Node? This is already done in the node mapping.

    In the JobInProgress:
    * I'm really concerned that we are adding 5 new fields holding collections to the JobInProgress
    * reducers is a really bad name. I'd suggest runnableReduces or something.
    * nodesToMaps should be runnableMaps
    * Don't use assignment in a parameter to a method in initTasks
    * I'm bothered by all of the checks for null Nodes that just skip the location. I think it should be a warn in the job tracker logs so that admins can find the problem and should be the default node/ default rack.
    * Shouldn't we remove the node from the nodesToMaps regardless of the level? Since if it is running, it is also in the runningMaps list and we can speculate out of there.
    * nodesToMaps being null should be a fatal error
    * nodesToMaps should be a Map<Node,Set<Tip>> rather than a list of tips, so that we can remove things reasonably fast

    i'll look some more tonight.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 21, 2008 at 5:59 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581006#action_12581006 ]

    Owen O'Malley commented on HADOOP-2119:
    ---------------------------------------

    Ah, we can't use Map<Node,Set<TIP>> because iteration would be unordered. Darn. *Smile*
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 21, 2008 at 6:27 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581009#action_12581009 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    bq. siblingSetAtLevel seems really arcane.
    Node is also used elsewhere. So changing Node might prove risky. So for now we thought we will have a JT level mapping and address this as a separate issue.
    bq. why is there yet another map from hostname to Node?
    There is not extra mapping in JobTracker. The variable is just renamed. Earlier the map was from the tracker-name to tracker-node. Now even the datanodes are mapped.
    bq. I'm really concerned that we are adding 5 new fields holding collections to the JobInProgress
    This will be fine once we remove the array (maps/reduces)
    bq. I'm bothered by all of the checks for null Nodes ....
    If nodes are null (anywhere in the cache topology) then there wont be any cache. The cache (as per trunk) is created only if the configuration is correct. The only place the node can be null is when a tracker just joins in. In that case we iterate over all the parent nodes and schedule a task. I agree that there should be sufficient amount of logging.
    bq. Shouldn't we remove the node from the nodesToMaps regardless of the level?
    We cant remove from the runnable-cache since it will be a costly operation. Its a list!!
    bq. nodesToMaps being null should be a fatal error
    With the latest patch there will be Null pointer exception.
    bq. nodesToMaps should be a Map<Node,Set<Tip>> ...
    It can be a LinkedHashSet. Where the order of sort is the order of insertion. Since this is what exactly we wanted, But then we would not be able to add failed tips in the front. We can maintain a separate cache for failed tips.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 21, 2008 at 10:43 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581206#action_12581206 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Taken into consideration Owen's comments. Here is what is done
    bq. I really wish that the synchronization changes could be done in another patch ...
    +1. Removed all the synchronization changes. Will open another issue regarding the same.
    bq. siblingSetAtLevel seems really arcane. I would propose that instead you add getChildren to the ...
    Maintaining this information at Node level might involve more complexity and will require more testing. A concept of children is already there in NodeBase but looking at the code it is not very clear what they are for and how to use them. Now there is just a single set of nodes at {{maxlevel}} maintained at the JobTracker. For now this seems to be a simpler solution.
    bq. why is there yet another map from hostname to Node? This is already done in the node mapping.
    This is done to incur less penalty during the job execution. While the job is running the only penalty incurred is for the resolution of datanodes and newly joining trackers while resolution of trackers (before the job is submitted) is done as a part of heart beat (separate thread). Without this mapping there is no way to find out the Node given the hostname. Also I have renamed the variable _trackerNameToNodeMap_ which is there in the trunk. I am also using it to store the mapping for datanodes mapping too.
    bq. I'm really concerned that we are adding 5 new fields holding collections to the JobInProgress
    As I said this is required to get away with the array and also that the total space is somewhat bounded by the total number of TIPs. Either the TIPs are local or not. Also the TIPs are either running or not-running. Mostly they move from one list to other. Hence
    _local-maps-non-running + local-maps-running + non-local-maps-non-running + non-local-maps-running ~ total-map-tips_
    and
    _non-running-reduces + running-reduces ~ total-reduce-tips_.
    bq. reducers is a really bad name.
    Fixed.
    bq. nodesToMaps should be runnableMaps
    runnable means !failed && not-completed. Running and non-running both belong to the runnable category. But I have used a different name for this variable.
    bq. Don't use assignment in a parameter to a method in initTasks
    Fixed.
    bq. I'm bothered by all of the checks for null Nodes that just skip the location.
    Fixed. Now there are no null checks.
    bq. Shouldn't we remove the node from the nodesToMaps regardless of the level?
    Consider a case where _tip1_ fails on _host1_. _host1_ belongs to _rack1_. Now _host1_ runs out of cached tips and queries _rack1_'s cache. In such a case it should not remove the tip since some other tracker in the same rack can schedule it.
    bq. nodesToMaps being null should be a fatal error
    Fixed.
    bq. nodesToMaps being null should be a fatal error
    Done. In case of misconfiguration (i.e nodesToMaps = null) the JobTracker will give a fatal error and shutdown.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 21, 2008 at 10:47 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581209#action_12581209 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    Btw in case of misconfiguration, the JT will try and shutdown and if the shutdown throws an exception there is a {{System.exit(-1)}}. This will be detected by the findbugs but afaik this is the only way .
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Mar 23, 2008 at 9:51 am
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581367#action_12581367 ]

    Hadoop QA commented on HADOOP-2119:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12378413/HADOOP-2119-v5.2.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included -1. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2028/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2028/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2028/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2028/console

    This message is automatically generated.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amar Kamat (JIRA) at Mar 23, 2008 at 3:39 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581400#action_12581400 ]

    Amar Kamat commented on HADOOP-2119:
    ------------------------------------

    The findbugs warning is due to the call to {{System.exit(-1)}} in the JobTracker. In case of misconfiguration (w.r.t cache levels), the JobTracker will try and shutdown and if the shutdown throws an exception there is a {{System.exit(-1)}} in the catch block to stop/kill the JobTracker forcefully. Hence I think this should not be a problem.
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 26, 2008 at 9:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582426#action_12582426 ]

    Devaraj Das commented on HADOOP-2119:
    -------------------------------------

    +1
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 26, 2008 at 10:29 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582452#action_12582452 ]

    Owen O'Malley commented on HADOOP-2119:
    ---------------------------------------

    Ok, now I committed it. Thanks, Amar!
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Mar 27, 2008 at 12:21 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582614#action_12582614 ]

    Hudson commented on HADOOP-2119:
    --------------------------------

    Integrated in Hadoop-trunk #443 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/443/])
    JobTracker becomes non-responsive if the task trackers finish task too fast
    ---------------------------------------------------------------------------

    Key: HADOOP-2119
    URL: https://issues.apache.org/jira/browse/HADOOP-2119
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.16.0
    Reporter: Runping Qi
    Assignee: Amar Kamat
    Priority: Critical
    Fix For: 0.17.0

    Attachments: HADOOP-2119-v4.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.1.patch, HADOOP-2119-v5.2.patch, hadoop-2119.patch, hadoop-jobtracker-thread-dump.txt


    I ran a job with 0 reducer on a cluster with 390 nodes.
    The mappers ran very fast.
    The jobtracker lacks behind on committing completed mapper tasks.
    The number of running mappers displayed on web UI getting bigger and bigger.
    The jos tracker eventually stopped responding to web UI.
    No progress is reported afterwards.
    Job tracker is running on a separate node.
    The job tracker process consumed 100% cpu, with vm size 1.01g (reach the heap space limit).
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 7, '08 at 8:27p
activeMar 27, '08 at 12:21p
posts23
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 23 posts

People

Translate

site design / logo © 2022 Grokbase