FAQ
job history directory grows without bound, locks up job tracker on new job submission
-------------------------------------------------------------------------------------

Key: HADOOP-5436
URL: https://issues.apache.org/jira/browse/HADOOP-5436
Project: Hadoop Core
Issue Type: Bug
Affects Versions: 0.19.0
Reporter: Tim Williamson


An unpleasant surprise upgrading to 0.19: requests to jobtracker.jsp would take a long time or even time out whenever new jobs where submitted. Investigation showed the call to JobInProgress.initTasks() was calling JobHistory.JobInfo.logSubmitted() which in turn was calling JobHistory.getJobHistoryFileName() which was pegging the CPU for a couple minutes. Further investigation showed the were 200,000+ files in the job history folder -- and every submission was creating a FileStatus for them all, then applying a regular expression to just the name. All this just on the off chance the job tracker had been restarted (see HADOOP-3245). To make matters worse, these files cannot be safely deleted while the job tracker is running, as the disappearance of a history file at the wrong time causes a FileNotFoundException.

So to summarize the issues:
- having Hadoop default to storing all the history files in a single directory is a Bad Idea
- doing expensive processing of every history file on every job submission is a Worse Idea
- doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
- not being able to clean up the mess without taking down the job tracker is just Unfortunate

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Amar Kamat (JIRA) at Mar 8, 2009 at 5:15 am
    [ https://issues.apache.org/jira/browse/HADOOP-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679953#action_12679953 ]

    Amar Kamat commented on HADOOP-5436:
    ------------------------------------

    Tim,

    bq. Further investigation showed the were 200,000+ files in the job history folder - and every submission was creating a FileStatus for them all, then applying a regular expression to just the name
    Hey. I think the regex is passed in the DFS call and the expected answer is just *one* FileStatus object. I dont know how the regex based search is implemented. But JobHistory doesnt create FileStatus objects for all the files.

    bq. having Hadoop default to storing all the history files in a single directory is a Bad Idea
    HADOOP-4670 is opened to address this.

    bq. doing expensive processing of every history file on every job submission is a Worse Idea
    HADOOP-4372 should help as there will be no need to access history folder in job initialization. But I think the DFS should be efficient enough for regex based searches.

    bq. doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
    The plan is to improve on JobTracker locking and improve on granularity. But I think HADOOP-4372 should eliminate this.

    bq. not being able to clean up the mess without taking down the job tracker is just Unfortunate
    Look at HADOOP-4167.
    job history directory grows without bound, locks up job tracker on new job submission
    -------------------------------------------------------------------------------------

    Key: HADOOP-5436
    URL: https://issues.apache.org/jira/browse/HADOOP-5436
    Project: Hadoop Core
    Issue Type: Bug
    Affects Versions: 0.19.0
    Reporter: Tim Williamson

    An unpleasant surprise upgrading to 0.19: requests to jobtracker.jsp would take a long time or even time out whenever new jobs where submitted. Investigation showed the call to JobInProgress.initTasks() was calling JobHistory.JobInfo.logSubmitted() which in turn was calling JobHistory.getJobHistoryFileName() which was pegging the CPU for a couple minutes. Further investigation showed the were 200,000+ files in the job history folder -- and every submission was creating a FileStatus for them all, then applying a regular expression to just the name. All this just on the off chance the job tracker had been restarted (see HADOOP-3245). To make matters worse, these files cannot be safely deleted while the job tracker is running, as the disappearance of a history file at the wrong time causes a FileNotFoundException.
    So to summarize the issues:
    - having Hadoop default to storing all the history files in a single directory is a Bad Idea
    - doing expensive processing of every history file on every job submission is a Worse Idea
    - doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
    - not being able to clean up the mess without taking down the job tracker is just Unfortunate
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tim Williamson (JIRA) at Mar 8, 2009 at 7:41 am
    [ https://issues.apache.org/jira/browse/HADOOP-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679962#action_12679962 ]

    Tim Williamson commented on HADOOP-5436:
    ----------------------------------------
    Hey. I think the regex is passed in the DFS call and the expected answer is just one FileStatus object. I dont know how the regex based search is implemented. But JobHistory doesnt create FileStatus objects for all the files.
    Regarding this, the code path is:
    JobHistory.getJobHistoryFileName(...) -> fs.listStatus(new Path(LOG_DIR), filter) --> listStatus(results, f, filter);
    which has the following:
    FileStatus listing[] = listStatus(f);
    if (listing != null) {
    for (int i = 0; i < listing.length; i++) {
    if (filter.accept(listing[i].getPath())) {
    results.add(listing[i]);
    }
    }
    }

    So you are right that it's not JobHistory that's creating all the FileStatus objects, it's org.apache.hadoop.fs.FileSystem that creates (potentially hundreds of thousands of) FileStatus objects, then asks only for their path attribute, and then returns just the ones that match -- which in the case of new job submission is zero.

    Thanks for pointing out the other tickets; I should have searched before filing.
    job history directory grows without bound, locks up job tracker on new job submission
    -------------------------------------------------------------------------------------

    Key: HADOOP-5436
    URL: https://issues.apache.org/jira/browse/HADOOP-5436
    Project: Hadoop Core
    Issue Type: Bug
    Affects Versions: 0.19.0
    Reporter: Tim Williamson

    An unpleasant surprise upgrading to 0.19: requests to jobtracker.jsp would take a long time or even time out whenever new jobs where submitted. Investigation showed the call to JobInProgress.initTasks() was calling JobHistory.JobInfo.logSubmitted() which in turn was calling JobHistory.getJobHistoryFileName() which was pegging the CPU for a couple minutes. Further investigation showed the were 200,000+ files in the job history folder -- and every submission was creating a FileStatus for them all, then applying a regular expression to just the name. All this just on the off chance the job tracker had been restarted (see HADOOP-3245). To make matters worse, these files cannot be safely deleted while the job tracker is running, as the disappearance of a history file at the wrong time causes a FileNotFoundException.
    So to summarize the issues:
    - having Hadoop default to storing all the history files in a single directory is a Bad Idea
    - doing expensive processing of every history file on every job submission is a Worse Idea
    - doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
    - not being able to clean up the mess without taking down the job tracker is just Unfortunate
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tim Williamson (JIRA) at Mar 8, 2009 at 8:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tim Williamson updated HADOOP-5436:
    -----------------------------------

    Attachment: HADOOP-5436.patch

    Attached is a patch to make the filtering more efficient by passing the filter all the way down to the RawLocalFileSystem.
    job history directory grows without bound, locks up job tracker on new job submission
    -------------------------------------------------------------------------------------

    Key: HADOOP-5436
    URL: https://issues.apache.org/jira/browse/HADOOP-5436
    Project: Hadoop Core
    Issue Type: Bug
    Affects Versions: 0.19.0
    Reporter: Tim Williamson
    Attachments: HADOOP-5436.patch


    An unpleasant surprise upgrading to 0.19: requests to jobtracker.jsp would take a long time or even time out whenever new jobs where submitted. Investigation showed the call to JobInProgress.initTasks() was calling JobHistory.JobInfo.logSubmitted() which in turn was calling JobHistory.getJobHistoryFileName() which was pegging the CPU for a couple minutes. Further investigation showed the were 200,000+ files in the job history folder -- and every submission was creating a FileStatus for them all, then applying a regular expression to just the name. All this just on the off chance the job tracker had been restarted (see HADOOP-3245). To make matters worse, these files cannot be safely deleted while the job tracker is running, as the disappearance of a history file at the wrong time causes a FileNotFoundException.
    So to summarize the issues:
    - having Hadoop default to storing all the history files in a single directory is a Bad Idea
    - doing expensive processing of every history file on every job submission is a Worse Idea
    - doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
    - not being able to clean up the mess without taking down the job tracker is just Unfortunate
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Mar 9, 2009 at 5:21 am
    [ https://issues.apache.org/jira/browse/HADOOP-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680050#action_12680050 ]

    Devaraj Das commented on HADOOP-5436:
    -------------------------------------

    I think addressing HADOOP-4372 is a good first step to fix this issue (HADOOP-4670 being the next). That should eliminate the need for doing listStatus for new jobs.. But yes, the listStatus implementation might be improved for performance and that could be done independently of this jira.
    job history directory grows without bound, locks up job tracker on new job submission
    -------------------------------------------------------------------------------------

    Key: HADOOP-5436
    URL: https://issues.apache.org/jira/browse/HADOOP-5436
    Project: Hadoop Core
    Issue Type: Bug
    Affects Versions: 0.19.0
    Reporter: Tim Williamson
    Attachments: HADOOP-5436.patch


    An unpleasant surprise upgrading to 0.19: requests to jobtracker.jsp would take a long time or even time out whenever new jobs where submitted. Investigation showed the call to JobInProgress.initTasks() was calling JobHistory.JobInfo.logSubmitted() which in turn was calling JobHistory.getJobHistoryFileName() which was pegging the CPU for a couple minutes. Further investigation showed the were 200,000+ files in the job history folder -- and every submission was creating a FileStatus for them all, then applying a regular expression to just the name. All this just on the off chance the job tracker had been restarted (see HADOOP-3245). To make matters worse, these files cannot be safely deleted while the job tracker is running, as the disappearance of a history file at the wrong time causes a FileNotFoundException.
    So to summarize the issues:
    - having Hadoop default to storing all the history files in a single directory is a Bad Idea
    - doing expensive processing of every history file on every job submission is a Worse Idea
    - doing expensive processing of every history file on every job submission while holding a lock on the JobInProgress object and thereby blocking the jobtracker.jsp from rendering is a Terrible Idea (note: haven't confirmed this, but a cursory glance suggests that's what's going on)
    - not being able to clean up the mess without taking down the job tracker is just Unfortunate
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 7, '09 at 8:03a
activeMar 9, '09 at 5:21a
posts5
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Devaraj Das (JIRA): 5 posts

People

Translate

site design / logo © 2022 Grokbase