FAQ
Folks,

I need some help on job tracker.
I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 (Cloudera).

I have the same problem with both the clusters: the job tracker hangs almost once a day.
Symptom: The job tracker web page can not be loaded, the command "hadoop job -list" hangs and jobtracker.log file stops being updated.
No useful information can I find in the job tracker log file.
The symptom is gone after I restart the job tracker and the cluster runs fine for another 20+ hour period. And then the symptom comes back.

I do not have serious problem with HDFS.

Any ideas about the causes? Any configuration parameter that I can change to reduce the chances of the problem?
Any tips for diagnosing and troubleshooting?

Thanks!

Tan

Search Discussions

  • Ted Yu at Jun 17, 2010 at 9:39 pm
    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan


  • Todd Lipcon at Jun 17, 2010 at 9:41 pm
    +1, jstack is crucial to solve these kinds of issues. Also, which scheduler
    are you using?

    Thanks
    -Todd
    On Thu, Jun 17, 2010 at 2:38 PM, Ted Yu wrote:

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Li, Tan at Jun 17, 2010 at 11:58 pm
    Thanks, Todd.
    I will try that and let you know the result.
    Tan

    -----Original Message-----
    From: Todd Lipcon
    Sent: Thursday, June 17, 2010 2:41 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    +1, jstack is crucial to solve these kinds of issues. Also, which scheduler
    are you using?

    Thanks
    -Todd
    On Thu, Jun 17, 2010 at 2:38 PM, Ted Yu wrote:

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Li, Tan at Jun 18, 2010 at 12:04 am
    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version related.
    I will run jstack and jmap once the problem happens again and I may need your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan


  • Todd Lipcon at Jun 18, 2010 at 12:07 am
    Li, just to narrow your search, in my experience this is usually caused by
    OOME on the JT. Check the logs for OutOfMemoryException, see what you find.
    You may need to configure it to retain fewer jobs in memory, or up your
    heap.

    -Todd
    On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan wrote:

    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version
    related.
    I will run jstack and jmap once the problem happens again and I may need
    your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Li, Tan at Jun 18, 2010 at 9:05 pm
    Todd,
    I will try to increase the HADOOP_HEAPSIZE to see if that helps.
    Tan

    -----Original Message-----
    From: Todd Lipcon
    Sent: Thursday, June 17, 2010 5:07 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Li, just to narrow your search, in my experience this is usually caused by
    OOME on the JT. Check the logs for OutOfMemoryException, see what you find.
    You may need to configure it to retain fewer jobs in memory, or up your
    heap.

    -Todd
    On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan wrote:

    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version
    related.
    I will run jstack and jmap once the problem happens again and I may need
    your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Bobby Dennett at Jun 21, 2010 at 7:50 pm
    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.

    Please note that I had previously started a similar topic on Get
    Satisfaction
    (http://www.getsatisfaction.com/cloudera/topics/looking_for_troubleshooting_tips_guidance_for_hanging_jobtracker)
    where Todd is helping and the output of jstack and jmap can be found.

    Thanks,
    -Bobby
    On Fri, 18 Jun 2010 15:04 -0600, "Li, Tan" wrote:
    Todd,
    I will try to increase the HADOOP_HEAPSIZE to see if that helps.
    Tan

    -----Original Message-----
    From: Todd Lipcon
    Sent: Thursday, June 17, 2010 5:07 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Li, just to narrow your search, in my experience this is usually caused
    by
    OOME on the JT. Check the logs for OutOfMemoryException, see what you
    find.
    You may need to configure it to retain fewer jobs in memory, or up your
    heap.

    -Todd
    On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan wrote:

    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version
    related.
    I will run jstack and jmap once the problem happens again and I may need
    your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • James Seigel at Jun 21, 2010 at 7:52 pm
    Good luck Bobby. I hope that when you get this problem licked you’ll post your solutions to help us all learn some more stuff as well :)

    Cheers
    James.
    On 2010-06-21, at 1:49 PM, Bobby Dennett wrote:

    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.

    Please note that I had previously started a similar topic on Get
    Satisfaction
    (http://www.getsatisfaction.com/cloudera/topics/looking_for_troubleshooting_tips_guidance_for_hanging_jobtracker)
    where Todd is helping and the output of jstack and jmap can be found.

    Thanks,
    -Bobby
    On Fri, 18 Jun 2010 15:04 -0600, "Li, Tan" wrote:
    Todd,
    I will try to increase the HADOOP_HEAPSIZE to see if that helps.
    Tan

    -----Original Message-----
    From: Todd Lipcon
    Sent: Thursday, June 17, 2010 5:07 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Li, just to narrow your search, in my experience this is usually caused
    by
    OOME on the JT. Check the logs for OutOfMemoryException, see what you
    find.
    You may need to configure it to retain fewer jobs in memory, or up your
    heap.

    -Todd
    On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan wrote:

    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version
    related.
    I will run jstack and jmap once the problem happens again and I may need
    your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Jun 21, 2010 at 8:17 pm
    Before the new hardware is ready, I suggest you configure jobtracker to
    retain fewer jobs in memory - as Todd mentioned.

    On Mon, Jun 21, 2010 at 12:49 PM, Bobby Dennett
    wrote:
    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.

    Please note that I had previously started a similar topic on Get
    Satisfaction
    (
    http://www.getsatisfaction.com/cloudera/topics/looking_for_troubleshooting_tips_guidance_for_hanging_jobtracker
    )
    where Todd is helping and the output of jstack and jmap can be found.

    Thanks,
    -Bobby
    On Fri, 18 Jun 2010 15:04 -0600, "Li, Tan" wrote:
    Todd,
    I will try to increase the HADOOP_HEAPSIZE to see if that helps.
    Tan

    -----Original Message-----
    From: Todd Lipcon
    Sent: Thursday, June 17, 2010 5:07 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Li, just to narrow your search, in my experience this is usually caused
    by
    OOME on the JT. Check the logs for OutOfMemoryException, see what you
    find.
    You may need to configure it to retain fewer jobs in memory, or up your
    heap.

    -Todd
    On Thu, Jun 17, 2010 at 5:03 PM, Li, Tan wrote:

    Thanks for your tips, Ted.
    All of our QA is done on 0.20.1, and I got a feeling it is not version
    related.
    I will run jstack and jmap once the problem happens again and I may
    need
    your help to analyze the result.

    Tan

    -----Original Message-----
    From: Ted Yu
    Sent: Thursday, June 17, 2010 2:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Is upgrading to hadoop-0.20.2+228 possible ?

    Use jstack to get stack trace of job tracker process when this happens
    again.
    Use jmap to get shared object memory maps or heap memory details.
    On Thu, Jun 17, 2010 at 2:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with
    version 0.19.1 (apache) and the other one is with version 0.20.
    1+169.68
    (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs
    almost once a day.
    Symptom: The job tracker web page can not be loaded, the command
    "hadoop
    job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster
    runs
    fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can
    change
    to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Steve Loughran at Jun 22, 2010 at 10:18 am

    Bobby Dennett wrote:
    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.
    Have you tried
    * using compressed object pointers on java 6 server? They reduce space

    * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked
    using right up until oracle stopped giving away the updates with
    security patches. It has a way better heap as well as compressed
    pointers for a long time (==more stable code)

    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and
    2ary NN that use more, especially if the files are many and the
    blocksize small. the JT should not be tracking that much data over time
  • James Seigel at Jun 22, 2010 at 2:28 pm
    +1 for compressed pointers.

    Sent from my mobile. Please excuse the typos.
    On 2010-06-22, at 4:18 AM, Steve Loughran wrote:

    Bobby Dennett wrote:
    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.
    Have you tried
    * using compressed object pointers on java 6 server? They reduce space

    * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked
    using right up until oracle stopped giving away the updates with
    security patches. It has a way better heap as well as compressed
    pointers for a long time (==more stable code)

    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and
    2ary NN that use more, especially if the files are many and the
    blocksize small. the JT should not be tracking that much data over time
  • Allen Wittenauer at Jun 22, 2010 at 3:54 pm

    On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:

    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary NN that use more, especially if the files are many and the blocksize small. the JT should not be tracking that much data over time
    Pre-0.20.2, there are definitely bugs with how the JT history is handled, causing some memory leakage.

    The other fairly common condition is if you have way too many tasks per job. This is usually an indication that your data layout is way out of whack (too little data in too many files) or that you should be using CombinedFileInputFormat.
  • Rahul Jain at Jun 22, 2010 at 5:21 pm
    There are two issues which were fixed in 0.21.0 and can cause job tracker
    to run out of memory:

    https://issues.apache.org/jira/browse/MAPREDUCE-1316

    and

    https://issues.apache.org/jira/browse/MAPREDUCE-841

    We've been hit by MAPREDUCE-841 (large jobConf objects with large number of
    tasks, especially when running pig jobs) a number of times in hadoop 0.20.1,
    0.20.2+.

    The current workarounds are:

    a) Be careful about what you store in jobConf object
    b) Understand and control the largest number of mappers/reducers that can
    be queued at any time for processing.
    c) Provide lot of RAM to jobTracker

    We use (c) to save on debugging man hours most of the time :).

    -Rahul

    On Tue, Jun 22, 2010 at 8:53 AM, Allen Wittenauer
    wrote:
    On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:

    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary
    NN that use more, especially if the files are many and the blocksize small.
    the JT should not be tracking that much data over time

    Pre-0.20.2, there are definitely bugs with how the JT history is handled,
    causing some memory leakage.

    The other fairly common condition is if you have way too many tasks per
    job. This is usually an indication that your data layout is way out of
    whack (too little data in too many files) or that you should be using
    CombinedFileInputFormat.
  • Bobby Dennett at Jun 24, 2010 at 6:36 am
    Thanks for the latest round of suggestions. We will definitely check
    out compressed object pointers and are looking into what we can do
    regarding the JT history. As I mentioned previously, we are working on
    getting stronger servers for the NN/JT node and the secondary NN node
    (similar to workaround (c) below). Engineering is also working on
    "improving" one of our processes that accesses a large number of
    potentially smaller files to try and reduce our maximum number of map
    tasks (similar to workaround (b) below).

    On a side note, our JT process has been running since Saturday morning
    after increasing the heap size to 6,000 MB... so far, so good.
    Hopefully, I didn't just jinx it ;o)

    -Bobby
    On 6/22/10 10:12 AM, Rahul Jain wrote:
    There are two issues which were fixed in 0.21.0 and can cause job tracker
    to run out of memory:

    https://issues.apache.org/jira/browse/MAPREDUCE-1316

    and

    https://issues.apache.org/jira/browse/MAPREDUCE-841

    We've been hit by MAPREDUCE-841 (large jobConf objects with large number of
    tasks, especially when running pig jobs) a number of times in hadoop 0.20.1,
    0.20.2+.

    The current workarounds are:

    a) Be careful about what you store in jobConf object
    b) Understand and control the largest number of mappers/reducers that can
    be queued at any time for processing.
    c) Provide lot of RAM to jobTracker

    We use (c) to save on debugging man hours most of the time :).

    -Rahul

    On Tue, Jun 22, 2010 at 8:53 AM, Allen Wittenauer
    wrote:
    On Jun 22, 2010, at 3:17 AM, Steve Loughran wrote:
    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary
    NN that use more, especially if the files are many and the blocksize small.
    the JT should not be tracking that much data over time

    Pre-0.20.2, there are definitely bugs with how the JT history is handled,
    causing some memory leakage.

    The other fairly common condition is if you have way too many tasks per
    job. This is usually an indication that your data layout is way out of
    whack (too little data in too many files) or that you should be using
    CombinedFileInputFormat.
  • Hemanth Yamijala at Jun 22, 2010 at 5:20 pm
    There was also https://issues.apache.org/jira/browse/MAPREDUCE-1316
    whose cause hit clusters at Yahoo! very badly last year. The situation
    was particularly noticeable in the face of lots of jobs with failed
    tasks and a specific fix that enabled OutOfBand heartbeats. The latter
    (i.e. the OOB heartbeats patch) is not in 0.20 AFAIK, but still the
    failed tasks could be causing it.

    Thanks
    Hemanth

    On Tue, Jun 22, 2010 at 3:47 PM, Steve Loughran wrote:
    Bobby Dennett wrote:
    Thanks all for your suggestions (please note that Tan is my co-worker;
    we are both working to try and resolve this issue)... we experienced
    another hang this weekend and increased the HADOOP_HEAPSIZE setting to
    6000 (MB) as we do periodically see "java.lang.OutOfMemoryError: Java
    heap space" errors in the jobtracker log. We are now looking into the
    resource allocation of the master node/server to ensure we aren't
    experiencing any issues due to the heap size increase. In parallel, we
    are also working on building "beefier" servers -- stronger CPUs, 3x more
    memory -- for the node running the primary namenode and jobtracker
    processes as well as for the secondary namenode.

    Any additional suggestions you might have for troubleshooting/resolving
    this hanging jobtracker issue would be greatly appreciated.
    Have you tried
    * using compressed object pointers on java 6 server? They reduce space

    * bolder: JRockit JVM. Not officially supported in Hadoop, but I liked
    using right up until oracle stopped giving away the updates with security
    patches. It has a way better heap as well as compressed pointers for a long
    time (==more stable code)

    I'm surprised its the JT that is OOM-ing, anecdotally its the NN and 2ary NN
    that use more, especially if the files are many and the blocksize small. the
    JT should not be tracking that much data over time
  • James Seigel at Jun 18, 2010 at 4:11 am
    Up the memory from the default to about 4x the default (heap setting). This should make it better I’d think!

    We’d been having the same issue...I believe this fixed it.

    James
    On 2010-06-17, at 3:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan

  • Li, Tan at Jun 18, 2010 at 5:40 pm
    Thanks for your suggestions, James.
    I will try that.
    Tan

    -----Original Message-----
    From: James Seigel
    Sent: Thursday, June 17, 2010 6:21 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop JobTracker Hanging

    Up the memory from the default to about 4x the default (heap setting). This should make it better I'd think!

    We'd been having the same issue...I believe this fixed it.

    James
    On 2010-06-17, at 3:00 PM, Li, Tan wrote:

    Folks,

    I need some help on job tracker.
    I am running a two hadoop clusters (with 30+ nodes) on Ubuntu. One is with version 0.19.1 (apache) and the other one is with version 0.20. 1+169.68 (Cloudera).

    I have the same problem with both the clusters: the job tracker hangs almost once a day.
    Symptom: The job tracker web page can not be loaded, the command "hadoop job -list" hangs and jobtracker.log file stops being updated.
    No useful information can I find in the job tracker log file.
    The symptom is gone after I restart the job tracker and the cluster runs fine for another 20+ hour period. And then the symptom comes back.

    I do not have serious problem with HDFS.

    Any ideas about the causes? Any configuration parameter that I can change to reduce the chances of the problem?
    Any tips for diagnosing and troubleshooting?

    Thanks!

    Tan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 17, '10 at 9:00p
activeJun 24, '10 at 6:36a
posts18
users10
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase