Thanks.
It is Hadoop 0.19.0. It looks like a peak of 90 concurrent map tasks
and 54 concurrent reduce tasks, not necessarily at the same time. You
asked how many are held in memory. I think the number is 100, the
default.
It looks like 130 MB peak heap size used on the jobtracker. The
default 1000 is allocated, currently.
I set mapred.jobtracker.completeuserjobs.maximum=5
I increased HADOOP_HEAPSIZE to 2000 in hadoop-env.sh.
I included the "kill -3" dump from the jobtracker, far below.
I'll check back in 3 days, if and when the jobtracker fails again.
Dave
On Apr 6, 2009, at 9:42 PM, Amar Kamat wrote:David Kellogg wrote:
I am running Hadoop streaming. After around 42 jobs on an 18-node
cluster, the jobtracker stops responding. This happens on normally-
working code. Here are the symptoms.
1. A job is running, but it pauses with reduce stuck at XX%
2. "hadoop job -list" hangs or takes a very long time to return
3. In the Ganglia metrics on the Jobtracker node:
a. jvm.metrics__JobTracker__gcTimeMillis rises above 20 k (20
seconds) before failure
b. jvm.metrics__JobTracker__memHeapUsedM rises above 600
before failure
c. jvm.metrics__JobTracker__gcCount rises above 1 k before
failure
The ticker looks like this.
09/04/06 03:06:28 INFO streaming.StreamJob: map 24% reduce 7%
09/04/06 03:13:44 INFO streaming.StreamJob: map 25% reduce 7%
After the 03:13:44 line, it hangs for more than 15 minutes.
In the jobtracker log, I see this.
2009-04-04 04:19:13,563 WARN org.apache.hadoop.hdfs.DFSClient:
Error Recovery for block blk_-8143535428142072268_95993 failed
because recovery from primary datanode 10.1.0.156:50010 failed 4
times. Will retry...
After restarting both dfs and mapreduce on all nodes, the problem
goes away, and the formally non-working job proceeds without failure.
David,
What version are you using?
There can be because of :
1) Number of tasks in jobtracker's memory might exceed its limits.
What is the total number of tasks in the jobtracker's memory? What
is the jobtracker's heap size? Try increasing the heap size and
also try setting the mapred.jobtracker.completeuserjobs.maximum
parameter to some low value.
2) Sometimes some slow/bad datanode causes the jobtracker to get
stuck. As you have mentioned this might be the cause. Can you let
us know the output of 'kill -3' on jobtracker process.
Does anyone else see this problem?
David Kellogg
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b22 mixed
mode):
"LeaseChecker" daemon prio=10 tid=0x00007f0c3042a800 nid=0x3ef3
waiting on condition [0x0000000040bf7000..0x0000000040bf7b90]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run
(DFSClient.java:979)
at java.lang.Thread.run(Thread.java:619)
"IPC Server handler 9 on 54311" daemon prio=10 tid=0x00007f0c307f0800
nid=0x3dd9 waiting on condition [0x000000004313f000..0x000000004313fd10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 8 on 54311" daemon prio=10 tid=0x00007f0c307ef400
nid=0x3dd8 waiting on condition [0x000000004303e000..0x000000004303ed90]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 7 on 54311" daemon prio=10 tid=0x00007f0c307b0400
nid=0x3dd7 waiting on condition [0x0000000042f3d000..0x0000000042f3da10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 6 on 54311" daemon prio=10 tid=0x00007f0c307af000
nid=0x3dd6 waiting on condition [0x0000000042e3c000..0x0000000042e3ca90]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 5 on 54311" daemon prio=10 tid=0x00007f0c30296800
nid=0x3dd5 waiting on condition [0x0000000042d3b000..0x0000000042d3bb10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 4 on 54311" daemon prio=10 tid=0x00007f0c302e7c00
nid=0x3dd4 waiting on condition [0x0000000042c3a000..0x0000000042c3ab90]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 3 on 54311" daemon prio=10 tid=0x00007f0c302e6800
nid=0x3dd3 waiting on condition [0x0000000042b39000..0x0000000042b39c10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 2 on 54311" daemon prio=10 tid=0x00007f0c30337400
nid=0x3dd2 waiting on condition [0x0000000042a38000..0x0000000042a38c90]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 1 on 54311" daemon prio=10 tid=0x00007f0c30336000
nid=0x3dd1 waiting on condition [0x0000000042937000..0x0000000042937d10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server handler 0 on 54311" daemon prio=10 tid=0x00007f0c302a6000
nid=0x3dd0 waiting on condition [0x0000000042836000..0x0000000042836d90]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3ca963a8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.LinkedBlockingQueue.take
(LinkedBlockingQueue.java:358)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:878)
"IPC Server listener on 54311" daemon prio=10 tid=0x00007f0c302b6c00
nid=0x3dcf runnable [0x0000000042735000..0x0000000042735a10]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:
215)
at sun.nio.ch.EPollSelectorImpl.doSelect
(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:
69)
- locked <0x00007f0c3ca96ac8> (a sun.nio.ch.Util$1)
- locked <0x00007f0c3ca96ab0> (a java.util.Collections
$UnmodifiableSet)
- locked <0x00007f0c3ca96850> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
at org.apache.hadoop.ipc.Server$Listener.run(Server.java:297)
"IPC Server Responder" daemon prio=10 tid=0x00007f0c302b5800
nid=0x3dce runnable [0x0000000042634000..0x0000000042634a90]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:
215)
at sun.nio.ch.EPollSelectorImpl.doSelect
(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:
69)
- locked <0x00007f0c3cba4810> (a sun.nio.ch.Util$1)
- locked <0x00007f0c3cba47f8> (a java.util.Collections
$UnmodifiableSet)
- locked <0x00007f0c3cba4520> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at org.apache.hadoop.ipc.Server$Responder.run(Server.java:456)
"expireLaunchingTasks" prio=10 tid=0x00007f0c308ba400 nid=0x3dcd
waiting on condition [0x0000000042533000..0x0000000042533b10]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker
$ExpireLaunchingTasks.run(JobTracker.java:198)
at java.lang.Thread.run(Thread.java:619)
"retireJobs" prio=10 tid=0x00007f0c303a8800 nid=0x3dcc waiting on
condition [0x0000000042432000..0x0000000042432b90]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker$RetireJobs.run
(JobTracker.java:351)
at java.lang.Thread.run(Thread.java:619)
"expireTrackers" prio=10 tid=0x00007f0c36940c00 nid=0x3dcb waiting on
condition [0x0000000041c7f000..0x0000000041c7fc10]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobTracker$ExpireTrackers.run
(JobTracker.java:281)
at java.lang.Thread.run(Thread.java:619)
"initJobs" prio=10 tid=0x00007f0c3693fc00 nid=0x3dca in Object.wait()
[0x0000000041b7e000..0x0000000041b7ec90]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007f0c3ca91d90> (a java.util.ArrayList)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.mapred.EagerTaskInitializationListener
$JobInitThread.run(EagerTaskInitializationListener.java:51)
- locked <0x00007f0c3ca91d90> (a java.util.ArrayList)
at java.lang.Thread.run(Thread.java:619)
"Thread-14" prio=10 tid=0x00007f0c3602f400 nid=0x3dc9 waiting on
condition [0x0000000041a7d000..0x0000000041a7dd10]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f0c3cf03938> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park
(LockSupport.java:158)
at java.util.concurrent.locks.AbstractQueuedSynchronizer
$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:160)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:39)
at org.apache.hadoop.mapred.JobEndNotifier$1.run
(JobEndNotifier.java:50)
at java.lang.Thread.run(Thread.java:619)
"Timer thread for monitoring mapred" daemon prio=10
tid=0x00007f0c36d51000 nid=0x3dc4 in Object.wait()
[0x000000004197c000..0x000000004197cb10]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007f0c3ceabad8> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)
"Timer thread for monitoring jvm" daemon prio=10
tid=0x00007f0c36d86000 nid=0x3dc3 in Object.wait()
[0x00000000407b3000..0x00000000407b3b90]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.util.TimerThread.mainLoop(Timer.java:509)
- locked <0x00007f0c3ceaeaf8> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:462)
"SocketListener0-1" prio=10 tid=0x00007f0c36a4f000 nid=0x3dc2 in
Object.wait() [0x000000004187b000..0x000000004187bc10]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.mortbay.util.ThreadPool$PoolThread.run
(ThreadPool.java:522)
- locked <0x00007f0c3ceb1760> (a org.mortbay.util.ThreadPool
$PoolThread)
"SocketListener0-0" prio=10 tid=0x00007f0c36d69400 nid=0x3dc1 in
Object.wait() [0x000000004177a000..0x000000004177ac90]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.mortbay.util.ThreadPool$PoolThread.run
(ThreadPool.java:522)
- locked <0x00007f0c3ceb1a40> (a org.mortbay.util.ThreadPool
$PoolThread)
"Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50030]"
prio=10 tid=0x00007f0c36b6a800 nid=0x3dc0 runnable
[0x0000000041679000..0x0000000041679d10]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
- locked <0x00007f0c3cc3b5c8> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:453)
at java.net.ServerSocket.accept(ServerSocket.java:421)
at org.mortbay.util.ThreadedServer.acceptSocket
(ThreadedServer.java:432)
at org.mortbay.util.ThreadedServer$Acceptor.run
(ThreadedServer.java:631)
"SessionScavenger" daemon prio=10 tid=0x00007f0c303cc000 nid=0x3dbf
waiting on condition [0x0000000040e0b000..0x0000000040e0bd90]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager
$SessionScavenger.run(AbstractSessionManager.java:587)
"SessionScavenger" daemon prio=10 tid=0x00007f0c30346800 nid=0x3dbe
waiting on condition [0x0000000041de8000..0x0000000041de8a10]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager
$SessionScavenger.run(AbstractSessionManager.java:587)
"SessionScavenger" daemon prio=10 tid=0x00007f0c3031e400 nid=0x3dbd
waiting on condition [0x0000000041578000..0x0000000041578a90]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mortbay.jetty.servlet.AbstractSessionManager
$SessionScavenger.run(AbstractSessionManager.java:587)
"Low Memory Detector" daemon prio=10 tid=0x00000000401c9400
nid=0x3d7c runnable [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x00000000401c6000 nid=0x3d7b
waiting on condition [0x0000000000000000..0x00000000423302b0]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x00000000401c4000 nid=0x3d7a
waiting on condition [0x0000000000000000..0x000000004222f330]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00000000401c2800 nid=0x3d79
waiting on condition [0x0000000000000000..0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x000000004019d800 nid=0x3d78 in
Object.wait() [0x000000004202e000..0x000000004202eb90]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x00007f0c3cbfe4e0> (a java.lang.ref.ReferenceQueue
$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run
(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x0000000040196c00 nid=0x3d77
in Object.wait() [0x0000000041f2d000..0x0000000041f2dc10]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run
(Reference.java:116)
- locked <0x00007f0c3caf1408> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x0000000040111c00 nid=0x3d73 in Object.wait()
[0x0000000041477000..0x0000000041477ed0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007f0c3cba4020> (a
org.apache.hadoop.ipc.RPC$Server)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Server.join(Server.java:1021)
- locked <0x00007f0c3cba4020> (a org.apache.hadoop.ipc.RPC
$Server)
at org.apache.hadoop.mapred.JobTracker.offerService
(JobTracker.java:1309)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:
2731)
"VM Thread" prio=10 tid=0x0000000040191c00 nid=0x3d76 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004011c400
nid=0x3d74 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x000000004011d800
nid=0x3d75 runnable
"VM Periodic Task Thread" prio=10 tid=0x00000000401cb000 nid=0x3d7d
waiting on condition
JNI global references: 964
Heap
PSYoungGen total 198528K, used 68487K [0x00007f0c66540000,
0x00007f0c77830000, 0x00007f0c7b290000)
eden space 142144K, 48% used
[0x00007f0c66540000,0x00007f0c6a821f70,0x00007f0c6f010000)
from space 56384K, 0% used
[0x00007f0c6f010000,0x00007f0c6f010000,0x00007f0c72720000)
to space 57600K, 0% used
[0x00007f0c73ff0000,0x00007f0c73ff0000,0x00007f0c77830000)
PSOldGen total 682688K, used 672276K [0x00007f0c3ca90000,
0x00007f0c66540000, 0x00007f0c66540000)
object space 682688K, 98% used
[0x00007f0c3ca90000,0x00007f0c65b15068,0x00007f0c66540000)
PSPermGen total 21248K, used 16555K [0x00007f0c37690000,
0x00007f0c38b50000, 0x00007f0c3ca90000)
object space 21248K, 77% used
[0x00007f0c37690000,0x00007f0c386bae60,0x00007f0c38b50000)