Thanks for your comments Allen.I have added mine inline.
On May 11, 2011, at 11:11 AM, Adi wrote:
By our calculations hadoop should not exceed 70% of memory.
Allocated per node - 48 map slots (24 GB) , 12 reduce slots (6 GB), 1 GB
each for DataNode/TaskTracker and one JobTracker Totalling 33/34 GB
It sounds like you are only taking into consideration the heap
size. There is more memory allocated than just the heap...
Our heapsize for mappers is half of memory allocated to each mapper. But
you're right we should account more for DN/TT/JT.
The queues are capped at using only 90% of capacity allocated so generally
10% of slots are always kept free.
But that doesn't translate into how free the nodes, which you've
discovered. Individual nodes should be configured based on the assumption
that *all* slots will be used.
The cluster was running total 33 mappers and 1 reducer so around 8-9 mappers
per node with 3 GB max limit and they were utilizing around 2GB each.
Top was showing 100% memory utilized. Which our sys admin says is ok as the
memory is used for file caching by linux if the processes are not using
Well, yes and no. What is the breakdown of that 100%? Is there
any actually allocated to buffer cache or is it all user space?
Here's the breakdown.
Tasks: 260 total, 1 running, 259 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.7%us, 1.2%sy, 0.0%ni, 83.6%id, 2.2%wa, 0.0%hi, 0.3%si, 0.0%
Mem: 49450772k total, 49143288k used, 307484k free, 16912k buffers
Swap: 5242872k total, 248k used, 5242624k free, 7076564k cached
No swapping on 3 nodes.
Then node4 just started swapping after the number of processes shot up
unexpectedly. The main mystery are these excess number of processes on the
node which went down. 36 as opposed to expected 11. The other 3 nodes were
successfully executing the mappers without any memory/swap issues.
Likely speculative execution or something else. But again: don't
build machines with the assumption that only x% of the slots will get used.
There is no guarantee in the system that says that free slots will be
balanced across all nodes... esp when you take into consideration node
I will look into this.
Meanwhile, now I am running the job again with three nodes and observing
that the completed tasks still show up in the process tree. In the earlier
run I was allocating max heap size as intial, which I have disabled now. So
mu hunch is that this job will run out of memory a little later.
Hadoop is showing 8/10 tasks running per node and each node right now has
25-30 java processes.
I grepped for a completed attempt and it still shows up in the ps listing.
Non-Running TasksTask AttemptsStatus attempt_201104280947_1266_m_000033_0
$ ps -ef | grep hadoop | grep attempt_201104280947_1266_m_000033_0
hadoop 17315 5018 26 16:38 ? 00:13:39