For a node with M gigabytes of memory and N total child tasks (both
map + reduce) running on the node, what do people typically use for
the following parameters:
- Xmx (heap size per child task JVM)?
I.e. my question here is what percentage of the total memory node do
you use for the heaps of the tasks' JVMs. I am trying to reuse JVMs,
and there are roughly N task-JVMs on one node at any time. I 've tried
using a very large chunk of my memory of my node for heaps (i.e. close
to M/N) and I have seen better execution times without experiencing
swapping; but I am wondering if this is a job-specific behaviour. When
I 've used both -Xmx and -Xms set to the same heap size (i.e. maximum
and minum heap size the same to avoid contraction and expansion
overheads) I have run into some swapping; I guess Xms=Xmx should be
avoided if we are close to the physical memory limit.
- io.sort.mb and io.sort.factor. I understand that to answer this we
'd have to take the disk configuration into consideration. Do you
consider this only a function of disk or also a function of the heap
size? Obviously io.sort.mb < heapsize, but how much space do you leave
for non-sort buffer usage?
I am interested in small cluster setups ( 8-16 nodes), and not large
clusters, if that makes any difference.