To test bottle neck, I tried to figure out if some processes/threads are often blocked and wait for either disk or network i/o and why if either mapper or reducer runs slow. In my case, on each slave, up to 12 mappers are allowed to run simultaneously. CPU are more than 90% of time in idle mode and about at most 2% in iowait. But I found most mappers (from "top" and "jps") were in sleep and strace shows that they (including tasktracker and datanode) were blocked on futex(0x4035b9d0, FUTEX_WAIT, 12566, NULL,
Here's a list of accumulated open files (including network, pipe, socket, etc) of data node grouped by type;
IPv6 15
unix 1
DIR 2
CHR 4
0000 17
REG 122
sock 1
FIFO 34
Here's a list of accumulated open files (including network, pipe,
socket, etc) of task tracker grouped by type;
IPv6 24
unix 1
DIR 2
CHR 4
0000 4
REG 105
sock 1
FIFO 50
Here's a typical mapper thread:
IPv6 2
unix 1
0000 1
DIR 4
sock 1
FIFO 2
CHR 6
REG 106
A mapper would block on futex for about a minute or so. It seems to me that various i/o cannot catch up with CPU. Would it be helpful to increase some buffer parameters to handle this? OR does this stats imply sth else? BTW, what is an effective way to analyze peformance of a hadoop cluster and what about good tools? Any recommendations?
Thanks,
Michael