I have a 20-node hadoop cluster, processing large log files. I've seen it
said that there's never any reason to make the inputSplitSize larger than a
single HDFS block (64M), because you give up data locality for no benefit if
But when I kick off a job against the whole dataset with that default
splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
each. Typically I can get through about half of them, then the jobTracker
freezes with OOM errors.
I do realize that I could just up the HADOOP_HEAP_SIZE on the jobTracker
node. But it also seems like we ought to have fewer map tasks, lasting more
like 1 or 1.5 minutes each, to reduce the overhead to the jobTracker of
managing so many tasks...also the overhead to the cluster nodes of starting
and cleaning up after so many child JVMs.
Is that not a compelling reason for upping the inputSplitSize? Or am I