(mapred.min.split.size can be only set to larger than HDFS block size)
I haven't tried this on a new mapreduce API, but
I think this would let you set a split size smaller than the hdfs block size :)
On 2/17/11 2:32 PM, "Jim Falgout" wrote:
Generally, if you have large files, setting the block size to 128M or larger is helpful. You can do that on a per file basis or set the block size for the whole filesystem. The larger block size cuts down on the number of map tasks required to handle the overall data size. I've experimented with mapred.min.split.size also and have usually found that the larger the split size, the better the overall run time. Of course there is a cut off point, especially with a very large cluster where larger split sizes will hurt overall scalability.
On tests I've run on a 10 and 20 node cluster though, setting the split size as high as 1GB has allows the overall Hadoop jobs to run faster, sometimes quite a bit faster. You will lose some locality, but it seems a trade off with the number of files that have to be shuffled for the reduction step.
From: Boduo Li
Sent: Thursday, February 17, 2011 12:01 PM
Subject: HDFS block size v.s. mapred.min.split.size
I'm recently benchmarking Hadoop. I know two ways to control the input data size for each map task(): by changing the HDFS block size (have to reload data into HDFS in this case), or by setting mapred.min.split.size.
For my benchmarking task, I need to change the input size for a map task frequently. Changing HDFS block size and reloading data is really painful.
But using mapred.min.split.size seems to be problematic. I did some simple test to verify if Hadoop has similar performance in the following cases:
(1) HDFS block size = 32MB, mapred.min.split.size=64MB (mapred.min.split.size can be only set to larger than HDFS block size)
(2) HDFS block size = 64MB, mapred.min.split.size is not set
I ran the same job under these settings. Setting (1) takes 1374s to finish.
Setting (2) takes 1412s to finish.
I do understand that, given smaller HDFS block size, the I/O is more random.
But the 50-second difference seems too much for random I/O of input data.
Does anyone have any insight of it? Or does anyone know a better way to control the input size of each map task?