By default, Linux file systems use a 4K block size. Block size of 4K means
all I/O happens 4K at a time. Any *updates* to data smaller than 4K will
result in a read-modify-write cycle on disk, ie, if a file was extended from
1K to 2K, the fs will read in the 4K, memcpy the region from 1K-2K into the
vm page, then write out 4K again.
If you make the block size 1M, the read-modify-write cycle will read in 1M,
and write 1M. I think you don't want that to happen. (imagine Hbase WAL
writing a few 100 bytes at a time.)
It also means that on the average, you will waste 512K of disk per file (vs.
2K with a 4K block size).
btw, MapR uses 8K as the native block size on disk.
If you insist on HDFS, try using XFS underneath, it does a much better job
than ext3 or ext4 for Hadoop in terms of how data is layed out on disk. But
its memory footprint is alteast twice of that of ext3, so it will gobble up
a lot more memory on your box.
On Sun, Oct 2, 2011 at 10:05 PM, Jinsong Hu wrote:
I just thought an idea. When we format the disk , the block size is
usually 1K to 4K. For hdfs, the block size is usually 64M.
I wonder if we change the raw file system's block size to something
significantly bigger, say, 1M or 8M, will that improve
disk IO performance for hadoop's hdfs ?
Currently, I noticed that mapr distribution uses mfs, its own file system.
That resulted in 4 times performance gain in terms
of disk IO. I just wonder if we tune the hosting os parameters, we can
achieve better disk IO performance with just the regular
apache hadoop distribution.
I understand that making the block size bigger can result in some disk
space waste for small files. However, for disk dedicated
for hdfs, where most of the files are very big, I just wonder if it is a
good idea. Any body have any comment ?