I installed hadoop-0.20.2 in Eucalyptus VM environment. The file system
is based on glusterfs, so it is a shared NAS. Though the nodes are much
powerful (8 cores + 15G memory), I found the response of hadoop namenode
and data nodes became very slow. For example, after running
start-all.sh, the datanodes take more than 5 minutes to be ready. The
safe mode time is really really long. Moreover, the program also runs
much slower than it did on old physical cluster nodes. I have tried
running hadoop on a cluster containing 15 VM nodes, also on a pesudo
cluster on a single VM, all very slow. Is it because NAS is an IO
bottleneck? The HDFS is created on top of glusterfs like reinventing
the wheel, so I tried to adjust the replication setting to different
values (1 to 4) but no improvement. I haven't tried CDH3 package yet. I
wonder whether switching to CDH3 would bring any significant
improvement. Any suggestion about this issue is highly appreciated.
Shi