Hi all,
we are using hadoop 0.20.2 with hbase 0.20.6 on about 30 nodes. Our
cluster is under heavy write load, causing hbase to do a lot of
compactions, which in turn causes many files with many new blocks to be
created in HDFS. Now, the problem is, that several datanodes are
periodically running out of disk space (they consume the space very
rapidly, we have seen 5 TiB to be exhausted in single day). We have
investigated a bit this problem and created a patch in
FSNamesystem.invalidateWorkForOneNode(). The patch changes selection of
node to invalidate - in original version the first node is selected, we
changed this to random node. After aplying the patch the problem seems
to disappear and all our DNs are having constant and balanced disk
usage. The question is, is our problem a general issue, or are we
missing something? What is the reason to take the first node to
invalidate, when this can potentially cause starvation of other nodes?
Thanx,
Jan