We did the same exercise a few months back. When we run the balancer which takes
a while to balance, it will balance based on the percentage of disk usage on
each node, so you will end up with usage of nodes between say 45-55% on all
Sometimes the balancer does not balance well initially, in which case, we
increased the rep factor to 4 and kept it that way for a few day while running
the balancer. Then we brought down the rep factor back to 3 and let the balancer
From: David Ginzburg <email@example.com>
To: HDFS USER mail list <firstname.lastname@example.org>
Sent: Thu, January 20, 2011 12:42:17 AM
Subject: Adding new data nodes to existing cluster, with different storage
Our current cluster runs with 22 data nodes - each with 4TB .
We should be installing new data nodes on this existing cluster , but each will
have 8TB of storage capacity.
I am wondering how will the namenode distribute the blocks, It is my
understanding thatReplica Placement policy is that data nodes are chosen at
random, so an even distribution is expected , So eventually the smaller nodes
will fill up while the larger nodes will reach 50% at which point the small
nodes will become unusable.
Am I correct?
Is there any recommended practice in this case? would running a balancer