I have just read about the HDFS RAID feature that was added to Hadoop 0.21
or 0.22. and I am quite curious to know if people use it, what kind of use
they have and what they think about Map/Reduce data locality.
First big actor of this technology is Facebook, that claims to save many PB
with it (see http://www.slideshare.net/ydn/hdfs-raid-facebook
slides 4 and 5).
I understand the following advantages with HDFS RAID:
- You can save space
- System tolerates more missing blocks
Nonetheless, one of the drawback I see is M/R data locality.
As far as I understand, the advantage of having 3 replicas of each blocks is
not only security if one server fails or a block is corrupted,
but also the possibility to have as far as 3 tasktrackers executing the map
task with “local data”.
If you consider the 4th slide of the Facebook presentation, such
infrastructure decreases this possibility to only 1 tasktracker.
That means that if this tasktracker is very busy executing other tasks, you
have the following choice:
- Waiting this tasktracker to finish executing (part of) the
current tasks (freeing map slots for instance)
- Executing the map task for this block in another tasktracker,
transferring the information of the block through the network
In both cases, you´ll get a M/R penalty (please, tell me if I am wrong).
Has somebody considered such penalty or has some benchmarks to share with
One of the scenario I can think in order to take advantage of HDFS RAID
without suffering this penalty is:
- Using normal HDFS with default replication=3 for my “fresh data”
- Using HDFS RAID for my historical data (that is barely used by
And you, what are you using HDFS RAID for?