On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:
We have already thoughts about it.
No, I think we are talking about different problems. What I'm talking
about is how to reduce the number of replica while still achieving the
same data reliability. The replica of data can already be compressed.
To illustrate the problem, here is a more concrete example:
The size of block A is X. After it is compressed, its size is Y. When it
is written to HDFS, it needs to be replicated if we want the data to be
reliable. If the replication factor is R, then R*Y bytes will be written
to the disk, and (R-1)*Y bytes will be transmitted in the network.
Now, if we use some better encoding to achieve data reliability, for B
blocks of data, we can have P parity blocks. And for each block, we need
to have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmitted
over the network, and thus it's possible to further reduce the network
and disk bandwidth.
So what Joey showed me is more relevant even though it doesn't reduce
the data size before data is written to the network or the disk.
To implement that, I think we will probably not use pipeline any more.
About your patches, I don't know how useful it can be when we can ask
the applications to compress data. For example, we can enable
mapred.output.compress in MapReduce to ask reducers to compress data. I
assume MapReduce is the major user of HDFS.