FAQ
Hello,

It seems that data replication in HDFS is simply data copy among nodes. Has
anyone considered to use a better encoding to reduce the data size? Say, a block
of data is split into N pieces, and as long as M pieces of data survive in the
network, we can regenerate original data.

There are many benefits to reduce the data size. It can save network and disk
benefit, and thus reduce energy consumption. Computation power might be a
concern, but we can use GPU to encode and decode.

But maybe the idea is stupid or it's hard to reduce the data size. I would like
to hear your comments.

Thanks,
Da

Search Discussions

  • Joey Echeverria at Jul 18, 2011 at 11:11 am
    Facebook contributed some code to do something similar called HDFS RAID:

    http://wiki.apache.org/hadoop/HDFS-RAID

    -Joey

    On Jul 18, 2011, at 3:41, Da Zheng wrote:

    Hello,

    It seems that data replication in HDFS is simply data copy among nodes. Has
    anyone considered to use a better encoding to reduce the data size? Say, a block
    of data is split into N pieces, and as long as M pieces of data survive in the
    network, we can regenerate original data.

    There are many benefits to reduce the data size. It can save network and disk
    benefit, and thus reduce energy consumption. Computation power might be a
    concern, but we can use GPU to encode and decode.

    But maybe the idea is stupid or it's hard to reduce the data size. I would like
    to hear your comments.

    Thanks,
    Da
  • Da Zheng at Jul 19, 2011 at 3:53 am
    So this kind of feature is desired by the community?

    It seems this implementation can only reduce the data size on the disk
    by the background daemon RaidNode, but it cannot reduce the disk
    bandwidth and network bandwidth when the client writes data to HDFS. It
    might be more interesting to reduce the disk bandwidth and network
    bandwidth although it might require to modify the implementation of the
    pipeline in HDFS.

    Thanks,
    Da

    On 07/18/11 04:10, Joey Echeverria wrote:
    Facebook contributed some code to do something similar called HDFS RAID:

    http://wiki.apache.org/hadoop/HDFS-RAID

    -Joey


    On Jul 18, 2011, at 3:41, Da Zhengwrote:
    Hello,

    It seems that data replication in HDFS is simply data copy among nodes. Has
    anyone considered to use a better encoding to reduce the data size? Say, a block
    of data is split into N pieces, and as long as M pieces of data survive in the
    network, we can regenerate original data.

    There are many benefits to reduce the data size. It can save network and disk
    benefit, and thus reduce energy consumption. Computation power might be a
    concern, but we can use GPU to encode and decode.

    But maybe the idea is stupid or it's hard to reduce the data size. I would like
    to hear your comments.

    Thanks,
    Da
  • Uma Maheswara Rao G 72686 at Jul 19, 2011 at 4:44 am
    Hi,

    We have already thoughts about it.

    Looks like you are talking about this features right
    https://issues.apache.org/jira/browse/HDFS-1640
    https://issues.apache.org/jira/browse/HDFS-2115

    but implementation not yet ready in trunk


    Regards,
    Uma
    ******************************************************************************************
    This email and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained here in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it!
    *****************************************************************************************

    ----- Original Message -----
    From: Da Zheng <zhengda1936@gmail.com>
    Date: Tuesday, July 19, 2011 9:23 am
    Subject: Re: replicate data in HDFS with smarter encoding
    To: common-user@hadoop.apache.org
    Cc: Joey Echeverria <joey@cloudera.com>, "hdfs-user@hadoop.apache.org" <hdfs-user@hadoop.apache.org>
    So this kind of feature is desired by the community?

    It seems this implementation can only reduce the data size on the
    disk
    by the background daemon RaidNode, but it cannot reduce the disk
    bandwidth and network bandwidth when the client writes data to
    HDFS. It
    might be more interesting to reduce the disk bandwidth and network
    bandwidth although it might require to modify the implementation of
    the
    pipeline in HDFS.

    Thanks,
    Da

    On 07/18/11 04:10, Joey Echeverria wrote:
    Facebook contributed some code to do something similar called
    HDFS RAID:
    http://wiki.apache.org/hadoop/HDFS-RAID

    -Joey


    On Jul 18, 2011, at 3:41, Da Zhengwrote:
    Hello,

    It seems that data replication in HDFS is simply data copy among
    nodes. Has
    anyone considered to use a better encoding to reduce the data
    size? Say, a block
    of data is split into N pieces, and as long as M pieces of data
    survive in the
    network, we can regenerate original data.

    There are many benefits to reduce the data size. It can save
    network and disk
    benefit, and thus reduce energy consumption. Computation power
    might be a
    concern, but we can use GPU to encode and decode.

    But maybe the idea is stupid or it's hard to reduce the data
    size. I would like
    to hear your comments.

    Thanks,
    Da
  • Da Zheng at Jul 19, 2011 at 5:38 am
    Hello,
    On 07/18/11 21:43, Uma Maheswara Rao G 72686 wrote:
    Hi,

    We have already thoughts about it.
    No, I think we are talking about different problems. What I'm talking
    about is how to reduce the number of replica while still achieving the
    same data reliability. The replica of data can already be compressed.

    To illustrate the problem, here is a more concrete example:
    The size of block A is X. After it is compressed, its size is Y. When it
    is written to HDFS, it needs to be replicated if we want the data to be
    reliable. If the replication factor is R, then R*Y bytes will be written
    to the disk, and (R-1)*Y bytes will be transmitted in the network.

    Now, if we use some better encoding to achieve data reliability, for B
    blocks of data, we can have P parity blocks. And for each block, we need
    to have (1+P/B)*Y bytes written to the disk and P/B*Y bytes transmitted
    over the network, and thus it's possible to further reduce the network
    and disk bandwidth.

    So what Joey showed me is more relevant even though it doesn't reduce
    the data size before data is written to the network or the disk.

    To implement that, I think we will probably not use pipeline any more.
    About your patches, I don't know how useful it can be when we can ask
    the applications to compress data. For example, we can enable
    mapred.output.compress in MapReduce to ask reducers to compress data. I
    assume MapReduce is the major user of HDFS.

    Thanks,
    Da

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 18, '11 at 7:41a
activeJul 19, '11 at 5:38a
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase