FAQ
Hi,
I am storing data on a HDFS cluster(4 machines). I have seen that read/write is not very much effected with the size of data on HDFS (Total data size of HDFS). I have used

20-30% of cluster and didn't completely filled it. Can someone explain me why its so and HDFS promises such feature or I am missing some stuff?

Thanks,

wasim

Search Discussions

  • Alex Loddengaard at Jun 18, 2009 at 5:55 pm
    I'm a little confused what you're question is. Are you asking why HDFS has
    consistent read/write speeds even as your cluster gets more and more data?

    If so, two HDFS bottlenecks that would change read/write performance as used
    capacity changes are name node (NN) RAM and the amount of data each of your
    data nodes (DNs) are storing. If you have so much meta data (lots of files,
    blocks, etc.) that the NN java process uses most of your NN's memory, then
    you'll see a big decrease in performance. This bottleneck usually only
    shows itself on large clusters with tons of metadata, though a small cluster
    with a wimpy NN machine will have the same bottleneck. Similarly, if each
    of your DNs are storing close to their capacity, then reads/writes will
    begin to slow down, as each node will be responsible for streaming more and
    more data in and out. Does that make sense?

    You should fill your cluster up to 80-90%. I imagine you'd probably see a
    decrease in read/write performance depending on the tests you're running,
    though I can't say I've done this performance test before. I'm merely
    speculating.

    Hope this clears things up.

    Alex
    On Thu, Jun 18, 2009 at 9:30 AM, Wasim Bari wrote:

    Hi,
    I am storing data on a HDFS cluster(4 machines). I have seen that
    read/write is not very much effected with the size of data on HDFS (Total
    data size of HDFS). I have used

    20-30% of cluster and didn't completely filled it. Can someone explain me
    why its so and HDFS promises such feature or I am missing some stuff?

    Thanks,

    wasim
  • Todd Lipcon at Jun 18, 2009 at 6:54 pm

    On Thu, Jun 18, 2009 at 10:55 AM, Alex Loddengaard wrote:

    I'm a little confused what you're question is. Are you asking why HDFS has
    consistent read/write speeds even as your cluster gets more and more data?

    If so, two HDFS bottlenecks that would change read/write performance as
    used
    capacity changes are name node (NN) RAM and the amount of data each of your
    data nodes (DNs) are storing. If you have so much meta data (lots of
    files,
    blocks, etc.) that the NN java process uses most of your NN's memory, then
    you'll see a big decrease in performance.
    To avoid this issue, simply watch swap usage on your NN. If your NN starts
    swapping you will likely run into problems with your metadata operation
    speed. This won't affect throughput of read/writes within a block, though.

    This bottleneck usually only
    shows itself on large clusters with tons of metadata, though a small
    cluster
    with a wimpy NN machine will have the same bottleneck. Similarly, if each
    of your DNs are storing close to their capacity, then reads/writes will
    begin to slow down, as each node will be responsible for streaming more and
    more data in and out. Does that make sense?

    You should fill your cluster up to 80-90%. I imagine you'd probably see a
    decrease in read/write performance depending on the tests you're running,
    though I can't say I've done this performance test before. I'm merely
    speculating.
    Another thing to keep in mind is that local filesystem performance begins to
    suffer once a disk is more than 80% or so full. This is due to the ways that
    filesystems endeavour to keep file fragmentation low. When there is little
    extra space on the drive, the file system has fewer options for relocating
    blocks and fighting fragmentation, so "sequential" writes and reads will
    actually incur seeks on the local disk. Since the datanodes store their
    blocks on the local file system, this is a factor worth considering.

    -Todd

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 18, '09 at 4:30p
activeJun 18, '09 at 6:54p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase