FAQ
How well does Hadoop handle multiple independent disks per node?

I have a cluster with 4 identical disks per node. I plan to use one
disk for OS and temporary storage, and dedicate the other three to
HDFS. Our IT folks have some disagreement as to whether the three disks
should be striped, or treated by HDFS as three independent disks. Could
someone with more HDFS experience comment on the relative advantages and
disadvantages to each approach?

Here are some of my thoughts. It's a bit easier to manage a 3-disk
striped partition, and we wouldn't have to worry about balancing files
between them. Single-file I/O should be considerably faster. On the
other hand, I would expect typical use to require multiple files reads
or write simultaneously. I would expect Hadoop to be able to manage
read/write to/from the disks independently. Managing 3 streams to 3
independent devices would likely result in less disk head movement, and
therefore better performance. I would expect Hadoop to be able to
balance load between the disks fairly well. Availability doesn't really
differentiate between the two approaches - if a single disk dies, the
striped array would go down, but all its data should be replicated on
another datanode, anyway. And besides, I understand that datanode will
shut down a node, even if only one of 3 independent disks crashes.

So - any one want to agree or disagree with these thoughts? Anyone have
any other ideas, or - better - benchmarks and experience with layouts
like these two?

Thanks!

David

Search Discussions

  • Jason Venner at Jan 11, 2009 at 10:56 pm
    If you put your dfs directory as a set of comma separated tokens you
    will do fine.

    <property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value>
    <description>Determines where on the local filesystem an DFS data node
    should store its blocks. If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
    </description>
    </property>

    The namenode does a lot of small writes, so raid 1, 10 is better.

    Also it having the file system mounts for the dfs.data.dir be noatime
    and nodiratime makes a significant performance difference.

    David B. Ritch wrote:
    How well does Hadoop handle multiple independent disks per node?

    I have a cluster with 4 identical disks per node. I plan to use one
    disk for OS and temporary storage, and dedicate the other three to
    HDFS. Our IT folks have some disagreement as to whether the three disks
    should be striped, or treated by HDFS as three independent disks. Could
    someone with more HDFS experience comment on the relative advantages and
    disadvantages to each approach?

    Here are some of my thoughts. It's a bit easier to manage a 3-disk
    striped partition, and we wouldn't have to worry about balancing files
    between them. Single-file I/O should be considerably faster. On the
    other hand, I would expect typical use to require multiple files reads
    or write simultaneously. I would expect Hadoop to be able to manage
    read/write to/from the disks independently. Managing 3 streams to 3
    independent devices would likely result in less disk head movement, and
    therefore better performance. I would expect Hadoop to be able to
    balance load between the disks fairly well. Availability doesn't really
    differentiate between the two approaches - if a single disk dies, the
    striped array would go down, but all its data should be replicated on
    another datanode, anyway. And besides, I understand that datanode will
    shut down a node, even if only one of 3 independent disks crashes.

    So - any one want to agree or disagree with these thoughts? Anyone have
    any other ideas, or - better - benchmarks and experience with layouts
    like these two?

    Thanks!

    David
  • David B. Ritch at Jan 12, 2009 at 12:36 pm
    Thank you - yes, I'm fairly confident that it will work either way. I'm
    trying to find out whether there is an established best practice, and
    the performance impact of the decision between RAID 0 and JBOD.
    I'll check out the noatime and nodiratime for their effect on our
    performance - thanks for that suggestion, as well.

    David
    Jason Venner wrote:
    If you put your dfs directory as a set of comma separated tokens you
    will do fine.

    <property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value>
    <description>Determines where on the local filesystem an DFS data node
    should store its blocks. If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
    </description>
    </property>

    The namenode does a lot of small writes, so raid 1, 10 is better.

    Also it having the file system mounts for the dfs.data.dir be noatime
    and nodiratime makes a significant performance difference.

    David B. Ritch wrote:
    How well does Hadoop handle multiple independent disks per node?

    I have a cluster with 4 identical disks per node. I plan to use one
    disk for OS and temporary storage, and dedicate the other three to
    HDFS. Our IT folks have some disagreement as to whether the three disks
    should be striped, or treated by HDFS as three independent disks. Could
    someone with more HDFS experience comment on the relative advantages and
    disadvantages to each approach?

    Here are some of my thoughts. It's a bit easier to manage a 3-disk
    striped partition, and we wouldn't have to worry about balancing files
    between them. Single-file I/O should be considerably faster. On the
    other hand, I would expect typical use to require multiple files reads
    or write simultaneously. I would expect Hadoop to be able to manage
    read/write to/from the disks independently. Managing 3 streams to 3
    independent devices would likely result in less disk head movement, and
    therefore better performance. I would expect Hadoop to be able to
    balance load between the disks fairly well. Availability doesn't really
    differentiate between the two approaches - if a single disk dies, the
    striped array would go down, but all its data should be replicated on
    another datanode, anyway. And besides, I understand that datanode will
    shut down a node, even if only one of 3 independent disks crashes.

    So - any one want to agree or disagree with these thoughts? Anyone have
    any other ideas, or - better - benchmarks and experience with layouts
    like these two?

    Thanks!

    David
  • Brian Vargas at Jan 12, 2009 at 1:03 pm
    David,

    As I understand it, you will theoretically get better performance from a
    JBOD configuration than a RAID configuration. In a RAID configuration,
    you have to wait for the slowest disk in the array to complete before
    the entire IO operation can complete, making the average IO time
    equivalent to the slowest disk. In a JBOD configuration, operations on
    a faster disks will complete independently of the slowest disk, making
    the average IO time for the node necessarily faster than the slowest
    disk (unless all disks are equally slow).

    Whether it would be a noticeable gain is questionable, though. I doubt
    it would make enough difference to provide a good reason to depart from
    whichever you feel is easiest to manage.

    And you don't need the redundancy of RAID, since HDFS does that using
    replication between nodes, so there's no loss there.

    Brian

    David B. Ritch wrote:
    Thank you - yes, I'm fairly confident that it will work either way. I'm
    trying to find out whether there is an established best practice, and
    the performance impact of the decision between RAID 0 and JBOD.
    I'll check out the noatime and nodiratime for their effect on our
    performance - thanks for that suggestion, as well.

    David
    Jason Venner wrote:
    If you put your dfs directory as a set of comma separated tokens you
    will do fine.

    <property>
    <name>dfs.data.dir</name>
    <value>${hadoop.tmp.dir}/dfs/data</value>
    <description>Determines where on the local filesystem an DFS data node
    should store its blocks. If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
    </description>
    </property>

    The namenode does a lot of small writes, so raid 1, 10 is better.

    Also it having the file system mounts for the dfs.data.dir be noatime
    and nodiratime makes a significant performance difference.

    David B. Ritch wrote:
    How well does Hadoop handle multiple independent disks per node?

    I have a cluster with 4 identical disks per node. I plan to use one
    disk for OS and temporary storage, and dedicate the other three to
    HDFS. Our IT folks have some disagreement as to whether the three disks
    should be striped, or treated by HDFS as three independent disks. Could
    someone with more HDFS experience comment on the relative advantages and
    disadvantages to each approach?

    Here are some of my thoughts. It's a bit easier to manage a 3-disk
    striped partition, and we wouldn't have to worry about balancing files
    between them. Single-file I/O should be considerably faster. On the
    other hand, I would expect typical use to require multiple files reads
    or write simultaneously. I would expect Hadoop to be able to manage
    read/write to/from the disks independently. Managing 3 streams to 3
    independent devices would likely result in less disk head movement, and
    therefore better performance. I would expect Hadoop to be able to
    balance load between the disks fairly well. Availability doesn't really
    differentiate between the two approaches - if a single disk dies, the
    striped array would go down, but all its data should be replicated on
    another datanode, anyway. And besides, I understand that datanode will
    shut down a node, even if only one of 3 independent disks crashes.

    So - any one want to agree or disagree with these thoughts? Anyone have
    any other ideas, or - better - benchmarks and experience with layouts
    like these two?

    Thanks!

    David
  • Colin Evans at Jan 12, 2009 at 5:17 pm
    Currently, Hadoop does round-robin allocation of blocks and data
    across multiple JBOD disks. We did some testing and found that there
    weren't significant differences between RAID-0 and JBOD. We went with
    JBOD because we figured that RAID-0 has a higher failure rate than
    JBOD -- any disk failure in a 3-disk RAID-0 configuration causes the
    whole node to go down, but if there is a single disk failure in a JBOD
    configuration, Hadoop will go on serving from the other disks.


    On Jan 11, 2009, at 1:23 PM, David B. Ritch wrote:

    How well does Hadoop handle multiple independent disks per node?

    I have a cluster with 4 identical disks per node. I plan to use one
    disk for OS and temporary storage, and dedicate the other three to
    HDFS. Our IT folks have some disagreement as to whether the three
    disks
    should be striped, or treated by HDFS as three independent disks.
    Could
    someone with more HDFS experience comment on the relative advantages
    and
    disadvantages to each approach?

    Here are some of my thoughts. It's a bit easier to manage a 3-disk
    striped partition, and we wouldn't have to worry about balancing files
    between them. Single-file I/O should be considerably faster. On the
    other hand, I would expect typical use to require multiple files reads
    or write simultaneously. I would expect Hadoop to be able to manage
    read/write to/from the disks independently. Managing 3 streams to 3
    independent devices would likely result in less disk head movement,
    and
    therefore better performance. I would expect Hadoop to be able to
    balance load between the disks fairly well. Availability doesn't
    really
    differentiate between the two approaches - if a single disk dies, the
    striped array would go down, but all its data should be replicated on
    another datanode, anyway. And besides, I understand that datanode
    will
    shut down a node, even if only one of 3 independent disks crashes.

    So - any one want to agree or disagree with these thoughts? Anyone
    have
    any other ideas, or - better - benchmarks and experience with layouts
    like these two?

    Thanks!

    David
  • David Ritch at Jan 12, 2009 at 9:00 pm
    Thank you! I'm glad to hear that you have actually tested this.

    I believe that a failure of any disk - even with JBOD - will cause dataNode
    to bring the node down. Presumably, we could bring it right back up, but
    this does sort of diminish the availability argument for JBOD.

    Sounds like it's basically a toss-up. I'm a bit concerned about the
    potential for uneven distribution - both of amount of data, and of transfer
    load - across the spindles. Unless I hear otherwise, I will probably go
    with RAID-0.
    On Mon, Jan 12, 2009 at 12:17 PM, Colin Evans wrote:

    Currently, Hadoop does round-robin allocation of blocks and data across
    multiple JBOD disks. We did some testing and found that there weren't
    significant differences between RAID-0 and JBOD. We went with JBOD because
    we figured that RAID-0 has a higher failure rate than JBOD -- any disk
    failure in a 3-disk RAID-0 configuration causes the whole node to go down,
    but if there is a single disk failure in a JBOD configuration, Hadoop will
    go on serving from the other disks.
  • John Kane at Jan 12, 2009 at 9:15 pm
    We have been using 2U boxes with 12x1TB disks. The first disk is used for
    OS/Scratch/Laziness, the other 11 disks are formatted as individual (~900GB)
    volumes and then mounted separately. We have /data-[a-k] mounted and
    configured in our cluster and have not had any issues with unbalanced
    loading. We do have varied sizes (from small to really large) files and
    hadoop just seems to figure it out for us.

    We have had single drive failures. We just bounce the datanode software and
    all is happy. When we get around to replacing the failed drive (I just about
    to go do one), we format it, mount it and then bounce the datanode. That
    replacement volume is now not very well balanced, but it is not typically an
    issue for us, we add lots of data every day so it does get filled up. We
    have run the rebalancer to address huge disparites in node utilization (like
    when we add a new node). That just made us feel better more than anything
    else.

    Cheers
    On Mon, Jan 12, 2009 at 2:00 PM, David Ritch wrote:

    Thank you! I'm glad to hear that you have actually tested this.

    I believe that a failure of any disk - even with JBOD - will cause dataNode
    to bring the node down. Presumably, we could bring it right back up, but
    this does sort of diminish the availability argument for JBOD.

    Sounds like it's basically a toss-up. I'm a bit concerned about the
    potential for uneven distribution - both of amount of data, and of transfer
    load - across the spindles. Unless I hear otherwise, I will probably go
    with RAID-0.
    On Mon, Jan 12, 2009 at 12:17 PM, Colin Evans wrote:

    Currently, Hadoop does round-robin allocation of blocks and data across
    multiple JBOD disks. We did some testing and found that there weren't
    significant differences between RAID-0 and JBOD. We went with JBOD because
    we figured that RAID-0 has a higher failure rate than JBOD -- any disk
    failure in a 3-disk RAID-0 configuration causes the whole node to go down,
    but if there is a single disk failure in a JBOD configuration, Hadoop will
    go on serving from the other disks.
  • Runping Qi at Jan 14, 2009 at 9:54 pm
    Hi,

    We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
    and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
    performed better.

    Gridmix tests:

    Load: gridmix2
    Cluster size: 190 nodes
    Test results:

    RAID0: 75 minutes
    JBOD: 67 minutes
    Difference: 10%

    Tests on HDFS writes performances

    We ran map only jobs writing data to dfs concurrently on different clusters.
    The overall dfs write throughputs on the jbod cluster are 30% (with a 58
    nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
    raid0 cluster, respectively.

    To understand why, we did some file level benchmarking on both clusters.
    We found that the file write throughput on a JBOD machine is 30% higher than
    that on a comparable machine with RAID0. This performance difference may be
    explained by the fact that the throughputs of different disks can vary 30%
    to 50%. With such variations, the overall throughput of a raid0 system may
    be bottlenecked by the slowest disk.


    -- Runping




    On 1/11/09 1:23 PM, "David B. Ritch" wrote:

    How well does Hadoop handle multiple independent disks per node?

    I have a cluster with 4 identical disks per node. I plan to use one
    disk for OS and temporary storage, and dedicate the other three to
    HDFS. Our IT folks have some disagreement as to whether the three disks
    should be striped, or treated by HDFS as three independent disks. Could
    someone with more HDFS experience comment on the relative advantages and
    disadvantages to each approach?

    Here are some of my thoughts. It's a bit easier to manage a 3-disk
    striped partition, and we wouldn't have to worry about balancing files
    between them. Single-file I/O should be considerably faster. On the
    other hand, I would expect typical use to require multiple files reads
    or write simultaneously. I would expect Hadoop to be able to manage
    read/write to/from the disks independently. Managing 3 streams to 3
    independent devices would likely result in less disk head movement, and
    therefore better performance. I would expect Hadoop to be able to
    balance load between the disks fairly well. Availability doesn't really
    differentiate between the two approaches - if a single disk dies, the
    striped array would go down, but all its data should be replicated on
    another datanode, anyway. And besides, I understand that datanode will
    shut down a node, even if only one of 3 independent disks crashes.

    So - any one want to agree or disagree with these thoughts? Anyone have
    any other ideas, or - better - benchmarks and experience with layouts
    like these two?

    Thanks!

    David
  • Steve Loughran at Jan 15, 2009 at 10:42 am

    Runping Qi wrote:
    Hi,

    We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
    and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
    performed better.

    Gridmix tests:

    Load: gridmix2
    Cluster size: 190 nodes
    Test results:

    RAID0: 75 minutes
    JBOD: 67 minutes
    Difference: 10%

    Tests on HDFS writes performances

    We ran map only jobs writing data to dfs concurrently on different clusters.
    The overall dfs write throughputs on the jbod cluster are 30% (with a 58
    nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
    raid0 cluster, respectively.

    To understand why, we did some file level benchmarking on both clusters.
    We found that the file write throughput on a JBOD machine is 30% higher than
    that on a comparable machine with RAID0. This performance difference may be
    explained by the fact that the throughputs of different disks can vary 30%
    to 50%. With such variations, the overall throughput of a raid0 system may
    be bottlenecked by the slowest disk.


    -- Runping
    This is really interesting. Thank you for sharing these results!

    Presumably the servers were all set up with "nominally" homogenous
    hardware? And yet still the variations existed. That would be something
    to experiment with on new versus old clusters to see if it gets worse
    over time.

    Here we have a batch of desktop workstations all bought at the same
    time, to the same spec, but one of them, "lucky" is more prone to race
    conditions than any of the others. We don't know why, and assume its do
    with the (multiple) Xeon CPU chips being at different ends of the bell
    curve or something. all we know is: test on that box before shipping to
    find race conditions early.

    -steve
  • Runping Qi at Jan 15, 2009 at 3:27 pm
    Yes, all the machines in the tests are new, with the same spec.
    The 30% to 50% throughput variations of the disks were observed on the disks
    of the same machines.

    Runping


    On 1/15/09 2:41 AM, "Steve Loughran" wrote:

    Runping Qi wrote:
    Hi,

    We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
    and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
    performed better.

    Gridmix tests:

    Load: gridmix2
    Cluster size: 190 nodes
    Test results:

    RAID0: 75 minutes
    JBOD: 67 minutes
    Difference: 10%

    Tests on HDFS writes performances

    We ran map only jobs writing data to dfs concurrently on different clusters.
    The overall dfs write throughputs on the jbod cluster are 30% (with a 58
    nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
    raid0 cluster, respectively.

    To understand why, we did some file level benchmarking on both clusters.
    We found that the file write throughput on a JBOD machine is 30% higher than
    that on a comparable machine with RAID0. This performance difference may be
    explained by the fact that the throughputs of different disks can vary 30%
    to 50%. With such variations, the overall throughput of a raid0 system may
    be bottlenecked by the slowest disk.


    -- Runping
    This is really interesting. Thank you for sharing these results!

    Presumably the servers were all set up with "nominally" homogenous
    hardware? And yet still the variations existed. That would be something
    to experiment with on new versus old clusters to see if it gets worse
    over time.

    Here we have a batch of desktop workstations all bought at the same
    time, to the same spec, but one of them, "lucky" is more prone to race
    conditions than any of the others. We don't know why, and assume its do
    with the (multiple) Xeon CPU chips being at different ends of the bell
    curve or something. all we know is: test on that box before shipping to
    find race conditions early.

    -steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 11, '09 at 9:23p
activeJan 15, '09 at 3:27p
posts10
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase