FAQ
What would be a good hard drive for a 7 node cluster which is targeted to
run a mix of IO and CPU intensive Hadoop workloads? We are looking for
around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
each ;)

I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
Got overwhelmed with the # of benchmarks and different aspects of HDD
performance.

Appreciate your help on this.

-Shrinivas

Search Discussions

  • Ted Dunning at Feb 10, 2011 at 8:44 pm
    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
    about the same (ish). Seek times will vary a bit with rotation speed, but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in getting
    acceptable performance. With only 4 or 5 drives, you don't have to worry
    about super-duper backplane, but you can still kill performance with a lousy
    controller.
    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi wrote:

    What would be a good hard drive for a 7 node cluster which is targeted to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking for
    around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
    Got overwhelmed with the # of benchmarks and different aspects of HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Chris Collins at Feb 10, 2011 at 8:51 pm
    Of late we have had serious issues with seagate drives in our hadoop cluster. These were purchased over several purchasing cycles and pretty sure it wasnt just a single "bad batch". Because of this we switched to buying 2TB hitachi drives which seem to of been considerably more reliable.

    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
    about the same (ish). Seek times will vary a bit with rotation speed, but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in getting
    acceptable performance. With only 4 or 5 drives, you don't have to worry
    about super-duper backplane, but you can still kill performance with a lousy
    controller.
    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi wrote:

    What would be a good hard drive for a 7 node cluster which is targeted to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking for
    around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
    Got overwhelmed with the # of benchmarks and different aspects of HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Shrinivas Joshi at Feb 10, 2011 at 9:47 pm
    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in data
    footprint and also read somewhere that larger the capacity of the drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2 platters.
    Large capacity drives should be OK for us as long as they perform equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas
    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins wrote:

    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and pretty
    sure it wasnt just a single "bad batch". Because of this we switched to
    buying 2TB hitachi drives which seem to of been considerably more reliable.

    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
    about the same (ish). Seek times will vary a bit with rotation speed, but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in getting
    acceptable performance. With only 4 or 5 drives, you don't have to worry
    about super-duper backplane, but you can still kill performance with a lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking for
    around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware, storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Ted Dunning at Feb 10, 2011 at 10:12 pm
    We see well over 100MB/s off of commodity 2TB drives.
    On Thu, Feb 10, 2011 at 1:47 PM, Shrinivas Joshi wrote:

    But looks like you can get 1TB drives with only 2 platters.
    Large capacity drives should be OK for us as long as they perform equally
    well.
  • Michael Segel at Feb 10, 2011 at 10:25 pm
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the motherboard? Since you're using JBOD, you don't want the additional expense of having to purchase a separate controller card for the additional drives.

    I'm running Seagate drives at home and I haven't had any problems for years.
    When you look at your drive, you need to know total storage, speed (rpms), and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A 1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up spending more money and use more energy, but I doubt you'll see a real performance difference.

    The other thing is that if you want to add more disk, you have room to grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in data
    footprint and also read somewhere that larger the capacity of the drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2 platters.
    Large capacity drives should be OK for us as long as they perform equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas
    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins wrote:

    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and pretty
    sure it wasnt just a single "bad batch". Because of this we switched to
    buying 2TB hitachi drives which seem to of been considerably more reliable.

    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
    about the same (ish). Seek times will vary a bit with rotation speed, but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in getting
    acceptable performance. With only 4 or 5 drives, you don't have to worry
    about super-duper backplane, but you can still kill performance with a lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking for
    around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware, storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Shrinivas Joshi at Feb 11, 2011 at 11:52 pm
    Thanks for your inputs, Michael. We have 6 open SATA ports on the
    motherboards. That is the reason why we are thinking of 4 to 5 data disks
    and 1 OS disk.
    Are you suggesting use of one 2TB disk instead of four 500GB disks lets say?
    I thought that the HDFS utilization/throughput increases with the # of disks
    per node (assuming that the total usable IO bandwidth increases
    proportionally).

    -Shrinivas
    On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel wrote:


    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal
    configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the
    motherboard? Since you're using JBOD, you don't want the additional expense
    of having to purchase a separate controller card for the additional drives.

    I'm running Seagate drives at home and I haven't had any problems for
    years.
    When you look at your drive, you need to know total storage, speed (rpms),
    and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
    1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up
    spending more money and use more energy, but I doubt you'll see a real
    performance difference.

    The other thing is that if you want to add more disk, you have room to
    grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the
    box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in data
    footprint and also read somewhere that larger the capacity of the drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2 platters.
    Large capacity drives should be OK for us as long as they perform equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas

    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
    chris_j_collins@yahoo.com>wrote:
    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and
    pretty
    sure it wasnt just a single "bad batch". Because of this we switched
    to
    buying 2TB hitachi drives which seem to of been considerably more
    reliable.
    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is going
    to
    be
    about the same (ish). Seek times will vary a bit with rotation
    speed,
    but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in getting
    acceptable performance. With only 4 or 5 drives, you don't have to
    worry
    about super-duper backplane, but you can still kill performance with
    a
    lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
    jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is
    targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking
    for
    around 1 TB of storage on each node distributed amongst 4 or 5
    disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
    than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware, storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of
    HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Ted Dunning at Feb 12, 2011 at 12:15 am
    Bandwidth is definitely better with more active spindles. I would recommend
    several larger disks. The cost is very nearly the same.
    On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote:

    Thanks for your inputs, Michael. We have 6 open SATA ports on the
    motherboards. That is the reason why we are thinking of 4 to 5 data disks
    and 1 OS disk.
    Are you suggesting use of one 2TB disk instead of four 500GB disks lets
    say?
    I thought that the HDFS utilization/throughput increases with the # of
    disks
    per node (assuming that the total usable IO bandwidth increases
    proportionally).

    -Shrinivas

    On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
    wrote:
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal
    configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the
    motherboard? Since you're using JBOD, you don't want the additional expense
    of having to purchase a separate controller card for the additional drives.
    I'm running Seagate drives at home and I haven't had any problems for
    years.
    When you look at your drive, you need to know total storage, speed (rpms),
    and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
    1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up
    spending more money and use more energy, but I doubt you'll see a real
    performance difference.

    The other thing is that if you want to add more disk, you have room to
    grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the
    box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in
    data
    footprint and also read somewhere that larger the capacity of the
    drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2
    platters.
    Large capacity drives should be OK for us as long as they perform
    equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas

    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
    chris_j_collins@yahoo.com>wrote:
    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and
    pretty
    sure it wasnt just a single "bad batch". Because of this we
    switched
    to
    buying 2TB hitachi drives which seem to of been considerably more
    reliable.
    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is
    going
    to
    be
    about the same (ish). Seek times will vary a bit with rotation
    speed,
    but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in
    getting
    acceptable performance. With only 4 or 5 drives, you don't have to
    worry
    about super-duper backplane, but you can still kill performance
    with
    a
    lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
    jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is
    targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking
    for
    around 1 TB of storage on each node distributed amongst 4 or 5
    disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
    than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware,
    storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of
    HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
  • Edward Capriolo at Feb 12, 2011 at 3:43 pm

    On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning wrote:
    Bandwidth is definitely better with more active spindles.  I would recommend
    several larger disks.  The cost is very nearly the same.
    On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote:

    Thanks for your inputs, Michael.  We have 6 open SATA ports on the
    motherboards. That is the reason why we are thinking of 4 to 5 data disks
    and 1 OS disk.
    Are you suggesting use of one 2TB disk instead of four 500GB disks lets
    say?
    I thought that the HDFS utilization/throughput increases with the # of
    disks
    per node (assuming that the total usable IO bandwidth increases
    proportionally).

    -Shrinivas

    On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
    wrote:
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What  we've found is that in the data nodes, there's an optimal
    configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the
    motherboard? Since you're using JBOD, you don't want the additional expense
    of having to purchase a separate controller card for the additional drives.
    I'm running Seagate drives at home and I haven't had any problems for
    years.
    When you look at your drive, you need to know total storage, speed (rpms),
    and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
    1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up
    spending more money and use more energy, but I doubt you'll see a real
    performance difference.

    The other thing is that if you want to add more disk, you have room to
    grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the
    box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in
    data
    footprint and also read somewhere that larger the capacity of the
    drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2
    platters.
    Large capacity drives should be OK for us as long as they perform
    equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would  backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks?  I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas

    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
    chris_j_collins@yahoo.com>wrote:
    Of late we have had serious issues with seagate drives in our hadoop
    cluster.  These were purchased over several purchasing cycles and
    pretty
    sure it wasnt just a single "bad batch".   Because of this we
    switched
    to
    buying 2TB hitachi drives which seem to of been considerably more
    reliable.
    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks.  Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is
    going
    to
    be
    about the same (ish).  Seek times will vary a bit with rotation
    speed,
    but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in
    getting
    acceptable performance.  With only 4 or 5 drives, you don't have to
    worry
    about super-duper backplane, but you can still kill performance
    with
    a
    lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
    jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is
    targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking
    for
    around 1 TB of storage on each node distributed amongst 4 or 5
    disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
    than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware,
    storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of
    HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
    You also do not need a dedicated OS disk. I typically slice to
    partitions of some of the disks and do a software mirror there. this
    gives you redundancy without having to sacrifice one or two disk slots
    with smaller disks.
  • Michael Segel at Feb 12, 2011 at 4:27 pm
    All,

    I'd like to clarify somethings...

    First the concept is to build out a cluster of commodity hardware.
    So when you do your shopping you want to get the most bang for your buck. That is the 'sweet spot' that I'm talking about.
    When you look at your E5500 or E5600 chip sets, you will want to go with 4 cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
    (Faster chips are more expensive and the performance edge falls off so you end up paying a premium.)

    Looking at your disks, you start with using the on board SATA controller. Why? Because it means you don't have to pay for a controller card.
    If you are building a cluster for general purpose computing... Assuming 1U boxes you have room for 4 3.5" SATA which still give you the best performance for your buck.
    Can you go with 2.5"? Yes, but you are going to be paying a premium.

    Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You could go with SATA III drives if your motherboard supports the SATA III ports, but you're still paying a slight premium.

    The OP felt that all he would need was 1TB of disk and was considering 4 250GB drives. (More spindles...yada yada yada...)

    My suggestion is to forget that nonsense and go with one 2 TB drive because its a better deal and if you want to add more disk to the node, you can. (Its easier to add disk than it is to replace it.)

    Now do you need to create a spare OS drive? No. Some people who have an internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging there. (Just make sure you have a lot of disk space...)

    The truth is that there really isn't any single *right* answer. There are a lot of options and budget constraints as well as physical constraints like power, space, and location of the hardware.

    Also you may be building out a cluster who's main purpose is to be a backup location for your cluster. So your production cluster has lots of nodes. Your backup cluster has lots of disks per node because your main focus is as much storage per node.

    So here you may end up buying a 4U rack box, load it up with 3.5" drives and a couple of SATA controller cards. You care less about performance but more about storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know how many you can fit in to a 4U chassis these days. So you have 10 DN backing up a 100+ DN cluster in your main data center. But that's another story.

    I think the main take away you should have is that if you look at the price point... your best price per GB is on a 2TB drive until the prices drop on 3TB drives.
    Since the OP believes that their requirement is 1TB per node... a single 2TB would be the best choice. It allows for additional space and you really shouldn't be too worried about disk i/o being your bottleneck.

    HTH

    -Mike

    Date: Sat, 12 Feb 2011 10:42:50 -0500
    Subject: Re: recommendation on HDDs
    From: edlinuxguru@gmail.com
    To: common-user@hadoop.apache.org
    On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning wrote:
    Bandwidth is definitely better with more active spindles. I would recommend
    several larger disks. The cost is very nearly the same.
    On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote:

    Thanks for your inputs, Michael. We have 6 open SATA ports on the
    motherboards. That is the reason why we are thinking of 4 to 5 data disks
    and 1 OS disk.
    Are you suggesting use of one 2TB disk instead of four 500GB disks lets
    say?
    I thought that the HDFS utilization/throughput increases with the # of
    disks
    per node (assuming that the total usable IO bandwidth increases
    proportionally).

    -Shrinivas

    On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
    wrote:
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal
    configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the
    motherboard? Since you're using JBOD, you don't want the additional expense
    of having to purchase a separate controller card for the additional drives.
    I'm running Seagate drives at home and I haven't had any problems for
    years.
    When you look at your drive, you need to know total storage, speed (rpms),
    and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
    1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up
    spending more money and use more energy, but I doubt you'll see a real
    performance difference.

    The other thing is that if you want to add more disk, you have room to
    grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the
    box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in
    data
    footprint and also read somewhere that larger the capacity of the
    drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2
    platters.
    Large capacity drives should be OK for us as long as they perform
    equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas

    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
    chris_j_collins@yahoo.com>wrote:
    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and
    pretty
    sure it wasnt just a single "bad batch". Because of this we
    switched
    to
    buying 2TB hitachi drives which seem to of been considerably more
    reliable.
    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is
    going
    to
    be
    about the same (ish). Seek times will vary a bit with rotation
    speed,
    but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in
    getting
    acceptable performance. With only 4 or 5 drives, you don't have to
    worry
    about super-duper backplane, but you can still kill performance
    with
    a
    lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
    jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is
    targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking
    for
    around 1 TB of storage on each node distributed amongst 4 or 5
    disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
    than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware,
    storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of
    HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
    You also do not need a dedicated OS disk. I typically slice to
    partitions of some of the disks and do a software mirror there. this
    gives you redundancy without having to sacrifice one or two disk slots
    with smaller disks.
  • Ted Dunning at Feb 12, 2011 at 7:23 pm
    The original poster also seemed somewhat interested in disk bandwidth.

    That is facilitated by having more than on disk in the box.
    On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel wrote:

    Since the OP believes that their requirement is 1TB per node... a single
    2TB would be the best choice. It allows for additional space and you really
    shouldn't be too worried about disk i/o being your bottleneck.
  • Shrinivas Joshi at Feb 15, 2011 at 4:49 pm
    Thanks much to all who shared their inputs. This really helps. It would be
    nice to have a wiki page collecting all this good information. I will check
    with that. We are definitely going with large capacity disks (>= 1TB).

    -Shrinivas
    On Sat, Feb 12, 2011 at 1:22 PM, Ted Dunning wrote:

    The original poster also seemed somewhat interested in disk bandwidth.

    That is facilitated by having more than on disk in the box.

    On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel <michael_segel@hotmail.com
    wrote:
    Since the OP believes that their requirement is 1TB per node... a single
    2TB would be the best choice. It allows for additional space and you really
    shouldn't be too worried about disk i/o being your bottleneck.
  • Ted Dunning at Feb 15, 2011 at 5:32 pm
    Good idea!

    Would you like to create the nucleus of such a page? (there might already
    be something like that)
    On Tue, Feb 15, 2011 at 8:49 AM, Shrinivas Joshi wrote:

    It would be
    nice to have a wiki page collecting all this good information.
  • zGreenfelder at Feb 15, 2011 at 5:49 pm
    untopposing everything.
    Since the OP believes that their requirement is 1TB per node... a single
    2TB would be the best choice. It allows for additional space and you really
    shouldn't be too worried about disk i/o being your bottleneck.
    The original poster also seemed somewhat interested in disk bandwidth.

    That is facilitated by having more than on disk in the box.

    On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel <michael_segel@hotmail.com
    wrote:
    On Tue, Feb 15, 2011 at 11:49 AM, Shrinivas Joshi wrote:
    Thanks much to all who shared their inputs. This really helps. It would be
    nice to have a wiki page collecting all this good information. I will check
    with that. We are definitely going with large capacity disks (>= 1TB).

    -Shrinivas
    I think the guidelines would be good to capture, but that seems like
    it'd be more of a footnote or subsection to a larger hardware
    notes/specs/suggestions page with some guides for picking processors,
    memory, et al(maybe also noting what flavors of OSes are known to have
    particular upside/downsides). It was noted no less than 3 times in
    the thread this is a very fluid target and completely reasonable
    choices today (e.g. X TB sata drives) might be viewed as silly in a
    year or 6 months.

    that's my personal opinion, anyway.



    --
    Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
  • Shrinivas Joshi at Feb 18, 2011 at 11:30 pm
    There seems to be a wiki page already intended for capturing information on
    disks in Hadoop environment. http://wiki.apache.org/hadoop/DiskSetup

    Do we just want to link the thread on HDD recommendations from this wiki
    page?

    -Shrinivas
    On Tue, Feb 15, 2011 at 11:48 AM, zGreenfelder wrote:

    untopposing everything.
    Since the OP believes that their requirement is 1TB per node... a
    single
    2TB would be the best choice. It allows for additional space and you really
    shouldn't be too worried about disk i/o being your bottleneck.
    The original poster also seemed somewhat interested in disk bandwidth.

    That is facilitated by having more than on disk in the box.

    On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel <
    michael_segel@hotmail.com
    wrote:
    On Tue, Feb 15, 2011 at 11:49 AM, Shrinivas Joshi wrote:
    Thanks much to all who shared their inputs. This really helps. It would be
    nice to have a wiki page collecting all this good information. I will check
    with that. We are definitely going with large capacity disks (>= 1TB).

    -Shrinivas
    I think the guidelines would be good to capture, but that seems like
    it'd be more of a footnote or subsection to a larger hardware
    notes/specs/suggestions page with some guides for picking processors,
    memory, et al(maybe also noting what flavors of OSes are known to have
    particular upside/downsides). It was noted no less than 3 times in
    the thread this is a very fluid target and completely reasonable
    choices today (e.g. X TB sata drives) might be viewed as silly in a
    year or 6 months.

    that's my personal opinion, anyway.



    --
    Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
  • Ted Dunning at Feb 19, 2011 at 1:14 am
    Better to provide a summary as well as the link

    On Friday, February 18, 2011, Shrinivas Joshi wrote:
    There seems to be a wiki page already intended for capturing information on
    disks in Hadoop environment. http://wiki.apache.org/hadoop/DiskSetup

    Do we just want to link the thread on HDD recommendations from this wiki
    page?

    -Shrinivas
    On Tue, Feb 15, 2011 at 11:48 AM, zGreenfelder wrote:

    untopposing everything.
    Since the OP believes that their requirement is 1TB per node... a
    single
    2TB would be the best choice. It allows for additional space and you really
    shouldn't be too worried about disk i/o being your bottleneck.
    The original poster also seemed somewhat interested in disk bandwidth.

    That is facilitated by having more than on disk in the box.

    On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel <
    michael_segel@hotmail.com
    wrote:
    On Tue, Feb 15, 2011 at 11:49 AM, Shrinivas Joshi <jshrinivas@gmail.com>
    wrote:
    Thanks much to all who shared their inputs. This really helps. It would be
    nice to have a wiki page collecting all this good information. I will check
    with that. We are definitely going with large capacity disks (>= 1TB).

    -Shrinivas
    I think the guidelines would be good to capture, but that seems like
    it'd be more of a footnote or subsection to a larger hardware
    notes/specs/suggestions page with some guides for picking processors,
    memory, et al(maybe also noting what flavors of OSes are known to have
    particular upside/downsides).   It was noted no less than 3 times in
    the thread this is a very fluid target and completely reasonable
    choices today (e.g. X TB sata drives) might be viewed as silly in a
    year or 6 months.

    that's my personal opinion, anyway.



    --
    Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
  • Steve Loughran at Feb 14, 2011 at 11:24 am

    On 12/02/11 16:26, Michael Segel wrote:
    All,

    I'd like to clarify somethings...

    First the concept is to build out a cluster of commodity hardware.
    So when you do your shopping you want to get the most bang for your buck. That is the 'sweet spot' that I'm talking about.
    When you look at your E5500 or E5600 chip sets, you will want to go with 4 cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
    (Faster chips are more expensive and the performance edge falls off so you end up paying a premium.)
    Interesting choice; the 7 core in a single CPU option is something else
    to consider. Remember also this is a moving target, what anyone says is
    valid now (Feb 2011) will be seen as quaint in two years time. Even a
    few months from now, what is the best value for a cluster will hve moved on.
    Looking at your disks, you start with using the on board SATA controller. Why? Because it means you don't have to pay for a controller card.
    If you are building a cluster for general purpose computing... Assuming 1U boxes you have room for 4 3.5" SATA which still give you the best performance for your buck.
    Can you go with 2.5"? Yes, but you are going to be paying a premium.

    Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You could go with SATA III drives if your motherboard supports the SATA III ports, but you're still paying a slight premium.

    The OP felt that all he would need was 1TB of disk and was considering 4 250GB drives. (More spindles...yada yada yada...)

    My suggestion is to forget that nonsense and go with one 2 TB drive because its a better deal and if you want to add more disk to the node, you can. (Its easier to add disk than it is to replace it.)

    Now do you need to create a spare OS drive? No. Some people who have an internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging there. (Just make sure you have a lot of disk space...)
    One advantage of a specific drive for OS and log (in a separate
    partition) is you can re-image it without losing data you care about,
    and swap in a replacement fast. If you have a small cluster set up for
    hotswap, that reduces the time a node is down -just have a spare OS HDD
    ready to put in. OS disks are the ones you care about when they fail,
    the others are more "mildly concerned about the failure rate" than
    something to page you over.
    The truth is that there really isn't any single *right* answer. There are a lot of options and budget constraints as well as physical constraints like power, space, and location of the hardware.
    +1. don't forget weight either.
    Also you may be building out a cluster who's main purpose is to be a backup location for your cluster. So your production cluster has lots of nodes. Your backup cluster has lots of disks per node because your main focus is as much storage per node.

    So here you may end up buying a 4U rack box, load it up with 3.5" drives and a couple of SATA controller cards. You care less about performance but more about storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know how many you can fit in to a 4U chassis these days. So you have 10 DN backing up a 100+ DN cluster in your main data center. But that's another story.
    You can get 12 HDDs in a 1U if you ask nicely. but in a small cluster
    there's a cost, that server can be a big chunk of your filesystem, and
    if it goes down there's up to 24TB worth of replication going to take
    place over the rest of the network, so you'll need at least 24TB of
    spare capacity on the other machines, ignoring bandwidth issues.
    I think the main take away you should have is that if you look at the price point... your best price per GB is on a 2TB drive until the prices drop on 3TB drives.
    Since the OP believes that their requirement is 1TB per node... a single 2TB would be the best choice. It allows for additional space and you really shouldn't be too worried about disk i/o being your bottleneck.

    One less thing to worry about is good.
  • Michael Segel at Feb 14, 2011 at 12:48 pm
    Steve is right, and to try and add more clarification...
    Interesting choice; the 7 core in a single CPU option is something else
    to consider. Remember also this is a moving target, what anyone says is
    valid now (Feb 2011) will be seen as quaint in two years time. Even a
    few months from now, what is the best value for a cluster will hve moved on.
    I've never saw a 7 core chip, 4,6, and now 8? (Cores not including hyper threading).
    The point Steve is making is that its true that the price point for picking the optimum hardware continues to move and what we see today doesn't mean we won't see a better optimal configuration. And more importantly, what is optimal for one user isn't going to be optimal for another.

    The other issue that we hadn't even talked about is if you want to go 'white box' and build your own, or your IT shop picks up the phone and calls Dell, HP, IBM, whomever supplies your hardware. That too will limit your options and affect your budget.

    In addition you have to look at what are realistic expectations. There are a lot of factors that you have to weigh when making hardware decisions, including how clean your developers code is going to be and what resources that they will require to run on your cloud.

    And if you run Hbase or Cloudbase, you add more variables.

    The key is finding out which combination of variables are going to be the most important for you to get the most out of your hardware.

    Ok...
    I'll get off my soapbox for now and go get my first cup of coffee. :-)

    -Mike

    Date: Mon, 14 Feb 2011 11:23:13 +0000
    From: stevel@apache.org
    To: common-user@hadoop.apache.org
    Subject: Re: recommendation on HDDs
    On 12/02/11 16:26, Michael Segel wrote:

    All,

    I'd like to clarify somethings...

    First the concept is to build out a cluster of commodity hardware.
    So when you do your shopping you want to get the most bang for your buck. That is the 'sweet spot' that I'm talking about.
    When you look at your E5500 or E5600 chip sets, you will want to go with 4 cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
    (Faster chips are more expensive and the performance edge falls off so you end up paying a premium.)
    Interesting choice; the 7 core in a single CPU option is something else
    to consider. Remember also this is a moving target, what anyone says is
    valid now (Feb 2011) will be seen as quaint in two years time. Even a
    few months from now, what is the best value for a cluster will hve moved on.
    Looking at your disks, you start with using the on board SATA controller. Why? Because it means you don't have to pay for a controller card.
    If you are building a cluster for general purpose computing... Assuming 1U boxes you have room for 4 3.5" SATA which still give you the best performance for your buck.
    Can you go with 2.5"? Yes, but you are going to be paying a premium.

    Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You could go with SATA III drives if your motherboard supports the SATA III ports, but you're still paying a slight premium.

    The OP felt that all he would need was 1TB of disk and was considering 4 250GB drives. (More spindles...yada yada yada...)

    My suggestion is to forget that nonsense and go with one 2 TB drive because its a better deal and if you want to add more disk to the node, you can. (Its easier to add disk than it is to replace it.)

    Now do you need to create a spare OS drive? No. Some people who have an internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging there. (Just make sure you have a lot of disk space...)
    One advantage of a specific drive for OS and log (in a separate
    partition) is you can re-image it without losing data you care about,
    and swap in a replacement fast. If you have a small cluster set up for
    hotswap, that reduces the time a node is down -just have a spare OS HDD
    ready to put in. OS disks are the ones you care about when they fail,
    the others are more "mildly concerned about the failure rate" than
    something to page you over.
    The truth is that there really isn't any single *right* answer. There are a lot of options and budget constraints as well as physical constraints like power, space, and location of the hardware.
    +1. don't forget weight either.
    Also you may be building out a cluster who's main purpose is to be a backup location for your cluster. So your production cluster has lots of nodes. Your backup cluster has lots of disks per node because your main focus is as much storage per node.

    So here you may end up buying a 4U rack box, load it up with 3.5" drives and a couple of SATA controller cards. You care less about performance but more about storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't know how many you can fit in to a 4U chassis these days. So you have 10 DN backing up a 100+ DN cluster in your main data center. But that's another story.
    You can get 12 HDDs in a 1U if you ask nicely. but in a small cluster
    there's a cost, that server can be a big chunk of your filesystem, and
    if it goes down there's up to 24TB worth of replication going to take
    place over the rest of the network, so you'll need at least 24TB of
    spare capacity on the other machines, ignoring bandwidth issues.
    I think the main take away you should have is that if you look at the price point... your best price per GB is on a 2TB drive until the prices drop on 3TB drives.
    Since the OP believes that their requirement is 1TB per node... a single 2TB would be the best choice. It allows for additional space and you really shouldn't be too worried about disk i/o being your bottleneck.

    One less thing to worry about is good.
  • James Seigel at Feb 12, 2011 at 6:37 pm
    The only thing of concern is that the hdfs stuff doesn't seem to do
    exceptionally well with different sized disks in practice

    James

    Sent from my mobile. Please excuse the typos.
    On 2011-02-12, at 8:43 AM, Edward Capriolo wrote:
    On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning wrote:
    Bandwidth is definitely better with more active spindles. I would recommend
    several larger disks. The cost is very nearly the same.
    On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi wrote:

    Thanks for your inputs, Michael. We have 6 open SATA ports on the
    motherboards. That is the reason why we are thinking of 4 to 5 data disks
    and 1 OS disk.
    Are you suggesting use of one 2TB disk instead of four 500GB disks lets
    say?
    I thought that the HDFS utilization/throughput increases with the # of
    disks
    per node (assuming that the total usable IO bandwidth increases
    proportionally).

    -Shrinivas

    On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel <michael_segel@hotmail.com
    wrote:
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal
    configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the
    motherboard? Since you're using JBOD, you don't want the additional expense
    of having to purchase a separate controller card for the additional drives.
    I'm running Seagate drives at home and I haven't had any problems for
    years.
    When you look at your drive, you need to know total storage, speed (rpms),
    and cache size.
    Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
    1TB Seagate was 70.00
    A 250GB SATA drive was $45.00

    So 2TB = 110, 140, 180 (respectively)

    So you get a better deal on 2TB.

    So if you go out and get more drives but of lower density, you'll end up
    spending more money and use more energy, but I doubt you'll see a real
    performance difference.

    The other thing is that if you want to add more disk, you have room to
    grow. (Just add more disk and restart the node, right?)
    If all of your disk slots are filled, you're SOL. You have to take out the
    box, replace all of the drives, then add to cluster as 'new' node.

    Just my $0.02 cents.

    HTH

    -Mike
    Date: Thu, 10 Feb 2011 15:47:16 -0600
    Subject: Re: recommendation on HDDs
    From: jshrinivas@gmail.com
    To: common-user@hadoop.apache.org

    Hi Ted, Chris,

    Much appreciate your quick reply. The reason why we are looking for smaller
    capacity drives is because we are not anticipating a huge growth in
    data
    footprint and also read somewhere that larger the capacity of the
    drive,
    bigger the number of platters in them and that could affect drive
    performance. But looks like you can get 1TB drives with only 2
    platters.
    Large capacity drives should be OK for us as long as they perform
    equally
    well.

    Also, the systems that we have can host up to 8 SATA drives in them. In that
    case, would backplanes offer additional advantages?

    Any suggestions on 5400 vs. 7200 vs. 10000 RPM disks? I guess 10K rpm disks
    would be overkill comparing their perf/cost advantage?

    Thanks for your inputs.

    -Shrinivas

    On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins <
    chris_j_collins@yahoo.com>wrote:
    Of late we have had serious issues with seagate drives in our hadoop
    cluster. These were purchased over several purchasing cycles and
    pretty
    sure it wasnt just a single "bad batch". Because of this we
    switched
    to
    buying 2TB hitachi drives which seem to of been considerably more
    reliable.
    Best

    C
    On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

    Get bigger disks. Data only grows and having extra is always good.

    You can get 2TB drives for <$100 and 1TB for < $75.

    As far as transfer rates are concerned, any 3GB/s SATA drive is
    going
    to
    be
    about the same (ish). Seek times will vary a bit with rotation
    speed,
    but
    with Hadoop, you will be doing long reads and writes.

    Your controller and backplane will have a MUCH bigger vote in
    getting
    acceptable performance. With only 4 or 5 drives, you don't have to
    worry
    about super-duper backplane, but you can still kill performance
    with
    a
    lousy
    controller.

    On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi <
    jshrinivas@gmail.com
    wrote:
    What would be a good hard drive for a 7 node cluster which is
    targeted
    to
    run a mix of IO and CPU intensive Hadoop workloads? We are looking
    for
    around 1 TB of storage on each node distributed amongst 4 or 5
    disks. So
    either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
    than
    100$
    each ;)

    I looked at HDD benchmark comparisons on tomshardware,
    storagereview
    etc.
    Got overwhelmed with the # of benchmarks and different aspects of
    HDD
    performance.

    Appreciate your help on this.

    -Shrinivas
    You also do not need a dedicated OS disk. I typically slice to
    partitions of some of the disks and do a software mirror there. this
    gives you redundancy without having to sacrifice one or two disk slots
    with smaller disks.
  • Steve Loughran at Feb 14, 2011 at 10:54 am

    On 10/02/11 22:25, Michael Segel wrote:
    Shrinivas,

    Assuming you're in the US, I'd recommend the following:

    Go with 2TB 7200 SATA hard drives.
    (Not sure what type of hardware you have)

    What we've found is that in the data nodes, there's an optimal configuration that balances price versus performance.

    While your chasis may hold 8 drives, how many open SATA ports are on the motherboard? Since you're using JBOD, you don't want the additional expense of having to purchase a separate controller card for the additional drives.
    I'm not going to disagree about cost, but I will note that a single
    controller can become a bottleneck once you add a lot of disks to it; it
    generates lots of interrupts that go to the came core, which then ends
    up at 100% CPU and overloading. With two controllers the work can get
    spread over two CPUs, moving the bottlenecks back into the IO channels.

    For that reason I'd limit the #of disks for a single controller at
    around 4-6.

    Remember as well as storage capacity, you need disk space for logs,
    spill space, temp dirs, etc. This is why 2TB HDDs are looking appealing
    these days

    Speed? 10K RPM has a faster seek time and possibly bandwidth but you pay
    in capital and power. If the HDFS blocks are laid out well, seek time
    isn't so important, so consider saving the money and putting it elsewhere.

    The other big question with Hadoop is RAM and CPU, and the answer there
    is "it depends". RAM depends on the algorithm, as can the CPU:spindle
    ratio ... I recommend 1 core to 1 spindle as a good starting point. In a
    large cluster the extra capital costs of a second CPU compared to the
    amount of extra servers and storage that you could get for the same
    money speaks in favour of more servers, but in smaller clusters the
    spreadsheets say different things.

    -Steve

    (disclaimer, I work for a server vendor :)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 10, '11 at 8:27p
activeFeb 19, '11 at 1:14a
posts20
users8
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase