FAQ
Does anyone have any recommendations for / against using a NAS / SAN system
as the underlying physical storage for a hadoop cluster, instead of local
data node DAS?

Search Discussions

  • Brian Bockelman at Dec 22, 2009 at 4:35 pm
    Things to consider are cost, reliability, scalability, and what equipment you might already own.

    - SAN / NAS: generally less reliable than HDFS in terms of "how much data do you lose if lightning strikes a box?". Many SAN/NAS solutions start with the assumption that a given piece of hardware will never fail; I have found this to be a lousy assumption at our site.
    - At today's disk failure rates, you can expect 2 dead disks a day for a petabyte scale solution. Keep this in mind for your plans. A HDFS-based solution will recover nicely from disk deaths.
    - local DAS can be more scalable depending on your application.
    - If you already own a SAN/NAS and it is sufficient for your install, don't throw out the equipment. Use it.
    - local DAS comes in cheaper *if* you need to buy the computational power anyway.

    A lot of this comes down to what your operations staff is used to.
    - If you have deep experience with a vendor-supported file system (i.e., GPFS), I'd recommend continuing to use it.
    - If you have no background in this area, you would probably benefit from Hadoop support from a company like Cloudera.

    Hope this helps - you didn't give much background into your specific situation, so I can only answer in very general terms.

    Brian
    On Dec 22, 2009, at 10:24 AM, Doopah Shaf wrote:

    Does anyone have any recommendations for / against using a NAS / SAN system
    as the underlying physical storage for a hadoop cluster, instead of local
    data node DAS?
  • Doopah Shaf at Dec 22, 2009 at 6:58 pm
    My setup is an existing farm based on a central Netapp, looking to scale out
    and considering hadoop as a data processing / DWH alternative. Does this add
    any relevant details to the answer?
    Thanks.
    On Tue, Dec 22, 2009 at 6:34 PM, Brian Bockelman wrote:

    Things to consider are cost, reliability, scalability, and what equipment
    you might already own.

    - SAN / NAS: generally less reliable than HDFS in terms of "how much data
    do you lose if lightning strikes a box?". Many SAN/NAS solutions start with
    the assumption that a given piece of hardware will never fail; I have found
    this to be a lousy assumption at our site.
    - At today's disk failure rates, you can expect 2 dead disks a day for a
    petabyte scale solution. Keep this in mind for your plans. A HDFS-based
    solution will recover nicely from disk deaths.
    - local DAS can be more scalable depending on your application.
    - If you already own a SAN/NAS and it is sufficient for your install, don't
    throw out the equipment. Use it.
    - local DAS comes in cheaper *if* you need to buy the computational power
    anyway.

    A lot of this comes down to what your operations staff is used to.
    - If you have deep experience with a vendor-supported file system (i.e.,
    GPFS), I'd recommend continuing to use it.
    - If you have no background in this area, you would probably benefit from
    Hadoop support from a company like Cloudera.

    Hope this helps - you didn't give much background into your specific
    situation, so I can only answer in very general terms.

    Brian
    On Dec 22, 2009, at 10:24 AM, Doopah Shaf wrote:

    Does anyone have any recommendations for / against using a NAS / SAN system
    as the underlying physical storage for a hadoop cluster, instead of local
    data node DAS?
  • Weiming Lu at Dec 23, 2009 at 11:10 am
    I also have the idea in my mind. I want to install the hadoop cluster
    in serveral blade servers, which are connected with a disk array. Data
    is storaged in the disk array, and the blade servers are used for
    computation.
    Can MapReduce benefit from such architecture?
    Could the communication between blade server and disk array be the bottleneck?

    Thanks,
    Weiming
    On Wed, Dec 23, 2009 at 2:58 AM, Doopah Shaf wrote:
    My setup is an existing farm based on a central Netapp, looking to scale out
    and considering hadoop as a data processing / DWH alternative. Does this add
    any relevant details to the answer?
    Thanks.
    On Tue, Dec 22, 2009 at 6:34 PM, Brian Bockelman wrote:

    Things to consider are cost, reliability, scalability, and what equipment
    you might already own.

    - SAN / NAS: generally less reliable than HDFS in terms of "how much data
    do you lose if lightning strikes a box?".  Many SAN/NAS solutions start with
    the assumption that a given piece of hardware will never fail; I have found
    this to be a lousy assumption at our site.
    - At today's disk failure rates, you can expect 2 dead disks a day for a
    petabyte scale solution.  Keep this in mind for your plans.  A HDFS-based
    solution will recover nicely from disk deaths.
    - local DAS can be more scalable depending on your application.
    - If you already own a SAN/NAS and it is sufficient for your install, don't
    throw out the equipment.  Use it.
    - local DAS comes in cheaper *if* you need to buy the computational power
    anyway.

    A lot of this comes down to what your operations staff is used to.
    - If you have deep experience with a vendor-supported file system (i.e.,
    GPFS), I'd recommend continuing to use it.
    - If you have no background in this area, you would probably benefit from
    Hadoop support from a company like Cloudera.

    Hope this helps - you didn't give much background into your specific
    situation, so I can only answer in very general terms.

    Brian
    On Dec 22, 2009, at 10:24 AM, Doopah Shaf wrote:

    Does anyone have any recommendations for / against using a NAS / SAN system
    as the underlying physical storage for a hadoop cluster, instead of local
    data node DAS?
  • Eli Collins at Dec 23, 2009 at 8:18 pm
    Could the communication between blade server and disk array be the bottleneck?
    Yes, depending on the number of blades, the network into the array
    will bottleneck because it doesn't scale with the number of data
    nodes.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 22, '09 at 4:25p
activeDec 23, '09 at 8:18p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase