FAQ
Hi all,

I am not a hardware guy but about to set up a 10 node cluster for some
processing of (mostly) tab files, generating various indexes and
researching HBase, Mahout, pig, hive etc.

Could someone please sanity check that these specs look sensible?
[I know 4 drives would be better but price is a factor (second hand
not an option, hosting is not either as there is very good bandwidth
provided)]

Something along the lines of:

Dell R200 (8GB is max memory)
Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


Dell R300 (can be expanded to 24GB RAM)
Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


If there is a major flaw please can you let me know.

Thanks,

Tim
(not a hardware guy ;o)

Search Discussions

  • Miles Osborne at Apr 2, 2009 at 2:16 pm
    make sure you also have a fast switch, since you will be transmitting
    data across your network and this will come to bite you otherwise

    (roughly, you need one core per hadoop-related job, each mapper, task
    tracker etc; the per-core memory may be too small if you are doing
    anything memory-intensive. we have 8-core boxes with 50 -- 33 GB RAM
    and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM
    and it routinely falls over when we run jobs on it)

    Miles

    2009/4/2 tim robertson <timrobertson100@gmail.com>:
    Hi all,

    I am not a hardware guy but about to set up a 10 node cluster for some
    processing of (mostly) tab files, generating various indexes and
    researching HBase, Mahout, pig, hive etc.

    Could someone please sanity check that these specs look sensible?
    [I know 4 drives would be better but price is a factor (second hand
    not an option, hosting is not either as there is very good bandwidth
    provided)]

    Something along the lines of:

    Dell R200 (8GB is max memory)
    Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
    8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    Dell R300 (can be expanded to 24GB RAM)
    Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
    8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    If there is a major flaw please can you let me know.

    Thanks,

    Tim
    (not a hardware guy ;o)


    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
  • Tim robertson at Apr 2, 2009 at 3:33 pm
    Thanks Miles,

    Thus far most of my work has been on EC2 large instances and *mostly*
    my code is not memory intensive (I sometimes do joins against polygons
    and hold Geospatial indexes in memory, but am aware of keeping things
    within the -Xmx for this).
    I am mostly looking to move routine data processing and
    transformation (lots of distinct, count and group by operations) off a
    chunky mysql DB (200million rows and growing) which gets all locked
    up.

    We have gigabit switches.

    Cheers

    Tim


    On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne wrote:
    make sure you also have a fast switch, since you will be transmitting
    data across your network and this will come to bite you otherwise

    (roughly, you need one core per hadoop-related job, each mapper, task
    tracker etc;  the per-core memory may be too small if you are doing
    anything memory-intensive.  we have 8-core boxes with 50 -- 33 GB RAM
    and 8 x 1 TB disks on each one;  one box however just has 16 GB of RAM
    and it routinely falls over when we run jobs on it)

    Miles

    2009/4/2 tim robertson <timrobertson100@gmail.com>:
    Hi all,

    I am not a hardware guy but about to set up a 10 node cluster for some
    processing of (mostly) tab files, generating various indexes and
    researching HBase, Mahout, pig, hive etc.

    Could someone please sanity check that these specs look sensible?
    [I know 4 drives would be better but price is a factor (second hand
    not an option, hosting is not either as there is very good bandwidth
    provided)]

    Something along the lines of:

    Dell R200 (8GB is max memory)
    Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
    8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    Dell R300 (can be expanded to 24GB RAM)
    Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
    8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    If there is a major flaw please can you let me know.

    Thanks,

    Tim
    (not a hardware guy ;o)


    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
  • Patrick Angeles at Apr 2, 2009 at 8:51 pm
    I had a similar curiosity, but more regarding disk speed.
    Can I assume linear improvement between 7200rpm -> 10k rpm -> 15k rpm? How
    much of a bottleneck is disk access?

    Another question is regarding hardware redundancy. What is the relative
    value of the following:
    - RAID / hot-swappable drives
    - dual NICs
    - redundant backplane
    - redundant power supply
    - UPS

    I've been assuming that RAID is generally a good idea (disks fail quite
    often, and it's cheaper to hotswap a drive than to rebuild an entire box).
    Dual NICs are also good, as both can be used at the same time. Everything
    else is not necessary in a Hadoop cluster.
    On Thu, Apr 2, 2009 at 11:33 AM, tim robertson wrote:

    Thanks Miles,

    Thus far most of my work has been on EC2 large instances and *mostly*
    my code is not memory intensive (I sometimes do joins against polygons
    and hold Geospatial indexes in memory, but am aware of keeping things
    within the -Xmx for this).
    I am mostly looking to move routine data processing and
    transformation (lots of distinct, count and group by operations) off a
    chunky mysql DB (200million rows and growing) which gets all locked
    up.

    We have gigabit switches.

    Cheers

    Tim


    On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne wrote:
    make sure you also have a fast switch, since you will be transmitting
    data across your network and this will come to bite you otherwise

    (roughly, you need one core per hadoop-related job, each mapper, task
    tracker etc; the per-core memory may be too small if you are doing
    anything memory-intensive. we have 8-core boxes with 50 -- 33 GB RAM
    and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM
    and it routinely falls over when we run jobs on it)

    Miles

    2009/4/2 tim robertson <timrobertson100@gmail.com>:
    Hi all,

    I am not a hardware guy but about to set up a 10 node cluster for some
    processing of (mostly) tab files, generating various indexes and
    researching HBase, Mahout, pig, hive etc.

    Could someone please sanity check that these specs look sensible?
    [I know 4 drives would be better but price is a factor (second hand
    not an option, hosting is not either as there is very good bandwidth
    provided)]

    Something along the lines of:

    Dell R200 (8GB is max memory)
    Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB
    8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    Dell R300 (can be expanded to 24GB RAM)
    Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS
    8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs)
    2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive


    If there is a major flaw please can you let me know.

    Thanks,

    Tim
    (not a hardware guy ;o)


    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
  • Philip Zeyliger at Apr 2, 2009 at 10:49 pm


    I've been assuming that RAID is generally a good idea (disks fail quite
    often, and it's cheaper to hotswap a drive than to rebuild an entire box).
    Hadoop data nodes are often configured without RAID (i.e., "JBOD" = Just a
    Bunch of Disks)--HDFS already provides for the data redundancy. Also, if
    you stripe across disks, you're liable to be as slow as the slowest of your
    disks, so data nodes are typically configured to point to multiple disks.

    -- Philip

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 2, '09 at 2:02p
activeApr 2, '09 at 10:49p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase