FAQ
What are reasonable hardware specifications for a Hadoop node?
Can we document this somewhere (maybe in the wiki as HowToConfigureHardware?)

Obviously this will be a moving target, but some guidance about how much
CPU vs. memory vs. disk space is typical would be helpful.

As one datapoint, we are running some boxes that are 4 core, 64-bit @
2GHz machines with 4GB of memory with [I think] 2 x 750GB disks. I
think if I could I'd put 4 x 750GB disks in this box. I believe this
configuration is basically the same as what came up in Yahoo!'s recent
sort benchmark.

Other datapoints anyone?

And what about, say on the namenode? People talk about it being a
memory bottleneck, but ours is underutilized.

Should we start a wiki page about this?

-John Heidemann

Search Discussions

  • Ted Dunning at Sep 11, 2007 at 12:05 am
    We have an oddball collection of machines. Most are in the class you
    mention, although some have single dual core CPU's and some have 12GB of
    memory. We plan to use developer workstations at night (real soon now)
    which typically have 1-3GB of memory + single CPU.

    Our name node is very lightly used because the files we analyze are pretty
    good sized (we produce only a few consolidated files per hour).

    On 9/10/07 4:54 PM, "John Heidemann" wrote:


    What are reasonable hardware specifications for a Hadoop node?
    Can we document this somewhere (maybe in the wiki as HowToConfigureHardware?)

    Obviously this will be a moving target, but some guidance about how much
    CPU vs. memory vs. disk space is typical would be helpful.

    As one datapoint, we are running some boxes that are 4 core, 64-bit @
    2GHz machines with 4GB of memory with [I think] 2 x 750GB disks. I
    think if I could I'd put 4 x 750GB disks in this box. I believe this
    configuration is basically the same as what came up in Yahoo!'s recent
    sort benchmark.

    Other datapoints anyone?

    And what about, say on the namenode? People talk about it being a
    memory bottleneck, but ours is underutilized.

    Should we start a wiki page about this?

    -John Heidemann
  • Bob Futrelle at Sep 25, 2007 at 4:28 pm
    How does Hadoop handle multi-core CPUs? Does each core run a distinct copy
    of the mapped app? Is this automatic, or need some configuration, or what?

    I'm in the market to buy a few machines to set up a small cluster and am
    wondering what I should consider.

    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?

    - Bob


    Ted Dunning-3 wrote:

    We have an oddball collection of machines. Most are in the class you
    mention, although some have single dual core CPU's and some have 12GB of
    memory. We plan to use developer workstations at night (real soon now)
    which typically have 1-3GB of memory + single CPU.

    Our name node is very lightly used because the files we analyze are pretty
    good sized (we produce only a few consolidated files per hour).

    On 9/10/07 4:54 PM, "John Heidemann" wrote:


    What are reasonable hardware specifications for a Hadoop node?
    Can we document this somewhere (maybe in the wiki as
    HowToConfigureHardware?)

    Obviously this will be a moving target, but some guidance about how much
    CPU vs. memory vs. disk space is typical would be helpful.

    As one datapoint, we are running some boxes that are 4 core, 64-bit @
    2GHz machines with 4GB of memory with [I think] 2 x 750GB disks. I
    think if I could I'd put 4 x 750GB disks in this box. I believe this
    configuration is basically the same as what came up in Yahoo!'s recent
    sort benchmark.

    Other datapoints anyone?

    And what about, say on the namenode? People talk about it being a
    memory bottleneck, but ours is underutilized.

    Should we start a wiki page about this?

    -John Heidemann
    --
    View this message in context: http://www.nabble.com/hardware-specs-for-hadoop-nodes-tf4419439.html#a12883435
    Sent from the Hadoop Users mailing list archive at Nabble.com.
  • Ted Dunning at Sep 25, 2007 at 4:38 pm

    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:


    How does Hadoop handle multi-core CPUs? Does each core run a distinct copy
    of the mapped app? Is this automatic, or need some configuration, or what?
    Works fine. You need to tell it how many maps to run per machine. I expect
    that this can be tuned per machine.
    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?
    Or both? You will get interesting results all three ways.
  • Michael Bieniosek at Sep 25, 2007 at 5:09 pm
    For our CPU-bound application, I set the value of mapred.tasktracker.tasks.maximum (number of map tasks per tasktracker) equal to the number of CPUs on a tasktracker. Unfortunately, I think this value has to be set per cluster, not per machine. This is okay for us because our machines have similar hardware, but it might be a problem if your machines have different numbers of CPUs.

    I created HADOOP-1245 a long time ago for this problem, but I've since heard that hadoop uses only the cluster value for maps per tasktracker, not the hybrid model I describe. In any case, I never did any work on fixing it because I don't need heterogeneous clusters.

    -Michael

    On 9/25/07 9:37 AM, "Ted Dunning" wrote:
    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:


    How does Hadoop handle multi-core CPUs? Does each core run a distinct copy
    of the mapped app? Is this automatic, or need some configuration, or what?
    Works fine. You need to tell it how many maps to run per machine. I expect
    that this can be tuned per machine.
    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?
    Or both? You will get interesting results all three ways.
  • Ross Boucher at Sep 25, 2007 at 5:19 pm

    On Sep 25, 2007, at 10:09 AM, Michael Bieniosek wrote:

    For our CPU-bound application, I set the value of
    mapred.tasktracker.tasks.maximum (number of map tasks per
    tasktracker) equal to the number of CPUs on a tasktracker.
    Unfortunately, I think this value has to be set per cluster, not
    per machine. This is okay for us because our machines have similar
    hardware, but it might be a problem if your machines have different
    numbers of CPUs.
    I did some experimentation with the number of tasks per machine on a
    set of quad core boxes. I couldn't figure out how to change this
    value without stopping the cluster and restarting it, and I also
    couldn't figure out how to tune it on a per machine basis (though it
    didn't matter much for me either).

    My test had no reduce phase, so I simply set the reduce count to 1
    per machine for all the tests. On the quad core boxes, 5 map tasks
    per machine actually performed the best, but only marginally better
    than 4 map tasks (about 4% with just one box in the cluster, 2% with
    4 boxes). Six tasks started to trend back in the other direction.
    I created HADOOP-1245 a long time ago for this problem, but I've
    since heard that hadoop uses only the cluster value for maps per
    tasktracker, not the hybrid model I describe. In any case, I never
    did any work on fixing it because I don't need heterogeneous clusters.

    -Michael

    On 9/25/07 9:37 AM, "Ted Dunning" wrote:
    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:


    How does Hadoop handle multi-core CPUs? Does each core run a
    distinct copy
    of the mapped app? Is this automatic, or need some configuration,
    or what?
    Works fine. You need to tell it how many maps to run per machine.
    I expect
    that this can be tuned per machine.
    Or should I just spread Hadoop over some friendly machines already
    in my
    College, buying nothing?
    Or both? You will get interesting results all three ways.

  • Ted Dunning at Sep 25, 2007 at 7:35 pm
    A day or so ago somebody mentioned that you could set reduce count to 0 to
    directly store the output of the map.

    That would be a very handy trick, but I haven't done it myself.

    On 9/25/07 10:18 AM, "Ross Boucher" wrote:

    My test had no reduce phase, so I simply set the reduce count to 1
    per machine for all the tests.
  • Allen Wittenauer at Sep 25, 2007 at 5:03 pm
    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:>
    I'm in the market to buy a few machines to set up a small cluster and am
    wondering what I should consider.
    If it helps, we're using quad core x86s with anywhere from 4g to 16g of
    ram. We've got 4x500g sata drives per box, no raid, swap and root taking a
    chunk out of each drive and the rest used for HDFS and/or MR work.

    While you can certainly go a much more heterogeneous route than we have,
    it should be noted that the more differences in the hardware/software
    layout, the more difficult is going to be to maintain them. This is
    especially true for large grids where hand-tuning individual machines just
    isn't worth the return on effort.
    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?
    Given the current lack of a security model in Hadoop and the direction
    that a smattering of Jira's are heading, "friendly" could go either way:
    either not friendly enough or too friendly. :)
  • Ross Boucher at Sep 25, 2007 at 5:11 pm

    On Sep 25, 2007, at 10:02 AM, Allen Wittenauer wrote:

    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:>
    I'm in the market to buy a few machines to set up a small cluster
    and am
    wondering what I should consider.
    If it helps, we're using quad core x86s with anywhere from 4g
    to 16g of
    ram. We've got 4x500g sata drives per box, no raid, swap and root
    taking a
    chunk out of each drive and the rest used for HDFS and/or MR work.
    How many map tasks are you running on each machine, one per core?
    If so, do you have it set up such that each one is talking to its own
    dedicated drive?

    I've been testing on quad core machines myself, but I didn't see an
    obvious way
    to set it up such that there was one map task per core talking to its
    own drive,
    and I also wasn't sure if it would really matter, as the performance
    tests I did run seemed
    to indicate that it didn't make much of a difference.
    While you can certainly go a much more heterogeneous route than
    we have,
    it should be noted that the more differences in the hardware/software
    layout, the more difficult is going to be to maintain them. This is
    especially true for large grids where hand-tuning individual
    machines just
    isn't worth the return on effort.
    Or should I just spread Hadoop over some friendly machines already
    in my
    College, buying nothing?
    Given the current lack of a security model in Hadoop and the
    direction
    that a smattering of Jira's are heading, "friendly" could go either
    way:
    either not friendly enough or too friendly. :)
  • Michael Bieniosek at Sep 25, 2007 at 5:14 pm
    This will also depend on your particular use - since our application is CPU-bound, it was more economical for us to buy machines with as many cores as possible. Obviously if you are IO-bound you have different considerations.

    -Michael

    On 9/25/07 10:02 AM, "Allen Wittenauer" wrote:

    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:>
    I'm in the market to buy a few machines to set up a small cluster and am
    wondering what I should consider.
    If it helps, we're using quad core x86s with anywhere from 4g to 16g of
    ram. We've got 4x500g sata drives per box, no raid, swap and root taking a
    chunk out of each drive and the rest used for HDFS and/or MR work.

    While you can certainly go a much more heterogeneous route than we have,
    it should be noted that the more differences in the hardware/software
    layout, the more difficult is going to be to maintain them. This is
    especially true for large grids where hand-tuning individual machines just
    isn't worth the return on effort.
    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?
    Given the current lack of a security model in Hadoop and the direction
    that a smattering of Jira's are heading, "friendly" could go either way:
    either not friendly enough or too friendly. :)
  • Joydeep Sen Sarma at Sep 25, 2007 at 6:31 pm
    Am curious how folks are sizing memory for Task nodes.

    It didn't seem to me that either map (memory needed ~ chunk size) or
    reduce (memory needed ~ io.sort.mb - yahoo's benchmark sort run sets it
    to low hundreds) tasks consumed a lot of memory in the normal course of
    affairs.

    (there could be exceptions I guess if the reduce group size is extremely
    large - but that seems like an outlier).

    So curious on why we might want to configure more than 1-1.5GB per core
    for task nodes.


    -----Original Message-----
    From: Allen Wittenauer
    Sent: Tuesday, September 25, 2007 10:02 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: hardware specs for hadoop nodes

    On 9/25/07 9:27 AM, "Bob Futrelle" wrote:>
    I'm in the market to buy a few machines to set up a small cluster and am
    wondering what I should consider.
    If it helps, we're using quad core x86s with anywhere from 4g to 16g
    of
    ram. We've got 4x500g sata drives per box, no raid, swap and root
    taking a
    chunk out of each drive and the rest used for HDFS and/or MR work.

    While you can certainly go a much more heterogeneous route than we
    have,
    it should be noted that the more differences in the hardware/software
    layout, the more difficult is going to be to maintain them. This is
    especially true for large grids where hand-tuning individual machines
    just
    isn't worth the return on effort.
    Or should I just spread Hadoop over some friendly machines already in my
    College, buying nothing?
    Given the current lack of a security model in Hadoop and the
    direction
    that a smattering of Jira's are heading, "friendly" could go either way:
    either not friendly enough or too friendly. :)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 10, '07 at 11:56p
activeSep 25, '07 at 7:35p
posts11
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase