FAQ
I am starting an install of hadoop on a new cluster. However, I am a little
confused what set of instructions I should follow, having only installed and
played around with hadoop on a single node ubuntu box with 2 cores (on a
single board) and 2 GB of RAM.
The new machine has 2 internal nodes, each with 4 cores. I would like to run
Hadoop to run in a distributed context over these 8 cores. One of my biggest
issues is the definition of the word "node". From the Hadoop wiki and
documentation, it seems that "node" means "machine", and not a board. So, by
this definition, our cluster is really one "node". Is this correct?

If this is the case, then I shouldn't be using the "cluster setup"
instructions, located here:
http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

But this one:
http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

Which is what I've been doing. But what should the operation be? I don't
think it should be standalone. Should it be Psuedo-distributed? If so, how
can I guarantee that it will be spread over all the 8 processors? What is
necessary for the hadoop-site.xml file?

Here are the specs of the machine.
-Mac Pro RAID Card 065-7214
-Two 3.0GHz Quad-Core Intel Xeon (8-core) 065-7534

-16GB RAM (4 x 4GB) 065-7179
-1TB 7200-rpm Serial ATA 3Gb/s 065-7544

-1TB 7200-rpm Serial ATA 3Gb/s 065-7546

-1TB 7200-rpm Serial ATA 3Gb/s 065-7193

-1TB 7200-rpm Serial ATA 3Gb/s 065-7548


Could someone please point me to the correct mode of operation/instructions
to install things correctly on this machine? I found some information how to
install on a OS X machine in the archives, but they are a touch outdated and
seems to be missing some things.

Thank you very much for you time.

-SM

Search Discussions

  • Jim Twensky at Sep 10, 2008 at 10:24 pm
    Apparently you have one node with 2 processors where each processor has 4
    cores. What do you want to use Hadoop for? If you have a single disk drive
    and multiple cores on one node then pseudo distributed environment seems
    like the best approach to me as long as you are not dealing with large
    amounts of data. If you have a single disk drive and huge amount of data to
    process, then the disk drive might be a bottleneck for your applications.
    Hadoop is usually used for data intensive applications whereas your hardware
    seems more like to be designed for cpu intensive job considering 8 cores on
    a single node.

    Tim
    On Wed, Sep 10, 2008 at 4:59 PM, Sandy wrote:

    I am starting an install of hadoop on a new cluster. However, I am a little
    confused what set of instructions I should follow, having only installed
    and
    played around with hadoop on a single node ubuntu box with 2 cores (on a
    single board) and 2 GB of RAM.
    The new machine has 2 internal nodes, each with 4 cores. I would like to
    run
    Hadoop to run in a distributed context over these 8 cores. One of my
    biggest
    issues is the definition of the word "node". From the Hadoop wiki and
    documentation, it seems that "node" means "machine", and not a board. So,
    by
    this definition, our cluster is really one "node". Is this correct?

    If this is the case, then I shouldn't be using the "cluster setup"
    instructions, located here:
    http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

    But this one:
    http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

    Which is what I've been doing. But what should the operation be? I don't
    think it should be standalone. Should it be Psuedo-distributed? If so, how
    can I guarantee that it will be spread over all the 8 processors? What is
    necessary for the hadoop-site.xml file?

    Here are the specs of the machine.
    -Mac Pro RAID Card 065-7214
    -Two 3.0GHz Quad-Core Intel Xeon (8-core) 065-7534

    -16GB RAM (4 x 4GB) 065-7179
    -1TB 7200-rpm Serial ATA 3Gb/s 065-7544

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7546

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7193

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7548


    Could someone please point me to the correct mode of operation/instructions
    to install things correctly on this machine? I found some information how
    to
    install on a OS X machine in the archives, but they are a touch outdated
    and
    seems to be missing some things.

    Thank you very much for you time.

    -SM
  • Sandy at Sep 10, 2008 at 11:14 pm
    Thanks for the swift response.

    I have 4 disk drives (please see specs), so I'm not sure if the hard disk
    will still be a bottleneck. Would you agree?
    I think we are dealing with data intensive jobs... my input data can be as
    large as a few gigabytes in size (though theoretically it could be larger).
    I understand that in comparison to what some people run this may seem small
    though. I tried running something on my old machine, and it took several
    hours to complete the reduce in the first map reduce phase before running
    out of memory (and this was after I increased the heap size).

    I'm trying to increase the max heap size on this machine in hadoop-env.sh
    past 2000, but hadoop gives me errors. Is this normal? I'm running
    hadoop-0.17.2. Is there anywhere else I need to specify a heap increase?

    Lastly, I think one more modification I will be needing to make is
    increasing the maximum number of map/reduce tasks to 8 (one per core). I
    made that change in hadoop-site.xml, by adding an additional property:

    <property>
    <name>mapred.tasktracker.tasks.maximum</name>
    <value>8</value>
    <description>The maximum number of tasks that will be run simultaneously by
    a
    a task tracker
    </description>
    </property>

    I don't see a mapred-default.xml file in the conf folder. I'm guessing this
    was removed in later versions? Is there anywhere else I would need to
    specify an increase in map and reduce tasks aside from

    JobConf.setNumMapTasks and JobConf.setNumReduceTasks?

    Thanks again for your time.

    -SM
    PS - I'm going to update the wiki with installation instructions for OS X as
    soon as I get everything finshed up :-)


    On Wed, Sep 10, 2008 at 5:23 PM, Jim Twensky wrote:

    Apparently you have one node with 2 processors where each processor has 4
    cores. What do you want to use Hadoop for? If you have a single disk drive
    and multiple cores on one node then pseudo distributed environment seems
    like the best approach to me as long as you are not dealing with large
    amounts of data. If you have a single disk drive and huge amount of data to
    process, then the disk drive might be a bottleneck for your applications.
    Hadoop is usually used for data intensive applications whereas your
    hardware
    seems more like to be designed for cpu intensive job considering 8 cores on
    a single node.

    Tim
    On Wed, Sep 10, 2008 at 4:59 PM, Sandy wrote:

    I am starting an install of hadoop on a new cluster. However, I am a little
    confused what set of instructions I should follow, having only installed
    and
    played around with hadoop on a single node ubuntu box with 2 cores (on a
    single board) and 2 GB of RAM.
    The new machine has 2 internal nodes, each with 4 cores. I would like to
    run
    Hadoop to run in a distributed context over these 8 cores. One of my
    biggest
    issues is the definition of the word "node". From the Hadoop wiki and
    documentation, it seems that "node" means "machine", and not a board. So,
    by
    this definition, our cluster is really one "node". Is this correct?

    If this is the case, then I shouldn't be using the "cluster setup"
    instructions, located here:
    http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

    But this one:
    http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

    Which is what I've been doing. But what should the operation be? I don't
    think it should be standalone. Should it be Psuedo-distributed? If so, how
    can I guarantee that it will be spread over all the 8 processors? What is
    necessary for the hadoop-site.xml file?

    Here are the specs of the machine.
    -Mac Pro RAID Card 065-7214
    -Two 3.0GHz Quad-Core Intel Xeon (8-core) 065-7534

    -16GB RAM (4 x 4GB) 065-7179
    -1TB 7200-rpm Serial ATA 3Gb/s 065-7544

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7546

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7193

    -1TB 7200-rpm Serial ATA 3Gb/s 065-7548


    Could someone please point me to the correct mode of
    operation/instructions
    to install things correctly on this machine? I found some information how
    to
    install on a OS X machine in the archives, but they are a touch outdated
    and
    seems to be missing some things.

    Thank you very much for you time.

    -SM

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 10, '08 at 9:59p
activeSep 10, '08 at 11:14p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Sandy: 2 posts Jim Twensky: 1 post

People

Translate

site design / logo © 2023 Grokbase