Thanks for the swift response.
I have 4 disk drives (please see specs), so I'm not sure if the hard disk
will still be a bottleneck. Would you agree?
I think we are dealing with data intensive jobs... my input data can be as
large as a few gigabytes in size (though theoretically it could be larger).
I understand that in comparison to what some people run this may seem small
though. I tried running something on my old machine, and it took several
hours to complete the reduce in the first map reduce phase before running
out of memory (and this was after I increased the heap size).
I'm trying to increase the max heap size on this machine in hadoop-env.sh
past 2000, but hadoop gives me errors. Is this normal? I'm running
hadoop-0.17.2. Is there anywhere else I need to specify a heap increase?
Lastly, I think one more modification I will be needing to make is
increasing the maximum number of map/reduce tasks to 8 (one per core). I
made that change in hadoop-site.xml, by adding an additional property:
<description>The maximum number of tasks that will be run simultaneously by
a task tracker
I don't see a mapred-default.xml file in the conf folder. I'm guessing this
was removed in later versions? Is there anywhere else I would need to
specify an increase in map and reduce tasks aside from
JobConf.setNumMapTasks and JobConf.setNumReduceTasks?
Thanks again for your time.
PS - I'm going to update the wiki with installation instructions for OS X as
soon as I get everything finshed up :-)
On Wed, Sep 10, 2008 at 5:23 PM, Jim Twensky wrote:
Apparently you have one node with 2 processors where each processor has 4
cores. What do you want to use Hadoop for? If you have a single disk drive
and multiple cores on one node then pseudo distributed environment seems
like the best approach to me as long as you are not dealing with large
amounts of data. If you have a single disk drive and huge amount of data to
process, then the disk drive might be a bottleneck for your applications.
Hadoop is usually used for data intensive applications whereas your
seems more like to be designed for cpu intensive job considering 8 cores on
a single node.
On Wed, Sep 10, 2008 at 4:59 PM, Sandy wrote:
I am starting an install of hadoop on a new cluster. However, I am a little
confused what set of instructions I should follow, having only installed
played around with hadoop on a single node ubuntu box with 2 cores (on a
single board) and 2 GB of RAM.
The new machine has 2 internal nodes, each with 4 cores. I would like to
Hadoop to run in a distributed context over these 8 cores. One of my
issues is the definition of the word "node". From the Hadoop wiki and
documentation, it seems that "node" means "machine", and not a board. So,
this definition, our cluster is really one "node". Is this correct?
If this is the case, then I shouldn't be using the "cluster setup"
instructions, located here:http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html
But this one:http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html
Which is what I've been doing. But what should the operation be? I don't
think it should be standalone. Should it be Psuedo-distributed? If so, how
can I guarantee that it will be spread over all the 8 processors? What is
necessary for the hadoop-site.xml file?
Here are the specs of the machine.
-Mac Pro RAID Card 065-7214
-Two 3.0GHz Quad-Core Intel Xeon (8-core) 065-7534
-16GB RAM (4 x 4GB) 065-7179
-1TB 7200-rpm Serial ATA 3Gb/s 065-7544
-1TB 7200-rpm Serial ATA 3Gb/s 065-7546
-1TB 7200-rpm Serial ATA 3Gb/s 065-7193
-1TB 7200-rpm Serial ATA 3Gb/s 065-7548
Could someone please point me to the correct mode of
to install things correctly on this machine? I found some information how
install on a OS X machine in the archives, but they are a touch outdated
seems to be missing some things.
Thank you very much for you time.