Hi,
On Sat, Dec 11, 2010 at 4:39 PM, Rob Stewart
wrote:
Hi,
When trying to compare Hadoop against other parallel paradigms, it is
important to consider heterogeneous systems. Some may have 100 nodes,
each single core. Some may have 100 nodes, with 8 cores on each, and
others may have 5 nodes, 32 cores per node.
Hadoop JobTracker runs "Tasks" on all TaskTrackers that are available
to it (A cluster). Each worker node (a machine, independent of what
its hardware is) needs only one TaskTracker service to run, and the
service tells the JobTracker the number of Task slots it supports for
Map and Reduce each.
If I had an 8-CPU machine, I could configure my TaskTracker on that
machine to accept 8 Map Tasks and 8 Reduce Tasks. And thus, I "enable"
the 100% utilization potential of that machine; provided there's a job
that's going to use all these slots.
As Hadoop runs on JVM's on each node, am I guessing correctly that
multi-threaded performance on multi-core nodes is as good as the Java
runtimes ability to run on multi-core nodes?
Or - is there support in Hadoop for multi-core nodes? Or is there in
fact no support? Or is it less definitive - rather "the jvm will run
slightly quicker on 4 cores, rather than 2" ?
Hadoop's TaskTracker will spawn a JVM per Task it gets. So if its
number of slots for that TaskTracker is set to 8, it may run up to 8
independent JVMs simultaneously to process data, thereby utilizing the
8 CPUs available on that machine in an indirect manner.
Threads are not used for running the Map/Reduce Tasks; but can be used
within the application logic. For instance, you can choose to execute
Mapper logic in threads instead of sequences if you wish to; using the
MultiThreadedMapper class.
Also, there's no "direct" support for detection of multi-core nodes.
You need to configure your Hadoop installation based on the machine's
details yourself (worker nodes can have certain levels of independent
configuration to customize and adapt to their hardware/software).
HTH
--
Harsh J
www.harshj.com