FAQ
Hi,

When trying to compare Hadoop against other parallel paradigms, it is
important to consider heterogeneous systems. Some may have 100 nodes,
each single core. Some may have 100 nodes, with 8 cores on each, and
others may have 5 nodes, 32 cores per node.

As Hadoop runs on JVM's on each node, am I guessing correctly that
multi-threaded performance on multi-core nodes is as good as the Java
runtimes ability to run on multi-core nodes?

Or - is there support in Hadoop for multi-core nodes? Or is there in
fact no support? Or is it less definitive - rather "the jvm will run
slightly quicker on 4 cores, rather than 2" ?


thanks,

Rob Stewart

Search Discussions

  • Harsh J at Dec 11, 2010 at 11:29 am
    Hi,

    On Sat, Dec 11, 2010 at 4:39 PM, Rob Stewart
    wrote:
    Hi,

    When trying to compare Hadoop against other parallel  paradigms, it is
    important to consider heterogeneous systems. Some may have 100 nodes,
    each single core. Some may have 100 nodes, with 8 cores on each, and
    others may have 5 nodes, 32 cores per node.
    Hadoop JobTracker runs "Tasks" on all TaskTrackers that are available
    to it (A cluster). Each worker node (a machine, independent of what
    its hardware is) needs only one TaskTracker service to run, and the
    service tells the JobTracker the number of Task slots it supports for
    Map and Reduce each.

    If I had an 8-CPU machine, I could configure my TaskTracker on that
    machine to accept 8 Map Tasks and 8 Reduce Tasks. And thus, I "enable"
    the 100% utilization potential of that machine; provided there's a job
    that's going to use all these slots.
    As Hadoop runs on JVM's on each node, am I guessing correctly that
    multi-threaded performance on multi-core nodes is as good as the Java
    runtimes ability to run on multi-core nodes?

    Or - is there support in Hadoop for multi-core nodes? Or is there in
    fact no support? Or is it less definitive - rather "the jvm will run
    slightly quicker on 4 cores, rather than 2" ?
    Hadoop's TaskTracker will spawn a JVM per Task it gets. So if its
    number of slots for that TaskTracker is set to 8, it may run up to 8
    independent JVMs simultaneously to process data, thereby utilizing the
    8 CPUs available on that machine in an indirect manner.

    Threads are not used for running the Map/Reduce Tasks; but can be used
    within the application logic. For instance, you can choose to execute
    Mapper logic in threads instead of sequences if you wish to; using the
    MultiThreadedMapper class.

    Also, there's no "direct" support for detection of multi-core nodes.
    You need to configure your Hadoop installation based on the machine's
    details yourself (worker nodes can have certain levels of independent
    configuration to customize and adapt to their hardware/software).

    HTH

    --
    Harsh J
    www.harshj.com
  • Rob Stewart at Dec 11, 2010 at 11:41 am
    Ah,

    that is very interesting indeed.

    I am running on a homogeneous cluster, where each node has 8 cores.

    Does that mean that Hadoop would need to be carefully configured, so
    that 8 core machines had a max.tasks value of 8, and dual core
    machines had the value 2 ?

    Very useful to know,

    Rob
    On 11 December 2010 11:28, Harsh J wrote:
    Hi,

    On Sat, Dec 11, 2010 at 4:39 PM, Rob Stewart
    wrote:
    Hi,

    When trying to compare Hadoop against other parallel  paradigms, it is
    important to consider heterogeneous systems. Some may have 100 nodes,
    each single core. Some may have 100 nodes, with 8 cores on each, and
    others may have 5 nodes, 32 cores per node.
    Hadoop JobTracker runs "Tasks" on all TaskTrackers that are available
    to it (A cluster). Each worker node (a machine, independent of what
    its hardware is) needs only one TaskTracker service to run, and the
    service tells the JobTracker the number of Task slots it supports for
    Map and Reduce each.

    If I had an 8-CPU machine, I could configure my TaskTracker on that
    machine to accept 8 Map Tasks and 8 Reduce Tasks. And thus, I "enable"
    the 100% utilization potential of that machine; provided there's a job
    that's going to use all these slots.
    As Hadoop runs on JVM's on each node, am I guessing correctly that
    multi-threaded performance on multi-core nodes is as good as the Java
    runtimes ability to run on multi-core nodes?

    Or - is there support in Hadoop for multi-core nodes? Or is there in
    fact no support? Or is it less definitive - rather "the jvm will run
    slightly quicker on 4 cores, rather than 2" ?
    Hadoop's TaskTracker will spawn a JVM per Task it gets. So if its
    number of slots for that TaskTracker is set to 8, it may run up to 8
    independent JVMs simultaneously to process data, thereby utilizing the
    8 CPUs available on that machine in an indirect manner.

    Threads are not used for running the Map/Reduce Tasks; but can be used
    within the application logic. For instance, you can choose to execute
    Mapper logic in threads instead of sequences if you wish to; using the
    MultiThreadedMapper class.

    Also, there's no "direct" support for detection of multi-core nodes.
    You need to configure your Hadoop installation based on the machine's
    details yourself (worker nodes can have certain levels of independent
    configuration to customize and adapt to their hardware/software).

    HTH

    --
    Harsh J
    www.harshj.com
  • Harsh J at Dec 11, 2010 at 12:23 pm

    On Sat, Dec 11, 2010 at 5:10 PM, Rob Stewart wrote:
    Ah,

    that is very interesting indeed.

    I am running on a homogeneous cluster, where each node has 8 cores.

    Does that mean that Hadoop would need to be carefully configured, so
    that 8 core machines had a max.tasks value of 8, and dual core
    machines had the value 2 ?
    Yes, for the best setup you'd need to create a configuration suited to
    the hardware (and your application) itself. Most of the times its
    homogeneous, so there's not much worry about managing configurations
    across clusters.
    Very useful to know,

    Rob
    On 11 December 2010 11:28, Harsh J wrote:
    Hi,

    On Sat, Dec 11, 2010 at 4:39 PM, Rob Stewart
    wrote:
    Hi,

    When trying to compare Hadoop against other parallel  paradigms, it is
    important to consider heterogeneous systems. Some may have 100 nodes,
    each single core. Some may have 100 nodes, with 8 cores on each, and
    others may have 5 nodes, 32 cores per node.
    Hadoop JobTracker runs "Tasks" on all TaskTrackers that are available
    to it (A cluster). Each worker node (a machine, independent of what
    its hardware is) needs only one TaskTracker service to run, and the
    service tells the JobTracker the number of Task slots it supports for
    Map and Reduce each.

    If I had an 8-CPU machine, I could configure my TaskTracker on that
    machine to accept 8 Map Tasks and 8 Reduce Tasks. And thus, I "enable"
    the 100% utilization potential of that machine; provided there's a job
    that's going to use all these slots.
    As Hadoop runs on JVM's on each node, am I guessing correctly that
    multi-threaded performance on multi-core nodes is as good as the Java
    runtimes ability to run on multi-core nodes?

    Or - is there support in Hadoop for multi-core nodes? Or is there in
    fact no support? Or is it less definitive - rather "the jvm will run
    slightly quicker on 4 cores, rather than 2" ?
    Hadoop's TaskTracker will spawn a JVM per Task it gets. So if its
    number of slots for that TaskTracker is set to 8, it may run up to 8
    independent JVMs simultaneously to process data, thereby utilizing the
    8 CPUs available on that machine in an indirect manner.

    Threads are not used for running the Map/Reduce Tasks; but can be used
    within the application logic. For instance, you can choose to execute
    Mapper logic in threads instead of sequences if you wish to; using the
    MultiThreadedMapper class.

    Also, there's no "direct" support for detection of multi-core nodes.
    You need to configure your Hadoop installation based on the machine's
    details yourself (worker nodes can have certain levels of independent
    configuration to customize and adapt to their hardware/software).

    HTH

    --
    Harsh J
    www.harshj.com


    --
    Harsh J
    www.harshj.com
  • Allen Wittenauer at Dec 13, 2010 at 9:04 pm

    On Dec 11, 2010, at 3:09 AM, Rob Stewart wrote:
    Or - is there support in Hadoop for multi-core nodes?
    Be aware that writing a job that is specifically uses multi-threaded tasks usually means that a) you probably aren't really doing map/reduce anymore and b) the job will likely tickle bugs in the system, as some APIs are more multi-thread safe than others. (See HDFS-1526, for example).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 11, '10 at 11:10a
activeDec 13, '10 at 9:04p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase