FAQ
Ability to thread task execution
--------------------------------

Key: HADOOP-2990
URL: https://issues.apache.org/jira/browse/HADOOP-2990
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Environment: All
Reporter: Holden Robbins


Currently Hadoop spawns a single threaded JVM for each task. While good for many tasks, this does not maximize resource usage for slaves that have many cores (machines with more cores are getting more cost effective everyday) _and_ are running jobs that require many gigabytes of read-only in-memory resources to maximize throughput. Running in separate JVMs requires redundantly loading large amounts of data, reducing the possible number of parallel tasks that can run per a machine even though more cpus are available.

Adding this ability will give hadoop users the flexibility to balance their need for maximizing memory usage & throughput and task segmentation.

Note: This is a blocking bug in porting processes over to hadoop for my own organization. I am testing a patch for this now that leaves the existing behavior for single threaded operation in-tact. All synchronization is done through wrapper classes and helper methods and should not add any overhead to non-threaded processes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Owen O'Malley (JIRA) at Mar 10, 2008 at 11:34 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577250#action_12577250 ]

    Owen O'Malley commented on HADOOP-2990:
    ---------------------------------------

    Have you tried using the java.nio.MappedByteBuffers? That should give you good performance between multiple jvms. There is also already a mutli-threaded map runner that works well for mappers. Do you have the problem with reduces, also? I multi-threaded reducer class might be a good option depending on what is required.
    Ability to thread task execution
    --------------------------------

    Key: HADOOP-2990
    URL: https://issues.apache.org/jira/browse/HADOOP-2990
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Environment: All
    Reporter: Holden Robbins
    Original Estimate: 48h
    Remaining Estimate: 48h

    Currently Hadoop spawns a single threaded JVM for each task. While good for many tasks, this does not maximize resource usage for slaves that have many cores (machines with more cores are getting more cost effective everyday) _and_ are running jobs that require many gigabytes of read-only in-memory resources to maximize throughput. Running in separate JVMs requires redundantly loading large amounts of data, reducing the possible number of parallel tasks that can run per a machine even though more cpus are available.
    Adding this ability will give hadoop users the flexibility to balance their need for maximizing memory usage & throughput and task segmentation.
    Note: This is a blocking bug in porting processes over to hadoop for my own organization. I am testing a patch for this now that leaves the existing behavior for single threaded operation in-tact. All synchronization is done through wrapper classes and helper methods and should not add any overhead to non-threaded processes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Holden Robbins (JIRA) at Mar 11, 2008 at 12:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577256#action_12577256 ]

    Holden Robbins commented on HADOOP-2990:
    ----------------------------------------

    Useful to know thanks.

    Is this the Multi-threaded mapper you're refering to?
    http://issues.apache.org/jira/browse/HADOOP-811

    Did it not make it into the code base? Any reason why?

    Ability to thread task execution
    --------------------------------

    Key: HADOOP-2990
    URL: https://issues.apache.org/jira/browse/HADOOP-2990
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Environment: All
    Reporter: Holden Robbins
    Original Estimate: 48h
    Remaining Estimate: 48h

    Currently Hadoop spawns a single threaded JVM for each task. While good for many tasks, this does not maximize resource usage for slaves that have many cores (machines with more cores are getting more cost effective everyday) _and_ are running jobs that require many gigabytes of read-only in-memory resources to maximize throughput. Running in separate JVMs requires redundantly loading large amounts of data, reducing the possible number of parallel tasks that can run per a machine even though more cpus are available.
    Adding this ability will give hadoop users the flexibility to balance their need for maximizing memory usage & throughput and task segmentation.
    Note: This is a blocking bug in porting processes over to hadoop for my own organization. I am testing a patch for this now that leaves the existing behavior for single threaded operation in-tact. All synchronization is done through wrapper classes and helper methods and should not add any overhead to non-threaded processes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Holden Robbins (JIRA) at Mar 11, 2008 at 12:32 am
    [ https://issues.apache.org/jira/browse/HADOOP-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577261#action_12577261 ]

    Holden Robbins commented on HADOOP-2990:
    ----------------------------------------

    Nm, found it under: src/java/org/apache/hadoop/mapred/lib/MultithreadedMapRunner
    Ability to thread task execution
    --------------------------------

    Key: HADOOP-2990
    URL: https://issues.apache.org/jira/browse/HADOOP-2990
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Environment: All
    Reporter: Holden Robbins
    Original Estimate: 48h
    Remaining Estimate: 48h

    Currently Hadoop spawns a single threaded JVM for each task. While good for many tasks, this does not maximize resource usage for slaves that have many cores (machines with more cores are getting more cost effective everyday) _and_ are running jobs that require many gigabytes of read-only in-memory resources to maximize throughput. Running in separate JVMs requires redundantly loading large amounts of data, reducing the possible number of parallel tasks that can run per a machine even though more cpus are available.
    Adding this ability will give hadoop users the flexibility to balance their need for maximizing memory usage & throughput and task segmentation.
    Note: This is a blocking bug in porting processes over to hadoop for my own organization. I am testing a patch for this now that leaves the existing behavior for single threaded operation in-tact. All synchronization is done through wrapper classes and helper methods and should not add any overhead to non-threaded processes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 11, 2008 at 2:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577463#action_12577463 ]

    Owen O'Malley commented on HADOOP-2990:
    ---------------------------------------

    It was committed back on 0.10.0. Code is at http://tinyurl.com/35eaj4 .
    I would have sent the java doc instead, but it is clearly broken (http://tinyurl.com/3xy9gq)
    Ability to thread task execution
    --------------------------------

    Key: HADOOP-2990
    URL: https://issues.apache.org/jira/browse/HADOOP-2990
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Environment: All
    Reporter: Holden Robbins
    Original Estimate: 48h
    Remaining Estimate: 48h

    Currently Hadoop spawns a single threaded JVM for each task. While good for many tasks, this does not maximize resource usage for slaves that have many cores (machines with more cores are getting more cost effective everyday) _and_ are running jobs that require many gigabytes of read-only in-memory resources to maximize throughput. Running in separate JVMs requires redundantly loading large amounts of data, reducing the possible number of parallel tasks that can run per a machine even though more cpus are available.
    Adding this ability will give hadoop users the flexibility to balance their need for maximizing memory usage & throughput and task segmentation.
    Note: This is a blocking bug in porting processes over to hadoop for my own organization. I am testing a patch for this now that leaves the existing behavior for single threaded operation in-tact. All synchronization is done through wrapper classes and helper methods and should not add any overhead to non-threaded processes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 10, '08 at 10:50p
activeMar 11, '08 at 2:36p
posts5
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Owen O'Malley (JIRA): 5 posts

People

Translate

site design / logo © 2022 Grokbase