FAQ
I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
cores each.
My input data is 4GB in size and it's split into 100MB files. Current
configuration is default so block size is 64MB.

If I understand it correctly Hadoop should be running 64 Mappers to process
the data.

I'm running a simple data counting MapReduce and it's taking about 30mins to
complete. This seems like way too much, doesn't it?
Is there any tunning you guys would recommend to try and see an improvement
in performance?

Thanks,
Pony

Search Discussions

  • GOEKE, MATTHEW (AG/1000) at Jun 27, 2011 at 8:00 pm
    If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

    Check out the below configs for details on what you are *most likely* running currently:
    http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

    HTH,
    Matt

    -----Original Message-----
    From: Juan P.
    Sent: Monday, June 27, 2011 2:50 PM
    To: common-user@hadoop.apache.org
    Subject: Performance Tunning

    I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
    cores each.
    My input data is 4GB in size and it's split into 100MB files. Current
    configuration is default so block size is 64MB.

    If I understand it correctly Hadoop should be running 64 Mappers to process
    the data.

    I'm running a simple data counting MapReduce and it's taking about 30mins to
    complete. This seems like way too much, doesn't it?
    Is there any tunning you guys would recommend to try and see an improvement
    in performance?

    Thanks,
    Pony
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Juan P. at Jun 27, 2011 at 10:34 pm
    Matt,
    Thanks for your help!
    I think I get it now, but this part is a bit confusing:
    *
    *
    *so: tasktracker/datanode and 6 slots left. How you break it up from there
    is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
    / 1 reducer.*
    *
    *
    If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
    Processes/Core = 32 Processes Total

    So my configuration mapred-site.xml should include these props:

    *<property>*
    * <name>mapred.map.tasks</name>*
    * <value>28</value>*
    *</property>*
    *<property>*
    * <name>mapred.reduce.tasks</name>*
    * <value>4</value>*
    *</property>*
    *
    *

    Is that correct?
    On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) wrote:

    If you are running default configurations then you are only getting 2
    mappers and 1 reducer per node. The rule of thumb I have gone on (and back
    up by the definitive guide) is 2 processes per core so: tasktracker/datanode
    and 6 slots left. How you break it up from there is your call but I would
    suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

    Check out the below configs for details on what you are *most likely*
    running currently:
    http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

    HTH,
    Matt

    -----Original Message-----
    From: Juan P.
    Sent: Monday, June 27, 2011 2:50 PM
    To: common-user@hadoop.apache.org
    Subject: Performance Tunning

    I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
    cores each.
    My input data is 4GB in size and it's split into 100MB files. Current
    configuration is default so block size is 64MB.

    If I understand it correctly Hadoop should be running 64 Mappers to process
    the data.

    I'm running a simple data counting MapReduce and it's taking about 30mins
    to
    complete. This seems like way too much, doesn't it?
    Is there any tunning you guys would recommend to try and see an improvement
    in performance?

    Thanks,
    Pony
    This e-mail message may contain privileged and/or confidential information,
    and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error,
    please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use
    of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring,
    reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage
    caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Juan P. at Jun 28, 2011 at 3:13 am
    Ok,
    So I tried putting the following config in the mapred-site.xml of all of my
    nodes

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>name-node:54311</value>
    </property>
    <property>
    <name>mapred.map.tasks</name>
    <value>7</value>
    </property>
    <property>
    <name>mapred.reduce.tasks</name>
    <value>1</value>
    </property>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>7</value>
    </property>
    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>
    </configuration>

    but when I start a new job it gets stuck at

    11/06/28 03:04:47 INFO mapred.JobClient: map 0% reduce 0%

    Any thoughts?
    Thanks for your help guys!
    On Mon, Jun 27, 2011 at 7:33 PM, Juan P. wrote:

    Matt,
    Thanks for your help!
    I think I get it now, but this part is a bit confusing:
    *
    *
    *so: tasktracker/datanode and 6 slots left. How you break it up from there
    is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
    / 1 reducer.*
    *
    *
    If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
    Processes/Core = 32 Processes Total

    So my configuration mapred-site.xml should include these props:

    *<property>*
    * <name>mapred.map.tasks</name>*
    * <value>28</value>*
    *</property>*
    *<property>*
    * <name>mapred.reduce.tasks</name>*
    * <value>4</value>*
    *</property>*
    *
    *

    Is that correct?

    On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) <
    matthew.goeke@monsanto.com> wrote:
    If you are running default configurations then you are only getting 2
    mappers and 1 reducer per node. The rule of thumb I have gone on (and back
    up by the definitive guide) is 2 processes per core so: tasktracker/datanode
    and 6 slots left. How you break it up from there is your call but I would
    suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

    Check out the below configs for details on what you are *most likely*
    running currently:
    http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

    HTH,
    Matt

    -----Original Message-----
    From: Juan P.
    Sent: Monday, June 27, 2011 2:50 PM
    To: common-user@hadoop.apache.org
    Subject: Performance Tunning

    I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
    cores each.
    My input data is 4GB in size and it's split into 100MB files. Current
    configuration is default so block size is 64MB.

    If I understand it correctly Hadoop should be running 64 Mappers to
    process
    the data.

    I'm running a simple data counting MapReduce and it's taking about 30mins
    to
    complete. This seems like way too much, doesn't it?
    Is there any tunning you guys would recommend to try and see an
    improvement
    in performance?

    Thanks,
    Pony
    This e-mail message may contain privileged and/or confidential
    information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error,
    please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other
    use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring,
    reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage
    caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • GOEKE, MATTHEW (AG/1000) at Jun 28, 2011 at 4:12 am
    Per node: 4 cores * 2 processes = 8 slots
    Datanode: 1 slot
    Tasktracker: 1 slot

    Therefore max of 6 slots between mappers and reducers.

    Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load.

    <configuration>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
    </property>

    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>

    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    </property>

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compress</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    </property>

    <property>
    <name>mapred.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>

    <property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>

    <property>
    <name>mapred.reduce.slowstart.completed.maps</name>
    <value>0.75</value>
    </property>

    <property>
    <name>mapred.reduce.tasks</name>
    <value>4</value>
    </property>

    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>false</value>
    </property>

    <property>
    <name>io.sort.mb</name>
    <value>256</value>
    </property>

    <property>
    <name>io.sort.factor</name>
    <value>64</value>
    </property>

    <property>
    <name>mapred.jobtracker.taskScheduler</name>
    <value>org.apache.hadoop.mapred.FairScheduler</value>
    <final>true</final>
    </property>

    <property>
    <name>mapred.fairscheduler.poolnameproperty</name>
    <value>mapreduce.job.user.name</value>
    <final>true</final>
    </property>

    <!--
    <property>
    <name>mapred.fairscheduler.allocation.file</name>
    <value>/hadoop/hadoop/conf/fair-scheduler.xml</value>
    </property>
    -->

    <property>
    <name>mapred.fairscheduler.assignmultiple</name>
    <value>true</value>
    <final>true</final>
    </property>

    <property>
    <name>mapred.hosts</name>
    <value>/hadoop/hadoop/conf/mapred-hosts-include</value>
    </property>

    <property>
    <name>mapred.hosts.exclude</name>
    <value>/hadoop/hadoop/conf/mapred-hosts-exclude</value>
    </property>
    </configuration>

    -----Original Message-----
    From: Juan P.
    Sent: Monday, June 27, 2011 10:13 PM
    To: common-user@hadoop.apache.org
    Subject: Re: Performance Tunning

    Ok,
    So I tried putting the following config in the mapred-site.xml of all of my
    nodes

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>name-node:54311</value>
    </property>
    <property>
    <name>mapred.map.tasks</name>
    <value>7</value>
    </property>
    <property>
    <name>mapred.reduce.tasks</name>
    <value>1</value>
    </property>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>7</value>
    </property>
    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>
    </configuration>

    but when I start a new job it gets stuck at

    11/06/28 03:04:47 INFO mapred.JobClient: map 0% reduce 0%

    Any thoughts?
    Thanks for your help guys!
    On Mon, Jun 27, 2011 at 7:33 PM, Juan P. wrote:

    Matt,
    Thanks for your help!
    I think I get it now, but this part is a bit confusing:
    *
    *
    *so: tasktracker/datanode and 6 slots left. How you break it up from there
    is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
    / 1 reducer.*
    *
    *
    If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
    Processes/Core = 32 Processes Total

    So my configuration mapred-site.xml should include these props:

    *<property>*
    * <name>mapred.map.tasks</name>*
    * <value>28</value>*
    *</property>*
    *<property>*
    * <name>mapred.reduce.tasks</name>*
    * <value>4</value>*
    *</property>*
    *
    *

    Is that correct?

    On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) <
    matthew.goeke@monsanto.com> wrote:
    If you are running default configurations then you are only getting 2
    mappers and 1 reducer per node. The rule of thumb I have gone on (and back
    up by the definitive guide) is 2 processes per core so: tasktracker/datanode
    and 6 slots left. How you break it up from there is your call but I would
    suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

    Check out the below configs for details on what you are *most likely*
    running currently:
    http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
    http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

    HTH,
    Matt

    -----Original Message-----
    From: Juan P.
    Sent: Monday, June 27, 2011 2:50 PM
    To: common-user@hadoop.apache.org
    Subject: Performance Tunning

    I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
    cores each.
    My input data is 4GB in size and it's split into 100MB files. Current
    configuration is default so block size is 64MB.

    If I understand it correctly Hadoop should be running 64 Mappers to
    process
    the data.

    I'm running a simple data counting MapReduce and it's taking about 30mins
    to
    complete. This seems like way too much, doesn't it?
    Is there any tunning you guys would recommend to try and see an
    improvement
    in performance?

    Thanks,
    Pony
    This e-mail message may contain privileged and/or confidential
    information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error,
    please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other
    use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring,
    reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage
    caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Michel Segel at Jun 28, 2011 at 12:38 pm
    Matt,
    You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor.

    Note this is only a starting point.you can always tune up.

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" wrote:

    Per node: 4 cores * 2 processes = 8 slots
    Datanode: 1 slot
    Tasktracker: 1 slot

    Therefore max of 6 slots between mappers and reducers.

    Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load.

    <configuration>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
    </property>

    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>

    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    </property>

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compress</name>
    <value>true</value>
    </property>
  • GOEKE, MATTHEW (AG/1000) at Jun 28, 2011 at 2:47 pm
    Mike,

    I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more stating that the starting point should be 1 slot per thread (or hyper-threaded core) but obviously reviewing the results from ganglia, or any other monitoring solution, will help you come up with a more concrete configuration based on the load.

    My brain might not be working this morning but how did you get the 10 slots again? That seems low for an 8 physical core box but somewhat overextending for a 4 physical core box.

    Matt

    -----Original Message-----
    From: im_gumby@hotmail.com On Behalf Of Michel Segel
    Sent: Tuesday, June 28, 2011 7:39 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Performance Tunning

    Matt,
    You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor.

    Note this is only a starting point.you can always tune up.

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" wrote:

    Per node: 4 cores * 2 processes = 8 slots
    Datanode: 1 slot
    Tasktracker: 1 slot

    Therefore max of 6 slots between mappers and reducers.

    Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load.

    <configuration>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
    </property>

    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>

    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    </property>

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compress</name>
    <value>true</value>
    </property>
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Michael Segel at Jun 28, 2011 at 6:31 pm
    Matthew,

    I understood that Juan was talking about a 2 socket quad core box. We run boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores.
    Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration.

    What I was saying was that if you consider 1 core for each TT, DN and RS jobs, thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread cores'.
    So you could put up 10 m/r slots on the machine. Note that on the main tasks (TT, DN, RS) I dedicate the physical core.

    Of course your mileage may vary if you're doing non-standard or normal things. A good starting point is 6 mappers and 4 reducers.
    And of course YMMV depending on if you're using MapR's release, Cloudera, and if you're running HBase or something else on the cluster.
    From our experience... we end up getting disk I/O bound first, and then network or memory becomes the next constraint. Really the xeon chipsets are really good.
    HTH

    -Mike

    From: matthew.goeke@monsanto.com
    To: common-user@hadoop.apache.org
    Subject: RE: Performance Tunning
    Date: Tue, 28 Jun 2011 14:46:40 +0000

    Mike,

    I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more stating that the starting point should be 1 slot per thread (or hyper-threaded core) but obviously reviewing the results from ganglia, or any other monitoring solution, will help you come up with a more concrete configuration based on the load.

    My brain might not be working this morning but how did you get the 10 slots again? That seems low for an 8 physical core box but somewhat overextending for a 4 physical core box.

    Matt

    -----Original Message-----
    From: im_gumby@hotmail.com On Behalf Of Michel Segel
    Sent: Tuesday, June 28, 2011 7:39 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Performance Tunning

    Matt,
    You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor.

    Note this is only a starting point.you can always tune up.

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" wrote:

    Per node: 4 cores * 2 processes = 8 slots
    Datanode: 1 slot
    Tasktracker: 1 slot

    Therefore max of 6 slots between mappers and reducers.

    Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load.

    <configuration>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
    </property>

    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>

    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    </property>

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compress</name>
    <value>true</value>
    </property>
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • GOEKE, MATTHEW (AG/1000) at Jun 28, 2011 at 6:48 pm
    Mike,

    Somewhat of a tangent but it is actually very informative to hear that you are getting bound by I/O with a 2:1 core to disk ratio. Could you share what you used to make those calls? We have been using both a local ganglia daemon as well as the Hadoop ganglia daemon to get an overall look at the cluster and the items of interest, I would assume, would be CPU wait i/o as well as the throughput of block operations.

    Obviously the disconnect on my side was I didn't realize you were dedicating a physical core per daemon. I am a little surprised that you found that necessary but then again after seeing some of the metrics from my own stress testing I am noticing that we might be over extending with our config on heavy loads. Unfortunately I am working with lower specced hardware at the moment so I don't have the overhead to test that out.

    Matt

    -----Original Message-----
    From: Michael Segel
    Sent: Tuesday, June 28, 2011 1:31 PM
    To: common-user@hadoop.apache.org
    Subject: RE: Performance Tunning



    Matthew,

    I understood that Juan was talking about a 2 socket quad core box. We run boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores.
    Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration.

    What I was saying was that if you consider 1 core for each TT, DN and RS jobs, thats 3 out of the 8 physical cores, leaving you 5 cores or 10 'hyperthread cores'.
    So you could put up 10 m/r slots on the machine. Note that on the main tasks (TT, DN, RS) I dedicate the physical core.

    Of course your mileage may vary if you're doing non-standard or normal things. A good starting point is 6 mappers and 4 reducers.
    And of course YMMV depending on if you're using MapR's release, Cloudera, and if you're running HBase or something else on the cluster.
    From our experience... we end up getting disk I/O bound first, and then network or memory becomes the next constraint. Really the xeon chipsets are really good.
    HTH

    -Mike

    From: matthew.goeke@monsanto.com
    To: common-user@hadoop.apache.org
    Subject: RE: Performance Tunning
    Date: Tue, 28 Jun 2011 14:46:40 +0000

    Mike,

    I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more stating that the starting point should be 1 slot per thread (or hyper-threaded core) but obviously reviewing the results from ganglia, or any other monitoring solution, will help you come up with a more concrete configuration based on the load.

    My brain might not be working this morning but how did you get the 10 slots again? That seems low for an 8 physical core box but somewhat overextending for a 4 physical core box.

    Matt

    -----Original Message-----
    From: im_gumby@hotmail.com On Behalf Of Michel Segel
    Sent: Tuesday, June 28, 2011 7:39 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Performance Tunning

    Matt,
    You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor.

    Note this is only a starting point.you can always tune up.

    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Jun 27, 2011, at 11:11 PM, "GOEKE, MATTHEW (AG/1000)" wrote:

    Per node: 4 cores * 2 processes = 8 slots
    Datanode: 1 slot
    Tasktracker: 1 slot

    Therefore max of 6 slots between mappers and reducers.

    Below is part of our mapred-site.xml. The thing to keep in mind is the number of maps is defined by the number of input splits (which is defined by your data) so you only need to worry about setting the maximum number of concurrent processes per node. In this case the property you want to hone in on is mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum. Keep in mind there are a LOT of other tuning improvements that can be made but it requires an strong understanding of your job load.

    <configuration>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
    </property>

    <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    </property>

    <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    </property>

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.output.compress</name>
    <value>true</value>
    </property>
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 27, '11 at 7:50p
activeJun 28, '11 at 6:48p
posts9
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase