FAQ
Hi there,
I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is there any way to control number of map tasks? What I noticed is that Hadoop is choosing too many mappers and there is an extra overhead being added due to this. For example, when I have only 10 map tasks, my job finishes faster than when Hadoop chooses 191 map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I want to set both map and reduce tasks to be 10 for max efficiency.

Thanks
Praveen

Search Discussions

  • David Rosenstrauch at Jun 20, 2011 at 7:39 pm

    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify no.
    of reduce tasks and hadoop honours it but "mapred.map.tasks" is not
    honoured by Hadoop. Is there any way to control number of map tasks?
    What I noticed is that Hadoop is choosing too many mappers and there
    is an extra overhead being added due to this. For example, when I
    have only 10 map tasks, my job finishes faster than when Hadoop
    chooses 191 map tasks. I have 5 slave cluster and 10 tasks can run in
    parallel. I want to set both map and reduce tasks to be 10 for max
    efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number of
    input chunks you have. If you want fewer map tasks either pass fewer
    input files to your job, or store the files using larger chunk sizes
    (which will result in fewer chunks per file, and thus fewer chunks total).

    HTH,

    DR
  • Praveen Peddi at Jun 20, 2011 at 7:44 pm
    Hi David,
    I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files, then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file split into 10 files, its choosing same no. of maps tasks (191 in my case).

    Thanks
    Praveen

    -----Original Message-----
    From: ext David Rosenstrauch
    Sent: Monday, June 20, 2011 3:39 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: controlling no. of mapper tasks
    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify no.
    of reduce tasks and hadoop honours it but "mapred.map.tasks" is not
    honoured by Hadoop. Is there any way to control number of map tasks?
    What I noticed is that Hadoop is choosing too many mappers and there
    is an extra overhead being added due to this. For example, when I have
    only 10 map tasks, my job finishes faster than when Hadoop chooses 191
    map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I
    want to set both map and reduce tasks to be 10 for max efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number of input chunks you have. If you want fewer map tasks either pass fewer input files to your job, or store the files using larger chunk sizes (which will result in fewer chunks per file, and thus fewer chunks total).

    HTH,

    DR
  • David Rosenstrauch at Jun 20, 2011 at 7:48 pm
    Yes, that is correct. It is indeed looking at the data size. Please
    carefully read through again what I wrote - particularly the part about
    files getting broken into chunks (aka "blocks"). If you want fewer map
    tasks, then store your files in HDFS with a larger block size. They
    will then get stored in fewer blocks/chunks, and will result in fewer
    map tasks per job.

    DR
    On 06/20/2011 03:44 PM, praveen.peddi@nokia.com wrote:
    Hi David, I think Hadoop is looking at the data size, not the no. of
    input files. If I pass in .gz files, then yes hadoop is choosing 1
    map task per file but if I pass in HUGE text file or same file split
    into 10 files, its choosing same no. of maps tasks (191 in my case).

    Thanks Praveen

    -----Original Message----- From: ext David Rosenstrauch
    Sent: Monday, June 20, 2011 3:39 PM To:
    mapreduce-user@hadoop.apache.org Subject: Re: controlling no. of
    mapper tasks
    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify
    no. of reduce tasks and hadoop honours it but "mapred.map.tasks" is
    not honoured by Hadoop. Is there any way to control number of map
    tasks? What I noticed is that Hadoop is choosing too many mappers
    and there is an extra overhead being added due to this. For
    example, when I have only 10 map tasks, my job finishes faster than
    when Hadoop chooses 191 map tasks. I have 5 slave cluster and 10
    tasks can run in parallel. I want to set both map and reduce tasks
    to be 10 for max efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number
    of input chunks you have. If you want fewer map tasks either pass
    fewer input files to your job, or store the files using larger chunk
    sizes (which will result in fewer chunks per file, and thus fewer
    chunks total).

    HTH,

    DR
  • GOEKE, MATTHEW (AG/1000) at Jun 20, 2011 at 7:49 pm
    Praveen,

    David is correct but we might need to use different terminology. Hadoop looks at the number of input splits and if the file is not splittable then yes it will only use 1 mapper for it. In the case of most files (which are splittable) Hadoop will break them into multiple maps and work over each one. What you need to take a look at is the number of concurrent mappers / reducers that you have defined per node so that you do not cause context switches due to too many processes per core. Take a look in mapred-site.xml and you will see a default defined (if not take a look at the default mapred-site.xml for your version).

    Matt

    -----Original Message-----
    From: praveen.peddi@nokia.com
    Sent: Monday, June 20, 2011 2:44 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Hi David,
    I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files, then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file split into 10 files, its choosing same no. of maps tasks (191 in my case).

    Thanks
    Praveen

    -----Original Message-----
    From: ext David Rosenstrauch
    Sent: Monday, June 20, 2011 3:39 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: controlling no. of mapper tasks
    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify no.
    of reduce tasks and hadoop honours it but "mapred.map.tasks" is not
    honoured by Hadoop. Is there any way to control number of map tasks?
    What I noticed is that Hadoop is choosing too many mappers and there
    is an extra overhead being added due to this. For example, when I have
    only 10 map tasks, my job finishes faster than when Hadoop chooses 191
    map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I
    want to set both map and reduce tasks to be 10 for max efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number of input chunks you have. If you want fewer map tasks either pass fewer input files to your job, or store the files using larger chunk sizes (which will result in fewer chunks per file, and thus fewer chunks total).

    HTH,

    DR
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Praveen Peddi at Jun 20, 2011 at 8:13 pm
    Hi David,
    Thanks for the response. I didn't specify anything for no. of concurrent mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I believe hadoop is defaulting to no. of cores in the cluster which is 10. That is why I want to choose the map tasks also same as no. of cores so that they match with max concurrent map tasks.

    Praveen

    -----Original Message-----
    From: ext GOEKE, MATTHEW (AG/1000)
    Sent: Monday, June 20, 2011 3:49 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Praveen,

    David is correct but we might need to use different terminology. Hadoop looks at the number of input splits and if the file is not splittable then yes it will only use 1 mapper for it. In the case of most files (which are splittable) Hadoop will break them into multiple maps and work over each one. What you need to take a look at is the number of concurrent mappers / reducers that you have defined per node so that you do not cause context switches due to too many processes per core. Take a look in mapred-site.xml and you will see a default defined (if not take a look at the default mapred-site.xml for your version).

    Matt

    -----Original Message-----
    From: praveen.peddi@nokia.com
    Sent: Monday, June 20, 2011 2:44 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Hi David,
    I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files, then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file split into 10 files, its choosing same no. of maps tasks (191 in my case).

    Thanks
    Praveen

    -----Original Message-----
    From: ext David Rosenstrauch
    Sent: Monday, June 20, 2011 3:39 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: controlling no. of mapper tasks
    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify no.
    of reduce tasks and hadoop honours it but "mapred.map.tasks" is not
    honoured by Hadoop. Is there any way to control number of map tasks?
    What I noticed is that Hadoop is choosing too many mappers and there
    is an extra overhead being added due to this. For example, when I have
    only 10 map tasks, my job finishes faster than when Hadoop chooses 191
    map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I
    want to set both map and reduce tasks to be 10 for max efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number of input chunks you have. If you want fewer map tasks either pass fewer input files to your job, or store the files using larger chunk sizes (which will result in fewer chunks per file, and thus fewer chunks total).

    HTH,

    DR
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
  • GOEKE, MATTHEW (AG/1000) at Jun 20, 2011 at 8:36 pm
    Praveen,

    We use CDH3 so the link that I refer to is http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html. The reason why it is defaulting to 2 per node is not because it looks at the number of cores but that mapred.tasktracker.map.tasks.maximum is set to 2 by default. There is a wealth of information in each of the default sites and it might take playing with some of the sites before you are happy with the result. Remember that you can easily overfit your parameters to a specific data set and then have horrible performance with others so be careful when making selective changes. That being said here are possible ways to get it to work for you:

    - Raise your block size to something higher that your input file size and then reimport them into HDFS
    - Change mapred.jobtracker.maxtasks.per.job to 10
    - Setup a custom record reader for that data so that it doesn't split them
    - The list goes on...

    If you are suffering a significant performance set back from it auto splitting files then there is possibly another culprit behind the scenes but here are some options for you.

    HTH,
    Matt

    -----Original Message-----
    From: praveen.peddi@nokia.com
    Sent: Monday, June 20, 2011 3:13 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Hi David,
    Thanks for the response. I didn't specify anything for no. of concurrent mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I believe hadoop is defaulting to no. of cores in the cluster which is 10. That is why I want to choose the map tasks also same as no. of cores so that they match with max concurrent map tasks.

    Praveen

    -----Original Message-----
    From: ext GOEKE, MATTHEW (AG/1000)
    Sent: Monday, June 20, 2011 3:49 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Praveen,

    David is correct but we might need to use different terminology. Hadoop looks at the number of input splits and if the file is not splittable then yes it will only use 1 mapper for it. In the case of most files (which are splittable) Hadoop will break them into multiple maps and work over each one. What you need to take a look at is the number of concurrent mappers / reducers that you have defined per node so that you do not cause context switches due to too many processes per core. Take a look in mapred-site.xml and you will see a default defined (if not take a look at the default mapred-site.xml for your version).

    Matt

    -----Original Message-----
    From: praveen.peddi@nokia.com
    Sent: Monday, June 20, 2011 2:44 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: controlling no. of mapper tasks

    Hi David,
    I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files, then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file split into 10 files, its choosing same no. of maps tasks (191 in my case).

    Thanks
    Praveen

    -----Original Message-----
    From: ext David Rosenstrauch
    Sent: Monday, June 20, 2011 3:39 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: controlling no. of mapper tasks
    On 06/20/2011 03:24 PM, praveen.peddi@nokia.com wrote:
    Hi there, I know client can send "mapred.reduce.tasks" to specify no.
    of reduce tasks and hadoop honours it but "mapred.map.tasks" is not
    honoured by Hadoop. Is there any way to control number of map tasks?
    What I noticed is that Hadoop is choosing too many mappers and there
    is an extra overhead being added due to this. For example, when I have
    only 10 map tasks, my job finishes faster than when Hadoop chooses 191
    map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I
    want to set both map and reduce tasks to be 10 for max efficiency.

    Thanks Praveen
    The number of map tasks is determined dynamically based on the number of input chunks you have. If you want fewer map tasks either pass fewer input files to your job, or store the files using larger chunk sizes (which will result in fewer chunks per file, and thus fewer chunks total).

    HTH,

    DR
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
  • Sridhar basam at Jun 20, 2011 at 8:37 pm

    On Mon, Jun 20, 2011 at 4:13 PM, wrote:

    Hi David,
    Thanks for the response. I didn't specify anything for no. of concurrent
    mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I
    believe hadoop is defaulting to no. of cores in the cluster which is 10.
    That is why I want to choose the map tasks also same as no. of cores so that
    they match with max concurrent map tasks.
    The maximum number of simultaneous mappers and reducers across the cluster
    are determined by your config. You are seeing capacity as 10 mappers since
    you probably have a default mapred-site.xml file. The default max jobs per
    task tracker is 2.

    http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html

    Sridhar
  • Allen Wittenauer at Jun 22, 2011 at 5:06 pm

    On Jun 20, 2011, at 12:24 PM, <praveen.peddi@nokia.com> wrote:

    Hi there,
    I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is there any way to control number of map tasks? What I noticed is that Hadoop is choosing too many mappers and there is an extra overhead being added due to this. For example, when I have only 10 map tasks, my job finishes faster than when Hadoop chooses 191 map tasks. I have 5 slave cluster and 10 tasks can run in parallel. I want to set both map and reduce tasks to be 10 for max efficiency.

    http://wiki.apache.org/hadoop/FAQ#How_do_I_limit_.28or_increase.29_the_number_of_concurrent_tasks_a_job_may_have_running_total_at_a_time.3F
  • Sudharsan Sampath at Jun 23, 2011 at 5:41 am
    Hi Allen,

    The number of map tasks is driven by the number of splits of the input
    provided. The configuration for 'number of map tasks' is only a hint and
    will be honored only if the value is more than the number of input splits.
    If its less, then the latter takes higer precedence.

    But as a hack/workaround you can increase the block size of your input (only
    for these input files overriding the default hdfs configuration) to a higher
    value to achieve the desired number of maps.

    Thanks
    Sudhan S
    On Wed, Jun 22, 2011 at 10:36 PM, Allen Wittenauer wrote:


    On Jun 20, 2011, at 12:24 PM, <praveen.peddi@nokia.com>
    wrote:
    Hi there,
    I know client can send "mapred.reduce.tasks" to specify no. of reduce
    tasks and hadoop honours it but "mapred.map.tasks" is not honoured by
    Hadoop. Is there any way to control number of map tasks? What I noticed is
    that Hadoop is choosing too many mappers and there is an extra overhead
    being added due to this. For example, when I have only 10 map tasks, my job
    finishes faster than when Hadoop chooses 191 map tasks. I have 5 slave
    cluster and 10 tasks can run in parallel. I want to set both map and reduce
    tasks to be 10 for max efficiency.



    http://wiki.apache.org/hadoop/FAQ#How_do_I_limit_.28or_increase.29_the_number_of_concurrent_tasks_a_job_may_have_running_total_at_a_time.3F

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 20, '11 at 7:25p
activeJun 23, '11 at 5:41a
posts10
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase