FAQ
Hi All,

I am new to hadoop and I seem to be having a problem setting the number of
map tasks per node. I have an application that needs to load a significant
amount of data (about 1 GB) in memory to use in mapping data read from
files. I store this in a singleton and access it from my mapper. In order to
do this, I need to have exactly one map task run on a node at anyone time or
the memory requirements will far exceed my RAM. I am generating my own
Splits using an InputFormat class. This gives me roughly 10 splits per node
and I need each corresponding map task in a sequential fashion in the same
child jvm so that each map run does not have to reinitialize the data.

I have tried the following in a single node configuration and 2 splits:
- setting setNumMapTasks in the JobConf to 1 but hadoop seems to create 2
map tasks
- setting mapred.tasktracker.tasks.maximum property 1 - same result 2 map
tasks
- setting mapred.map.tasks property to 1 - same result 2 map tasks

I have yet to try it in a multiple node configuration. My target will using
20 AWS EC2 instances.

Can you please let me know what I should be doing or looking at to make sure
that I have maximum 1 map task per node. Also, how can I have multiple
splits being mapped within the same child jvm by different map tasks in
sequence?

Thanks in advance,
Dev

Search Discussions

  • Derek Gottfrid at Sep 11, 2007 at 4:29 pm
    you can specify in your config the number of tasks per node. don't
    think the second thing you mention is possible.

    hadoop + ec2 has worked very well for me. good luck.

    derek

    On 9/8/07, Devajyoti Sarkar wrote:
    Hi All,

    I am new to hadoop and I seem to be having a problem setting the number of
    map tasks per node. I have an application that needs to load a significant
    amount of data (about 1 GB) in memory to use in mapping data read from
    files. I store this in a singleton and access it from my mapper. In order to
    do this, I need to have exactly one map task run on a node at anyone time or
    the memory requirements will far exceed my RAM. I am generating my own
    Splits using an InputFormat class. This gives me roughly 10 splits per node
    and I need each corresponding map task in a sequential fashion in the same
    child jvm so that each map run does not have to reinitialize the data.

    I have tried the following in a single node configuration and 2 splits:
    - setting setNumMapTasks in the JobConf to 1 but hadoop seems to create 2
    map tasks
    - setting mapred.tasktracker.tasks.maximum property 1 - same result 2 map
    tasks
    - setting mapred.map.tasks property to 1 - same result 2 map tasks

    I have yet to try it in a multiple node configuration. My target will using
    20 AWS EC2 instances.

    Can you please let me know what I should be doing or looking at to make sure
    that I have maximum 1 map task per node. Also, how can I have multiple
    splits being mapped within the same child jvm by different map tasks in
    sequence?

    Thanks in advance,
    Dev

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 8, '07 at 4:35p
activeSep 11, '07 at 4:29p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase