Hi All,
I am new to hadoop and I seem to be having a problem setting the number of
map tasks per node. I have an application that needs to load a significant
amount of data (about 1 GB) in memory to use in mapping data read from
files. I store this in a singleton and access it from my mapper. In order to
do this, I need to have exactly one map task run on a node at anyone time or
the memory requirements will far exceed my RAM. I am generating my own
Splits using an InputFormat class. This gives me roughly 10 splits per node
and I need each corresponding map task in a sequential fashion in the same
child jvm so that each map run does not have to reinitialize the data.
I have tried the following in a single node configuration and 2 splits:
- setting setNumMapTasks in the JobConf to 1 but hadoop seems to create 2
map tasks
- setting mapred.tasktracker.tasks.maximum property 1 - same result 2 map
tasks
- setting mapred.map.tasks property to 1 - same result 2 map tasks
I have yet to try it in a multiple node configuration. My target will using
20 AWS EC2 instances.
Can you please let me know what I should be doing or looking at to make sure
that I have maximum 1 map task per node. Also, how can I have multiple
splits being mapped within the same child jvm by different map tasks in
sequence?
Thanks in advance,
Dev