I am assuming you mean the web-interface of the jobtracker, right? What I
see there is appended at the end of the email. Is there supposed to be a
counter which is equal to the number of data-local jobs? One obvious way to
find this would be to look at the location of the input split of each of the
mappers and see if that is the same as that of the map task.
Do I need to enable some config parameter to actually see the counter which
shows the number of data-local tasks?
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
1600 0 0 1600 0 3 / 46
20 0 0 20 0 0 / 1
Job Counters Launched reduce tasks 0
Rack-local map tasks 0
Launched map tasks 0
FileSystemCounters FILE_BYTES_READ 215,256,891,609
Map-Reduce Framework Reduce input groups 0
Combine output records 0
Map input records 20,443,005
Reduce shuffle bytes 0
Reduce output records 0
Spilled Records 40,886,010
Map output bytes 214,913,316,171
Map input bytes 215,457,082,591
Map output records 20,443,005
Combine input records 0
Reduce input records 0
On Tue, Jul 12, 2011 at 2:43 PM, Harsh J wrote:
You can see the number of data local vs. non.'s counters in the job itself.
On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
How do I find the number of data-local map tasks that are launched? I
checked the log files but didnt see any information about this. All the map
tasks are rack local since I am running the job just using a single rack.
From the completion time per map (comparing it to the case where I have
1Gbps of bandwidth between the nodes i.e. the case where network bandwidth
is not a bottle neck), I saw that more than 90% of the maps are actually
reading data over the network.
I understand that there might be some maps that are actually launched as
non-data local task but I am surprised that around 90% of the maps are
actually running as non-data local tasks.
I have not measured how much bandwidth was being used but I think the whole
50Mbps is being used.
On Tue, Jul 12, 2011 at 1:55 PM, Harsh J wrote:
How much of bandwidth did you see being utilized? What was the count
of number of tasks launched as data-local map tasks versus rack local
A little bit of edge record data is always read over network but that
is highly insignificant compared to the amount of data read locally (a
whole block size, if available).
On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of
data using a 20 node cluster of nodes. HDFS is configured to use 128MB
size (so 1600maps are created) and a replication factor of 1 is being
All the 20 nodes are also hdfs datanodes. I was using a bandwidth
50Mbps between each of the nodes (this was configured using linux
see that around 90% of the map tasks are reading data over the network
most of the map tasks are not being scheduled at the nodes where the
be processed by them is located.
My understanding was that Hadoop tries to schedule as many data-local
as possible. But in this situation, this does not seem to happen. Any
why this is happening? and is there a way to actually configure hadoop
ensure the maximum possible node locality?
Any help regarding this is very much appreciated.