calculated, as far as I know, the number is primarily based on the number of
splits, say if I have 5 splits and I have 10 tasktracker running in the
cluster, I will have 5 mapper tasks running in my MR job, right?
But what I found is that sometimes if the input is huge(5 GB), at this point
I still have 5 splits which is on purpose, but I got more than 40 mapper
tasks running, how this happens? Now, if I compress the huge input to
smaller size, the number of mapper got back to 5 again, is something tricky
happens here relevant to DFS block location of the input?
BTW, our InputFormat is a special kind of FileInputFormat which does not
split each file, whereas we copy each file to DFS and the location of the
file on DFS will be the input key to mapper task.