I am trying to understand the effects of increasing block size or minimum
split size. If I increase them, then a mapper will process more data,
effectively reducing the number of mappers that will be spawned. As there is
an overhead in starting mappers, so this seems good.
However, If I increase their values too much, what negative effects will
come up? Put in other words, how to compute what is the best number of
mappers to start for processing a given size data on a cluster.
For calculations, let us assume- 100G of data, 4 machines (dual core).
Also if I set the reuse jvm flag to -1, will it make a difference?