The input for a M/R job consists of multiple files that are less than a
block size and the number of maps is the number of files.
I would like to be able to control the number of maps in a way that I have
one map task for multiple files (for example, gluing together files up to a
block size).
I don't want to use a M/R job to do that as it is expensive (extra IO ops:
read/write-read/write)
I don't want to have a COPY program as this is still expensive (extra IO
ops: read/write)
I know files are not that big, but this is the common case in my system and
this would mean increasing the number of IO significantly.
I'd rather would want to have a custom InputSplit that takes multiple files
up to a given size, then I don't have any extra IO ops.
Looking at the InputSplit the interfaces do not seem prepared to be able do
such thing (consolidating multiple files into a single split).
Am I missing something on the APIs? Or another suggestion on how to achieve
the desired behavior?
Thxs.
Alejandro