|| at Mar 17, 2011 at 2:05 pm
On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter wrote:
If I have is big gzip files (>>block size) does the M/R will split a single
file to multiple blocks and send them to different mappers ?
The behavior I currently see is that a map is still open per file (and not
Yes this is true. This is the current behavior with GZip files (since
they can't be split and decompressed right out). I had somehow managed
to ignore the GZIP part of your question in the previous thread!
But still, 60~ files worth 15 GB total would mean at least 3 GB per
file. And seeing how they can't really be split out right now, it
would be good to have them use up only a single block. Perhaps for
these files alone you may use a block size of 3-4 GB, thereby making
these file reads more local for your record readers?
In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
GZIP files, though. 'Concatenated' GZIP files could be split
(HADOOP-6835) across mappers as well (as demonstrated in PIG-42).