There is a good tutorial on the wiki about this.
Your problem here is that you have conflated two concepts. The first is the
splitting of files into blocks for storage purposes. This has nothing to do
with what data a program can read any more than splitting a file into blocks
on a disk in a conventional file system limits what you can read. The
second splitting concept is that the input format does in order to allow
parallelism. Basically, the file block splits have nothing to do with what
data the mapper can read. It only has to do with what data will be local.
What the text input format does is start reading at a specified point
ignoring data until it finds the beginning of a line (which might be at the
starting point). Then it starts passing lines to the mapper until a line
end is AFTER the specified end-point of the split. That will include any
data that would have been ignored by the reader that is handling the next
split in the file.
Does that help?
On 5/26/08 4:32 AM, "email@example.com" wrote:
i am considering using hadoop map/reduce but have some difficulties getting
around the basic concepts of chunks distribution.
How does the 'distributed' processing on large files account for the fact that
some files cannot be split at (64mb) boundary?
Such as large text files (many gigs) that need to be processed line by line --
splitting a line mid-way and processing an incomplete partial chunk on some
worker may be a serious error, depending on the application.
Can somebody please tell me where am i wrong in my thinking here. Links to
relevant documentation passages/tutorials welcome too, cheers.