HDFS splits files in chunks regardless of their content.
This is what I understood so far:
an InputFormat object reads the file and returns a list of InputSplit
object (an object that usually contains the boundaries of the file
section your map task will read). Moreover InputFormat contains a method
to return a RecordReader, able to read and interpret your InputSplit.
Therefore, when you have a file for which none of the existent
InputSplit implementations work, you can:
- if you are able to start reading from an arbitrary offset in the file
(there are sync points in the file, such as \n o spaces in textfile),
you can just define your custom recordreader.
- if your file is splittable in fixed length chunks, you have to
override the getSplits method of InputFormat in order to supply to the
job a whole record, and of course, subclass InputFormat.
-------- Original Message --------
Subject: Re: Manually splitting files in blocks
From: Yuri K. <email@example.com>
Date: Fri Mar 26 2010 15:49:09 GMT+0100 (CET)
ok so far so good. thanks for the reply. i'm trying to implement a custom
file input format. but i can set it only in the job configuration:
how do i make hadoop implement the file format, or the custom file split
when i upload new files to the hdfs? do i need a custom upload interface for
that or is there a hadoop config option for that?
Yuri K. wrote:
i'm trying to find out how and where hadoop splits a file into blocks and
decides to send them to the datanodes.
My specific problem:
i have two types of data files.
One large file is used as a database-file where information is sorted
... lots of data 1
... lots of data 2
and so on.
and the other smaller files contain raw data and are to be compared to a
datarow in the large file.
so my question is: is it possible to manually set how hadoop splits the
large data file into blocks?
obviously i want the begin-end section to be in one block to optimize
performance. thus i can replicate the smaller files on each node and so
those can work independently from the other.
You should create a CustomInputSplit and CustomRecordReader (should have
start and end tag )