FAQ
I'm trying to implement my own InputFormat and RecordReader for my
data and I'm stuck as I can't find enough documentation about the
details.

My input format is a tightly packed binary data consisting individual
event packets. Each event packet contains its length and the packets
are simply appended into end of an file. Thus the file must be read as
stream and it cannot be splitted.

FileInputFormat seems like a reasonable place to start but I
immediately ran into problems and unanswered questions:

1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
input file is 128MB and my HDFS block size is 64MB, will it return one
InputSplit or two InputSplits?

2) If my file is splitted into two or more filesystem blocks, how will
hadoop handle the reading of those blocks? As the file must be read in
sequence, will hadoop first copy every block to a machine (if the
blocks aren't already in there) and then start the mapper in this
machine? Do I need to handle the reading and opening multiple blocks,
or will hadoop provide me a simple stream interface which I can use to
read the entire file without worrying if the file is larger than the
HDFS block size?

- Juho Mäkinen

Search Discussions

  • Owen O'Malley at Sep 15, 2008 at 4:44 pm

    On Sep 15, 2008, at 6:13 AM, Juho Mäkinen wrote:

    1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
    input file is 128MB and my HDFS block size is 64MB, will it return one
    InputSplit or two InputSplits?
    Your InputFormat needs to define:

    protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
    }

    which tells the FileInputFormat.getSplits to not split files. You will
    end up with a single split for each file.
    2) If my file is splitted into two or more filesystem blocks, how will
    hadoop handle the reading of those blocks? As the file must be read in
    sequence, will hadoop first copy every block to a machine (if the
    blocks aren't already in there) and then start the mapper in this
    machine? Do I need to handle the reading and opening multiple blocks,
    or will hadoop provide me a simple stream interface which I can use to
    read the entire file without worrying if the file is larger than the
    HDFS block size?
    HDFS transparently handles the data motion for you. You can just use
    FileSystem.open(path) and HDFS will pull the file from the closest
    location. It doesn't actually move the block to your local disk, just
    gives it to the application. Basically, you don't need to worry about
    it.

    There are two downsides to unsplitable files. The first is that if
    they are large, the map times can be very long. The second is that the
    map/reduce scheduler tries to place the tasks close to the data, which
    it can't do very well if the data spans blocks. Of course if data
    isn't splitable, you don't have a choice.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 15, '08 at 1:14p
activeSep 15, '08 at 4:44p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Owen O'Malley: 1 post Juho Mäkinen: 1 post

People

Translate

site design / logo © 2022 Grokbase