FAQ
I need to process a dataset that contains text records of fixed length
in bytes. For example, each record may be 100 bytes in length, with
the first field being the first 10 bytes, the second field being the
second 10 bytes, etc... There are no newlines on the file. Field
values have been either whitespace-padded or truncated to fit within
the specific locations in these fixed-width records.

Does Hadoop have an InputFormat to support processing of such files?
I looked but couldn't find one.

Of course, I could pre-process the file (outside of Hadoop) to put
newlines at the end of each record, but I'd prefer not to require such
a prep step.

Thanks.

Search Discussions

  • Tom White at May 28, 2009 at 1:49 pm
    Hi Stuart,

    There isn't an InputFormat that comes with Hadoop to do this. Rather
    than pre-processing the file, it would be better to implement your own
    InputFormat. Subclass FileInputFormat and provide an implementation of
    getRecordReader() that returns your implementation of RecordReader to
    read fixed width records. In the next() method you would do something
    like:

    byte[] buf = new byte[100];
    IOUtils.readFully(in, buf, pos, 100);
    pos += 100;

    You would also need to check for the end of the stream. See
    LineRecordReader for some ideas. You'll also have to handle finding
    the start of records for a split, which you can do by looking at the
    offset and seeking to the next multiple of 100.

    If the RecordReader was a RecordReader<NullWritable, BytesWritable>
    (no keys) then it would return each record as a byte array to the
    mapper, which would then break it into fields. Alternatively, you
    could do it in the RecordReader, and use your own type which
    encapsulates the fields for the value.

    Hope this helps.

    Cheers,
    Tom
    On Thu, May 28, 2009 at 1:15 PM, Stuart White wrote:
    I need to process a dataset that contains text records of fixed length
    in bytes.  For example, each record may be 100 bytes in length, with
    the first field being the first 10 bytes, the second field being the
    second 10 bytes, etc...  There are no newlines on the file.  Field
    values have been either whitespace-padded or truncated to fit within
    the specific locations in these fixed-width records.

    Does Hadoop have an InputFormat to support processing of such files?
    I looked but couldn't find one.

    Of course, I could pre-process the file (outside of Hadoop) to put
    newlines at the end of each record, but I'd prefer not to require such
    a prep step.

    Thanks.
  • Owen O'Malley at May 28, 2009 at 2:51 pm

    On May 28, 2009, at 5:15 AM, Stuart White wrote:

    I need to process a dataset that contains text records of fixed length
    in bytes. For example, each record may be 100 bytes in length
    The update to the terasort example has an InputFormat that does
    exactly that. The key is 10 bytes and the value is the next 90 bytes.
    It is pretty easy to write, but I should upload it soon. The output
    types are Text, but they just have the binary data in them.

    -- Owen
  • Stuart White at May 29, 2009 at 1:28 am

    On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley wrote:
    The update to the terasort example has an InputFormat that does exactly
    that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
    easy to write, but I should upload it soon. The output types are Text, but
    they just have the binary data in them.
    Would you mind uploading it or sending it to the list?
  • Yabo-Arber Xu at Jun 2, 2009 at 3:06 am
    I have a follow-up question on this thread: How do we make sure that at the
    getFileSplit phase, there is no records that cross the boundary of different
    file splits?

    To explain my point better, for example, if each of my record is 100 bytes,
    would there be such case that there is some record whose key was put in the
    1st filesplit, while its value was put in the second split?

    Best,
    Arber
    On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley wrote:

    On May 28, 2009, at 5:15 AM, Stuart White wrote:

    I need to process a dataset that contains text records of fixed length
    in bytes. For example, each record may be 100 bytes in length
    The update to the terasort example has an InputFormat that does exactly
    that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
    easy to write, but I should upload it soon. The output types are Text, but
    they just have the binary data in them.

    -- Owen
  • Chuck Lam at Jun 2, 2009 at 4:40 am
    Yes, it's totally possible for part of one record in the first file split
    and the rest in the second file split. It's the job of the RecordReader to
    make sure it's always reading in an entire record. Given a file split, your
    RecordReader has to be able to skip over the first few bytes to get to the
    first full record (if there's a partial record at the beginning). When it
    reaches the end of the split, if there's a partial record there, it will go
    get the rest of the record from the next split.

    Tom's email earlier in this thread explained some of the details. Like he
    said, look at LineRecordReader for inspiration. The logic for figuring out
    the start of the first full record is in LineRecordReader itself. The
    RecordReader can read the last record (that spans two file splits) without
    any special logic because the Hadoop filesystem abstracts away file split
    boundaries when reading.


    On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu wrote:

    I have a follow-up question on this thread: How do we make sure that at the
    getFileSplit phase, there is no records that cross the boundary of
    different
    file splits?

    To explain my point better, for example, if each of my record is 100 bytes,
    would there be such case that there is some record whose key was put in the
    1st filesplit, while its value was put in the second split?

    Best,
    Arber
    On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley wrote:

    On May 28, 2009, at 5:15 AM, Stuart White wrote:

    I need to process a dataset that contains text records of fixed length
    in bytes. For example, each record may be 100 bytes in length
    The update to the terasort example has an InputFormat that does exactly
    that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
    easy to write, but I should upload it soon. The output types are Text, but
    they just have the binary data in them.

    -- Owen
  • Yabo-Arber Xu at Jun 2, 2009 at 5:32 am
    Thanks for your reply. It clarifies a lot. The place i was not so sure is
    how to read the last record in a split, but now it seems there is no problem
    as filesystem has done it for me. :-)
    On Tue, Jun 2, 2009 at 12:40 PM, Chuck Lam wrote:

    Yes, it's totally possible for part of one record in the first file split
    and the rest in the second file split. It's the job of the RecordReader to
    make sure it's always reading in an entire record. Given a file split, your
    RecordReader has to be able to skip over the first few bytes to get to the
    first full record (if there's a partial record at the beginning). When it
    reaches the end of the split, if there's a partial record there, it will go
    get the rest of the record from the next split.

    Tom's email earlier in this thread explained some of the details. Like he
    said, look at LineRecordReader for inspiration. The logic for figuring out
    the start of the first full record is in LineRecordReader itself. The
    RecordReader can read the last record (that spans two file splits) without
    any special logic because the Hadoop filesystem abstracts away file split
    boundaries when reading.



    On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.research@gmail.com
    wrote:
    I have a follow-up question on this thread: How do we make sure that at the
    getFileSplit phase, there is no records that cross the boundary of
    different
    file splits?

    To explain my point better, for example, if each of my record is 100 bytes,
    would there be such case that there is some record whose key was put in the
    1st filesplit, while its value was put in the second split?

    Best,
    Arber

    On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <omalley@apache.org>
    wrote:
    On May 28, 2009, at 5:15 AM, Stuart White wrote:

    I need to process a dataset that contains text records of fixed length
    in bytes. For example, each record may be 100 bytes in length
    The update to the terasort example has an InputFormat that does exactly
    that. The key is 10 bytes and the value is the next 90 bytes. It is pretty
    easy to write, but I should upload it soon. The output types are Text, but
    they just have the binary data in them.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 28, '09 at 12:16p
activeJun 2, '09 at 5:32a
posts7
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase