FAQ
Hello,

i am considering using hadoop map/reduce but have some difficulties getting around the basic concepts of chunks distribution.

How does the 'distributed' processing on large files account for the fact that some files cannot be split at (64mb) boundary?
Such as large text files (many gigs) that need to be processed line by line -- splitting a line mid-way and processing an incomplete partial chunk on some worker may be a serious error, depending on the application.

Can somebody please tell me where am i wrong in my thinking here. Links to relevant documentation passages/tutorials welcome too, cheers.

Search Discussions

  • Doug Cutting at May 26, 2008 at 6:08 pm

    koara@atlas.cz wrote:
    How does the 'distributed' processing on large files account for the fact that some files cannot be split at (64mb) boundary?
    Such as large text files (many gigs) that need to be processed line by line -- splitting a line mid-way and processing an incomplete partial chunk on some worker may be a serious error, depending on the application.
    Each split (except the first) contains the first line starting after
    it's start position through the first line ending after its end
    position. So if you have a file with:

    ----------
    a b c
    d e f
    g h i
    j k l
    m n o
    ----------

    And split it into three evenly sized chunks, at {0,10}, {10,20},
    {20,30}, then the first split would contain the first two lines, the
    second the next two, and the final split would contain the last line.

    Doug
  • Andreas Kostyrka at May 27, 2008 at 2:20 pm

    Am Montag, den 26.05.2008, 11:32 +0000 schrieb koara@atlas.cz:
    Hello,

    i am considering using hadoop map/reduce but have some difficulties getting around the basic concepts of chunks distribution.

    How does the 'distributed' processing on large files account for the fact that some files cannot be split at (64mb) boundary?
    Such as large text files (many gigs) that need to be processed line by line -- splitting a line mid-way and processing an incomplete partial chunk on some worker may be a serious error, depending on the application.

    Can somebody please tell me where am i wrong in my thinking here. Links to relevant documentation passages/tutorials welcome too, cheers.
    Well, when using streaming.jar, it handles this for you.

    But basically, you follow certain rules in this, don't nail me if these
    are not correct, but it goes something like that, taking \n delimited
    lines as "logical units" here:

    Any map task is given an offset and length.

    Any map task reads into the following block to finish the line.
    Any map task but the first, ignores the partial first line.

    Andreas
  • Ted Dunning at May 27, 2008 at 5:59 pm
    There is a good tutorial on the wiki about this.

    Your problem here is that you have conflated two concepts. The first is the
    splitting of files into blocks for storage purposes. This has nothing to do
    with what data a program can read any more than splitting a file into blocks
    on a disk in a conventional file system limits what you can read. The
    second splitting concept is that the input format does in order to allow
    parallelism. Basically, the file block splits have nothing to do with what
    data the mapper can read. It only has to do with what data will be local.

    What the text input format does is start reading at a specified point
    ignoring data until it finds the beginning of a line (which might be at the
    starting point). Then it starts passing lines to the mapper until a line
    end is AFTER the specified end-point of the split. That will include any
    data that would have been ignored by the reader that is handling the next
    split in the file.

    Does that help?

    On 5/26/08 4:32 AM, "koara@atlas.cz" wrote:

    Hello,

    i am considering using hadoop map/reduce but have some difficulties getting
    around the basic concepts of chunks distribution.

    How does the 'distributed' processing on large files account for the fact that
    some files cannot be split at (64mb) boundary?
    Such as large text files (many gigs) that need to be processed line by line --
    splitting a line mid-way and processing an incomplete partial chunk on some
    worker may be a serious error, depending on the application.

    Can somebody please tell me where am i wrong in my thinking here. Links to
    relevant documentation passages/tutorials welcome too, cheers.
  • Erik Paulson at May 28, 2008 at 7:58 pm

    On Tue, May 27, 2008 at 10:49:38AM -0700, Ted Dunning wrote:

    There is a good tutorial on the wiki about this.

    Your problem here is that you have conflated two concepts. The first is the
    splitting of files into blocks for storage purposes. This has nothing to do
    with what data a program can read any more than splitting a file into blocks
    on a disk in a conventional file system limits what you can read. The
    second splitting concept is that the input format does in order to allow
    parallelism. Basically, the file block splits have nothing to do with what
    data the mapper can read. It only has to do with what data will be local.
    When reading from HDFS, how big are the network read requests, and what
    controls that? Or, more concretely, if I store files using 64Meg blocks
    in HDFS and run the simple word count example, and I get the default of
    one FileSplit/Map task per 64 meg block, how many bytes into the second 64meg
    block will a mapper read before it first passes a buffer up to the record
    reader to see if it has found an end-of-line?

    Thanks,

    -Erik
  • Doug Cutting at May 29, 2008 at 5:21 pm

    Erik Paulson wrote:
    When reading from HDFS, how big are the network read requests, and what
    controls that? Or, more concretely, if I store files using 64Meg blocks
    in HDFS and run the simple word count example, and I get the default of
    one FileSplit/Map task per 64 meg block, how many bytes into the second 64meg
    block will a mapper read before it first passes a buffer up to the record
    reader to see if it has found an end-of-line?
    This is controlled by io.file.buffer.size, which is 4k by default.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 26, '08 at 11:33a
activeMay 29, '08 at 5:21p
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase