FAQ
Hello, are the restrictions to the size or "width" of text files
placed in HDFS? I have a file structure like this:

<text key><tab><text data><nl>

It would be helpful if in some circumstances I could make text data
really large (large meaning many KB to one/few MB). I may have some
rows that have a very small payload and some with a very large
payload. Is this OK? When HDFS is splitting the file into chunks to
spread across the cluster will it ever spit a record? Total file size
may be on the order of 20-30GB.

Thanks,

-K

Search Discussions

  • Harsh J at Mar 4, 2011 at 7:43 pm
    HDFS does not operate with records in mind. There shouldn't be too
    much of a problem with having a few MBs per record in text files
    (provided, 'few MBs' means a (very) small fraction of the file's
    blocksize value).
    On Sat, Mar 5, 2011 at 1:00 AM, Kelly Burkhart wrote:
    Hello, are the restrictions to the size or "width" of text files
    placed in HDFS?  I have a file structure like this:

    <text key><tab><text data><nl>

    It would be helpful if in some circumstances I could make text data
    really large (large meaning many KB to one/few MB).  I may have some
    rows that have a very small payload and some with a very large
    payload.  Is this OK?  When HDFS is splitting the file into chunks to
    spread across the cluster will it ever spit a record?  Total file size
    may be on the order of 20-30GB.

    Thanks,

    -K


    --
    Harsh J
    www.harshj.com
  • Kelly Burkhart at Mar 4, 2011 at 8:25 pm

    On Fri, Mar 4, 2011 at 1:42 PM, Harsh J wrote:
    HDFS does not operate with records in mind.
    So does that mean that HDFS will break a file at exactly <blocksize>
    bytes? Map/Reduce *does* operate with records in mind, so what
    happens to the split record? Does HDFS put the fragments back
    together and deliver the reconstructed record to one map? Or are both
    fragments and consequently the whole record discarded?

    Thanks,

    -Kelly
  • Harsh J at Mar 4, 2011 at 8:31 pm
    The class responsible for reading records as lines off a file, seek in
    to the next block in sequence until the newline. This behavior, and
    how it affects the Map tasks, is better documented here (see the
    TextInputFormat example doc):
    http://wiki.apache.org/hadoop/HadoopMapReduce
    On Sat, Mar 5, 2011 at 1:54 AM, Kelly Burkhart wrote:
    On Fri, Mar 4, 2011 at 1:42 PM, Harsh J wrote:
    HDFS does not operate with records in mind.
    So does that mean that HDFS will break a file at exactly <blocksize>
    bytes?  Map/Reduce *does* operate with records in mind, so what
    happens to the split record?  Does HDFS put the fragments back
    together and deliver the reconstructed record to one map?  Or are both
    fragments and consequently the whole record discarded?

    Thanks,

    -Kelly


    --
    Harsh J
    www.harshj.com
  • Brian Bockelman at Mar 4, 2011 at 8:46 pm
    If, for example, you have a record that contains 20MB in one block and 1MB in another, Map/Reduce will feed you the entire 21MB record. If you are lucky and the map is executing on a node with the 20MB block, MapReduce will transfer 1MB out of HDFS for you.

    This is glossing over some details, but the point is that MR will feed you whole records regardless of whether they are stored on one or two blocks.

    Brian
    On Mar 4, 2011, at 2:24 PM, Kelly Burkhart wrote:
    On Fri, Mar 4, 2011 at 1:42 PM, Harsh J wrote:
    HDFS does not operate with records in mind.
    So does that mean that HDFS will break a file at exactly <blocksize>
    bytes? Map/Reduce *does* operate with records in mind, so what
    happens to the split record? Does HDFS put the fragments back
    together and deliver the reconstructed record to one map? Or are both
    fragments and consequently the whole record discarded?

    Thanks,

    -Kelly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 4, '11 at 7:31p
activeMar 4, '11 at 8:46p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase