FAQ
Our team is still new to Hadoop, and a colleague and I are trying to
make a decision on file formats. The arguments are:

* We should use a SequenceFile (binary) format as it's faster for the
machine to read than parsing text, and the files are smaller.

* We should use a text file format as it's easier for humans to read,
easier to change, text files can be compressed quite small, and a) if
the text format is designed well and b) given the context of a
distributed system like Hadoop where you can throw more nodes at a
problem, the text parsing time will wind up being negligible/irrelevant
in the overall processing time.

I realize I'm leaving out a lot of variables and specifics that could
impact this answer, but I'm just wondering if the Hadoop community had
any general rules of thumb about this like "favor (binary) sequence
files over text files" or some such.

If anyone has any general suggestions/advice here, please post back.

Thanks,

DR

Search Discussions

  • Alex Loddengaard at Jul 2, 2010 at 10:15 pm
    Hi David,
    On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch wrote:

    * We should use a SequenceFile (binary) format as it's faster for the
    machine to read than parsing text, and the files are smaller.

    * We should use a text file format as it's easier for humans to read,
    easier to change, text files can be compressed quite small, and a) if the
    text format is designed well and b) given the context of a distributed
    system like Hadoop where you can throw more nodes at a problem, the text
    parsing time will wind up being negligible/irrelevant in the overall
    processing time.
    SequenceFiles can also be compressed, either per record or per block. This
    is advantageous if you want to use gzip, because gzip isn't splittable. A
    SF compressed by blocks is therefor splittable, because each block is
    gzipped vs. the entire file being gzipped.

    As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
    SequenceFiles.

    Lastly, I promise that eventually you'll run out of space in your cluster
    and wish you did better compression. Plus compression makes jobs faster.

    The general recommendation is to use SequenceFiles as early in your ETL as
    possible. Usually people get their data in as text, and after the first MR
    pass they work with SequenceFiles from there on out.

    Alex
  • Joe Stein at Jul 2, 2010 at 10:36 pm
    David,

    You can also set compression to occur of your data between your map & reduce
    tasks (this data can be large and often is quicker to compress and transfer
    than just transfer when the copy gets going).

    *mapred.compress.map.output*

    Setting this value to *true* should speed up the reducers copy greatly
    especially when working with large data sets.

    http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

    When we load in our data we use the HDFS API and get the data in to begin
    with as SequenceFiles (compressed by block) and never look back from there.

    We have a custom SequenceFileLoader so we can still use Pig also against our
    SequenceFiles. It is worth the little bit of engineering effort to save
    space.

    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */
    On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard wrote:

    Hi David,

    On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch <[email protected]
    wrote:

    * We should use a SequenceFile (binary) format as it's faster for the
    machine to read than parsing text, and the files are smaller.

    * We should use a text file format as it's easier for humans to read,
    easier to change, text files can be compressed quite small, and a) if the
    text format is designed well and b) given the context of a distributed
    system like Hadoop where you can throw more nodes at a problem, the text
    parsing time will wind up being negligible/irrelevant in the overall
    processing time.
    SequenceFiles can also be compressed, either per record or per block. This
    is advantageous if you want to use gzip, because gzip isn't splittable. A
    SF compressed by blocks is therefor splittable, because each block is
    gzipped vs. the entire file being gzipped.

    As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
    SequenceFiles.

    Lastly, I promise that eventually you'll run out of space in your cluster
    and wish you did better compression. Plus compression makes jobs faster.

    The general recommendation is to use SequenceFiles as early in your ETL as
    possible. Usually people get their data in as text, and after the first MR
    pass they work with SequenceFiles from there on out.

    Alex
  • Aaron Kimball at Jul 5, 2010 at 7:48 am
    David,

    I think you've more-or-less outlined the pros and cons of each format
    (though do see Alex's important point regarding SequenceFiles and
    compression). If everyone who worked with Hadoop clearly favored one or the
    other, we probably wouldn't include support for both formats by default. :)
    Neither format is "right" or "wrong" in the general case. The decision will
    be application-specific.

    I would point out, though, that you may be underestimating the processing
    cost of parsing records. If you've got a really dead-simple problem like
    "each record is just a set of integers", you could probably split a line of
    text on commas/tabs/etc. into fields and then convert those to proper
    integer values in a relatively efficient fashion. But if you may have
    delimiters embedded in free-form strings, you'll need to build up a much
    more complex DFA to process the data, and it's not too hard to find yourself
    CPU-bound. (Java regular expressions can be very slow.) Yes, you can always
    throw more nodes at the problem, but you may find that your manager is
    unwilling to sign off on purchasing more nodes at some point :) Also,
    writing/maintaining parser code is its own challenge.

    If your data is essentially text in nature, you might just store it in text
    files and be done with it for all the reasons you've stated.

    But for complex record types, SequenceFiles will be faster. Especially if
    you have to work with raw byte arrays at any point, escaping that (e.g.,
    BASE64 encoding) into text and then back is hardly worth the trouble. Just
    store it in a binary format and be done with it. Intermediate job data
    should probably live as SequenceFiles all the time. They're only ever going
    to be read by more MapReduce jobs, right? For data at either "edge" of your
    problem--either input or final output data--you might want the greater
    ubiquity of text-based files.

    - Aaron
    On Fri, Jul 2, 2010 at 3:35 PM, Joe Stein wrote:

    David,

    You can also set compression to occur of your data between your map &
    reduce
    tasks (this data can be large and often is quicker to compress and transfer
    than just transfer when the copy gets going).

    *mapred.compress.map.output*

    Setting this value to *true* should speed up the reducers copy greatly
    especially when working with large data sets.


    http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

    When we load in our data we use the HDFS API and get the data in to begin
    with as SequenceFiles (compressed by block) and never look back from there.

    We have a custom SequenceFileLoader so we can still use Pig also against
    our
    SequenceFiles. It is worth the little bit of engineering effort to save
    space.

    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */
    On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard wrote:

    Hi David,

    On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch <[email protected]
    wrote:

    * We should use a SequenceFile (binary) format as it's faster for the
    machine to read than parsing text, and the files are smaller.

    * We should use a text file format as it's easier for humans to read,
    easier to change, text files can be compressed quite small, and a) if
    the
    text format is designed well and b) given the context of a distributed
    system like Hadoop where you can throw more nodes at a problem, the
    text
    parsing time will wind up being negligible/irrelevant in the overall
    processing time.
    SequenceFiles can also be compressed, either per record or per block. This
    is advantageous if you want to use gzip, because gzip isn't splittable. A
    SF compressed by blocks is therefor splittable, because each block is
    gzipped vs. the entire file being gzipped.

    As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
    SequenceFiles.

    Lastly, I promise that eventually you'll run out of space in your cluster
    and wish you did better compression. Plus compression makes jobs faster.

    The general recommendation is to use SequenceFiles as early in your ETL as
    possible. Usually people get their data in as text, and after the first MR
    pass they work with SequenceFiles from there on out.

    Alex
  • David Rosenstrauch at Jul 6, 2010 at 2:55 pm
    Thanks much for the helpful responses everyone. This very much helped
    clarify our thinking on the code design. Sounds like all other things
    being equal, sequence files are the way to go. Again, thanks again for
    the advice, all.

    DR
    On 07/05/2010 03:47 AM, Aaron Kimball wrote:
    David,

    I think you've more-or-less outlined the pros and cons of each format
    (though do see Alex's important point regarding SequenceFiles and
    compression). If everyone who worked with Hadoop clearly favored one or the
    other, we probably wouldn't include support for both formats by default. :)
    Neither format is "right" or "wrong" in the general case. The decision will
    be application-specific.

    I would point out, though, that you may be underestimating the processing
    cost of parsing records. If you've got a really dead-simple problem like
    "each record is just a set of integers", you could probably split a line of
    text on commas/tabs/etc. into fields and then convert those to proper
    integer values in a relatively efficient fashion. But if you may have
    delimiters embedded in free-form strings, you'll need to build up a much
    more complex DFA to process the data, and it's not too hard to find yourself
    CPU-bound. (Java regular expressions can be very slow.) Yes, you can always
    throw more nodes at the problem, but you may find that your manager is
    unwilling to sign off on purchasing more nodes at some point :) Also,
    writing/maintaining parser code is its own challenge.

    If your data is essentially text in nature, you might just store it in text
    files and be done with it for all the reasons you've stated.

    But for complex record types, SequenceFiles will be faster. Especially if
    you have to work with raw byte arrays at any point, escaping that (e.g.,
    BASE64 encoding) into text and then back is hardly worth the trouble. Just
    store it in a binary format and be done with it. Intermediate job data
    should probably live as SequenceFiles all the time. They're only ever going
    to be read by more MapReduce jobs, right? For data at either "edge" of your
    problem--either input or final output data--you might want the greater
    ubiquity of text-based files.

    - Aaron

    On Fri, Jul 2, 2010 at 3:35 PM, Joe Steinwrote:
    David,

    You can also set compression to occur of your data between your map&
    reduce
    tasks (this data can be large and often is quicker to compress and transfer
    than just transfer when the copy gets going).

    *mapred.compress.map.output*

    Setting this value to *true* should speed up the reducers copy greatly
    especially when working with large data sets.


    http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

    When we load in our data we use the HDFS API and get the data in to begin
    with as SequenceFiles (compressed by block) and never look back from there.

    We have a custom SequenceFileLoader so we can still use Pig also against
    our
    SequenceFiles. It is worth the little bit of engineering effort to save
    space.

    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */

    On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<[email protected]>
    wrote:
    Hi David,

    On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<[email protected]
    wrote:

    * We should use a SequenceFile (binary) format as it's faster for the
    machine to read than parsing text, and the files are smaller.

    * We should use a text file format as it's easier for humans to read,
    easier to change, text files can be compressed quite small, and a) if
    the
    text format is designed well and b) given the context of a distributed
    system like Hadoop where you can throw more nodes at a problem, the
    text
    parsing time will wind up being negligible/irrelevant in the overall
    processing time.
    SequenceFiles can also be compressed, either per record or per block. This
    is advantageous if you want to use gzip, because gzip isn't splittable. A
    SF compressed by blocks is therefor splittable, because each block is
    gzipped vs. the entire file being gzipped.

    As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
    SequenceFiles.

    Lastly, I promise that eventually you'll run out of space in your cluster
    and wish you did better compression. Plus compression makes jobs faster.

    The general recommendation is to use SequenceFiles as early in your ETL as
    possible. Usually people get their data in as text, and after the first MR
    pass they work with SequenceFiles from there on out.

    Alex
  • Edward Capriolo at Jul 6, 2010 at 3:14 pm

    On Tue, Jul 6, 2010 at 10:56 AM, David Rosenstrauch wrote:
    Thanks much for the helpful responses everyone.  This very much helped
    clarify our thinking on the code design.  Sounds like all other things being
    equal, sequence files are the way to go.  Again, thanks again for the
    advice, all.

    DR
    On 07/05/2010 03:47 AM, Aaron Kimball wrote:

    David,

    I think you've more-or-less outlined the pros and cons of each format
    (though do see Alex's important point regarding SequenceFiles and
    compression). If everyone who worked with Hadoop clearly favored one or
    the
    other, we probably wouldn't include support for both formats by default.
    :)
    Neither format is "right" or "wrong" in the general case. The decision
    will
    be application-specific.

    I would point out, though, that you may be underestimating the processing
    cost of parsing records. If you've got a really dead-simple problem like
    "each record is just a set of integers", you could probably split a line
    of
    text on commas/tabs/etc. into fields and then convert those to proper
    integer values in a relatively efficient fashion. But if you may have
    delimiters embedded in free-form strings, you'll need to build up a much
    more complex DFA to process the data, and it's not too hard to find
    yourself
    CPU-bound. (Java regular expressions can be very slow.) Yes, you can
    always
    throw more nodes at the problem, but you may find that your manager is
    unwilling to sign off on purchasing more nodes at some point :) Also,
    writing/maintaining parser code is its own challenge.

    If your data is essentially text in nature, you might just store it in
    text
    files and be done with it for all the reasons you've stated.

    But for complex record types, SequenceFiles will be faster. Especially if
    you have to work with raw byte arrays at any point, escaping that (e.g.,
    BASE64 encoding) into text and then back is hardly worth the trouble. Just
    store it in a binary format and be done with it. Intermediate job data
    should probably live as SequenceFiles all the time. They're only ever
    going
    to be read by more MapReduce jobs, right? For data at either "edge" of
    your
    problem--either input or final output data--you might want the greater
    ubiquity of text-based files.

    - Aaron

    On Fri, Jul 2, 2010 at 3:35 PM, Joe
    Steinwrote:
    David,

    You can also set compression to occur of your data between your map&
    reduce
    tasks (this data can be large and often is quicker to compress and
    transfer
    than just transfer when the copy gets going).

    *mapred.compress.map.output*

    Setting this value to *true* should speed up the reducers copy greatly
    especially when working with large data sets.



    http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

    When we load in our data we use the HDFS API and get the data in to begin
    with as SequenceFiles (compressed by block) and never look back from
    there.

    We have a custom SequenceFileLoader so we can still use Pig also against
    our
    SequenceFiles.  It is worth the little bit of engineering effort to save
    space.

    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */

    On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<[email protected]>
    wrote:
    Hi David,

    On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<[email protected]
    wrote:

    * We should use a SequenceFile (binary) format as it's faster for the
    machine to read than parsing text, and the files are smaller.

    * We should use a text file format as it's easier for humans to read,
    easier to change, text files can be compressed quite small, and a) if
    the
    text format is designed well and b) given the context of a distributed
    system like Hadoop where you can throw more nodes at a problem, the
    text
    parsing time will wind up being negligible/irrelevant in the overall
    processing time.
    SequenceFiles can also be compressed, either per record or per block. This
    is advantageous if you want to use gzip, because gzip isn't splittable. A
    SF compressed by blocks is therefor splittable, because each block is
    gzipped vs. the entire file being gzipped.

    As for readability, "hadoop fs -text" is the same as "hadoop fs -cat"
    for
    SequenceFiles.

    Lastly, I promise that eventually you'll run out of space in your
    cluster
    and wish you did better compression.  Plus compression makes jobs
    faster.

    The general recommendation is to use SequenceFiles as early in your ETL as
    possible.  Usually people get their data in as text, and after the first MR
    pass they work with SequenceFiles from there on out.

    Alex
    One challenge is that support for sequence files can be incomplete.
    For example pig 6.0 has support for globs and delimiters with
    PigStorage. However the SequenceFile storage did not support
    delimiters or Globs. (This was added to Pig 7.0 if I remember
    correctly). Hive has/had a similar issue where with sequence files you
    could handle multiple files per directory. Except hive typically
    ignores the key of its input, and it just so happened that all our
    data was generated in the key.

    Thus, when you work with sequence files you tend to have to "dig-in"
    http://www.ivanprado.es/2008/02/reading-hadoop-sequencefile-from-pig.html
    to the code base of third party utilities.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 2, '10 at 9:53p
activeJul 6, '10 at 3:14p
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase