On Tue, Jul 6, 2010 at 10:56 AM, David Rosenstrauch wrote:
Thanks much for the helpful responses everyone. This very much helped
clarify our thinking on the code design. Sounds like all other things being
equal, sequence files are the way to go. Again, thanks again for the
advice, all.
DR
Thanks much for the helpful responses everyone. This very much helped
clarify our thinking on the code design. Sounds like all other things being
equal, sequence files are the way to go. Again, thanks again for the
advice, all.
DR
On 07/05/2010 03:47 AM, Aaron Kimball wrote:
David,
I think you've more-or-less outlined the pros and cons of each format
(though do see Alex's important point regarding SequenceFiles and
compression). If everyone who worked with Hadoop clearly favored one or
the
other, we probably wouldn't include support for both formats by default.
:)
Neither format is "right" or "wrong" in the general case. The decision
will
be application-specific.
I would point out, though, that you may be underestimating the processing
cost of parsing records. If you've got a really dead-simple problem like
"each record is just a set of integers", you could probably split a line
of
text on commas/tabs/etc. into fields and then convert those to proper
integer values in a relatively efficient fashion. But if you may have
delimiters embedded in free-form strings, you'll need to build up a much
more complex DFA to process the data, and it's not too hard to find
yourself
CPU-bound. (Java regular expressions can be very slow.) Yes, you can
always
throw more nodes at the problem, but you may find that your manager is
unwilling to sign off on purchasing more nodes at some point :) Also,
writing/maintaining parser code is its own challenge.
If your data is essentially text in nature, you might just store it in
text
files and be done with it for all the reasons you've stated.
But for complex record types, SequenceFiles will be faster. Especially if
you have to work with raw byte arrays at any point, escaping that (e.g.,
BASE64 encoding) into text and then back is hardly worth the trouble. Just
store it in a binary format and be done with it. Intermediate job data
should probably live as SequenceFiles all the time. They're only ever
going
to be read by more MapReduce jobs, right? For data at either "edge" of
your
problem--either input or final output data--you might want the greater
ubiquity of text-based files.
- Aaron
On Fri, Jul 2, 2010 at 3:35 PM, Joe
Steinwrote:
David,
I think you've more-or-less outlined the pros and cons of each format
(though do see Alex's important point regarding SequenceFiles and
compression). If everyone who worked with Hadoop clearly favored one or
the
other, we probably wouldn't include support for both formats by default.
:)
Neither format is "right" or "wrong" in the general case. The decision
will
be application-specific.
I would point out, though, that you may be underestimating the processing
cost of parsing records. If you've got a really dead-simple problem like
"each record is just a set of integers", you could probably split a line
of
text on commas/tabs/etc. into fields and then convert those to proper
integer values in a relatively efficient fashion. But if you may have
delimiters embedded in free-form strings, you'll need to build up a much
more complex DFA to process the data, and it's not too hard to find
yourself
CPU-bound. (Java regular expressions can be very slow.) Yes, you can
always
throw more nodes at the problem, but you may find that your manager is
unwilling to sign off on purchasing more nodes at some point :) Also,
writing/maintaining parser code is its own challenge.
If your data is essentially text in nature, you might just store it in
text
files and be done with it for all the reasons you've stated.
But for complex record types, SequenceFiles will be faster. Especially if
you have to work with raw byte arrays at any point, escaping that (e.g.,
BASE64 encoding) into text and then back is hardly worth the trouble. Just
store it in a binary format and be done with it. Intermediate job data
should probably live as SequenceFiles all the time. They're only ever
going
to be read by more MapReduce jobs, right? For data at either "edge" of
your
problem--either input or final output data--you might want the greater
ubiquity of text-based files.
- Aaron
On Fri, Jul 2, 2010 at 3:35 PM, Joe
Steinwrote:
David,
You can also set compression to occur of your data between your map&
reduce
tasks (this data can be large and often is quicker to compress and
transfer
than just transfer when the copy gets going).
*mapred.compress.map.output*
Setting this value to *true* should speed up the reducers copy greatly
especially when working with large data sets.
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
When we load in our data we use the HDFS API and get the data in to begin
with as SequenceFiles (compressed by block) and never look back from
there.
We have a custom SequenceFileLoader so we can still use Pig also against
our
SequenceFiles. It is worth the little bit of engineering effort to save
space.
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/
On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<[email protected]>
wrote:
text
is advantageous if you want to use gzip, because gzip isn't splittable. A
SF compressed by blocks is therefor splittable, because each block is
gzipped vs. the entire file being gzipped.
As for readability, "hadoop fs -text" is the same as "hadoop fs -cat"
for
SequenceFiles.
Lastly, I promise that eventually you'll run out of space in your
cluster
and wish you did better compression. Plus compression makes jobs
faster.
The general recommendation is to use SequenceFiles as early in your ETL as
possible. Usually people get their data in as text, and after the first MR
pass they work with SequenceFiles from there on out.
Alex
You can also set compression to occur of your data between your map&
reduce
tasks (this data can be large and often is quicker to compress and
transfer
than just transfer when the copy gets going).
*mapred.compress.map.output*
Setting this value to *true* should speed up the reducers copy greatly
especially when working with large data sets.
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
When we load in our data we use the HDFS API and get the data in to begin
with as SequenceFiles (compressed by block) and never look back from
there.
We have a custom SequenceFileLoader so we can still use Pig also against
our
SequenceFiles. It is worth the little bit of engineering effort to save
space.
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/
On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<[email protected]>
wrote:
Hi David,
On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<[email protected]
theOn Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<[email protected]
wrote:
* We should use a SequenceFile (binary) format as it's faster for the
machine to read than parsing text, and the files are smaller.
* We should use a text file format as it's easier for humans to read,
easier to change, text files can be compressed quite small, and a) if
* We should use a SequenceFile (binary) format as it's faster for the
machine to read than parsing text, and the files are smaller.
* We should use a text file format as it's easier for humans to read,
easier to change, text files can be compressed quite small, and a) if
text format is designed well and b) given the context of a distributed
system like Hadoop where you can throw more nodes at a problem, the
system like Hadoop where you can throw more nodes at a problem, the
parsing time will wind up being negligible/irrelevant in the overall
processing time.
SequenceFiles can also be compressed, either per record or per block. Thisprocessing time.
is advantageous if you want to use gzip, because gzip isn't splittable. A
SF compressed by blocks is therefor splittable, because each block is
gzipped vs. the entire file being gzipped.
As for readability, "hadoop fs -text" is the same as "hadoop fs -cat"
for
SequenceFiles.
Lastly, I promise that eventually you'll run out of space in your
cluster
and wish you did better compression. Plus compression makes jobs
faster.
The general recommendation is to use SequenceFiles as early in your ETL as
possible. Usually people get their data in as text, and after the first MR
pass they work with SequenceFiles from there on out.
Alex
For example pig 6.0 has support for globs and delimiters with
PigStorage. However the SequenceFile storage did not support
delimiters or Globs. (This was added to Pig 7.0 if I remember
correctly). Hive has/had a similar issue where with sequence files you
could handle multiple files per directory. Except hive typically
ignores the key of its input, and it just so happened that all our
data was generated in the key.
Thus, when you work with sequence files you tend to have to "dig-in"
http://www.ivanprado.es/2008/02/reading-hadoop-sequencefile-from-pig.html
to the code base of third party utilities.