FAQ
David,

You can also set compression to occur of your data between your map & reduce
tasks (this data can be large and often is quicker to compress and transfer
than just transfer when the copy gets going).

*mapred.compress.map.output*

Setting this value to *true* should speed up the reducers copy greatly
especially when working with large data sets.

http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

When we load in our data we use the HDFS API and get the data in to begin
with as SequenceFiles (compressed by block) and never look back from there.

We have a custom SequenceFileLoader so we can still use Pig also against our
SequenceFiles. It is worth the little bit of engineering effort to save
space.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/
On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard wrote:

Hi David,

On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch <darose@darose.net
wrote:

* We should use a SequenceFile (binary) format as it's faster for the
machine to read than parsing text, and the files are smaller.

* We should use a text file format as it's easier for humans to read,
easier to change, text files can be compressed quite small, and a) if the
text format is designed well and b) given the context of a distributed
system like Hadoop where you can throw more nodes at a problem, the text
parsing time will wind up being negligible/irrelevant in the overall
processing time.
SequenceFiles can also be compressed, either per record or per block. This
is advantageous if you want to use gzip, because gzip isn't splittable. A
SF compressed by blocks is therefor splittable, because each block is
gzipped vs. the entire file being gzipped.

As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
SequenceFiles.

Lastly, I promise that eventually you'll run out of space in your cluster
and wish you did better compression. Plus compression makes jobs faster.

The general recommendation is to use SequenceFiles as early in your ETL as
possible. Usually people get their data in as text, and after the first MR
pass they work with SequenceFiles from there on out.

Alex

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 6 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 2, '10 at 9:53p
activeJul 6, '10 at 3:14p
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase