FAQ
Hi hadoop Gurus, here's a question about intermediate compression.

As I understand, the point to compress Map output is to reduce network traffic that occur when feeding sequence files from Map to Reduce tasks which do not reside on the same boxes as Map tasks. So, depending on various factors such as how cluster is set up, data size, property of problem to solve and the quality of m/r program (e.g. a pig script), etc, this reduce in network traffic (due to compressed data) may or may not compensate the time for compression and decompression. In other words, intermediate compression may not reach its goal to reduce the overall time cost of a m/r job.

As I know, a blog (http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) gives compression and decompression factor and speed and reports a positive result of using compression on raw data as input to m/r job, but no test or insight about intermediate compression. So, I am wondering if there is any case study or test results guiding when to use intermediate compression, pros and cons, settings, pitfalls and gains...

Thanks,

Michael

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 25, '10 at 7:44p
activeFeb 25, '10 at 7:44p
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Jiang licht: 1 post

People

Translate

site design / logo © 2022 Grokbase