FAQ
Hello,

What is the most optimal way to compress several files already in hadoop ?

Search Discussions

  • Dhruv at Jun 11, 2011 at 2:26 am
    Can you be more specific? Tom White's book has a whole section devoted to
    it.
    On Fri, Jun 10, 2011 at 7:24 PM, Madhu Ramanna wrote:

    Hello,

    What is the most optimal way to compress several files already in hadoop ?
  • Madhu Ramanna at Jun 11, 2011 at 2:34 pm
    Sure,

    I wrote a job that runs hourly / daily and produces several files. I'm
    using MultipleOutputs to generate these files. However, when compression
    is turned on (bz2), MultipleOutputs produces 0 byte files for all but one
    named output. (part files are 14 bytes). Now without compressions,
    MultipleOutputs seems to be doing its job fine. Given output is all text,
    it saves us a ton of disk space if we compress the output.

    Our cluster is cdh3b3 (hadoop-0.20.2)




    On 6/10/11 7:26 PM, "Dhruv" wrote:

    Can you be more specific? Tom White's book has a whole section devoted to
    it.
    On Fri, Jun 10, 2011 at 7:24 PM, Madhu Ramanna wrote:

    Hello,

    What is the most optimal way to compress several files already in
    hadoop ?
  • Shi Yu at Jun 12, 2011 at 5:57 pm
    This is a re-post of the same message. I made it more specific
    and clear. Have been considering it several days so really
    appreciate any help.

    I have a question about configuring Map/Side inner join for
    multiple mappers in Hadoop. Suppose I have two very large data
    sets A and B, I use the same partition and sort algorithm to
    split them into smaller parts. For A, assume I have a(1) to
    a(10), and for B I have b(1) to b(10). It is assured that a(1)
    and b(1) contain the same keys, a(2) and b(2) have the same
    keys, and so on. I would like to setup 10 mappers,
    specifically, mapper(1) to mapper(10). To my understanding,
    Map/Side join is a pre-processing task prior to the mapper,
    therefore, I would like to join a(1) and b(1) for mapper(1),
    to join a(2) and b(2) for mapper(2), and so on.

    After reading some reference materials, it is still not clear
    to me how to configure these ten mappers. I understand that
    using CompositeInputFormat I would be able to join two files,
    but it seems only configuring one mapper and joining the 20
    files pair after pair (in 10 sequential tasks). How to
    configure all these ten mappers and join ten pairs at the same
    time in a genuine Map/Reduce (10 tasks in parallel)? To my
    understanding, ten mappers would require ten
    CompositeInputFormat settings because the files to join are
    all different. I strongly believe this is practical and
    doable, but I couldn't figure out what exact commands I should
    use.

    Any hint and suggestion is highly welcome and appreciated.

    Shi

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 10, '11 at 11:24p
activeJun 12, '11 at 5:57p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

3 users in discussion

Madhu Ramanna: 2 posts Dhruv: 1 post Shi Yu: 1 post

People

Translate

site design / logo © 2022 Grokbase