FAQ
I'm starting to evaluate Hadoop. We are currently running Sensage and
store a lot of log files in our current environment. I've been looking
at the Hadoop forums and googling (of course) but haven't learned if
Hadoop HDFS does any compression to files we store.

On the average we're storing about 600 gigs a week in log files (more or
less). Generally we need to store about 1 1/2 - 2 years of logs. With
Sensage compression we can store about 200+ Tb of logs in our current
environment.

As I said, we're starting to evaluate if Hadoop would be a good
replacement to our Sensage environment (or at least augment it).

Thanks a bunch!!

Search Discussions

  • Sonal Goyal at Apr 4, 2010 at 2:06 am
    Hi,

    Please check
    http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Data+Compression

    Thanks and Regards,
    Sonal
    www.meghsoft.com

    On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel wrote:

    I'm starting to evaluate Hadoop. We are currently running Sensage and
    store a lot of log files in our current environment. I've been looking at
    the Hadoop forums and googling (of course) but haven't learned if Hadoop
    HDFS does any compression to files we store.

    On the average we're storing about 600 gigs a week in log files (more or
    less). Generally we need to store about 1 1/2 - 2 years of logs. With
    Sensage compression we can store about 200+ Tb of logs in our current
    environment.

    As I said, we're starting to evaluate if Hadoop would be a good replacement
    to our Sensage environment (or at least augment it).

    Thanks a bunch!!
  • Rajesh Balamohan at Apr 4, 2010 at 3:01 am
    There is a facility in Hadoop to compress "intermediate mapoutput and job
    output". Is your question related to reading compressed files itself into
    hadoop?

    If so, refer SequenceFileInputFormat. (
    http://developer.yahoo.com/hadoop/tutorial/module4.html )

    the *SequenceFileInputFormat* reads special binary files that are specific
    to Hadoop. These files include many features designed to allow data to be
    rapidly read into Hadoop mappers. Sequence files are block-compressed and
    provide direct serialization and deserialization of several arbitrary data
    types (not just text). Sequence files can be generated as the output of
    other MapReduce tasks and are an efficient intermediate representation for
    data that is passing from one MapReduce job to anther.
    On Sat, Apr 3, 2010 at 11:15 PM, u235sentinel wrote:

    I'm starting to evaluate Hadoop. We are currently running Sensage and
    store a lot of log files in our current environment. I've been looking at
    the Hadoop forums and googling (of course) but haven't learned if Hadoop
    HDFS does any compression to files we store.

    On the average we're storing about 600 gigs a week in log files (more or
    less). Generally we need to store about 1 1/2 - 2 years of logs. With
    Sensage compression we can store about 200+ Tb of logs in our current
    environment.

    As I said, we're starting to evaluate if Hadoop would be a good replacement
    to our Sensage environment (or at least augment it).

    Thanks a bunch!!


    --
    ~Rajesh.B
  • Eric Sammer at Apr 4, 2010 at 8:46 am
    To clarify, there is no implicit compression in HDFS. In other words,
    if you want your data to be compressed, you have to write it that way.
    If you plan on writing map reduce jobs to process the compressed data,
    you'll want to use a splittable compression format. This generally
    means LZO or block compressed SequenceFiles which others have
    mentioned.
    On Sat, Apr 3, 2010 at 10:45 AM, u235sentinel wrote:
    I'm starting to evaluate Hadoop.  We are currently running Sensage and store
    a lot of log files in our current environment.  I've been looking at the
    Hadoop forums and googling (of course) but haven't learned if Hadoop HDFS
    does any compression to files we store.

    On the average we're storing about 600 gigs a week in log files (more or
    less).  Generally we need to store about 1 1/2 - 2 years of logs.  With
    Sensage compression we can store about 200+ Tb of logs in our current
    environment.

    As I said, we're starting to evaluate if Hadoop would be a good replacement
    to our Sensage environment (or at least augment it).

    Thanks a bunch!!


    --
    Eric Sammer
    phone: +1-917-287-2675
    twitter: esammer
    data: www.cloudera.com
  • U235sentinel at Apr 4, 2010 at 10:32 pm
    Ok that's what I was thinking. I was wondering if Hadoop did on the fly
    compression as it stored files in HDFS like Sensage does. But it sounds
    like Hadoop will take a compressed file and store it as compressed which
    is fine by me. Sensage will do that same.

    I believe this answers the question. Sonal's link suggests there is
    support for compression using zlib, gzip and bzip2.

    One more question though. So storing files in compressed format, any
    issues with searching that data? I'm curious if there is a disadvantage
    in doing this. I could build bigger and badder servers but was hoping
    for compression.

    Thanks



    Eric Sammer wrote:
    To clarify, there is no implicit compression in HDFS. In other words,
    if you want your data to be compressed, you have to write it that way.
    If you plan on writing map reduce jobs to process the compressed data,
    you'll want to use a splittable compression format. This generally
    means LZO or block compressed SequenceFiles which others have
    mentioned.




  • Eric Sammer at Apr 5, 2010 at 4:56 am
    See below.
    On Sun, Apr 4, 2010 at 3:32 PM, u235sentinel wrote:
    Ok that's what I was thinking.  I was wondering if Hadoop did on the fly
    compression as it stored files in HDFS like Sensage does.  But it sounds
    like Hadoop will take a compressed file and store it as compressed which is
    fine by me.  Sensage will do that same.
    That's correct.
    I believe this answers the question.  Sonal's link suggests there is support
    for compression using zlib, gzip and bzip2.
    One more question though.  So storing files in compressed format, any issues
    with searching that data?  I'm curious if there is a disadvantage in doing
    this.  I could build bigger and badder servers but was hoping for
    compression.
    Just to be super specific about this, you can write data in any format
    into HDFS. If you can turn it into java primitives (including bytes),
    you can write it to HDFS. The second half of the question is what are
    my options for processing this data? If you plan on using Hadoop map
    reduce to process these files you'll want to make sure you use a
    compression format that Hadoop can "split" for parallel processing
    which only a subset of these are. If you aren't planning on using the
    MR component of Hadoop you can do whatever you'd like. You can still
    write map reduce jobs on non-splittable compression formats, but
    Hadoop will not be able to process a single file concurrently and
    instead will have to process an entire file in one task. The best
    option here is to dig into the docs a bit and figure out if what you
    want to do will be possible and take care of these details in the
    beginning.
    Thanks



    Eric Sammer wrote:
    To clarify, there is no implicit compression in HDFS. In other words,
    if you want your data to be compressed, you have to write it that way.
    If you plan on writing map reduce jobs to process the compressed data,
    you'll want to use a splittable compression format. This generally
    means LZO or block compressed SequenceFiles which others have
    mentioned.





    --
    Eric Sammer
    phone: +1-917-287-2675
    twitter: esammer
    data: www.cloudera.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 3, '10 at 5:45p
activeApr 5, '10 at 4:56a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase