FAQ
If i need to compress some files or data and then dump it into HDFS, the
apis' of hadoop is the solution???
Also, these classes under hadoop.io.compress package- do they implement
different comperssion algorithms implicitly? If yes, I have to simply import
them and use them as per my requirements..rt??? No need to make a chk of how
these algorithms have been implemented??

Lastly, which comperssion algo. is more efficient???

Suggest me solutions to above..!
--
Regards!
Sugandha

Search Discussions

  • Sugandha Naolekar at Jul 9, 2009 at 12:46 pm

    ---------- Forwarded message ----------
    From: Sugandha Naolekar <sugandha.n87@gmail.com>
    Date: Thu, Jul 9, 2009 at 1:03 PM
    Subject: Compression issues..!
    To: core-user@hadoop.apache.org

    If i need to compress some files or data and then dump it into HDFS, the
    apis' of hadoop is the solution???
    Also, these classes under hadoop.io.compress package- do they implement
    different comperssion algorithms implicitly? If yes, I have to simply import
    them and use them as per my requirements..rt??? No need to make a chk of how
    these algorithms have been implemented??

    Lastly, which comperssion algo. is more efficient???

    Suggest me solutions to above..!
    --
    Regards!
    Sugandha



    --
    Regards!
    Sugandha
  • Sugandha Naolekar at Jul 15, 2009 at 5:40 am
    Hello!

    Few days back, I had asked about the compression of data placed in hadoop..I
    did get apt replies as::

    Place the data first in HDFS and then compress it, so that the data would be
    in sequence files.

    But, here my query is, I want to compress the data before placing it in
    HDFS, so that redundancy won't come into picture..!

    How to do that...!Also, will I have to use external compression algo. or
    simply api's would solve the purpose?

    --
    Regards!
    Sugandha
  • Tarandeep Singh at Jul 15, 2009 at 6:12 am
    You can put compress data on HDFS and run Map Reduce job on it. But you
    should use a codec that supports file splitting, otherwise whole file will
    be read by one mapper. If you have read about Map reduce architecture, you
    would understand that a map function processes chunk of data (called split).
    If file is big and supports splitting (e.g. plain text file where lines are
    separated by new lines or sequence files) then the big file can be processed
    in parallel by multiple mappers (each processing a split of the file).
    However if the compression codec that you use does not supprt file
    splitting, then whole file will be processed by one mapper and you won't
    achieve parallelism.

    Check Hadoop wiki on compression codecs that support file splitting.

    -Tarandeep

    On Tue, Jul 14, 2009 at 10:39 PM, Sugandha Naolekar
    wrote:
    Hello!

    Few days back, I had asked about the compression of data placed in
    hadoop..I
    did get apt replies as::

    Place the data first in HDFS and then compress it, so that the data would
    be
    in sequence files.

    But, here my query is, I want to compress the data before placing it in
    HDFS, so that redundancy won't come into picture..!

    How to do that...!Also, will I have to use external compression algo. or
    simply api's would solve the purpose?

    --
    Regards!
    Sugandha
  • Jason hadoop at Jul 15, 2009 at 1:30 pm
    Particularly for highly compressible data such as web log files, the loss in
    potential data locality is more than made up for by the increase in network
    transfer speed. The other somewhat unexpected side benefit is that there are
    fewer map tasks with less task startup overhead. If your data is not highly
    compressible, or your jobs are cpu bound the cost benefit ratio may not be
    favorable.
    On Tue, Jul 14, 2009 at 11:12 PM, Tarandeep Singh wrote:

    You can put compress data on HDFS and run Map Reduce job on it. But you
    should use a codec that supports file splitting, otherwise whole file will
    be read by one mapper. If you have read about Map reduce architecture, you
    would understand that a map function processes chunk of data (called
    split).
    If file is big and supports splitting (e.g. plain text file where lines are
    separated by new lines or sequence files) then the big file can be
    processed
    in parallel by multiple mappers (each processing a split of the file).
    However if the compression codec that you use does not supprt file
    splitting, then whole file will be processed by one mapper and you won't
    achieve parallelism.

    Check Hadoop wiki on compression codecs that support file splitting.

    -Tarandeep

    On Tue, Jul 14, 2009 at 10:39 PM, Sugandha Naolekar
    wrote:
    Hello!

    Few days back, I had asked about the compression of data placed in
    hadoop..I
    did get apt replies as::

    Place the data first in HDFS and then compress it, so that the data would
    be
    in sequence files.

    But, here my query is, I want to compress the data before placing it in
    HDFS, so that redundancy won't come into picture..!

    How to do that...!Also, will I have to use external compression algo. or
    simply api's would solve the purpose?

    --
    Regards!
    Sugandha


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 9, '09 at 7:34a
activeJul 15, '09 at 1:30p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase