FAQ
will data stored in compression format affect mapreduce job speed? increase or decrease? or more complex relationship between these two ? can anybody give some explanation in detail?

2010-08-26



shangan

Search Discussions

  • Ted Yu at Aug 26, 2010 at 3:44 am
    Compressed data would increase processing time in mapper/reducer but
    decrease the amount of data transferred between tasktracker nodes.
    Normally you should consider applying some form of compression.
    On Wed, Aug 25, 2010 at 7:32 PM, shangan wrote:

    will data stored in compression format affect mapreduce job speed?
    increase or decrease? or more complex relationship between these two ? can
    anybody give some explanation in detail?

    2010-08-26



    shangan
  • Harsh J at Aug 26, 2010 at 4:15 am
    Logically it 'should' increase time as its an extra step beyond the
    Mapper/Reducer. But while your processing time would slightly (very
    very slightly) increase, your IO and Network Transfers time would
    decrease by a large margin -- giving you a clear impression that your
    total job time has decreased overall. The difference being in writing
    out say 10 GB before, and writing out 5-7 GB this time (a crude
    example).

    With the fast CPUs available these days, compressing and decompressing
    should hardly take a noticeable amount of extra time. Its almost
    negligible in case of using gzip, lzo or plain deflate.
    On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu wrote:
    Compressed data would increase processing time in mapper/reducer but
    decrease the amount of data transferred between tasktracker nodes.
    Normally you should consider applying some form of compression.
    On Wed, Aug 25, 2010 at 7:32 PM, shangan wrote:

    will data stored in  compression format affect mapreduce job speed?
    increase or decrease? or more complex relationship between these two ?  can
    anybody give some explanation in detail?

    2010-08-26



    shangan


    --
    Harsh J
    www.harshj.com
  • Shangan at Aug 26, 2010 at 5:50 am
    I agree with you on the most part. But I have some other questions. mapper are working on local machine so there's no network transfers during this process, if the original data stored in hdfs is compressed it will only decrease the IO time. One major point is I doubt whether the mapper can deal with only part of the whole data if the data is compressed which seems can't be split ? I've try to do a "select sum()" in hive and trace the job, it seems the .tar.gz data can only worked on one single matchine and stuck there for quite a long time(seems like need to wait other part of data be copied from other machines),while other data not compressed can work on different machines parallelly. Do you know something about this ?

    2010-08-26



    shangan



    发件人: Harsh J
    发送时间: 2010-08-26 12:15:49
    收件人: common-user
    抄送:
    主题: Re: data in compression format affect mapreduce speed

    Logically it 'should' increase time as its an extra step beyond the
    Mapper/Reducer. But while your processing time would slightly (very
    very slightly) increase, your IO and Network Transfers time would
    decrease by a large margin -- giving you a clear impression that your
    total job time has decreased overall. The difference being in writing
    out say 10 GB before, and writing out 5-7 GB this time (a crude
    example).
    With the fast CPUs available these days, compressing and decompressing
    should hardly take a noticeable amount of extra time. Its almost
    negligible in case of using gzip, lzo or plain deflate.
    On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu wrote:
    Compressed data would increase processing time in mapper/reducer but
    decrease the amount of data transferred between tasktracker nodes.
    Normally you should consider applying some form of compression.
    On Wed, Aug 25, 2010 at 7:32 PM, shangan wrote:

    will data stored in compression format affect mapreduce job speed?
    increase or decrease? or more complex relationship between these two ? can
    anybody give some explanation in detail?

    2010-08-26



    shangan
    --
    Harsh J
    www.harshj.com
    __________ Information from ESET NOD32 Antivirus, version of virus signature database 5397 (20100825) __________
    The message was checked by ESET NOD32 Antivirus.
    http://www.eset.com
  • Harsh J at Aug 26, 2010 at 6:09 am

    On Thu, Aug 26, 2010 at 11:20 AM, shangan wrote:
    I agree with you on the most part. But I have some other questions. mapper are working on local machine so there's no network transfers during this process, if the original data stored in hdfs is compressed it will only decrease the IO time. One major point is I doubt whether the mapper can deal with only part of the whole data if the data is compressed which seems can't be split ? I've try to do a "select sum()" in hive and trace the job, it seems the .tar.gz data can only worked on one single matchine and stuck there for quite a long time(seems like need to wait other part of data be copied from other machines),while other data not compressed can work on different machines parallelly. Do you know something about this ?
    GZip compressed files can not be decompressed as split blocks so only
    one mapper runs. BZip2 algorithm supports splitting and decompressing
    individual blocks of a file, you may try that.

    LZO can be made to allow block splitting by indexing all the available
    files first ( a program and a set of IOFormat classes for this is
    provided by the hadoop-lzo project over at GitHub --
    http://github.com/kevinweil/hadoop-lzo )

    When using compression its usually also suggested to use SequenceFiles
    and/or Avro data files for the data storage; as these are designed
    with the HDFS and MR of Hadoop in mind and contain some form of block
    checkpoints inside them which let them be split as blocks with any
    form of compression codec applied. (Note: Avro uses deflate in its own
    format).
    2010-08-26



    shangan



    发件人: Harsh J
    发送时间: 2010-08-26 12:15:49
    收件人: common-user
    抄送:
    主题: Re: data in compression format affect mapreduce speed

    Logically it 'should' increase time as its an extra step beyond the
    Mapper/Reducer. But while your processing time would slightly (very
    very slightly) increase, your IO and Network Transfers time would
    decrease by a large margin -- giving you a clear impression that your
    total job time has decreased overall. The difference being in writing
    out say 10 GB before, and writing out 5-7 GB this time (a crude
    example).
    With the fast CPUs available these days, compressing and decompressing
    should hardly take a noticeable amount of extra time. Its almost
    negligible in case of using gzip, lzo or plain deflate.
    On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu wrote:
    Compressed data would increase processing time in mapper/reducer but
    decrease the amount of data transferred between tasktracker nodes.
    Normally you should consider applying some form of compression.
    On Wed, Aug 25, 2010 at 7:32 PM, shangan wrote:

    will data stored in compression format affect mapreduce job speed?
    increase or decrease? or more complex relationship between these two ? can
    anybody give some explanation in detail?

    2010-08-26



    shangan
    --
    Harsh J
    www.harshj.com
    __________ Information from ESET NOD32 Antivirus, version of virus signature database 5397 (20100825) __________
    The message was checked by ESET NOD32 Antivirus.
    http://www.eset.com


    --
    Harsh J
    www.harshj.com
  • Greg Roelofs at Aug 26, 2010 at 7:58 pm

    Harsh J wrote:

    BZip2 algorithm supports splitting and decompressing
    individual blocks of a file, you may try that.
    Only on trunk (and maybe 0.21--not sure). I don't believe anyone backported
    bzip2 splittable support to 0.20.

    Greg

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 26, '10 at 2:33a
activeAug 26, '10 at 7:58p
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase