FAQ
Hi,
If I have is big gzip files (>>block size) does the M/R will split a single
file to multiple blocks and send them to different mappers ?
The behavior I currently see is that a map is still open per file (and not
per block).

I will also appreciate it if you can share your experience in defining
block size (compared to HDFS size and to job processing size).


Thanks,
Lior

Search Discussions

  • Harsh J at Mar 17, 2011 at 2:05 pm

    On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter wrote:
    Hi,
    If I have is big gzip files (>>block size) does the M/R will split a single
    file to multiple blocks and send them to different mappers ?
    The behavior I currently see is that a map is still open per file (and not
    per block).
    Yes this is true. This is the current behavior with GZip files (since
    they can't be split and decompressed right out). I had somehow managed
    to ignore the GZIP part of your question in the previous thread!

    But still, 60~ files worth 15 GB total would mean at least 3 GB per
    file. And seeing how they can't really be split out right now, it
    would be good to have them use up only a single block. Perhaps for
    these files alone you may use a block size of 3-4 GB, thereby making
    these file reads more local for your record readers?

    In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
    GZIP files, though. 'Concatenated' GZIP files could be split
    (HADOOP-6835) across mappers as well (as demonstrated in PIG-42).
  • Lior Schachter at Mar 17, 2011 at 2:23 pm
    Currently each gzip file is about 250MB (*60files=15G) so we have 256M
    blocks.

    However I understand that in order to utilize better M/R parallel processing
    smaller files/blocks are better.

    So maybe having 128M gzip files with coreesponding 128M block size would be
    better?

    On Thu, Mar 17, 2011 at 4:05 PM, Harsh J wrote:
    On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter wrote:
    Hi,
    If I have is big gzip files (>>block size) does the M/R will split a single
    file to multiple blocks and send them to different mappers ?
    The behavior I currently see is that a map is still open per file (and not
    per block).
    Yes this is true. This is the current behavior with GZip files (since
    they can't be split and decompressed right out). I had somehow managed
    to ignore the GZIP part of your question in the previous thread!

    But still, 60~ files worth 15 GB total would mean at least 3 GB per
    file. And seeing how they can't really be split out right now, it
    would be good to have them use up only a single block. Perhaps for
    these files alone you may use a block size of 3-4 GB, thereby making
    these file reads more local for your record readers?

    In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
    GZIP files, though. 'Concatenated' GZIP files could be split
    (HADOOP-6835) across mappers as well (as demonstrated in PIG-42).

    --
    Harsh J
    http://harshj.com
  • Harsh J at Mar 17, 2011 at 3:08 pm

    On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter wrote:
    Currently each gzip file is about 250MB (*60files=15G) so we have 256M
    blocks.
    Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file mehh..
    However I understand that in order to utilize better M/R parallel processing
    smaller files/blocks are better.
    Yes this is true in case of text/sequence files.
    So maybe having 128M gzip files with coreesponding 128M block size would be
    better?
    Why not 256 for all your ~250MB _gzip_ files, making it nearly one
    block since they would not be split anyways?
  • Lior Schachter at Mar 17, 2011 at 3:15 pm
    yes. but with 128M gzip files/block size the M/R will work better ? no ?

    anyhow, thanks for the useful information.
    On Thu, Mar 17, 2011 at 5:07 PM, Harsh J wrote:
    On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter wrote:
    Currently each gzip file is about 250MB (*60files=15G) so we have 256M
    blocks.
    Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file
    mehh..
    However I understand that in order to utilize better M/R parallel
    processing
    smaller files/blocks are better.
    Yes this is true in case of text/sequence files.
    So maybe having 128M gzip files with coreesponding 128M block size would be
    better?
    Why not 256 for all your ~250MB _gzip_ files, making it nearly one
    block since they would not be split anyways?

    --
    Harsh J
    http://harshj.com
  • Harsh J at Mar 17, 2011 at 4:15 pm
    Not in case of .gz files [Since there is no splitting done, the mapper
    shall possibly read 128 MB locally from a resident DN, and then could
    read the remaining 128 MB over the network from another DN if the next
    block does not reside on the same DN as well -- thereby introducing a
    network read cost].
    On Thu, Mar 17, 2011 at 8:44 PM, Lior Schachter wrote:
    yes. but with 128M gzip files/block size the M/R will work better ? no ?

    anyhow, thanks for the useful information.
    On Thu, Mar 17, 2011 at 5:07 PM, Harsh J wrote:

    On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter <liors@infolinks.com>
    wrote:
    Currently each gzip file is about 250MB (*60files=15G) so we have 256M
    blocks.
    Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file
    mehh..
    However I understand that in order to utilize better M/R parallel
    processing
    smaller files/blocks are better.
    Yes this is true in case of text/sequence files.
    So maybe having 128M gzip files with coreesponding 128M block size would
    be
    better?
    Why not 256 for all your ~250MB _gzip_ files, making it nearly one
    block since they would not be split anyways?

    --
    Harsh J
    http://harshj.com


    --
    Harsh J
    http://harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedMar 17, '11 at 1:10p
activeMar 17, '11 at 4:15p
posts6
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Lior Schachter: 3 posts Harsh J: 3 posts

People

Translate

site design / logo © 2022 Grokbase