Hi,

I run two examples of a MR execution with the same input files and
with 3 Reduce tasks defined. One example has the map-intermediate
files compressed, and the other examples has uncompressed data. Below,
I've put some debug lines that I put in the code.

1 - On the uncompressed data, the raw length is always smaller than
the partition length, but on the compressed data, is not. Why in
compressed data the raw length is bigger than the partition length?

2 - If we define the map-intermediate files as compressed, how the
map-intermediate files are distributed to all reduces? Since we can
split a compressed file, this means that each spill file is
compressed? For example, Compressed(Spill idx 0) goes to Reduce 0,
Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2)
goes to Reduce 2,

Compressed data

Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567
Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003
Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459

Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785
Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100
Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821

Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503
Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564
Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716

Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440
Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160
Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750



Non-compressed data

Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567
Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003
Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459

Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785
Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100
Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821

Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503
Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564
Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716

Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440
Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160
Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750


Thanks,

--
Pedro

Search Discussions

  • Pedro Costa at Feb 15, 2011 at 10:49 am
    As I understand from the log files that I put, in the example, since
    we've 3 Reduces, all spill 0 files will be merged to go to Reduce 0,
    all spill 1 files will be merged to go to Reduce 1 and all spill 2
    files will be merged to go to Reduce 2.

    This means that, if we set compression on, it's the merged files that
    are compressed?

    Thanks,




    On Tue, Feb 15, 2011 at 10:35 AM, Pedro Costa wrote:
    Hi,

    I run two examples of a MR execution with the same input files and
    with 3 Reduce tasks defined. One example has the map-intermediate
    files compressed, and the other examples has uncompressed data. Below,
    I've put some debug lines that I put in the code.

    1 - On the uncompressed data, the raw length is always smaller than
    the partition length, but on the compressed data, is not. Why in
    compressed data the raw length is bigger than the partition length?

    2 - If we define the map-intermediate files as compressed, how the
    map-intermediate files are distributed to all reduces? Since we can
    split a compressed file, this means that each spill file is
    compressed? For example, Compressed(Spill idx 0) goes to Reduce 0,
    Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2)
    goes to Reduce 2,

    Compressed data

    Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567
    Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003
    Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459

    Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785
    Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100
    Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821

    Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503
    Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564
    Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716

    Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440
    Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160
    Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750



    Non-compressed data

    Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567
    Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003
    Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459

    Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785
    Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100
    Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821

    Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503
    Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564
    Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716

    Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440
    Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160
    Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750


    Thanks,

    --
    Pedro


    --
    Pedro

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedFeb 15, '11 at 10:36a
activeFeb 15, '11 at 10:49a
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Pedro Costa: 2 posts

People

Translate

site design / logo © 2022 Grokbase