FAQ
I am running a large streaming job that processes that about 3TB of data I
am seeing large jumps in hard drive space usage in the reduce part of the
jobs I tracked the problem down. The job is set to compress map outputs but
looking at the intermediate files on the local drives the intermediate files
are not getting compressed during/after merges. I am going from having say
2Gb of mapfile.out files to having one intermediate.X file sizing 100-350%
larger then the map files. I have looked at one of the files and confirmed
that it is not getting compressed as I can read the data in it. if it was
only one merge then it would not be a problem but when you are merging
70-100 of these you use tons of GB's and my task are starting to die as they
run out of hard drive space end the end kill the job.

I am running 0.19.1-dev, r744282. I have searched the issues but found
nothing about the compression.
Should the intermediate results not be compressed also if the map output
files are set to be compressed?
If not then why do we have the map compression option just to save network
traffic?

Search Discussions

  • Chris Douglas at Mar 17, 2009 at 5:34 am

    I am running 0.19.1-dev, r744282. I have searched the issues but
    found nothing about the compression.
    AFAIK, there are no open issues that prevent intermediate compression
    from working. The following might be useful:

    http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
    Should the intermediate results not be compressed also if the map
    output files are set to be compressed?
    These are controlled by separate options.

    FileOutputFormat::setCompressOutput enables/disables compression on
    the final output
    JobConf::setCompressMapOutput enables/disables compression of the
    intermediate output
    If not then why do we have the map compression option just to save
    network traffic?
    That's part of it. Also to save on disk bandwidth and intermediate
    space. -C
  • Billy Pearson at Mar 17, 2009 at 7:52 am
    I understand that I got CompressMapOutput set and it works the maps outputs
    are compressed but on the reduce end it downloads x files then merges the x
    file in to one intermediate file to keep the number of files to a minimal
    <= io.sort.factor.

    My problem is the output from merging the intermediate map output files is
    not compresses so I lose all the benefit of compressing the map file output
    to save disk space because the merged map output files are no longer
    compressed.

    Note there are two different type of intermediate files the map outputs then
    one the reduce merges the map outputs to meet the set io.sort.factor.

    Billy



    ----- Original Message -----
    From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g0lQ@public.gmane.org>
    Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
    To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org>
    Sent: Tuesday, March 17, 2009 12:33 AM
    Subject: Re: intermediate results not getting compressed

    I am running 0.19.1-dev, r744282. I have searched the issues but found
    nothing about the compression.
    AFAIK, there are no open issues that prevent intermediate compression
    from working. The following might be useful:

    http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
    Should the intermediate results not be compressed also if the map output
    files are set to be compressed?
    These are controlled by separate options.

    FileOutputFormat::setCompressOutput enables/disables compression on the
    final output
    JobConf::setCompressMapOutput enables/disables compression of the
    intermediate output
    If not then why do we have the map compression option just to save
    network traffic?
    That's part of it. Also to save on disk bandwidth and intermediate
    space. -C
  • Chris Douglas at Mar 17, 2009 at 8:25 am

    My problem is the output from merging the intermediate map output
    files is not compresses so I lose all the benefit of compressing the
    map file output to save disk space because the merged map output
    files are no longer compressed.
    It should still be compressed, unless there's some bizarre regression.
    More segments will be around simultaneously (since the segments not
    yet merged are still on disk), which clearly puts pressure on
    intermediate storage, but if the map outputs are compressed, then the
    merged map outputs at the reduce must also be compressed. There's no
    place in the intermediate format to store compression metadata, so
    either all are or none are. Intermediate merges should also follow the
    compression spec of the initiating merger, too (o.a.h.mapred.Merger:
    447).

    How are you concluding that the intermediate output is compressed from
    the map, but not in the reduce? -C
    ----- Original Message ----- From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g0lQ@public.gmane.org
    Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
    To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org>
    Sent: Tuesday, March 17, 2009 12:33 AM
    Subject: Re: intermediate results not getting compressed

    I am running 0.19.1-dev, r744282. I have searched the issues but
    found nothing about the compression.
    AFAIK, there are no open issues that prevent intermediate
    compression from working. The following might be useful:

    http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
    Should the intermediate results not be compressed also if the map
    output files are set to be compressed?
    These are controlled by separate options.

    FileOutputFormat::setCompressOutput enables/disables compression
    on the final output
    JobConf::setCompressMapOutput enables/disables compression of the
    intermediate output
    If not then why do we have the map compression option just to save
    network traffic?
    That's part of it. Also to save on disk bandwidth and intermediate
    space. -C
  • Billy Pearson at Mar 17, 2009 at 5:16 pm
    Watching a second job with more reduce task running looks like the in-memory
    merges are working correctly with compression.

    The task I was watching failed and was running again it Shuffle all the map
    output files then started the merged after all was copied so non was merged
    in memory it was closed before the merging started.
    If it helps the name of the output files is intermediate.x and is stored in
    folder mapred/local/job-taskname/intermediate.x
    while the in-memory merges are stored
    mapred/local/taskTracker/jobcache/job-name/taskname/

    The non compressed ones are the intermediate.x file above.

    Billy


    "Chris Douglas" <chrisdo@yahoo-inc.com> wrote in
    message news:9BB78C3A-EFAB-45C3-8CC3-25AAB60DF914@yahoo-inc.com...
    My problem is the output from merging the intermediate map output files
    is not compresses so I lose all the benefit of compressing the map file
    output to save disk space because the merged map output files are no
    longer compressed.
    It should still be compressed, unless there's some bizarre regression.
    More segments will be around simultaneously (since the segments not yet
    merged are still on disk), which clearly puts pressure on intermediate
    storage, but if the map outputs are compressed, then the merged map
    outputs at the reduce must also be compressed. There's no place in the
    intermediate format to store compression metadata, so either all are or
    none are. Intermediate merges should also follow the compression spec of
    the initiating merger, too (o.a.h.mapred.Merger: 447).

    How are you concluding that the intermediate output is compressed from
    the map, but not in the reduce? -C
    ----- Original Message ----- From: "Chris Douglas"
    <chrisdo-ZXvpkYn067l8UrSeD/g0lQ@public.gmane.org
    Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
    To:
    <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org>
    Sent: Tuesday, March 17, 2009 12:33 AM
    Subject: Re: intermediate results not getting compressed

    I am running 0.19.1-dev, r744282. I have searched the issues but
    found nothing about the compression.
    AFAIK, there are no open issues that prevent intermediate compression
    from working. The following might be useful:

    http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
    Should the intermediate results not be compressed also if the map
    output files are set to be compressed?
    These are controlled by separate options.

    FileOutputFormat::setCompressOutput enables/disables compression on
    the final output
    JobConf::setCompressMapOutput enables/disables compression of the
    intermediate output
    If not then why do we have the map compression option just to save
    network traffic?
    That's part of it. Also to save on disk bandwidth and intermediate
    space. -C
  • Billy Pearson at Mar 19, 2009 at 3:08 am
    the intermediate.X files are not getting compresses for some reason not
    sure why
    I download and build the latest branch for 0.19

    o.a.h.mapred.Merger.class line 432
    new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

    this seams to use the codec defined above but for some reasion its not
    working correctly the compression is not passing from the map output files
    to the on disk merge of the intermediate.X files

    tail task report from one server:

    2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
    Interleaved on-disk merge complete: 1730 files left.
    2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
    merge complete: 3 files left.
    2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 3
    segments, 39835369 bytes in memory for intermediate, on-disk merge
    2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging
    1730 files, 70359998581 bytes from disk
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
    segments, 0 bytes from memory into reduce
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733
    sorted segments
    2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
    intermediate segments out of a total of 1733
    2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1712
    2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1683
    2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1654
    2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1625
    2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1596
    2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1567
    2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1538
    2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1509
    2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1480
    2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1451
    2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1422
    2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1393
    2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1364
    2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1335
    2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1306

    See the size of the files is about ~70GB (70359998581) these are compressed
    at this points its went from 1733 file to 1306 left to merge and the
    intermediate.X files are well over 200Gb at this point and we are not even
    close to done. If compression is working we should not see task failing at
    this point in the task becuase lack of hard drvie space sense as we merge we
    delete the merged file from the output folder.

    I only see this happening when there are to many files left that did not get
    merged durring the shuffle stage and it starts on disk mergeing.
    the task that complete the merges and keep it below the io.sort size in my
    case 30 skips the on disk merge and complete useing normal hard drive space.

    Anyone care to take a look?
    This job takes two or more days to get to this point so getting kind of a
    pain in the butt to run and watch the reduces fail and the job keep failing
    no matter what.

    I can post the tail of this task long when it fails to show you how far it
    gets before it runs out of space. before redcue on disk merge starts the
    disk are about 35-40% used on 500GB Drive and two taks runnning at the same
    time.

    Billy Pearson
  • Billy Pearson at Mar 19, 2009 at 3:14 am
    I can run head on the map.out files and I get compressed garbish but I run
    head on a intermediate file and I can read the data in the file clearly so
    compression is not getting passed but I am setting the CompressMapOutput to
    true by default in my hadoop-site.conf file.

    Billy


    "Billy Pearson" <sales@pearsonwholesale.com>
    wrote in message news:gpscu3$66p$1@ger.gmane.org...
    the intermediate.X files are not getting compresses for some reason not
    sure why
    I download and build the latest branch for 0.19

    o.a.h.mapred.Merger.class line 432
    new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

    this seams to use the codec defined above but for some reasion its not
    working correctly the compression is not passing from the map output files
    to the on disk merge of the intermediate.X files

    tail task report from one server:

    2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
    Interleaved on-disk merge complete: 1730 files left.
    2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
    In-memory merge complete: 3 files left.
    2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping
    3 segments, 39835369 bytes in memory for intermediate, on-disk merge
    2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging
    1730 files, 70359998581 bytes from disk
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging
    0 segments, 0 bytes from memory into reduce
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733
    sorted segments
    2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
    intermediate segments out of a total of 1733
    2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1712
    2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1683
    2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1654
    2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1625
    2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1596
    2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1567
    2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1538
    2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1509
    2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1480
    2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1451
    2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1422
    2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1393
    2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1364
    2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1335
    2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1306

    See the size of the files is about ~70GB (70359998581) these are
    compressed at this points its went from 1733 file to 1306 left to merge
    and the intermediate.X files are well over 200Gb at this point and we are
    not even close to done. If compression is working we should not see task
    failing at this point in the task becuase lack of hard drvie space sense
    as we merge we delete the merged file from the output folder.

    I only see this happening when there are to many files left that did not
    get merged durring the shuffle stage and it starts on disk mergeing.
    the task that complete the merges and keep it below the io.sort size in my
    case 30 skips the on disk merge and complete useing normal hard drive
    space.

    Anyone care to take a look?
    This job takes two or more days to get to this point so getting kind of a
    pain in the butt to run and watch the reduces fail and the job keep
    failing no matter what.

    I can post the tail of this task long when it fails to show you how far it
    gets before it runs out of space. before redcue on disk merge starts the
    disk are about 35-40% used on 500GB Drive and two taks runnning at the
    same time.

    Billy Pearson
  • Stefan Will at Mar 19, 2009 at 6:27 pm
    I noticed this too. I think the compression only applies to the final mapper
    and reducer outputs, but not any intermediate files produced. The reducer
    will decompress the map output files after copying them, and then compress
    its own output only after it has finished.

    I wonder if this is by design, or just an oversight.

    -- Stefan

    From: Billy Pearson <sales@pearsonwholesale.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Wed, 18 Mar 2009 22:14:07 -0500
    To: <core-user@hadoop.apache.org>
    Subject: Re: intermediate results not getting compressed

    I can run head on the map.out files and I get compressed garbish but I run
    head on a intermediate file and I can read the data in the file clearly so
    compression is not getting passed but I am setting the CompressMapOutput to
    true by default in my hadoop-site.conf file.

    Billy


    "Billy Pearson" <sales@pearsonwholesale.com>
    wrote in message news:gpscu3$66p$1@ger.gmane.org...
    the intermediate.X files are not getting compresses for some reason not
    sure why
    I download and build the latest branch for 0.19

    o.a.h.mapred.Merger.class line 432
    new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

    this seams to use the codec defined above but for some reasion its not
    working correctly the compression is not passing from the map output files
    to the on disk merge of the intermediate.X files

    tail task report from one server:

    2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
    Interleaved on-disk merge complete: 1730 files left.
    2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
    In-memory merge complete: 3 files left.
    2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping
    3 segments, 39835369 bytes in memory for intermediate, on-disk merge
    2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging
    1730 files, 70359998581 bytes from disk
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging
    0 segments, 0 bytes from memory into reduce
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733
    sorted segments
    2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
    intermediate segments out of a total of 1733
    2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1712
    2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1683
    2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1654
    2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1625
    2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1596
    2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1567
    2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1538
    2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1509
    2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1480
    2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1451
    2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1422
    2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1393
    2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1364
    2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1335
    2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1306

    See the size of the files is about ~70GB (70359998581) these are
    compressed at this points its went from 1733 file to 1306 left to merge
    and the intermediate.X files are well over 200Gb at this point and we are
    not even close to done. If compression is working we should not see task
    failing at this point in the task becuase lack of hard drvie space sense
    as we merge we delete the merged file from the output folder.

    I only see this happening when there are to many files left that did not
    get merged durring the shuffle stage and it starts on disk mergeing.
    the task that complete the merges and keep it below the io.sort size in my
    case 30 skips the on disk merge and complete useing normal hard drive
    space.

    Anyone care to take a look?
    This job takes two or more days to get to this point so getting kind of a
    pain in the butt to run and watch the reduces fail and the job keep
    failing no matter what.

    I can post the tail of this task long when it fails to show you how far it
    gets before it runs out of space. before redcue on disk merge starts the
    disk are about 35-40% used on 500GB Drive and two taks runnning at the
    same time.

    Billy Pearson
  • Billy Pearson at Mar 19, 2009 at 8:54 pm
    If CompressMapOutput then it should carry all the way to the reduce
    including map.out files and intermediate
    I added some logging to the Merger I have to wait until some more jobs
    finish before I can rebuild and restart to see the logging
    but that will confirm weather or not the codec is null when it gets to line
    432 and the writer is created for the intermediate files.

    if its null I will open a issue.

    Billy


    "Stefan Will" <stefan.will@gmx.net> wrote in message
    news:C5E7DC6D.1840D%stefan.will@gmx.net...
    I noticed this too. I think the compression only applies to the final
    mapper
    and reducer outputs, but not any intermediate files produced. The reducer
    will decompress the map output files after copying them, and then compress
    its own output only after it has finished.

    I wonder if this is by design, or just an oversight.

    -- Stefan

    From: Billy Pearson
    <sales@pearsonwholesale.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Wed, 18 Mar 2009 22:14:07 -0500
    To: <core-user@hadoop.apache.org>
    Subject: Re: intermediate results not getting compressed

    I can run head on the map.out files and I get compressed garbish but I
    run
    head on a intermediate file and I can read the data in the file clearly
    so
    compression is not getting passed but I am setting the CompressMapOutput
    to
    true by default in my hadoop-site.conf file.

    Billy


    "Billy Pearson" <sales@pearsonwholesale.com>
    wrote in message news:gpscu3$66p$1@ger.gmane.org...
    the intermediate.X files are not getting compresses for some reason not
    sure why
    I download and build the latest branch for 0.19

    o.a.h.mapred.Merger.class line 432
    new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

    this seams to use the codec defined above but for some reasion its not
    working correctly the compression is not passing from the map output
    files
    to the on disk merge of the intermediate.X files

    tail task report from one server:

    2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
    Interleaved on-disk merge complete: 1730 files left.
    2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
    In-memory merge complete: 3 files left.
    2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask:
    Keeping
    3 segments, 39835369 bytes in memory for intermediate, on-disk merge
    2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask:
    Merging
    1730 files, 70359998581 bytes from disk
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask:
    Merging
    0 segments, 0 bytes from memory into reduce
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging
    1733
    sorted segments
    2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
    intermediate segments out of a total of 1733
    2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1712
    2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1683
    2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1654
    2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1625
    2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1596
    2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1567
    2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1538
    2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1509
    2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1480
    2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1451
    2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1422
    2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1393
    2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1364
    2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1335
    2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1306

    See the size of the files is about ~70GB (70359998581) these are
    compressed at this points its went from 1733 file to 1306 left to merge
    and the intermediate.X files are well over 200Gb at this point and we
    are
    not even close to done. If compression is working we should not see task
    failing at this point in the task becuase lack of hard drvie space sense
    as we merge we delete the merged file from the output folder.

    I only see this happening when there are to many files left that did not
    get merged durring the shuffle stage and it starts on disk mergeing.
    the task that complete the merges and keep it below the io.sort size in
    my
    case 30 skips the on disk merge and complete useing normal hard drive
    space.

    Anyone care to take a look?
    This job takes two or more days to get to this point so getting kind of
    a
    pain in the butt to run and watch the reduces fail and the job keep
    failing no matter what.

    I can post the tail of this task long when it fails to show you how far
    it
    gets before it runs out of space. before redcue on disk merge starts the
    disk are about 35-40% used on 500GB Drive and two taks runnning at the
    same time.

    Billy Pearson
  • Billy Pearson at Mar 20, 2009 at 6:55 am
    I opened a issue here
    https://issues.apache.org/jira/browse/HADOOP-5539

    If you would like to comment on it.

    Billy

    "Stefan Will" <stefan.will@gmx.net> wrote in message
    news:C5E7DC6D.1840D%stefan.will@gmx.net...
    I noticed this too. I think the compression only applies to the final
    mapper
    and reducer outputs, but not any intermediate files produced. The reducer
    will decompress the map output files after copying them, and then compress
    its own output only after it has finished.

    I wonder if this is by design, or just an oversight.

    -- Stefan

    From: Billy Pearson
    <sales@pearsonwholesale.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Wed, 18 Mar 2009 22:14:07 -0500
    To: <core-user@hadoop.apache.org>
    Subject: Re: intermediate results not getting compressed

    I can run head on the map.out files and I get compressed garbish but I
    run
    head on a intermediate file and I can read the data in the file clearly
    so
    compression is not getting passed but I am setting the CompressMapOutput
    to
    true by default in my hadoop-site.conf file.

    Billy


    "Billy Pearson" <sales@pearsonwholesale.com>
    wrote in message news:gpscu3$66p$1@ger.gmane.org...
    the intermediate.X files are not getting compresses for some reason not
    sure why
    I download and build the latest branch for 0.19

    o.a.h.mapred.Merger.class line 432
    new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

    this seams to use the codec defined above but for some reasion its not
    working correctly the compression is not passing from the map output
    files
    to the on disk merge of the intermediate.X files

    tail task report from one server:

    2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
    Interleaved on-disk merge complete: 1730 files left.
    2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
    In-memory merge complete: 3 files left.
    2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask:
    Keeping
    3 segments, 39835369 bytes in memory for intermediate, on-disk merge
    2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask:
    Merging
    1730 files, 70359998581 bytes from disk
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask:
    Merging
    0 segments, 0 bytes from memory into reduce
    2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging
    1733
    sorted segments
    2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
    intermediate segments out of a total of 1733
    2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1712
    2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1683
    2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1654
    2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1625
    2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1596
    2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1567
    2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1538
    2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1509
    2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1480
    2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1451
    2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1422
    2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1393
    2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1364
    2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1335
    2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
    intermediate segments out of a total of 1306

    See the size of the files is about ~70GB (70359998581) these are
    compressed at this points its went from 1733 file to 1306 left to merge
    and the intermediate.X files are well over 200Gb at this point and we
    are
    not even close to done. If compression is working we should not see task
    failing at this point in the task becuase lack of hard drvie space sense
    as we merge we delete the merged file from the output folder.

    I only see this happening when there are to many files left that did not
    get merged durring the shuffle stage and it starts on disk mergeing.
    the task that complete the merges and keep it below the io.sort size in
    my
    case 30 skips the on disk merge and complete useing normal hard drive
    space.

    Anyone care to take a look?
    This job takes two or more days to get to this point so getting kind of
    a
    pain in the butt to run and watch the reduces fail and the job keep
    failing no matter what.

    I can post the tail of this task long when it fails to show you how far
    it
    gets before it runs out of space. before redcue on disk merge starts the
    disk are about 35-40% used on 500GB Drive and two taks runnning at the
    same time.

    Billy Pearson
  • Billy Pearson at Mar 19, 2009 at 7:40 am

    How are you concluding that the intermediate output is compressed from
    the map, but not in the reduce? -C
    my hadoop-site.xml

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    <description>Should the job outputs be compressed?
    </description>
    </property>
    <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    <description>If the job outputs are to compressed as SequenceFiles, how
    should
    they be compressed? Should be one of NONE, RECORD or BLOCK.
    </description>
    </property>


    from the job.xml

    mapred.output.compress = false // final output
    mapred.compress.map.output = true // map output

    + I can head the files from comand line and read the key / value in the
    reduce intermediate merges but not the map.out files.
  • Billy Pearson at Mar 20, 2009 at 6:54 am
    open issue
    https://issues.apache.org/jira/browse/HADOOP-5539

    Billy


    "Billy Pearson" <billy_pearson@sbcglobal.net>
    wrote in message news:CECF0598D9CA40A08E777568361DE773@BillyPC...
    How are you concluding that the intermediate output is compressed from
    the map, but not in the reduce? -C
    my hadoop-site.xml

    <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
    <description>Should the job outputs be compressed?
    </description>
    </property>
    <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    <description>If the job outputs are to compressed as SequenceFiles, how
    should
    they be compressed? Should be one of NONE, RECORD or BLOCK.
    </description>
    </property>


    from the job.xml

    mapred.output.compress = false // final output
    mapred.compress.map.output = true // map output

    + I can head the files from comand line and read the key / value in the
    reduce intermediate merges but not the map.out files.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 17, '09 at 5:13a
activeMar 20, '09 at 6:55a
posts12
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase