The reduce output may spill to disk during the sort, and if it expected to
be larger than the partition free space, unless the machine/jvm has a hugh
allowed memory space, the data will spill to disk during the sort.
If I did my math correctly, you are trying to push ~2TB through the single
as for the part-XXXX files, if you have the number of reduces set to zero,
you will get N part files, where N is the number of map tasks.
If you absolutely must have it all go to one reduce, you will need to
increase the free disk space. I think 19.1 preserves compression for the map
output, so you could try enabling compression for map output.
If you have many nodes, you can set the number of reduces to some number and
then use sort -M on the part files, to merge sort them, assuming your reduce
Try adding these parameters to your job line:
-D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK
BTW, /bin/cat works fine as an identity mapper or an identity reducer
On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon wrote:
It turns out that Alex's answer was mistaken - your error is actually
from lack of disk space on the TT that has been assigned the reduce task.
Specifically, there is not enough space in mapred.local.dir. You'll need to
change your mapred.local.dir to point to a partition that has enough space
to contain your reduce output.
As for why this is the case, I hope someone will pipe up. It seems to me
that reduce output can go directly to the target filesystem without using
space on mapred.local.dir.
On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard wrote:
What is mapred.child.ulimit set to? This configuration options specifics
how much memory child processes are allowed to have. You may want to up
this limit and see what happens.
Let me know if that doesn't get you anywhere.
On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote:
Complete newby map/reduce question here. I am using hadoop streaming
come from a Perl background, and am trying to prototype/test a process
load/clean-up ad server log lines from multiple input files into one large
file on the hdfs that can then be used as the source of a hive db
I have a perl map script that reads an input line from stdin, does the
needed cleanup/manipulation, and writes back to stdout. I don't
need a reduce step, as I don't care what order the lines are written
there is no summary data to produce. When I run the job with -reducer NONE
I get valid output, however I get multiple part-xxxxx files rather than one
So I wrote a trivial 'reduce' script that reads from stdin and simply
splits the key/value, and writes the value back to stdout.
I am executing the code as follows:
./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
The code I have works when given a small set of input files. However,
get the following error when attempting to run the code on a large set
WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
bytes free; but we expect reduce input to take 22138478392
I assume this is because the all the map output is being buffered in memory
prior to running the reduce step? If so, what can I change to stop the
buffering? I just need the map output to go directly to one large
Pro Hadoop, a book to guide you from beginner to hadoop mastery,http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals