Complete newby map/reduce question here. I am using hadoop streaming as
I come from a Perl background, and am trying to prototype/test a process
to load/clean-up ad server log lines from multiple input files into one
large file on the hdfs that can then be used as the source of a hive db

I have a perl map script that reads an input line from stdin, does the
needed cleanup/manipulation, and writes back to stdout. I don't
really need a reduce step, as I don't care what order the lines are
written in, and there is no summary data to produce. When I run the job
with -reducer NONE I get valid output, however I get multiple part-xxxxx
files rather than one big file.

So I wrote a trivial 'reduce' script that reads from stdin and simply
splits the key/value, and writes the value back to stdout.

I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
/logs/*.log -output test9

The code I have works when given a small set of input files. However, I
get the following error when attempting to run the code on a large set
of input files:

15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for
reduce task. Node
tracker_testdw0b00:localhost.localdomain/ has 2004049920
bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered in
memory prior to running the reduce step? If so, what can I change to
stop the buffering? I just need the map output to go directly to one
large file.


Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 4 | next ›
Discussion Overview
groupcommon-user @
postedJun 10, '09 at 4:40p
activeJun 11, '09 at 6:13a



site design / logo © 2022 Grokbase