I've setup a cluster using Cloudera's CDH4 Beta 1 release of MapReduce
2.0 and I'd like to test some hadoop streaming scripts I have. Before
using the real scripts, I wanted to test the simplest streaming
application: downloading a file from hdfs, zipping it, and then loading
it back to hdfs. In particular I want to test this out with large
files, say 900 megs, since previous versions of hadoop streaming had an
issue where the uploading of the file would claim to have finished
before the file was fully uploaded. This pretty much aligns with the
streaming question in
http://hadoop.apache.org/common/docs/current/streaming.html#How+do+I+process+files%2C+one+per+map%3F
My test script looks like:
while read fileName; do
echo "reporter:status:copying file" >&2
echo $line
hadoop fs -copyToLocal /path/to/file/$fileName .
ls -l
tar -czf smaller.tar.gz $fileName
hadoop fs -copyFromLocal /path/to/file/$smaller.tar.gz
done
And I run this with
hadoop jar /usr/lib/hadoop/hadoop-streaming-0.23.0-cdh4b1.jar streamjob
-input input -output output -mapper streamTest.sh -file streamTest.sh.
The job looks good as it starts, however, I quickly get the following error:
Container
[pid=4181,containerID=container_1331576268297_0001_01_000003] is running
beyond virtual memory limits. Current usage: 302.6mb of 1.0gb physical
memory used; 4.9gb of 2.1gb virtual memory used. Killing container.
Dump of the process-tree for container_1331576268297_0001_01_000003 :
- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
- 4181 4070 4181 4181 (java) 322 20 581316608 52327
/usr/java/default/bin/java -Djava.net.preferIPv4Stack=trueSYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
- 4181 4070 4181 4181 (java) 322 20 581316608 52327
-Dhadoop.metrics.log.level=WARN -Xmx200m
-Djava.io.tmpdir=/hdfs/dfs/yarn/usercache/stevens35/appcache/application_1331576268297_0001/container_1331576268297_0001_01_000003/tmp
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.mapreduce.container.log.dir=/hdfs/logs/yarn/application_1331576268297_0001/container_1331576268297_0001_01_000003
-Dyarn.app.mapreduce.container.log.filesize=0
-Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild
10.220.5.11 58072 attempt_1331576268297_0001_m_000001_0 3
- 4252 4246 4181 4181 (java) 334 90 4566265856 24808
/usr/java/default/bin/java -Xmx4000m -Dhadoop.log.dir=/hdfs/logs-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console
-Djava.library.path=/usr /lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.mapreduce.container.log.dir=/hdfs/logs/yarn/application_1331576268297_0001/container_1331576268297_0001_01_000003
-Dyarn.app.mapreduce.container.log.filesize=0
-Dhadoop.root.logger=INFO,CLA -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.fs.FsShell -copyToLocal
/data/nyt/nyt03_NMF_500-ds.dat.transpose .
- 4246 4181 4181 4181 (streamTest.sh) 0 0 65507328 325
/bin/bash/hdfs/dfs/yarn/usercache/stevens35/appcache/application_1331576268297_0001/container_1331576268297_0001_01_000003/./streamTest.sh
The container is running out of virtual memory, but i'm not exactly sure
why this would be the case. In version 0.20.2, this job worked just
fine. What has changed that might cause this kind of streaming job run
out of memory? Is it not possible to pull files from hdfs from within a
streaming job?
Also, as a secondary question, is there any documentation on reporting
the status of the job? Previously, I'd update the task's status with
"reporter:status:current status description" written to stderr.
However, looking through the new application manager ui and job history
ui, I don't see this status being reported anywhere. Is there a new
format for this? Is it reported elsewhere?
Thanks!
--Keith