Hi,
I am having some issues with Hadoop Streaming when the size of the value is large. Here is the code snippet of the Mapper program written in C++:
std::string outTif;
generateString64(hSrcDS,outTif);
std::cout<<url<<'\t'<<outTif<<std::endl;
return (EXIT_SUCCESS);
}
here outTif strings for all the mapper tasks is of the same size - about 33 MB. When I replace outTif by outTif.substr(0,20000000), it is completing the job fine. Though it is taking long time. Obviously it is working
fine for smaller values. But if I replace it by outTif.substr(0,30000000), I get the following output:
09/12/03 00:35:22 INFO streaming.StreamJob: Running job: job_200912021427_0027
09/12/03 00:35:22 INFO streaming.StreamJob: To kill this job, run:
09/12/03 00:35:22 INFO streaming.StreamJob: /home/upendra/hadoop-0.20.1/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200912021427_0027
09/12/03 00:35:22 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_200912021427_0027
09/12/03 00:35:23 INFO streaming.StreamJob: map 0% reduce 0%
09/12/03 00:35:39 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 00:35:49 INFO streaming.StreamJob: map 100% reduce 0%
09/12/03 00:45:58 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 00:46:34 INFO streaming.StreamJob: map 100% reduce 0%
09/12/03 00:46:41 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 00:47:13 INFO streaming.StreamJob: map 100% reduce 0%
09/12/03 00:57:00 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 00:57:36 INFO streaming.StreamJob: map 0% reduce 0%
09/12/03 00:57:52 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 00:57:55 INFO streaming.StreamJob: map 100% reduce 0%
09/12/03 01:08:19 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 01:08:47 INFO streaming.StreamJob: map 0% reduce 0%
09/12/03 01:08:59 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 01:09:03 INFO streaming.StreamJob: map 100% reduce 0%
09/12/03 01:19:15 INFO streaming.StreamJob: map 50% reduce 0%
09/12/03 01:19:27 INFO streaming.StreamJob: map 0% reduce 0%
09/12/03 01:19:48 INFO streaming.StreamJob: map 100% reduce 100%
09/12/03 01:19:48 INFO streaming.StreamJob: To kill this job, run:
09/12/03 01:19:48 INFO streaming.StreamJob: /home/upendra/hadoop-0.20.1/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_200912021427_0027
09/12/03 01:19:48 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_200912021427_0027
09/12/03 01:19:49 ERROR streaming.StreamJob: Job not Successful!
09/12/03 01:19:49 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
here is the snippet from syslog:
2009-12-03 00:57:48,314 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, se
ssionId=
2009-12-03 00:57:48,579 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-12-03 00:57:50,876 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/tmp/hadoop-upendra/mapred/local/t
askTracker/jobcache/job_200912021427_0027/attempt_200912021427_0027_m_000000_2/work/./gdalloadmap]
2009-12-03 00:57:51,198 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2009-12-03 01:08:18,789 WARN org.apache.hadoop.mapred.TaskRunner: Parent died. Exiting attempt_200912021427_0027_m_0000
00_2
2009-12-03 01:08:42,071 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=CLEANUP
, sessionId=
2009-12-03 01:08:42,931 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
2009-12-03 01:08:44,144 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200912021427_0027_m_000000_2 is done. And
is in the process of commiting
2009-12-03 01:08:44,560 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200912021427_0027_m_000000_2' done.
LOG_DIR:attempt_200912021427_0027_m_000000_3
I don't know what is happening when the size of value is increased. There is no ruducer (-D mapred.reduce.tasks=0) for the job. I am guessing there is some size limit or time limit which is causing the problem. I am running this job on a single node in a pseudo-distributed mode. Any guess on what might be going wrong? Also what parameters should I modify to improve the performance when the size of the values are large. I have very few input key-value pairs, but sizes of the values are large. Any help is much appreciated. Thank you.
Upendra