I've been running a program to count search terms in log files, which
is basically a small modification of the wordcount program. This
doesn't have a reduce phase, so the only tasks for the reduce jobs to
perform is sorting the output files of the map jobs.
My cluster has 4 machines on it, so based on the recommendations on
the wiki, I set my reduce count to 8. Unfortunately, the performance
was less than ideal. Specifically, when the map functions had
finished, I had to wait an additional 40% of the total job time just
for copying/sorting the files. I know for a fact that the sort is
very fast, so the only remaining question is why moving the files
around takes so long.
Looking at the jobtracker webapp, I noticed that the reduce->copying
phase listed under the job showed a transfer speed of 0.01MB/s, which
is fairly slow. The machines are connected on a gigabit switch, and
uploading 5GB of files to the hdfs system (hadoop dfs copyFromLocal)
only takes about a minute.
Finally, when I set the reduce count to the number of machines,
performance is good, since one reduce task will start up right away,
and the slow transfers will continue throughout the map phase, and be
ready almost immediately at the end of the map phase.
If anyone has some suggestions on how I might be able to increase
performance, or what might be going on in this scenario, I would
appreciate the tips. I'd be happy to provide some more details about
the setup if needed, for the moment its more of a testing ground to
see what my options are.