On May 9, 2007, at 12:18 PM, Steve Schlosser wrote:

To start with, I
modified the benchmark to just write out 1GB of data per node, rather
than the default 10GB, since I don't have a whole lot of disk capacity
at the moment.
You don't actually need to change code to do that. You could also
create a config file with test.randomwriter.maps_per_host or
test.randomwrite.bytes_per_map set to a lower number than the defaults.
12 nodes gets me a 21X speedup over 1 node, and 13 nodes gets me a 33x
speedup over 1 node. This seems too good to be true - what could
Hadoop be doing? For the 13-node runs, there are only ever 13 reduce
tasks, and mapred.tasktracker.tasks.maximum is set to 1. Can anyone
shed some light?
I'm not sure what is going on, but it is probably related to whether
the data is kept in memory or spills to disk. The framework tries to
keep as much data in RAM as possible to avoid trips to disk.
Just to make sure I understand, Sort itself does nothing but force
Hadoop to partition the input data (1GB per node in my case) and sort
it. Should I think of the sort as being part of the Map phase or the
Reduce phase? That is, is there one sort per node? One sort per Map
task? One sort per Reduce task?
*Laugh* The answer is more complicated than that. In Hadoop 0.1, the
sort was all done on the reduce side. Currently the map outputs are
sorted on the map side. The reduce merges the various map outputs.

-- Owen

Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupcommon-user @
postedMay 9, '07 at 7:18p
activeMay 9, '07 at 8:40p

2 users in discussion

Steve Schlosser: 1 post Owen O'Malley: 1 post



site design / logo © 2022 Grokbase