Hi Everyone:
I launched two experiments for sorting 1 Gb and 10 Gb data with hadoop, on
(1) a single machine (2) 5-node clustrer in LAN
The cmd is:
bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>]
<in-dir> <out-dir>
the result is shown here:
[image: image.png]
Mapping shows good scalability. The thing is, reduce takes much longer time
than expected in cluster.
As far as I know, hadoop sort uses identity function for reduce, which
simply output the mapping
result in a file. I tested LAN bandwidth, which is ~ 100Mbps, and the
average LAN flow during reduce
is about 10 Mbps (for sending and receiving).
as a result, it appears a bit weird to me here...
I am quite new in hadoop thus forgive me for any stupid questions here...
Thanks.
Best Regards
Yours Sincerely
Jingwei Lu