Your Font block size got increased dynamically , check in core-site :) :)
From: He Chen <[email protected]
To: [email protected]
Sent: Mon, 30 May, 2011 11:39:35 AM
Subject: Re: Poor IO performance on a 10 node cluster.
I would suggest you divide MapReduce program execution time into 3 parts
a) Map Stage
In this stage, wc splits input data and generates map tasks. Each map task
process one block (in default, you can change it in FileInputFormat.java).
As Brian said, if you have larger blocks size, you may have less number of
map tasks, and then probably less overhead.
b) Reduce Stage
2) shuffle phase
In this phase, reduce task collect intermediate results from every node
that has executed map tasks. Each reduce task can have many current threads
to obtain data(you can configure it in mapred-site.xml, it is
"mapreduce.reduce.shuffle.parallelcopies"). But, be careful to your data
popularity. For example, you have "Hadoop, Hadoop, Hadoop,hello". The
default Hadoop partitioner will assign 3 <Hadoop, 1> key-value pairs to one
node. Thus, if you have two nodes run reduce tasks, one of them will copy 3
times ｍｏｒｅ data than the other. This will cause one node slower than the
other. You may rewrite the partitioner.
3) sort and reduce phase
I think the Hadoop UI ｗｉｌｌ ｇｉｖｅ ｙｏｕ ｓｏｍｅ ｈｉｎｔｓ ａｂｏｕｔ ｈｏｗ ｌｏｎｇ ｔｈｉｓ ｐｈａｓｅ
Ｂｙ ｄｉｖｉｄｉｎｇ ＭａｐＲｅｄｕｃｅ ａｐｐｌｉｃａｔｉｏｎ ｉｎｔｏ ｔｈｅｓｅ ３ ｐａｒｔｓ， ｙｏｕ ｃａｎ ｅａｓｉｌｙ ｆｉｎｄ
ｗｈｉｃｈ ｏｎｅ ｉｓ ｙｏｕｒ ｂｏｔｔｌｅｎｅｃｋ ａｎｄ ｄｏ ｓｏｍｅ ｐｒｏｆｉｌｉｎｇ． Ａｎｄ Ｉ ｄｏｎ＇ｔ ｋｎｏｗ ｗｈｙ ｍｙ
ｆｏｎｔ ｃｈａｎｇｅ ｔｏ ｔｈｉｓ ｔｙｐｅ．：（
Ｈｏｐｅ ｉｔ ｗｉｌｌ ｂｅ ｈｅｌｐｆｕｌ．
On Mon, May 30, 2011 at 12:32 PM, Harsh J wrote:
Psst. The cats speak in their own language ;-)
On Mon, May 30, 2011 at 10:31 PM, James Seigel wrote:
Not sure that will help ;)
Sent from my mobile. Please excuse the typos.
On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky wrote:
On May 30, 2011 5:28 AM, "Gyuribácsi" wrote:
I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16
I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming jar
with 'wc -l' as mapper and 'cat' as reducer.
I use 64MB block size and the default replication (3).
The wc on the 100 GB took about 220 seconds which translates to about
Gbit/sec processing speed. One disk can do sequential read with
i would expect someting around 20 GBit/sec (minus some overhead), and
getting only 3.5.
Is my expectaion valid?
I checked the jobtracked and it seems all nodes are working, each
the right blocks. I have not played with the number of mapper and
yet. It seems the number of mappers is the same as the number of blocks and
the number of reducers is 20 (there are 20 disks). This looks ok for
We also did an experiment with TestDFSIO with similar results.
read io speed is around 3.5Gbit/sec. It is just too far from my
View this message in context:
Sent from the Hadoop core-user mailing list archive at Nabble.com.