Your Font block size got increased dynamically , check in core-site :) :)
- Jagaran
________________________________
From: He Chen <
[email protected]>
To:
[email protected]Sent: Mon, 30 May, 2011 11:39:35 AM
Subject: Re: Poor IO performance on a 10 node cluster.
Hi Gyuribácsi
I would suggest you divide MapReduce program execution time into 3 parts
a) Map Stage
In this stage, wc splits input data and generates map tasks. Each map task
process one block (in default, you can change it in FileInputFormat.java).
As Brian said, if you have larger blocks size, you may have less number of
map tasks, and then probably less overhead.
b) Reduce Stage
2) shuffle phase
In this phase, reduce task collect intermediate results from every node
that has executed map tasks. Each reduce task can have many current threads
to obtain data(you can configure it in mapred-site.xml, it is
"mapreduce.reduce.shuffle.parallelcopies"). But, be careful to your data
popularity. For example, you have "Hadoop, Hadoop, Hadoop,hello". The
default Hadoop partitioner will assign 3 <Hadoop, 1> key-value pairs to one
node. Thus, if you have two nodes run reduce tasks, one of them will copy 3
times more data than the other. This will cause one node slower than the
other. You may rewrite the partitioner.
3) sort and reduce phase
I think the Hadoop UI will give you some hints about how long this phase
takes.
By dividing MapReduce application into these 3 parts, you can easily find
which one is your bottleneck and do some profiling. And I don't know why my
font change to this type.:(
Hope it will be helpful.
Chen
On Mon, May 30, 2011 at 12:32 PM, Harsh J wrote:Psst. The cats speak in their own language ;-)
On Mon, May 30, 2011 at 10:31 PM, James Seigel wrote:Not sure that will help ;)
Sent from my mobile. Please excuse the typos.
On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky wrote:
Ljddfjfjfififfifjftjiiiiiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs
s
On May 30, 2011 5:28 AM, "Gyuribácsi" wrote:
Hi,
I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16
HT
cores).
I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming jar
with 'wc -l' as mapper and 'cat' as reducer.
I use 64MB block size and the default replication (3).
The wc on the 100 GB took about 220 seconds which translates to about
3.5
Gbit/sec processing speed. One disk can do sequential read with
1Gbit/sec
so
i would expect someting around 20 GBit/sec (minus some overhead), and
I'm
getting only 3.5.
Is my expectaion valid?
I checked the jobtracked and it seems all nodes are working, each
reading
the right blocks. I have not played with the number of mapper and
reducers
yet. It seems the number of mappers is the same as the number of blocks and
the number of reducers is 20 (there are 20 disks). This looks ok for
me.
We also did an experiment with TestDFSIO with similar results.
Aggregated
read io speed is around 3.5Gbit/sec. It is just too far from my
expectation:(
Please help!
Thank you,
Gyorgy
--
View this message in context:
http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.htmll
Sent from the Hadoop core-user mailing list archive at Nabble.com.
--
Harsh J