FAQ
We are testing our cluster map-reduce performance by specifying one
dir vs multiple dirs in property “mapred.local.dir”. The documentation
for this property says “ The local directory where MapReduce stores
intermediate data files. May be a comma-separated list of directories
on different devices in order to spread disk i/o”. So I was expecting
a performance boost when specifying two local dirs vs one dir for
property “mapred.local.dir”. We did a sort test and saw the opposite.

- One dir config defaults to ${hadoop.tmp.dir}/mapred/local.
“hadoop.tmp.dir” is set to "/var/lib/hadoop-0.20/cache/${user.name}"

- Two dir config explicitly sets the “mapred.local.dir” in mapred-site.xml
<property>
<name>mapred.local.dir</name>
<value>/data1/hadoop/mapred/local,/data2/hadoop/mapred/local</value>
</property>

/data1, /data2 are separate drives on each hadoop cluster instance box

$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 330G 128G 186G 41% /
/dev/sda1 190M 18M 163M 10% /boot
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/sdb1 2.2T 442G 1.8T 20% /data1
/dev/sdc1 2.2T 484G 1.8T 22% /data2

---------- Sort test with one vs two dir --------------------

1. For 40 GB Data :

a. Results when one dir was specified in “mapred.local.dir” :
Time taken for random-data generation : 39 mins 3 sec
Time taken for random-data Sort : 1hrs, 8mins, 5sec
Time taken for sorted-data Validation : 3 mins 22 sec

b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 36mins, 51sec
Time taken for random-data Sort : 1hrs, 38mins, 33sec
Time taken for sorted-data Validation : 3mins, 31sec

2. For 100 GB Data :

a. Results when one dir was specified in “mapred.local.dir” :
Time taken for random-data generation : 1hrs, 33mins, 28sec
Time taken for random-data Sort : 3hrs, 50mins, 39sec
Time taken for sorted-data Validation : 7mins, 33sec

b. Results when two dirs were specified in mapred.local.dir :
Time taken for random-data generation : 1hrs, 27mins, 17sec
Time taken for random-data Sort : 6hrs, 35mins, 20sec
Time taken for sorted-data Validation : 8mins, 52sec

The random data generation time had a slight performance gain however
the sort job (which is also I/O intensive) almost doubled in both
instances. Any reason why this is happening?

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 30, '10 at 5:44p
activeAug 30, '10 at 5:44p
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Sudhir Vallamkondu: 1 post

People

Translate

site design / logo © 2022 Grokbase