FAQ
Hi,

I am running a small cluster of 4 nodes, each node having quad-cores and 8
gb of RAM. I have used the following values for parameters in
hadoop-site.xml. I want to know, can I increase the performance further by
changing one or more of these-

dfs.replication: I have set it to 2. Will I get performance boost if I set
it to 4 (=number of nodes). If this is true, how much replication people use
when they run a cluster of say 1000 nodes. Do they replicate peta bytes of
data ?

mapred.child.java.opts: -Xms4096m -Xmx7500m- I tried with diff min and max
memory and found there was not any improvement in performance. I was
thinking that giving more memory to the process, will help it to do
sorting/shuffling etc quickly, but it seems my thinking is not correct. Can
anyone comment on this parameter and what should be the optimal value

fs.inmemory.size.mb: I have set it to 225. Increasing it further does not
help. Also can someone explain it in detail like how does this parameter
affects performance.

io.sort.mb: I have set it to 200. Increasing it further does not help, at
least in my jobs. Anyone with more details about this parameter ?

mapred.map.tasks: After reading the description, I set its value as 41
(nearest prime close to 10*number of nodes).
mapred.reduce.tasks: I set its value to 5 (nearest prime close to number of
nodes)

However I noticed there was not much performance gain. If I use the default
values, I get similar performance. But I ran the test on a small amount of
data. I have not tested with huge data set. But I would like to know how
these parameters are going to affect performance.

I used the default values for-

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

Thanks a lot,
Taran

Search Discussions

  • Owen O'Malley at Sep 23, 2008 at 6:48 pm

    On Tue, Sep 23, 2008 at 9:52 AM, Tarandeep Singh wrote:

    I am running a small cluster of 4 nodes, each node having quad-cores and 8
    gb of RAM.

    dfs.replication: I have set it to 2.

    Probably reasonable given the small cluster.

    Will I get performance boost if I set it to 4 (=number of nodes).


    It will likely hurt performance. Reads will be a trivially faster, but
    writes will be substantially slower.

    If this is true, how much replication people use
    when they run a cluster of say 1000 nodes. Do they replicate peta bytes of
    data ?

    We usually use 3, because it gives us better expected reliability. The
    relevant measure is how likely is it that the 2nd machine will crash before
    the data is replicated off.

    mapred.child.java.opts: -Xms4096m -Xmx7500m- I tried with diff min and max
    memory and found there was not any improvement in performance. I was
    thinking that giving more memory to the process, will help it to do
    sorting/shuffling etc quickly, but it seems my thinking is not correct. Can
    anyone comment on this parameter and what should be the optimal value

    fs.inmemory.size.mb: I have set it to 225. Increasing it further does not
    help. Also can someone explain it in detail like how does this parameter
    affects performance.

    Increasing it only helps up until you don't have more than io.sort.factor
    spills from your reduce. (In 0.19, there are more options, including not
    spilling at all. I assume you are talking about 0.18.) For your job, the
    amount of input data to the reduce should fit in fs.inmemory.size.mb *
    io.sort.factor. If not, the framework will run a multi-level merge, which is
    slower. In 0.19, if you give the framework enough memory, it doesn't ever
    write the reduce inputs to disk.

    io.sort.mb: I have set it to 200. Increasing it further does not help, at
    least in my jobs. Anyone with more details about this parameter ?

    This is used to control the buffering of the map outputs. As long as each
    map's output fits into io.sort.mb, you are doing fine.

    mapred.map.tasks: After reading the description, I set its value as 41
    (nearest prime close to 10*number of nodes).
    mapred.reduce.tasks: I set its value to 5 (nearest prime close to number of
    nodes)

    *Ugh* I seriously need to update those comments. Prime numbers don't matter
    at all. Don't set the number of map tasks. The framework does pretty well,
    if left to its own devices. Reduce tasks should be set to the # of reduce
    slots on the cluster. (On a big cluster, I'd suggest using 98% or so, but on
    a 4 node cluster, it isn't required.)

    However I noticed there was not much performance gain. If I use the default
    values, I get similar performance. But I ran the test on a small amount of
    data. I have not tested with huge data set. But I would like to know how
    these parameters are going to affect performance.

    In my experience, in general having large maps and reduces helps throughput.


    I used the default values for-
    mapred.tasktracker.map.tasks.maximum
    mapred.tasktracker.reduce.tasks.maximum

    I'd suggest upping the map slots/node to 6 or so.

    I'd also suggest setting io.sort.factor to 100 to allow more files to be
    merged at once.

    Setting the default block size to 256mb will help on large data, although it
    requires more io.sort.mb.

    You probaly want io.file.buffer.size to be set to at least 128k.

    In terms of HDFS start up on small clusters, I'd suggest:
    dfs.safemode.threshold.pct = 1.0
    dfs.safemode.extension=0

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 23, '08 at 4:52p
activeSep 23, '08 at 6:48p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Owen O'Malley: 1 post Tarandeep Singh: 1 post

People

Translate

site design / logo © 2022 Grokbase