FAQ
Hi all !
How are you?

My name is Avi and I have been fascinated by Apache Hadoop for the last few
months.
I am spending the last two weeks trying to optimize my configuration files
and environment.
I have been going through many Hadoop's configuration properties and it
seems that none
of them is making a big difference (+- 3 minutes of a total job run time).

In Hadoop's meanings my cluster considered to be extremely small (260 GB of
text files, while every job is going through only +- 8 GB).
I have one server acting as "NameNode and JobTracker", and another 5 servers
acting as "DataNodes and TaskTreckers".
Right now Hadoop's configurations are set to default, beside the DFS Block
Size which is set to 256 MB since every file on my cluster takes 155 - 250
MB.

All of the above servers are exactly the same and having the following
hardware and software:
1.7 GB memory
1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
Ubuntu Server 10.10 , 32-bit platform
Cloudera CDH3 Manual Hadoop Installation
(for the ones who are familiar with Amazon Web Services, I am talking about
Small EC2 Instances/Servers)

Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
MB and 10 reduce tasks).

Based on the above information, does anyone can recommend on a best practice
configuration??
Do you thinks that when dealing with such a small cluster, and when
processing such a small amount of data,
is it even possible to optimize jobs so they would run much faster?

By the way, it seems like none of the nodes are having a hardware
performance issues (CPU/Memory) while running the job.
Thats true unless I am having a bottle neck somewhere else (seems like
network bandwidth is not the issue).
That issue is a little confusing because the NameNode process and the
JobTracker process should allocate 1GB of memory each,
which means that my hardware starting point is insufficient and in that case
why am I not seeing a full Memory utilization using 'top'
command on the NameNode & JobTracker Server?
How would you recommend to measure/monitor different Hadoop's properties to
find out where is the bottle neck?

Thanks for your help!!

Avi

Search Discussions

  • Stanley Shi at Aug 22, 2011 at 1:47 am
    Hi Avi,

    I'm also learning Hadoop now. There's a tool named "nmon" that can track the usage of the server. You can use this to track the mem, cpu, disk and network usage of the servers. It's very easy to use and there's a nmon-analyzer that can generate excel diagrams base on the nmon data.

    Hope this helps



    -----Original Message-----
    From: Avi Vaknin
    Sent: 2011年8月21日 快下班了 7:57
    To: common-user@hadoop.apache.org
    Subject: Hadoop cluster optimization

    Hi all !
    How are you?

    My name is Avi and I have been fascinated by Apache Hadoop for the last few
    months.
    I am spending the last two weeks trying to optimize my configuration files
    and environment.
    I have been going through many Hadoop's configuration properties and it
    seems that none
    of them is making a big difference (+- 3 minutes of a total job run time).

    In Hadoop's meanings my cluster considered to be extremely small (260 GB of
    text files, while every job is going through only +- 8 GB).
    I have one server acting as "NameNode and JobTracker", and another 5 servers
    acting as "DataNodes and TaskTreckers".
    Right now Hadoop's configurations are set to default, beside the DFS Block
    Size which is set to 256 MB since every file on my cluster takes 155 - 250
    MB.

    All of the above servers are exactly the same and having the following
    hardware and software:
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best practice
    configuration??
    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?

    By the way, it seems like none of the nodes are having a hardware
    performance issues (CPU/Memory) while running the job.
    Thats true unless I am having a bottle neck somewhere else (seems like
    network bandwidth is not the issue).
    That issue is a little confusing because the NameNode process and the
    JobTracker process should allocate 1GB of memory each,
    which means that my hardware starting point is insufficient and in that case
    why am I not seeing a full Memory utilization using 'top'
    command on the NameNode & JobTracker Server?
    How would you recommend to measure/monitor different Hadoop's properties to
    find out where is the bottle neck?

    Thanks for your help!!

    Avi
  • Michel Segel at Aug 22, 2011 at 2:16 am
    Avi,
    First why 32 bit OS?
    You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus.

    With only 1.7 GB you're going to be limited on the number of slots you can configure.
    I'd say run ganglia but that would take resources away from you. It sounds like the default parameters are a pretty good fit.


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Aug 21, 2011, at 6:57 AM, "Avi Vaknin" wrote:

    Hi all !
    How are you?

    My name is Avi and I have been fascinated by Apache Hadoop for the last few
    months.
    I am spending the last two weeks trying to optimize my configuration files
    and environment.
    I have been going through many Hadoop's configuration properties and it
    seems that none
    of them is making a big difference (+- 3 minutes of a total job run time).

    In Hadoop's meanings my cluster considered to be extremely small (260 GB of
    text files, while every job is going through only +- 8 GB).
    I have one server acting as "NameNode and JobTracker", and another 5 servers
    acting as "DataNodes and TaskTreckers".
    Right now Hadoop's configurations are set to default, beside the DFS Block
    Size which is set to 256 MB since every file on my cluster takes 155 - 250
    MB.

    All of the above servers are exactly the same and having the following
    hardware and software:
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 25
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best practice
    configuration??
    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?

    By the way, it seems like none of the nodes are having a hardware
    performance issues (CPU/Memory) while running the job.
    Thats true unless I am having a bottle neck somewhere else (seems like
    network bandwidth is not the issue).
    That issue is a little confusing because the NameNode process and the
    JobTracker process should allocate 1GB of memory each,
    which means that my hardware starting point is insufficient and in that case
    why am I not seeing a full Memory utilization using 'top'
    command on the NameNode & JobTracker Server?
    How would you recommend to measure/monitor different Hadoop's properties to
    find out where is the bottle neck?

    Thanks for your help!!

    Avi

  • Allen Wittenauer at Aug 22, 2011 at 4:06 am

    On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

    Avi,
    First why 32 bit OS?
    You have a 64 bit processor that has 4 cores hyper threaded looks like 8cpus.
    With only 1.7gb of mem, there likely isn't much of a reason to use a 64-bit OS. The machines (as you point out) are already tight on memory. 64-bit is only going to make it worse.
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to 250
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best practice
    configuration??
    How many spindles? Are your tasks spilling?

    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?
    Most of the time, performance issues are with the algorithm, not Hadoop.
  • Avi Vaknin at Aug 22, 2011 at 9:57 am
    Hi Allen/Michel ,
    First, thanks a lot for your reply.

    I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
    why
    I am trying to change it now.
    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with
    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    Based on the new hardware I'm using, Are there any suggestions regarding the
    Hadoop
    configuration parameters?

    One more thing, you asked: "Are your tasks spilling?"
    How can I check if my tasks spilling ?

    Thanks.

    Avi


    -----Original Message-----
    From: Allen Wittenauer
    Sent: Monday, August 22, 2011 7:06 AM
    To: common-user@hadoop.apache.org
    Subject: Re: Hadoop cluster optimization

    On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

    Avi,
    First why 32 bit OS?
    You have a 64 bit processor that has 4 cores hyper threaded looks like
    8cpus.

    With only 1.7gb of mem, there likely isn't much of a reason to use a
    64-bit OS. The machines (as you point out) are already tight on memory.
    64-bit is only going to make it worse.
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking
    about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
    250
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best
    practice
    configuration??
    How many spindles? Are your tasks spilling?

    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?
    Most of the time, performance issues are with the algorithm, not
    Hadoop.

    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11
  • אבי ווקנין at Aug 22, 2011 at 10:01 am
    Hi Allen/Michel ,


    First, thanks a lot for your reply.

    I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
    why I am trying to change it now.

    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with

    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    Based on the new hardware I'm using, Are there any suggestions regarding the
    Hadoop

    configuration parameters?

    One more thing, you asked: "Are your tasks spilling?"

    How can I check if my tasks spilling ?

    Thanks.

    Avi



    On Mon, Aug 22, 2011 at 12:55 PM, Avi Vaknin wrote:

    Hi Allen/Michel ,
    First, thanks a lot for your reply.

    I assumed that the 1.7GB RAM will be the bottleneck in my environment
    that's
    why
    I am trying to change it now.
    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with
    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    Based on the new hardware I'm using, Are there any suggestions regarding
    the
    Hadoop
    configuration parameters?

    One more thing, you asked: "Are your tasks spilling?"
    How can I check if my tasks spilling ?

    Thanks.

    Avi


    -----Original Message-----
    From: Allen Wittenauer
    Sent: Monday, August 22, 2011 7:06 AM
    To: common-user@hadoop.apache.org

    Subject: Re: Hadoop cluster optimization

    On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

    Avi,
    First why 32 bit OS?
    You have a 64 bit processor that has 4 cores hyper threaded looks like
    8cpus.

    With only 1.7gb of mem, there likely isn't much of a reason to use a
    64-bit OS. The machines (as you point out) are already tight on memory.
    64-bit is only going to make it worse.
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking
    about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
    250
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best
    practice
    configuration??
    How many spindles? Are your tasks spilling?

    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?
    Most of the time, performance issues are with the algorithm, not
    Hadoop.



    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11

  • Ian Michael Gumby at Aug 22, 2011 at 1:35 pm
    Avi,
    You can run with 1.7GB of RAM, but that means you're going to have 1 m/r slot per node.

    With 4 cores, figure 1 core for DN, 1 core for TT and then w hyper threading 2 threads per core means 4 virtual cores and then you could run with 4 slots per. (3 mappers 1 reducer)
    So that would be 1 + 1 + 4 GB assuming your jobs are 1GB or less in heap.
    Note this leaves some head room on the machine for OS jobs and later fine tuning so that you're not over committing your machines.
    Also if your files are 256MB in size and your block size is 256MB, you get one slot per file. No parallelism unless you're processing multiple files in a M/R job.

    As to running 32 bit JT/NN and 64bit DN/TT, I haven't tried this, and I don't think you'll have a problem, but I wouldn't recommend it. You're adding a variable that you don't need and I'm not sure of the cost difference is worth the risk. (You said you're running this on EC2... ) Note: I'm not an expert on EC2. I guess I'm lucky that I have my own sandbox to play in... :-)

    HTH

    -Mike

    Date: Mon, 22 Aug 2011 13:00:27 +0300
    Subject: Re: Hadoop cluster optimization
    From: avivaknin13@gmail.com
    To: common-user@hadoop.apache.org

    Hi Allen/Michel ,


    First, thanks a lot for your reply.

    I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
    why I am trying to change it now.

    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with

    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    Based on the new hardware I'm using, Are there any suggestions regarding the
    Hadoop

    configuration parameters?

    One more thing, you asked: "Are your tasks spilling?"

    How can I check if my tasks spilling ?

    Thanks.

    Avi



    On Mon, Aug 22, 2011 at 12:55 PM, Avi Vaknin wrote:

    Hi Allen/Michel ,
    First, thanks a lot for your reply.

    I assumed that the 1.7GB RAM will be the bottleneck in my environment
    that's
    why
    I am trying to change it now.
    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with
    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).

    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    Based on the new hardware I'm using, Are there any suggestions regarding
    the
    Hadoop
    configuration parameters?

    One more thing, you asked: "Are your tasks spilling?"
    How can I check if my tasks spilling ?

    Thanks.

    Avi


    -----Original Message-----
    From: Allen Wittenauer
    Sent: Monday, August 22, 2011 7:06 AM
    To: common-user@hadoop.apache.org

    Subject: Re: Hadoop cluster optimization

    On Aug 21, 2011, at 7:17 PM, Michel Segel wrote:

    Avi,
    First why 32 bit OS?
    You have a 64 bit processor that has 4 cores hyper threaded looks like
    8cpus.

    With only 1.7gb of mem, there likely isn't much of a reason to use a
    64-bit OS. The machines (as you point out) are already tight on memory.
    64-bit is only going to make it worse.
    1.7 GB memory
    1 Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
    Ubuntu Server 10.10 , 32-bit platform
    Cloudera CDH3 Manual Hadoop Installation
    (for the ones who are familiar with Amazon Web Services, I am talking
    about
    Small EC2 Instances/Servers)

    Total job run time is +-15 minutes (+-50 files/blocks/mapTasks of up to
    250
    MB and 10 reduce tasks).

    Based on the above information, does anyone can recommend on a best
    practice
    configuration??
    How many spindles? Are your tasks spilling?

    Do you thinks that when dealing with such a small cluster, and when
    processing such a small amount of data,
    is it even possible to optimize jobs so they would run much faster?
    Most of the time, performance issues are with the algorithm, not
    Hadoop.



    -----
    No virus found in this message.
    Checked by AVG - www.avg.com
    Version: 10.0.1392 / Virus Database: 1520/3848 - Release Date: 08/21/11

  • Allen Wittenauer at Aug 22, 2011 at 4:19 pm

    On Aug 22, 2011, at 3:00 AM, אבי ווקנין wrote:
    I assumed that the 1.7GB RAM will be the bottleneck in my environment that's
    why I am trying to change it now.

    I shut down the 4 datanodes with 1.7GB RAM (Amazon EC2 small instance) and
    replaced them with

    2 datanodes with 7.5GB RAM (Amazon EC2 large instance).
    This should allow you to bump up the memory and/or increase the task count.
    Is it OK that the datanodes are 64 bit while the namenode is still 32 bit?
    I've run several instance where the NN was 64-bit and the DNs were 32-bit. I can't think of a reason the reverse wouldn't work. The only thing that is really going to matter is if they are the same CPU architecture. (which, if you are running on EC2, will likely always be the case).
    Based on the new hardware I'm using, Are there any suggestions regarding the
    Hadoop configuration parameters?
    It really depends upon how much memory you need per task. Thus why task spill rate is important... :)
    One more thing, you asked: "Are your tasks spilling?"

    How can I check if my tasks spilling ?
    Check the task logs.

    If you aren't spilling, then you'll likely want to match task count=core count-1 unless mem is exhausted first. (i.e., tasks*mem should be < avail mem).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 21, '11 at 11:58a
activeAug 22, '11 at 4:19p
posts8
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase