FAQ
Hi,
in the first phase we are planning to establish a small cluster with few
commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
10.10 and a hadoop build from the branch 0.20.204 (i had some issues with
version 0.20.203 with missing
libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
Would you suggest any other version?

In the second phase we are planning to analyse, test and modify some of
hadoop schedulers.

Now I am interested what is the best way to deploy ubuntu and hadop to this
few machine. I was thinking to configure the system in the local VM and then
converting it to each physical machine but probably this is not the best
option. If you know any other please share..

Thanks you!

Search Discussions

  • GOEKE, MATTHEW (AG/1000) at Sep 23, 2011 at 3:10 pm
    If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation.

    Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time.

    Matt

    -----Original Message-----
    From: Merto Mertek
    Sent: Friday, September 23, 2011 10:00 AM
    To: common-user@hadoop.apache.org
    Subject: Environment consideration for a research on scheduling

    Hi,
    in the first phase we are planning to establish a small cluster with few
    commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
    10.10 and a hadoop build from the branch 0.20.204 (i had some issues with
    version 0.20.203 with missing
    libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
    Would you suggest any other version?

    In the second phase we are planning to analyse, test and modify some of
    hadoop schedulers.

    Now I am interested what is the best way to deploy ubuntu and hadop to this
    few machine. I was thinking to configure the system in the local VM and then
    converting it to each physical machine but probably this is not the best
    option. If you know any other please share..

    Thanks you!
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Merto Mertek at Sep 24, 2011 at 11:49 am
    I agree, we will go the standard route. Like you suggested we will go step
    by step to the full cluster deployment. After the first node configuration
    we will use clonezilla to replicate it and then setup them one by one..

    On the workernodes I was thinking to run ubuntu server, namenode will run
    ubuntu desktop. I am interested how should I configure the environment that
    I will able to remotely monitor, analyse and configure the cluster. I will
    run jobs outsite the local network via ssh to the namenode, however in this
    situation I will not be abble to access the web interface of the job and
    tasktracker. So I am wondering how to analyze them and how did you configure
    your environment to be as practical as possible.

    For monitoring the cluster I saw that ganglia is one of the option, but in
    this stage of testing probably job-history files will be enough..
    On 23 September 2011 17:09, GOEKE, MATTHEW (AG/1000) wrote:

    If you are starting from scratch with no prior Hadoop install experience I
    would configure stand-alone, migrate to pseudo distributed and then to fully
    distributed verifying functionality at each step by doing a simple word
    count run. Also, if you don't mind using the CDH distribution then SCM /
    their rpms will greatly simplify both the bin installs as well as the user
    creation.

    Your VM route will most likely work but I can imagine the amount of hiccups
    during migration from that to the real cluster will not make it worth your
    time.

    Matt

    -----Original Message-----
    From: Merto Mertek
    Sent: Friday, September 23, 2011 10:00 AM
    To: common-user@hadoop.apache.org
    Subject: Environment consideration for a research on scheduling

    Hi,
    in the first phase we are planning to establish a small cluster with few
    commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
    10.10 and a hadoop build from the branch 0.20.204 (i had some issues with
    version 0.20.203 with missing
    libraries<
    http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567
    ).
    Would you suggest any other version?

    In the second phase we are planning to analyse, test and modify some of
    hadoop schedulers.

    Now I am interested what is the best way to deploy ubuntu and hadop to this
    few machine. I was thinking to configure the system in the local VM and
    then
    converting it to each physical machine but probably this is not the best
    option. If you know any other please share..

    Thanks you!
    This e-mail message may contain privileged and/or confidential information,
    and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error,
    please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use
    of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring,
    reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage
    caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Steve Loughran at Sep 26, 2011 at 9:42 am

    On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:
    If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM / their rpms will greatly simplify both the bin installs as well as the user creation.

    Your VM route will most likely work but I can imagine the amount of hiccups during migration from that to the real cluster will not make it worth your time.

    Matt

    -----Original Message-----
    From: Merto Mertek
    Sent: Friday, September 23, 2011 10:00 AM
    To: common-user@hadoop.apache.org
    Subject: Environment consideration for a research on scheduling

    Hi,
    in the first phase we are planning to establish a small cluster with few
    commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
    10.10 and a hadoop build from the branch 0.20.204 (i had some issues with
    version 0.20.203 with missing
    libraries<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>).
    Would you suggest any other version?
    I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL
    and CentOS are the platform of choice in the server side.

    In the second phase we are planning to analyse, test and modify some of
    hadoop schedulers.
    The main schedulers used by Y! and FB are fairly tuned for their
    workloads, and not apparently something you'd want to play with. There
    is at least one other scheduler in the contribs/ dir to play with.

    the other thing about scheduling is that you may have a faster
    development cycle if, instead of working on a real cluster, you simulate
    it and multiples of real time; using stats collected from your own
    workload by way of the gridmix2 tools. I've never done scheduling work,
    but think there's some stuff there to do that. if not, it's a possible
    contribution.

    Be aware that the changes in 0.23+ will change resource scheduling; this
    may be a better place to do development with a plan to deploy in 2012.
    Oh, and get on the mapreduce lists, esp, the -dev list, to discuss issues

    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
    I have no idea what that means but am not convinced that reading an
    email forces me to comply with a different country's rules
  • Merto Mertek at Sep 27, 2011 at 8:03 pm
    Desktop edition was chosen just to run the namemode and to monitor cluster
    statistics. Workernodes were chosen to run on ubuntu server edition because
    we find this configuration in several research papers. One of such
    configuration can be found in the paper for LATE scheduler (is maybe some
    source code of this available or is integrated in the new fair scheduler?)

    thanks for the provided tools..
    On 26 September 2011 11:41, Steve Loughran wrote:
    On 23/09/11 16:09, GOEKE, MATTHEW (AG/1000) wrote:

    If you are starting from scratch with no prior Hadoop install experience I
    would configure stand-alone, migrate to pseudo distributed and then to fully
    distributed verifying functionality at each step by doing a simple word
    count run. Also, if you don't mind using the CDH distribution then SCM /
    their rpms will greatly simplify both the bin installs as well as the user
    creation.

    Your VM route will most likely work but I can imagine the amount of
    hiccups during migration from that to the real cluster will not make it
    worth your time.

    Matt

    -----Original Message-----
    From: Merto Mertek
    Sent: Friday, September 23, 2011 10:00 AM
    To: common-user@hadoop.apache.org
    Subject: Environment consideration for a research on scheduling

    Hi,
    in the first phase we are planning to establish a small cluster with few
    commodity computer (each 1GB, 200GB,..). Cluster would run ubuntu server
    10.10 and a hadoop build from the branch 0.20.204 (i had some issues with
    version 0.20.203 with missing
    libraries<http://hadoop-**common.472056.n3.nabble.com/**
    Development-enviroment-**problems-eclipse-hadoop-0-20-**
    203-td3186022.html#a3188567<http://hadoop-common.472056.n3.nabble.com/Development-enviroment-problems-eclipse-hadoop-0-20-203-td3186022.html#a3188567>
    ).
    Would you suggest any other version?
    I wouldn't run to put Ubuntu 10.x on; they make good desktops, but RHEL and
    CentOS are the platform of choice in the server side.



    In the second phase we are planning to analyse, test and modify some of
    hadoop schedulers.
    The main schedulers used by Y! and FB are fairly tuned for their workloads,
    and not apparently something you'd want to play with. There is at least one
    other scheduler in the contribs/ dir to play with.

    the other thing about scheduling is that you may have a faster development
    cycle if, instead of working on a real cluster, you simulate it and
    multiples of real time; using stats collected from your own workload by way
    of the gridmix2 tools. I've never done scheduling work, but think there's
    some stuff there to do that. if not, it's a possible contribution.

    Be aware that the changes in 0.23+ will change resource scheduling; this
    may be a better place to do development with a plan to deploy in 2012. Oh,
    and get on the mapreduce lists, esp, the -dev list, to discuss issues



    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR)
    and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this
    information you are obligated to comply with all
    applicable U.S. export laws and regulations.
    I have no idea what that means but am not convinced that reading an email
    forces me to comply with a different country's rules

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 23, '11 at 3:00p
activeSep 27, '11 at 8:03p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase