FAQ
Hi there,

I am trying to get Hadoop on Demand up and running, but am having
problems with the ringmaster not being able to communicate with HDFS.

The output from the hod allocate command ends with this, with full verbosity:

[2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
'hdfs' service address.
[2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
[2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
[2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
[2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
cluster /home/dmilne/hadoop/cluster
[2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


I've attached the hodrc file below, but briefly HOD is supposed to
provision an HDFS cluster as well as a Map/Reduce cluster, and seems
to be failing to do so. The ringmaster log looks like this:

[2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
[2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found

... and so on, until it gives up

Any ideas why? One red flag is that when running the allocate command,
some of the variables echo-ed back look dodgy:

--gridservice-hdfs.fs_port 0
--gridservice-hdfs.host localhost
--gridservice-hdfs.info_port 0

These are not what I specified in the hodrc. Are the port numbers just
set to 0 because I am not using an external HDFS, or is this a
problem?


The software versions involved are:
- Hadoop 0.20.2
- Python 2.5.2 (no Twisted)
- Java 1.6.0_20
- Torque 2.4.5


The hodrc file looks like this:

[hod]
stream = True
java-home = /opt/jdk1.6.0_20
cluster = debian5
cluster-factor = 1.8
xrs-port-range = 32768-65536
debug = 3
allocate-wait-time = 3600
temp-dir = /scratch/local/dmilne/hod

[ringmaster]
register = True
stream = False
temp-dir = /scratch/local/dmilne/hod
log-dir = /scratch/local/dmilne/hod/log
http-port-range = 8000-9000
idleness-limit = 864000
work-dirs =
/scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
xrs-port-range = 32768-65536
debug = 4

[hodring]
stream = False
temp-dir = /scratch/local/dmilne/hod
log-dir = /scratch/local/dmilne/hod/log
register = True
java-home = /opt/jdk1.6.0_20
http-port-range = 8000-9000
xrs-port-range = 32768-65536
debug = 4

[resource_manager]
queue = express
batch-home = /opt/torque-2.4.5
id = torque
options = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
#env-vars =
HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

[gridservice-mapred]
external = False
pkgs = /opt/hadoop-0.20.2
tracker_port = 8030
info_port = 50080

[gridservice-hdfs]
external = False
pkgs = /opt/hadoop-0.20.2
fs_port = 8020
info_port = 50070

Cheers,
Dave

Search Discussions

  • David Milne at Jun 14, 2010 at 2:33 am
    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    On Thu, Jun 10, 2010 at 2:56 PM, David Milne wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full verbosity:

    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream                          = True
    java-home                       = /opt/jdk1.6.0_2
    cluster                         = debian5
    cluster-factor                  = 1.8
    xrs-port-range                  = 32768-65536
    debug                           = 3
    allocate-wait-time              = 3600
    temp-dir                        = /scratch/local/dmilne/hod

    [ringmaster]
    register                        = True
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    http-port-range                 = 8000-9000
    idleness-limit                  = 864000
    work-dirs                       =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [hodring]
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    register                        = True
    java-home                       = /opt/jdk1.6.0_2
    http-port-range                 = 8000-9000
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [resource_manager]
    queue                           = express
    batch-home                      = /opt/torque-2.4.5
    id                              = torque
    options                         = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars                       =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    tracker_port                    = 8030
    info_port                       = 50080

    [gridservice-hdfs]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    fs_port                         = 8020
    info_port                       = 50070

    Cheers,
    Dave
  • Jeff Hammerbacher at Jun 14, 2010 at 2:40 am
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff
    On Sun, Jun 13, 2010 at 7:33 PM, David Milne wrote:

    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    On Thu, Jun 10, 2010 at 2:56 PM, David Milne wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave
  • David Milne at Jun 14, 2010 at 4:22 am
    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave
    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff
    On Sun, Jun 13, 2010 at 7:33 PM, David Milne wrote:

    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    On Thu, Jun 10, 2010 at 2:56 PM, David Milne wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream                          = True
    java-home                       = /opt/jdk1.6.0_20
    cluster                         = debian5
    cluster-factor                  = 1.8
    xrs-port-range                  = 32768-65536
    debug                           = 3
    allocate-wait-time              = 3600
    temp-dir                        = /scratch/local/dmilne/hod

    [ringmaster]
    register                        = True
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    http-port-range                 = 8000-9000
    idleness-limit                  = 864000
    work-dirs                       =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [hodring]
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    register                        = True
    java-home                       = /opt/jdk1.6.0_20
    http-port-range                 = 8000-9000
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [resource_manager]
    queue                           = express
    batch-home                      = /opt/torque-2.4.5
    id                              = torque
    options                         =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars                       =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    tracker_port                    = 8030
    info_port                       = 50080

    [gridservice-hdfs]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    fs_port                         = 8020
    info_port                       = 50070

    Cheers,
    Dave
  • Vinod KV at Jun 14, 2010 at 6:00 am

    On Monday 14 June 2010 09:51 AM, David Milne wrote:
    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave

    It isn't formally deprecated though. May be we'll need to do it
    explicitly; that'll help putting up proper documentation about what else
    to use instead.

    A quick reply is that you start a static cluster on a set of nodes.
    Static cluster means bringing up hadoop dameons on a set of nodes using
    the startup scripts distributed along in bin/ directory.

    That said, there are no changes in HOD in 0.21 and beyond. Deploying
    0.21 clusters should mostly work out of the box. But beyond 0.21, it may
    not work because HOD needs to be updated w.r.t removed/updated hadoop
    specific configuration parameters and environmental variables it
    generates itself.

    HTH,
    +vinod
    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacherwrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff

    On Sun, Jun 13, 2010 at 7:33 PM, David Milnewrote:

    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave

    On Thu, Jun 10, 2010 at 2:56 PM, David Milnewrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave
  • Amr Awadallah at Jun 14, 2010 at 3:28 pm
    Dave,

    Yes, many others have the same situation, the recommended solution is
    either to use the Fair Share Scheduler or the Capacity Scheduler. These
    schedulers are much better than HOD since they take data locality into
    consideration (they don't just spin up 20 TT nodes on machines that have
    nothing to do with your data). They also don't lock down the nodes just for
    you, so as TT are freed other jobs can use them immediately (as opposed to
    no body can use them till your entire job is done).

    Also, if you are brave and want to try something spanking new, then I
    recommend you reach out to the Mesos guys, they have a scheduler layer under
    Hadoop that is data locality aware:

    http://mesos.berkeley.edu/

    -- amr
    On Sun, Jun 13, 2010 at 9:21 PM, David Milne wrote:

    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave
    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff
    On Sun, Jun 13, 2010 at 7:33 PM, David Milne wrote:

    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    On Thu, Jun 10, 2010 at 2:56 PM, David Milne wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
    rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave
  • Edward Capriolo at Jun 14, 2010 at 3:26 pm

    On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah wrote:

    Dave,

    Yes, many others have the same situation, the recommended solution is
    either to use the Fair Share Scheduler or the Capacity Scheduler. These
    schedulers are much better than HOD since they take data locality into
    consideration (they don't just spin up 20 TT nodes on machines that have
    nothing to do with your data). They also don't lock down the nodes just for
    you, so as TT are freed other jobs can use them immediately (as opposed to
    no body can use them till your entire job is done).

    Also, if you are brave and want to try something spanking new, then I
    recommend you reach out to the Mesos guys, they have a scheduler layer
    under
    Hadoop that is data locality aware:

    http://mesos.berkeley.edu/

    -- amr
    On Sun, Jun 13, 2010 at 9:21 PM, David Milne wrote:

    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave

    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff

    On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave

    On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with
    HDFS.
    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
    retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
    id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
    rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate
    command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers
    just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave
    I have not used it much, but I think HOD is pretty cool. I guess most people
    who are looking to (spin up, run job ,transfer off, spin down) are using
    EC2. HOD does something like make private hadoop clouds on your hardware and
    many probably do not have that use case. As schedulers advance and get
    better HOD becomes less attractive, but I can always see a place for it.
  • Steve Loughran at Jun 14, 2010 at 3:50 pm

    Edward Capriolo wrote:


    I have not used it much, but I think HOD is pretty cool. I guess most people
    who are looking to (spin up, run job ,transfer off, spin down) are using
    EC2. HOD does something like make private hadoop clouds on your hardware and
    many probably do not have that use case. As schedulers advance and get
    better HOD becomes less attractive, but I can always see a place for it.
    I don't know who is using it, or maintaining it; we've been bringing up
    short-lived Hadoop clusters different.

    I think I should write a little article on the topic; I presented about
    it at Berlin Buzzwords last week.

    Short lived Hadoop clusters on VMs are fine if you don't have enough
    data or CPU load to justify a set of dedicated physical machines, and is
    a good way of experimenting with Hadoop at scale. You can maybe lock
    down the network better too, though that depends on your VM infrastructure.

    Where VMs are weak is in disk IO performance, but there's no reason why
    the VM infrastructure can't take a list of filenames/directories as a
    hint for VM placement (placement is the new scheduling, incidentally),
    and virtualized IO can only improve. If you can run Hadoop MapReduce
    directly against SAN-mounted storage then you can stop worrying about
    locality of data and still gain from parallelisation of the operations.


    -steve
  • Gang Luo at Jun 14, 2010 at 4:55 pm
    Hi,
    According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency?

    Thanks,
    -Gang
  • Akash Deep Shakya at Jun 14, 2010 at 4:58 pm
    Use ControlledJob class from Hadoop trunk. And run it through JobControl.

    Regards
    Akash Deep Shakya "OpenAK"
    FOSS Nepal Community
    akashakya at gmail dot com

    ~ Failure to prepare is preparing to fail ~


    On Mon, Jun 14, 2010 at 10:40 PM, Gang Luo wrote:

    Hi,
    According to the doc, JobControl can maintain the dependency among
    different jobs and only jobs without dependency can execute. How does
    JobControl maintain the dependency and how can we indicate the dependency?

    Thanks,
    -Gang



  • Jeff Zhang at Jun 15, 2010 at 1:44 am
    There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a
    wapper of JobConf. And You and dependent jobs to it. Then put it to
    JobControl.



    On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo wrote:
    Hi,
    According to the doc, JobControl can maintain the dependency among different jobs and only jobs without dependency can execute. How does JobControl maintain the dependency and how can we indicate the dependency?

    Thanks,
    -Gang





    --
    Best Regards

    Jeff Zhang
  • Akash Deep Shakya at Jun 15, 2010 at 5:05 am
    @Jeff, I think JobConf is already deprecated
    org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
    org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl; can be used instead.

    Regards
    Akash Deep Shakya "OpenAK"
    FOSS Nepal Community
    akashakya at gmail dot com

    ~ Failure to prepare is preparing to fail ~


    On Tue, Jun 15, 2010 at 7:28 AM, Jeff Zhang wrote:

    There's a class org.apache.hadoop.mapred.jobcontrol.Job which is a
    wapper of JobConf. And You and dependent jobs to it. Then put it to
    JobControl.



    On Mon, Jun 14, 2010 at 9:55 AM, Gang Luo wrote:
    Hi,
    According to the doc, JobControl can maintain the dependency among
    different jobs and only jobs without dependency can execute. How does
    JobControl maintain the dependency and how can we indicate the dependency?
    Thanks,
    -Gang





    --
    Best Regards

    Jeff Zhang
  • David Milne at Jun 15, 2010 at 12:05 am
    Is there something else I could read about setting up short-lived
    Hadoop clusters on virtual machines? I have no experience with VMs at
    all. I see there is quite a bit of material about using them to get
    Hadoop up and running with a psuedo-cluster on a single machine, but I
    don't follow how this stretches out to using multiple machines
    allocated by Torque.

    Thanks,
    Dave
    On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran wrote:
    Edward Capriolo wrote:
    I have not used it much, but I think HOD is pretty cool. I guess most
    people
    who are looking to (spin up, run job ,transfer off, spin down) are using
    EC2. HOD does something like make private hadoop clouds on your hardware
    and
    many probably do not have that use case. As schedulers advance and get
    better HOD becomes less attractive, but I can always see a place for it.
    I don't know who is using it, or maintaining it; we've been bringing up
    short-lived Hadoop clusters different.

    I think I should write a little article on the topic; I presented about it
    at Berlin Buzzwords last week.

    Short lived Hadoop clusters on VMs are fine if you don't have enough data or
    CPU load to justify a set of dedicated physical machines, and is a good way
    of experimenting with Hadoop at scale. You can maybe lock down the network
    better too, though that depends on your VM infrastructure.

    Where VMs are weak is in disk IO performance, but there's no reason why the
    VM infrastructure can't take a list of filenames/directories as a hint for
    VM placement (placement is the new scheduling, incidentally), and
    virtualized IO can only improve. If you can run Hadoop MapReduce directly
    against SAN-mounted storage then you can stop worrying about locality of
    data and still gain from parallelisation of the operations.


    -steve

  • Steve Loughran at Jun 15, 2010 at 10:09 am

    David Milne wrote:
    Is there something else I could read about setting up short-lived
    Hadoop clusters on virtual machines? I have no experience with VMs at
    all. I see there is quite a bit of material about using them to get
    Hadoop up and running with a psuedo-cluster on a single machine, but I
    don't follow how this stretches out to using multiple machines
    allocated by Torque.
    My slides are up here
    http://www.slideshare.net/steve_l/farming-hadoop-inthecloud

    We've been bringing up hadoop in a virtual infrastructure, first you ask
    for the master node containing a NN, a JT and a DN with almost no
    storage (just enough for the filesystem to go live, so stop the JT
    blocking). If it comes up you then have a stable hostname for the
    filesystem which you can use for all the real worker nodes (DN + TT) you
    want.

    Some nearby physicists are trying to get Hadoop to co-exist with the
    grid schedulers, I've added a feature request to make the reporting of
    task tracker slots something plugins can handle, so that you'd have a
    set of hadoop workers which could be used by the grid apps or by hadoop
    -with physical hadoop storage. When they were doing work scheduled out
    of hadoop, they'd report less availability to the Job Tracker, so not
    overload the machines.

    Dan Templeton of Sun/Oracle has been working with getting Hadoop to
    coexist with his resource manager -he's worth contacting. Maybe we could
    persuade him to give public online talk on the topic.

    -steve
  • David Milne at Jun 14, 2010 at 11:45 pm
    Unless I am missing something, the Fair Share and Capacity schedulers
    sound like a solution to a different problem: aren't they for a
    dedicated Hadoop cluster that needs to be shared by lots of people? I
    have a general purpose cluster that needs to be shared by lots of
    people. Only one of them (me) wants to run hadoop, and only wants to
    run it intermittently. I'm not concerned with data locality, as my
    workflow is:

    1) upload data I need to process to cluster
    2) run a chain of map-reduce tasks
    3) grab processed data from cluster
    4) clean up cluster

    Mesos sounds good, but I am definitely NOT brave about this. As I
    said, I am just one user of the cluster among many. I would want to
    stick with Torque and Maui for resource management.

    - Dave
    On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah wrote:
    Dave,

    Yes, many others have the same situation, the recommended solution is
    either to use the Fair Share Scheduler or the Capacity Scheduler. These
    schedulers are much better than HOD since they take data locality into
    consideration (they don't just spin up 20 TT nodes on machines that have
    nothing to do with your data). They also don't lock down the nodes just for
    you, so as TT are freed other jobs can use them immediately (as opposed to
    no body can use them till your entire job is done).

    Also, if you are brave and want to try something spanking new, then I
    recommend you reach out to the Mesos guys, they have a scheduler layer under
    Hadoop that is data locality aware:

    http://mesos.berkeley.edu/

    -- amr
    On Sun, Jun 13, 2010 at 9:21 PM, David Milne wrote:

    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave

    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <hammer@cloudera.com>
    wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff

    On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave

    On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
    rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream                          = True
    java-home                       = /opt/jdk1.6.0_20
    cluster                         = debian5
    cluster-factor                  = 1.8
    xrs-port-range                  = 32768-65536
    debug                           = 3
    allocate-wait-time              = 3600
    temp-dir                        = /scratch/local/dmilne/hod

    [ringmaster]
    register                        = True
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    http-port-range                 = 8000-9000
    idleness-limit                  = 864000
    work-dirs                       =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [hodring]
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    register                        = True
    java-home                       = /opt/jdk1.6.0_20
    http-port-range                 = 8000-9000
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [resource_manager]
    queue                           = express
    batch-home                      = /opt/torque-2.4.5
    id                              = torque
    options                         =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars                       =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    tracker_port                    = 8030
    info_port                       = 50080

    [gridservice-hdfs]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    fs_port                         = 8020
    info_port                       = 50070

    Cheers,
    Dave
  • Jason Stowe at Jun 15, 2010 at 7:10 pm
    Hi David,
    The original HOD project was integrated with Condor (
    http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.

    A year or two ago, the Condor project in addition to being open-source w/o
    costs for licensing, created close integration with Hadoop (as does SGE), as
    presented by me at a prior Hadoop World, and the Condor team at Condor Week
    2010:
    http://bit.ly/Condor_Hadoop_CondorWeek2010

    My company has solutions for deploying Hadoop Clusters on shared
    infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
    general deployment strategy is to deploy head nodes (Name/Job Tracker), then
    execute nodes, and to be careful about how you deal with
    data/sizing/replication counts.

    If you're interested in this, please feel free to drop us a line at my
    e-mail or http://cyclecomputing.com/about/contact

    Thanks,
    Jason

    On Mon, Jun 14, 2010 at 7:45 PM, David Milne wrote:

    Unless I am missing something, the Fair Share and Capacity schedulers
    sound like a solution to a different problem: aren't they for a
    dedicated Hadoop cluster that needs to be shared by lots of people? I
    have a general purpose cluster that needs to be shared by lots of
    people. Only one of them (me) wants to run hadoop, and only wants to
    run it intermittently. I'm not concerned with data locality, as my
    workflow is:

    1) upload data I need to process to cluster
    2) run a chain of map-reduce tasks
    3) grab processed data from cluster
    4) clean up cluster

    Mesos sounds good, but I am definitely NOT brave about this. As I
    said, I am just one user of the cluster among many. I would want to
    stick with Torque and Maui for resource management.

    - Dave
    On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah wrote:
    Dave,

    Yes, many others have the same situation, the recommended solution is
    either to use the Fair Share Scheduler or the Capacity Scheduler. These
    schedulers are much better than HOD since they take data locality into
    consideration (they don't just spin up 20 TT nodes on machines that have
    nothing to do with your data). They also don't lock down the nodes just for
    you, so as TT are freed other jobs can use them immediately (as opposed to
    no body can use them till your entire job is done).

    Also, if you are brave and want to try something spanking new, then I
    recommend you reach out to the Mesos guys, they have a scheduler layer under
    Hadoop that is data locality aware:

    http://mesos.berkeley.edu/

    -- amr
    On Sun, Jun 13, 2010 at 9:21 PM, David Milne wrote:

    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave

    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <hammer@cloudera.com
    wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not receiving any
    support on the mailing list.

    Thanks,
    Jeff

    On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Anybody? I am completely stuck here. I have no idea who else I can
    ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave

    On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with
    HDFS.
    The output from the hod allocate command ends with this, with full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
    retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
    id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be
    allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
    rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and
    seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate
    command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers
    just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave


    --

    ==================================
    Jason A. Stowe
    cell: 607.227.9686
    main: 888.292.5320

    http://twitter.com/jasonastowe/
    http://twitter.com/cyclecomputing/

    Cycle Computing, LLC
    Leader in Open Compute Solutions for Clouds, Servers, and Desktops
    Enterprise Condor Support and Management Tools

    http://www.cyclecomputing.com
    http://www.cyclecloud.com
  • Edward Capriolo at Jun 15, 2010 at 8:48 pm

    On Tue, Jun 15, 2010 at 3:10 PM, Jason Stowe wrote:

    Hi David,
    The original HOD project was integrated with Condor (
    http://bit.ly/CondorProject), which Yahoo! was using to schedule clusters.

    A year or two ago, the Condor project in addition to being open-source w/o
    costs for licensing, created close integration with Hadoop (as does SGE),
    as
    presented by me at a prior Hadoop World, and the Condor team at Condor Week
    2010:
    http://bit.ly/Condor_Hadoop_CondorWeek2010

    My company has solutions for deploying Hadoop Clusters on shared
    infrastructure using CycleServer and schedulers like Condor/SGE/etc. The
    general deployment strategy is to deploy head nodes (Name/Job Tracker),
    then
    execute nodes, and to be careful about how you deal with
    data/sizing/replication counts.

    If you're interested in this, please feel free to drop us a line at my
    e-mail or http://cyclecomputing.com/about/contact

    Thanks,
    Jason

    On Mon, Jun 14, 2010 at 7:45 PM, David Milne wrote:

    Unless I am missing something, the Fair Share and Capacity schedulers
    sound like a solution to a different problem: aren't they for a
    dedicated Hadoop cluster that needs to be shared by lots of people? I
    have a general purpose cluster that needs to be shared by lots of
    people. Only one of them (me) wants to run hadoop, and only wants to
    run it intermittently. I'm not concerned with data locality, as my
    workflow is:

    1) upload data I need to process to cluster
    2) run a chain of map-reduce tasks
    3) grab processed data from cluster
    4) clean up cluster

    Mesos sounds good, but I am definitely NOT brave about this. As I
    said, I am just one user of the cluster among many. I would want to
    stick with Torque and Maui for resource management.

    - Dave
    On Tue, Jun 15, 2010 at 12:37 AM, Amr Awadallah wrote:
    Dave,

    Yes, many others have the same situation, the recommended solution is
    either to use the Fair Share Scheduler or the Capacity Scheduler. These
    schedulers are much better than HOD since they take data locality into
    consideration (they don't just spin up 20 TT nodes on machines that
    have
    nothing to do with your data). They also don't lock down the nodes just for
    you, so as TT are freed other jobs can use them immediately (as opposed to
    no body can use them till your entire job is done).

    Also, if you are brave and want to try something spanking new, then I
    recommend you reach out to the Mesos guys, they have a scheduler layer under
    Hadoop that is data locality aware:

    http://mesos.berkeley.edu/

    -- amr

    On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Ok, thanks Jeff.

    This is pretty surprising though. I would have thought many people
    would be in my position, where they have to use Hadoop on a general
    purpose cluster, and need it to play nice with a resource manager?
    What do other people do in this position, if they don't use HOD?
    Deprecated normally means there is a better alternative.

    - Dave

    On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <
    hammer@cloudera.com
    wrote:
    Hey Dave,

    I can't speak for the folks at Yahoo!, but from watching the JIRA, I don't
    think HOD is actively used or developed anywhere these days. You're
    attempting to use a mostly deprecated project, and hence not
    receiving
    any
    support on the mailing list.

    Thanks,
    Jeff

    On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Anybody? I am completely stuck here. I have no idea who else I can
    ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave

    On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.milne@gmail.com>
    wrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with
    HDFS.
    The output from the hod allocate command ends with this, with
    full
    verbosity:
    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
    retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up
    cluster
    id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be
    allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
    rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed
    to
    provision an HDFS cluster as well as a Map/Reduce cluster, and
    seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 -
    getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 -
    getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 -
    getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 -
    getServiceAddr
    name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 -
    getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 -
    getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate
    command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers
    just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave


    --

    ==================================
    Jason A. Stowe
    cell: 607.227.9686
    main: 888.292.5320

    http://twitter.com/jasonastowe/
    http://twitter.com/cyclecomputing/

    Cycle Computing, LLC
    Leader in Open Compute Solutions for Clouds, Servers, and Desktops
    Enterprise Condor Support and Management Tools

    http://www.cyclecomputing.com
    http://www.cyclecloud.com
    but I don't follow how this stretches out to using multiple machines
    allocated by Torque.

    Hadoop does not have a concept of VirutalHosting NameNode has a port,
    jobtracker has a port, DataNode users a port, and has a port for the web
    interface, task tracker is the same deal. Running multiple copies of hadoop
    on the same machine is "easy". All you have to do is make sure they do not
    step on each other. Make sure they do not write to the same folder
    locations, make sure they do not use the same ports.

    Single setup
    NameNode: 9000 Web: 50070
    JobTracker: 1000 Web: 50030
    ...

    Multi Setup

    Setup 1
    NameNode: 9001 Web: 50071
    JobTracker: 1001 Web: 50031
    ...

    Setup2
    NameNode: 9002 Web: 50072
    JobTracker: 1002 Web: 50032
    ...

    HOD is supposed to handle the "dirty" work for you of building configuration
    files, installing hadoop to the nodes, starting the hadoop components. You
    could theoretically accomplish similar things with remote SSH keys, and a
    boatload of scripting. HOD is a deployment and management tool.

    It sounds like it may not meet your need. Is your goal to just deploy and
    manage one instance of Hadoop or multiple instances? HOD is designed to
    install multiple instances of hadoop on a single set of hardware. It sounds
    like you want to deploy one cluster per group of VM's which is not really
    the same thing.
  • Vinod KV at Jun 14, 2010 at 5:55 am

    On Monday 14 June 2010 08:03 AM, David Milne wrote:
    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    In the ringmaster logs, you should see which node was supposed to run
    Namenode. This can be found above the logs that you've printed. I can
    barely remember but I guess it reads something like getCommand(). Once
    you find out the node, check the hodring logs there, something must have
    gone wrong there.

    The return code was 7 - indicating HDFS failure. See
    http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque,
    and check if you are hitting one of the problems listed there.

    HTH,
    +vinod

    On Thu, Jun 10, 2010 at 2:56 PM, David Milnewrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full verbosity:

    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream = True
    java-home = /opt/jdk1.6.0_20
    cluster = debian5
    cluster-factor = 1.8
    xrs-port-range = 32768-65536
    debug = 3
    allocate-wait-time = 3600
    temp-dir = /scratch/local/dmilne/hod

    [ringmaster]
    register = True
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    http-port-range = 8000-9000
    idleness-limit = 864000
    work-dirs =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range = 32768-65536
    debug = 4

    [hodring]
    stream = False
    temp-dir = /scratch/local/dmilne/hod
    log-dir = /scratch/local/dmilne/hod/log
    register = True
    java-home = /opt/jdk1.6.0_20
    http-port-range = 8000-9000
    xrs-port-range = 32768-65536
    debug = 4

    [resource_manager]
    queue = express
    batch-home = /opt/torque-2.4.5
    id = torque
    options = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external = False
    pkgs = /opt/hadoop-0.20.2
    tracker_port = 8030
    info_port = 50080

    [gridservice-hdfs]
    external = False
    pkgs = /opt/hadoop-0.20.2
    fs_port = 8020
    info_port = 50070

    Cheers,
    Dave
  • David Milne at Jun 14, 2010 at 10:50 pm
    Thanks everyone for your replies.

    Even though HOD looks like a dead-end I would prefer to use it. I am
    just one user of the cluster among many, and currently the only one
    using Hadoop. The jobs I need to run are pretty much one-off: they are
    big jobs that I can't do without Hadoop, but I might need to run them
    once a month or less. The ability to provision MapReduce and HDFS when
    I need it sounds ideal.

    Following Vinod's advice, I have rolled back to Hadoop 0.20.1 (the
    last version that HOD kept up with) and taken a closer look at the
    ringmaster logs. However, I am still getting the same problems as
    before, and I can't find anything in the logs to help me identify the
    NameNode.

    The full ringmaster log is below. It's a pretty repetitive song, so
    I've identified the chorus.

    [2010-06-15 10:07:40,236] DEBUG/10 ringMaster:569 - Getting service ID.
    [2010-06-15 10:07:40,237] DEBUG/10 ringMaster:573 - Got service ID:
    34350.symphony.cs.waikato.ac.nz
    [2010-06-15 10:07:40,239] DEBUG/10 ringMaster:756 - Command to
    execute: /bin/cp /home/dmilne/hadoop/hadoop-0.20.1.tar.gz
    /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
    [2010-06-15 10:07:42,314] DEBUG/10 ringMaster:762 - Completed command
    execution. Exit Code: 0.
    [2010-06-15 10:07:42,315] DEBUG/10 ringMaster:591 - Service registry @
    http://symphony.cs.waikato.ac.nz:36372
    [2010-06-15 10:07:47,503] DEBUG/10 ringMaster:726 - tarball name :
    /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
    hadoop package name : hadoop-0.20.1/
    [2010-06-15 10:07:47,505] DEBUG/10 ringMaster:716 - Returning Hadoop
    directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
    [2010-06-15 10:07:47,515] DEBUG/10 util:215 - Executing command
    /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
    version to find hadoop version
    [2010-06-15 10:07:48,241] DEBUG/10 util:224 - Version from hadoop
    command: Hadoop 0.20.1

    [2010-06-15 10:07:48,244] DEBUG/10 ringMaster:117 - Using max-connect value 30
    [2010-06-15 10:07:48,246] INFO/20 ringMaster:61 - Twisted interface
    not found. Using hodXMLRPCServer.
    [2010-06-15 10:07:48,257] DEBUG/10 ringMaster:73 - Ringmaster RPC
    Server at 33771
    [2010-06-15 10:07:48,265] DEBUG/10 ringMaster:121 - registering:
    http://cn71:8030/hadoop-0.20.1.tar.gz
    [2010-06-15 10:07:48,275] DEBUG/10 ringMaster:658 - dmilne
    34350.symphony.cs.waikato.ac.nz cn71.symphony.cs.waikato.ac.nz
    ringmaster hod
    [2010-06-15 10:07:48,307] DEBUG/10 ringMaster:670 - Registered with
    serivce registry: http://symphony.cs.waikato.ac.nz:36372.

    //chorus start
    [2010-06-15 10:07:48,393] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
    [2010-06-15 10:07:48,394] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.hdfs.Hdfs instance at 0xc9e050>
    [2010-06-15 10:07:48,395] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    //chorus end

    //chorus (3x)

    [2010-06-15 10:07:51,461] DEBUG/10 ringMaster:726 - tarball name :
    /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
    hadoop package name : hadoop-0.20.1/
    [2010-06-15 10:07:51,463] DEBUG/10 ringMaster:716 - Returning Hadoop
    directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
    [2010-06-15 10:07:51,465] DEBUG/10 ringMaster:690 -
    hadoopdir=/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/,
    java-home=/opt/jdk1.6.0_20
    [2010-06-15 10:07:51,470] DEBUG/10 util:215 - Executing command
    /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
    version to find hadoop version

    //chorus (1x)

    [2010-06-15 10:07:52,448] DEBUG/10 util:224 - Version from hadoop
    command: Hadoop 0.20.1
    [2010-06-15 10:07:52,450] DEBUG/10 ringMaster:697 - starting jt monitor
    [2010-06-15 10:07:52,453] DEBUG/10 ringMaster:913 - Entered start method.
    [2010-06-15 10:07:52,455] DEBUG/10 ringMaster:924 -
    /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
    --hodring.tarball-retry-initial-time 1.0
    --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
    --hodring.service-id 34350.symphony.cs.waikato.ac.nz
    --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
    8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
    --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
    --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
    --hodring.log-dir /scratch/local/dmilne/hod/log
    --hodring.mapred-system-dir-root /mapredsystem
    --hodring.xrs-port-range 32768-65536 --hodring.debug 4
    --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
    [2010-06-15 10:07:52,456] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
    [2010-06-15 10:07:52,458] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
    [2010-06-15 10:07:52,460] DEBUG/10 ringMaster:504 - getServiceAddr
    addr mapred: not found
    [2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command:
    /opt/torque-2.4.5/bin/pbsdsh
    /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
    --hodring.tarball-retry-initial-time 1.0
    --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
    --hodring.service-id 34350.symphony.cs.waikato.ac.nz
    --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
    8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
    --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
    --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
    --hodring.log-dir /scratch/local/dmilne/hod/log
    --hodring.mapred-system-dir-root /mapredsystem
    --hodring.xrs-port-range 32768-65536 --hodring.debug 4
    --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
    [2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers.

    //chorus (many times)

    [2010-06-15 10:12:02,852] DEBUG/10 ringMaster:530 - inside xml-rpc
    call to stop ringmaster
    [2010-06-15 10:12:02,853] DEBUG/10 ringMaster:976 - RingMaster stop
    method invoked.
    [2010-06-15 10:12:02,854] DEBUG/10 ringMaster:981 - finding exit code

    //chorus (1x)

    [2010-06-15 10:12:02,858] DEBUG/10 ringMaster:533 - returning from
    xml-rpc call to stop ringmaster
    [2010-06-15 10:12:02,859] DEBUG/10 ringMaster:949 - exit code 7
    [2010-06-15 10:12:02,859] DEBUG/10 ringMaster:983 - stopping ringmaster instance
    [2010-06-15 10:12:03,420] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
    [2010-06-15 10:12:03,421] DEBUG/10 ringMaster:487 - getServiceAddr
    service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
    [2010-06-15 10:12:03,422] DEBUG/10 ringMaster:504 - getServiceAddr
    addr mapred: not found
    [2010-06-15 10:12:03,852] DEBUG/10 idleJobTracker:79 - Joining the
    monitoring thread.
    [2010-06-15 10:12:03,853] DEBUG/10 idleJobTracker:83 - Joined the
    monitoring thread.
    [2010-06-15 10:12:04,442] DEBUG/10 ringMaster:793 - Cleaned up
    temporary dir: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
    [2010-06-15 10:12:04,477] DEBUG/10 ringMaster:976 - RingMaster stop
    method invoked.
    [2010-06-15 10:12:04,478] DEBUG/10 ringMaster:1014 - returning from main





    On Mon, Jun 14, 2010 at 5:52 PM, Vinod KV wrote:
    On Monday 14 June 2010 08:03 AM, David Milne wrote:

    Anybody? I am completely stuck here. I have no idea who else I can ask
    or where I can go for more information. Is there somewhere specific
    where I should be asking about HOD?

    Thank you,
    Dave
    In the ringmaster logs, you should see which node was supposed to run
    Namenode. This can be found above the logs that you've printed. I can barely
    remember but I guess it reads something like getCommand(). Once you find out
    the node, check the hodring logs there, something must have gone wrong
    there.

    The return code was 7 - indicating HDFS failure. See
    http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque,
    and check if you are hitting one of the problems listed there.

    HTH,
    +vinod

    On Thu, Jun 10, 2010 at 2:56 PM, David Milnewrote:
    Hi there,

    I am trying to get Hadoop on Demand up and running, but am having
    problems with the ringmaster not being able to communicate with HDFS.

    The output from the hod allocate command ends with this, with full
    verbosity:

    [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
    'hdfs' service address.
    [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
    34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
    [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
    [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
    [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
    cluster /home/dmilne/hadoop/cluster
    [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


    I've attached the hodrc file below, but briefly HOD is supposed to
    provision an HDFS cluster as well as a Map/Reduce cluster, and seems
    to be failing to do so. The ringmaster log looks like this:

    [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
    hdfs
    [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found
    [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
    hdfs
    [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
    service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
    [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
    addr hdfs: not found

    ... and so on, until it gives up

    Any ideas why? One red flag is that when running the allocate command,
    some of the variables echo-ed back look dodgy:

    --gridservice-hdfs.fs_port 0
    --gridservice-hdfs.host localhost
    --gridservice-hdfs.info_port 0

    These are not what I specified in the hodrc. Are the port numbers just
    set to 0 because I am not using an external HDFS, or is this a
    problem?


    The software versions involved are:
    - Hadoop 0.20.2
    - Python 2.5.2 (no Twisted)
    - Java 1.6.0_20
    - Torque 2.4.5


    The hodrc file looks like this:

    [hod]
    stream                          = True
    java-home                       = /opt/jdk1.6.0_20
    cluster                         = debian5
    cluster-factor                  = 1.8
    xrs-port-range                  = 32768-65536
    debug                           = 3
    allocate-wait-time              = 3600
    temp-dir                        = /scratch/local/dmilne/hod

    [ringmaster]
    register                        = True
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    http-port-range                 = 8000-9000
    idleness-limit                  = 864000
    work-dirs                       =
    /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [hodring]
    stream                          = False
    temp-dir                        = /scratch/local/dmilne/hod
    log-dir                         = /scratch/local/dmilne/hod/log
    register                        = True
    java-home                       = /opt/jdk1.6.0_20
    http-port-range                 = 8000-9000
    xrs-port-range                  = 32768-65536
    debug                           = 4

    [resource_manager]
    queue                           = express
    batch-home                      = /opt/torque-2.4.5
    id                              = torque
    options                         =
    l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
    #env-vars                       =
    HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

    [gridservice-mapred]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    tracker_port                    = 8030
    info_port                       = 50080

    [gridservice-hdfs]
    external                        = False
    pkgs                            = /opt/hadoop-0.20.2
    fs_port                         = 8020
    info_port                       = 50070

    Cheers,
    Dave
  • Vinod KV at Jun 15, 2010 at 8:11 am

    On Tuesday 15 June 2010 04:19 AM, David Milne wrote:
    [2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command:
    /opt/torque-2.4.5/bin/pbsdsh
    /home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
    --hodring.tarball-retry-initial-time 1.0
    --hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
    --hodring.service-id 34350.symphony.cs.waikato.ac.nz
    --hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
    8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
    --hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
    --hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
    --hodring.log-dir /scratch/local/dmilne/hod/log
    --hodring.mapred-system-dir-root /mapredsystem
    --hodring.xrs-port-range 32768-65536 --hodring.debug 4
    --hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
    [2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers.

    //chorus (many times)
    Did you mean pbsdsh command itseld was printed many times above? That
    should not happen.

    I previously thought hodrings could not start namenode but looks like
    hodrings themselves failed to startup. You can do two things:
    - See qstat output, log into the slave nodes where your job was
    supposed to start and see hodring logs there.
    - run the above hodring command yourselves directly on on these slave
    nodes for your job and see if it fails with some error.

    +Vinod

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 10, '10 at 2:57a
activeJun 15, '10 at 8:48p
posts20
users10
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase