Thanks everyone for your replies.
Even though HOD looks like a dead-end I would prefer to use it. I am
just one user of the cluster among many, and currently the only one
using Hadoop. The jobs I need to run are pretty much one-off: they are
big jobs that I can't do without Hadoop, but I might need to run them
once a month or less. The ability to provision MapReduce and HDFS when
I need it sounds ideal.
Following Vinod's advice, I have rolled back to Hadoop 0.20.1 (the
last version that HOD kept up with) and taken a closer look at the
ringmaster logs. However, I am still getting the same problems as
before, and I can't find anything in the logs to help me identify the
NameNode.
The full ringmaster log is below. It's a pretty repetitive song, so
I've identified the chorus.
[2010-06-15 10:07:40,236] DEBUG/10 ringMaster:569 - Getting service ID.
[2010-06-15 10:07:40,237] DEBUG/10 ringMaster:573 - Got service ID:
34350.symphony.cs.waikato.ac.nz
[2010-06-15 10:07:40,239] DEBUG/10 ringMaster:756 - Command to
execute: /bin/cp /home/dmilne/hadoop/hadoop-0.20.1.tar.gz
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
[2010-06-15 10:07:42,314] DEBUG/10 ringMaster:762 - Completed command
execution. Exit Code: 0.
[2010-06-15 10:07:42,315] DEBUG/10 ringMaster:591 - Service registry @
http://symphony.cs.waikato.ac.nz:36372[2010-06-15 10:07:47,503] DEBUG/10 ringMaster:726 - tarball name :
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
hadoop package name : hadoop-0.20.1/
[2010-06-15 10:07:47,505] DEBUG/10 ringMaster:716 - Returning Hadoop
directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
[2010-06-15 10:07:47,515] DEBUG/10 util:215 - Executing command
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
version to find hadoop version
[2010-06-15 10:07:48,241] DEBUG/10 util:224 - Version from hadoop
command: Hadoop 0.20.1
[2010-06-15 10:07:48,244] DEBUG/10 ringMaster:117 - Using max-connect value 30
[2010-06-15 10:07:48,246] INFO/20 ringMaster:61 - Twisted interface
not found. Using hodXMLRPCServer.
[2010-06-15 10:07:48,257] DEBUG/10 ringMaster:73 - Ringmaster RPC
Server at 33771
[2010-06-15 10:07:48,265] DEBUG/10 ringMaster:121 - registering:
http://cn71:8030/hadoop-0.20.1.tar.gz[2010-06-15 10:07:48,275] DEBUG/10 ringMaster:658 - dmilne
34350.symphony.cs.waikato.ac.nz cn71.symphony.cs.waikato.ac.nz
ringmaster hod
[2010-06-15 10:07:48,307] DEBUG/10 ringMaster:670 - Registered with
serivce registry:
http://symphony.cs.waikato.ac.nz:36372.//chorus start
[2010-06-15 10:07:48,393] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-15 10:07:48,394] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.hdfs.Hdfs instance at 0xc9e050>
[2010-06-15 10:07:48,395] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
//chorus end
//chorus (3x)
[2010-06-15 10:07:51,461] DEBUG/10 ringMaster:726 - tarball name :
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1.tar.gz
hadoop package name : hadoop-0.20.1/
[2010-06-15 10:07:51,463] DEBUG/10 ringMaster:716 - Returning Hadoop
directory as: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/
[2010-06-15 10:07:51,465] DEBUG/10 ringMaster:690 -
hadoopdir=/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/,
java-home=/opt/jdk1.6.0_20
[2010-06-15 10:07:51,470] DEBUG/10 util:215 - Executing command
/scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster/hadoop-0.20.1/bin/hadoop
version to find hadoop version
//chorus (1x)
[2010-06-15 10:07:52,448] DEBUG/10 util:224 - Version from hadoop
command: Hadoop 0.20.1
[2010-06-15 10:07:52,450] DEBUG/10 ringMaster:697 - starting jt monitor
[2010-06-15 10:07:52,453] DEBUG/10 ringMaster:913 - Entered start method.
[2010-06-15 10:07:52,455] DEBUG/10 ringMaster:924 -
/home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
--hodring.service-id 34350.symphony.cs.waikato.ac.nz
--hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
--hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
--hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
--hodring.log-dir /scratch/local/dmilne/hod/log
--hodring.mapred-system-dir-root /mapredsystem
--hodring.xrs-port-range 32768-65536 --hodring.debug 4
--hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
[2010-06-15 10:07:52,456] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
[2010-06-15 10:07:52,458] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
[2010-06-15 10:07:52,460] DEBUG/10 ringMaster:504 - getServiceAddr
addr mapred: not found
[2010-06-15 10:07:52,470] DEBUG/10 torque:147 - pbsdsh command:
/opt/torque-2.4.5/bin/pbsdsh
/home/dmilne/hadoop/hadoop-0.20.1/contrib/hod/bin/hodring
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodring.cmd-retry-interval 2.0
--hodring.service-id 34350.symphony.cs.waikato.ac.nz
--hodring.temp-dir /scratch/local/dmilne/hod --hodring.http-port-range
8000-9000 --hodring.userid dmilne --hodring.java-home /opt/jdk1.6.0_20
--hodring.svcrgy-addr symphony.cs.waikato.ac.nz:36372
--hodring.download-addr h:t --hodring.tarball-retry-interval 3.0
--hodring.log-dir /scratch/local/dmilne/hod/log
--hodring.mapred-system-dir-root /mapredsystem
--hodring.xrs-port-range 32768-65536 --hodring.debug 4
--hodring.ringmaster-xrs-addr cn71:33771 --hodring.register
[2010-06-15 10:07:52,475] DEBUG/10 ringMaster:929 - Returned from runWorkers.
//chorus (many times)
[2010-06-15 10:12:02,852] DEBUG/10 ringMaster:530 - inside xml-rpc
call to stop ringmaster
[2010-06-15 10:12:02,853] DEBUG/10 ringMaster:976 - RingMaster stop
method invoked.
[2010-06-15 10:12:02,854] DEBUG/10 ringMaster:981 - finding exit code
//chorus (1x)
[2010-06-15 10:12:02,858] DEBUG/10 ringMaster:533 - returning from
xml-rpc call to stop ringmaster
[2010-06-15 10:12:02,859] DEBUG/10 ringMaster:949 - exit code 7
[2010-06-15 10:12:02,859] DEBUG/10 ringMaster:983 - stopping ringmaster instance
[2010-06-15 10:12:03,420] DEBUG/10 ringMaster:479 - getServiceAddr name: mapred
[2010-06-15 10:12:03,421] DEBUG/10 ringMaster:487 - getServiceAddr
service: <hodlib.GridServices.mapred.MapReduce instance at 0xc9e098>
[2010-06-15 10:12:03,422] DEBUG/10 ringMaster:504 - getServiceAddr
addr mapred: not found
[2010-06-15 10:12:03,852] DEBUG/10 idleJobTracker:79 - Joining the
monitoring thread.
[2010-06-15 10:12:03,853] DEBUG/10 idleJobTracker:83 - Joined the
monitoring thread.
[2010-06-15 10:12:04,442] DEBUG/10 ringMaster:793 - Cleaned up
temporary dir: /scratch/local/dmilne/hod/dmilne.34350.symphony.cs.waikato.ac.nz.ringmaster
[2010-06-15 10:12:04,477] DEBUG/10 ringMaster:976 - RingMaster stop
method invoked.
[2010-06-15 10:12:04,478] DEBUG/10 ringMaster:1014 - returning from main
On Mon, Jun 14, 2010 at 5:52 PM, Vinod KV wrote:On Monday 14 June 2010 08:03 AM, David Milne wrote:
Anybody? I am completely stuck here. I have no idea who else I can ask
or where I can go for more information. Is there somewhere specific
where I should be asking about HOD?
Thank you,
Dave
In the ringmaster logs, you should see which node was supposed to run
Namenode. This can be found above the logs that you've printed. I can barely
remember but I guess it reads something like getCommand(). Once you find out
the node, check the hodring logs there, something must have gone wrong
there.
The return code was 7 - indicating HDFS failure. See
http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque,and check if you are hitting one of the problems listed there.
HTH,
+vinod
On Thu, Jun 10, 2010 at 2:56 PM, David Milnewrote:
Hi there,
I am trying to get Hadoop on Demand up and running, but am having
problems with the ringmaster not being able to communicate with HDFS.
The output from the hod allocate command ends with this, with full
verbosity:
[2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
'hdfs' service address.
[2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
[2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
[2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
[2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
cluster /home/dmilne/hadoop/cluster
[2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
I've attached the hodrc file below, but briefly HOD is supposed to
provision an HDFS cluster as well as a Map/Reduce cluster, and seems
to be failing to do so. The ringmaster log looks like this:
[2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name:
hdfs
[2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
[2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name:
hdfs
[2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
... and so on, until it gives up
Any ideas why? One red flag is that when running the allocate command,
some of the variables echo-ed back look dodgy:
--gridservice-hdfs.fs_port 0
--gridservice-hdfs.host localhost
--gridservice-hdfs.info_port 0
These are not what I specified in the hodrc. Are the port numbers just
set to 0 because I am not using an external HDFS, or is this a
problem?
The software versions involved are:
- Hadoop 0.20.2
- Python 2.5.2 (no Twisted)
- Java 1.6.0_20
- Torque 2.4.5
The hodrc file looks like this:
[hod]
stream = True
java-home = /opt/jdk1.6.0_20
cluster = debian5
cluster-factor = 1.8
xrs-port-range = 32768-65536
debug = 3
allocate-wait-time = 3600
temp-dir = /scratch/local/dmilne/hod
[ringmaster]
register = True
stream = False
temp-dir = /scratch/local/dmilne/hod
log-dir = /scratch/local/dmilne/hod/log
http-port-range = 8000-9000
idleness-limit = 864000
work-dirs =
/scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
xrs-port-range = 32768-65536
debug = 4
[hodring]
stream = False
temp-dir = /scratch/local/dmilne/hod
log-dir = /scratch/local/dmilne/hod/log
register = True
java-home = /opt/jdk1.6.0_20
http-port-range = 8000-9000
xrs-port-range = 32768-65536
debug = 4
[resource_manager]
queue = express
batch-home = /opt/torque-2.4.5
id = torque
options =
l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
#env-vars =
HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
[gridservice-mapred]
external = False
pkgs = /opt/hadoop-0.20.2
tracker_port = 8030
info_port = 50080
[gridservice-hdfs]
external = False
pkgs = /opt/hadoop-0.20.2
fs_port = 8020
info_port = 50070
Cheers,
Dave