Hi Lyall,
Yes, that happens often. What happens is that opmn keeps a state file
for every process it's supposed to manage. When a process exits and
another reuses its PID between polls, opmn goes on thinking that the
process is running, but since it's another process the response is not
what's expected. Therefore the process is marked "stop" rather than
"down". When you now do a stopall, opmn actively tries to kill the new
process, but it won't succeed unless it runs under the same userid.
To get out of this, after doing the stopall (and verifying that all OAS
processes are really down e.g. using ps), delete all files in
${ORACLE_HOME}/opmn/logs/states.
Be aware that emd also keeps track of process ports in the file
${ORACLE_HOME}/sysman/emd/targets.xml. if they don't match with the
ports used by opmn, you'll find that opmn reports a process as "alive"
whereas it is shown as down in the enterprise manager. To compound thing
further, dcm keeps a log of states that processes should be in. If you
manually kill and restart processes, say an OC4J process, you may get
into a situation that dcm forces the process down again, since the last
command it logged was a stop command. You may have to use dcmctl in
combination with opmnctl to repair this.
Killing system processes such as xinetd is not usually a good idea.
Hope this helps,
Tony
Around 12/09/2009 12:14 AM, Lyall Barbour said:
Anyone ever seen this, where the iAS server console status is up and running and a status on opmnctl, the HTTP_Server is "Alive" but the OC4J process is "Stop" and the pid is the pid of another process running?
bss1.tri-c.edu: opmnctl status
Processes in Instance: bss1.bss1.tri-c.edu
-------------------+--------------------+---------+---------
ias-component | process-type | pid | status
-------------------+--------------------+---------+---------
LogLoader | logloaderd | N/A | Down
dcm-daemon | dcm-daemon | N/A | Down
OC4J | home | 30867 | Alive
HTTP_Server | HTTP_Server | 30868 | Alive
DSA | DSA | N/A | Down
i fixed the problem, but wanted to know if anybody has seen this, so that i can have it NOT happen again. The status of OC4J was Stop and the pid was 2047 which was the same pid as the xinetd process running. I tried to stopall and wait, then startall, but OC4J really wanted to use 2047 pid. So, i logged in as root and killall xinetd, logged oracle and opmnctl stopall, waited, then did xinetd -stayalive -pidfile /var/run/xinetd.pid, which is what is in our rc3.d script when the server boots. Xinetd didn't use 2047 anymore, but used:
bss1.tri-c.edu: ps -eaf|grep xinetd
oracle 591 356 0 10:13 pts/2 00:00:00 grep xinetd
root 30802 1 0 09:51 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
and, of course, the new opmnctl startall has OC4J using what's above.
Anyone seen that?
Thanks,
Lyall
--
http://www.freelists.org/webpage/oracle-l