Our applications connect to only 1 node of a 2-node 10gR2 (10.2.0.5)
RAC system on Solaris 10. ASM is configured.
The TPS is mostly around 5 and at times 11.
There is hardly any load on the server. CPU is around 8% on a HP DL380
box (it has 2 cpu with 6 cores each). Out of that 8%, 7.9% is consumed
by lgwr. Dont understand why? prstat -Lw for the lgwr pid shows that
there are 2 LWPs and for one of them CPU in SYS mode is around 50-60%.
To troubleshoot, checked statspack and AWR reports and found that
maximum time spent is on log file sync (avg is 75msec). For log file
parallel write, avg is 69msec. I understand that is too high but
iostat always shows 0.2msec. Why this mismatch?
So enabled 10046 tracing for the lgwr and it showed only log file
parallel write and ela was in line with what is reported by AWR i.e.
asmiostat shows 0msec and after every 6-8 seconds, it shows 135msec.
But I think that the asmiostat script as available on metalink has a
bug. Because what it gathers is in centisec and what it displays is in
millsec. And hence it should multiply by 10 but it multiplies by 1000.
Pls correct me if I am wrong here.
As a result of the above, the system is very vulnerable to contention.
At times, application sessions wait for as long as 10-15 seconds on
log file sync and as a result, apps restart.