I'm a sysadmin, not a DBA, and I inherited a legacy app after the developer
left the company. It's 4 early mod_perl (1.29) + early DBI (1.43) app
servers, going against postgres 7.4.6.
The DB just crapped itself a few days ago. In the postmortem, we found out
that the number of processes on the server had been climbing as the server
ran, going from having some 100 processes, to about 350, which we believe
were mostly idle postgres processes (someone else got the page-out for
support). The uptime was around 6 months.
In the 3 days since it died, I've been watching it and, once again, the
processes are climbing slowly again. They're idle postgres processes,
fairly evenly distributed against the app servers, and more interestingly,
if I do lsof|grep postgres, I see a large number of lines (now 55):
postmaste 24521 postgres 55u REG 58,0 16777216 2899982
I looked in the 7.4 docs about WAL. checkpoint_timeout is 300,
checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files
in there, some a few hours old, which I would suspect would not be the case
if it were checkpointing properly.
Am I chasing up the right tree over these leaking processes/connections, or
is the WAL logs thing just a red herring? Sadly, I can't edit the app code,
and I doubt I could upgrade the DB unless I can really business-justify that
a minor change would do it (I'd probably have to stay in the 7.4 series due
to timid managers), but if there's little baby tweaks I'm missing, or if I
can say "yeah, this version of postgres leaks, plan for 3-month-reboots or
moving to 7.4.x", it'd work for me.
Thanks for any help!
Dave vs. Carl: The Insignificant Championship Series. Who will win?