I'm a sysadmin, not a DBA, and I inherited a legacy app after the developer
left the company. It's 4 early mod_perl (1.29) + early DBI (1.43) app
servers, going against postgres 7.4.6.

The DB just crapped itself a few days ago. In the postmortem, we found out
that the number of processes on the server had been climbing as the server
ran, going from having some 100 processes, to about 350, which we believe
were mostly idle postgres processes (someone else got the page-out for
support). The uptime was around 6 months.

In the 3 days since it died, I've been watching it and, once again, the
processes are climbing slowly again. They're idle postgres processes,
fairly evenly distributed against the app servers, and more interestingly,
if I do lsof|grep postgres, I see a large number of lines (now 55):

postmaste 24521 postgres 55u REG 58,0 16777216 2899982
/a/path/pdm/pgsql/data/pg_xlog/0000030F000000E2 (deleted)

I looked in the 7.4 docs about WAL. checkpoint_timeout is 300,
checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files
in there, some a few hours old, which I would suspect would not be the case
if it were checkpointing properly.

Am I chasing up the right tree over these leaking processes/connections, or
is the WAL logs thing just a red herring? Sadly, I can't edit the app code,
and I doubt I could upgrade the DB unless I can really business-justify that
a minor change would do it (I'd probably have to stay in the 7.4 series due
to timid managers), but if there's little baby tweaks I'm missing, or if I
can say "yeah, this version of postgres leaks, plan for 3-month-reboots or
moving to 7.4.x", it'd work for me.

Thanks for any help!
-Greg

_________________________________________________________________
Dave vs. Carl: The Insignificant Championship Series.  Who will win?
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001

Search Discussions

  • Brandon Aiken at Jan 3, 2007 at 10:27 pm
    I highly recommend moving to the latest 7.4.x. 7.4.6 is well over 2 years old, and the numerous security issues alone should be convincing enough to upgrade. Updating this far appears to be a multi-step process, though. See the release notes.

    There was a race condition prior to 7.4.10 in transaction logging:
    http://www.postgresql.org/docs/7.4/interactive/release-7-4-10.html

    There's also a pg_clog race condition in 7.4.12, but that doesn't seem obviously related to me.

    You didn't say if you're using clustering, but you may need to use this to correct this issue.
    http://www.postgresql.org/docs/7.4/interactive/app-pgresetxlog.html

    I am by no means an expert with PostgreSQL; however, it seems likely that at least some of the problems you're having have been identified and corrected.


    --
    Brandon Aiken
    CS/IT Systems Engineer
    -----Original Message-----
    From: pgsql-novice-owner@postgresql.org On Behalf Of Greg Cox
    Sent: Wednesday, January 03, 2007 4:52 PM
    To: pgsql-novice@postgresql.org
    Subject: [NOVICE] Excess idle processes, leaking WAL logs?

    I'm a sysadmin, not a DBA, and I inherited a legacy app after the developer
    left the company. It's 4 early mod_perl (1.29) + early DBI (1.43) app
    servers, going against postgres 7.4.6.

    The DB just crapped itself a few days ago. In the postmortem, we found out
    that the number of processes on the server had been climbing as the server
    ran, going from having some 100 processes, to about 350, which we believe
    were mostly idle postgres processes (someone else got the page-out for
    support). The uptime was around 6 months.

    In the 3 days since it died, I've been watching it and, once again, the
    processes are climbing slowly again. They're idle postgres processes,
    fairly evenly distributed against the app servers, and more interestingly,
    if I do lsof|grep postgres, I see a large number of lines (now 55):

    postmaste 24521 postgres 55u REG 58,0 16777216 2899982
    /a/path/pdm/pgsql/data/pg_xlog/0000030F000000E2 (deleted)

    I looked in the 7.4 docs about WAL. checkpoint_timeout is 300,
    checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files
    in there, some a few hours old, which I would suspect would not be the case
    if it were checkpointing properly.

    Am I chasing up the right tree over these leaking processes/connections, or
    is the WAL logs thing just a red herring? Sadly, I can't edit the app code,
    and I doubt I could upgrade the DB unless I can really business-justify that
    a minor change would do it (I'd probably have to stay in the 7.4 series due
    to timid managers), but if there's little baby tweaks I'm missing, or if I
    can say "yeah, this version of postgres leaks, plan for 3-month-reboots or
    moving to 7.4.x", it'd work for me.

    Thanks for any help!
    -Greg

    _________________________________________________________________
    Dave vs. Carl: The Insignificant Championship Series.  Who will win?
    http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://davevscarl.spaces.live.com/?icid=T001MSN38C07001


    ---------------------------(end of broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
  • Tom Lane at Jan 3, 2007 at 11:13 pm

    "Greg Cox" <ratness@hotmail.com> writes:
    In the 3 days since it died, I've been watching it and, once again, the
    processes are climbing slowly again. They're idle postgres processes,
    fairly evenly distributed against the app servers, and more interestingly,
    This isn't a postgres bug, it's a problem in your application code; it's
    not closing sessions it doesn't need. Possibly a problem in a
    connection-pooling layer?
    I looked in the 7.4 docs about WAL. checkpoint_timeout is 300,
    checkpoint_segments is 8, plenty of space in pg_xlog, but there's 18 files
    in there, some a few hours old, which I would suspect would not be the case
    if it were checkpointing properly.
    No, that doesn't sound out of line. The steady state is to have about
    2 * checkpoint_segments + 1 WAL files. You might have one or so more
    than I'd have expected but that's not enough to evidence a problem.
    If you are concerned about whether checkpoints are really happening,
    use pg_controldata to keep an eye on the last-checkpoint time in
    pg_control.

    I concur with the other comment that you really ought to be on 7.4.latest
    but that's not going to magically fix your connection-pooling issue.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-novice @
categoriespostgresql
postedJan 3, '07 at 9:52p
activeJan 3, '07 at 11:13p
posts3
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase