Awhile back I noted that we had a problem with the postmaster failing
to recognize error exit from the startup process:
http://archives.postgresql.org/pgsql-hackers/2006-07/msg01485.php
The discussion with Stephen Harris about signal response brought this
back to mind --- as things stand, the only way that the xlog.c code
could report an unrecoverable error is to elog(PANIC). The problem
noted in the above message only applied in early startup of a
subprocess, but really we've got an issue with elog(FATAL) exits at
any point in a subprocess. (Note: in the startup process, any
elog(ERROR) is auto-promoted to elog(FATAL) by elog.c, because of the
lack of a setjmp handler to return to.) So the solution I proposed
before isn't enough.

The backend code is quite littered with elog(FATAL) calls that are meant
to indicate "this backend seems hopelessly confused, but there's no
reason to suppose there's a system-wide problem". So we don't want the
postmaster to engage in a panic restart if a normal backend goes down
with elog(FATAL). I claim, however, that that *would* be a good idea
for the startup process, and probably for the bgwriter too.

Rather than try to change a lot of elog call sites, what I'm thinking
would be a good plan is to make the FATAL-exit case in elog.c always
exit with exit(1) (right now it tests a couple of different conditions
to decide what to return). Then, in the postmaster, consider an exit
code of 1 to be either OK or not OK depending on which child it came
from. I think there are a small number of exit(1) calls that might
need to be changed to exit(2) because they are trying to force the
postmaster to do a panic restart, but it should be a minimal patch.

Comments?

regards, tom lane

Search Discussions

  • Alvaro Herrera at Nov 21, 2006 at 4:15 am

    Tom Lane wrote:

    Rather than try to change a lot of elog call sites, what I'm thinking
    would be a good plan is to make the FATAL-exit case in elog.c always
    exit with exit(1) (right now it tests a couple of different conditions
    to decide what to return). Then, in the postmaster, consider an exit
    code of 1 to be either OK or not OK depending on which child it came
    from. I think there are a small number of exit(1) calls that might
    need to be changed to exit(2) because they are trying to force the
    postmaster to do a panic restart, but it should be a minimal patch.
    I was going to suggest using symbolic names to exit codes instead of
    hardcoding 1 or 2. We do that in Mammoth replicator, and use the exit
    codes to determine whether the postmaster needs to take special action
    for different replication scenarios, e.g. when one needs to promote a
    master server to slave or vice versa.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    The PostgreSQL Company - Command Prompt, Inc.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 20, '06 at 11:14p
activeNov 21, '06 at 4:15a
posts2
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Alvaro Herrera: 1 post Tom Lane: 1 post

People

Translate

site design / logo © 2023 Grokbase