Stephen Harris writes:
Doing a shutdown "immediate" isn't to clever because it actually leaves
the recovery threads running
LOG: restored log file "00000001000000010000003E" from archive
LOG: received immediate shutdown request
LOG: restored log file "00000001000000010000003F" from archive
Hm, that should work --- AFAICS the startup process should abort on
SIGQUIT the same as any regular backend.

[ thinks... ] Ah-hah, "man system(3)" tells the tale:

system() ignores the SIGINT and SIGQUIT signals, and blocks the
SIGCHLD signal, while waiting for the command to terminate. If this
might cause the application to miss a signal that would have killed
it, the application should examine the return value from system() and
take whatever action is appropriate to the application if the command
terminated due to receipt of a signal.

So the SIGQUIT went to the recovery script command and was missed by the
startup process. It looks to me like your script actually ignored the
signal, which you'll need to fix, but it also looks like we are not
checking for these cases in RestoreArchivedFile(), which we'd better fix.
As the code stands, if the recovery script is killed by a signal, we'd
take that as normal termination of the recovery and proceed to come up,
which is definitely the Wrong Thing.

regards, tom lane

Search Discussions

  • Stephen Harris at Nov 18, 2006 at 2:22 am

    On Fri, Nov 17, 2006 at 05:03:44PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    Doing a shutdown "immediate" isn't to clever because it actually leaves
    the recovery threads running
    LOG: restored log file "00000001000000010000003E" from archive
    LOG: received immediate shutdown request
    LOG: restored log file "00000001000000010000003F" from archive
    Hm, that should work --- AFAICS the startup process should abort on
    SIGQUIT the same as any regular backend.

    [ thinks... ] Ah-hah, "man system(3)" tells the tale:

    system() ignores the SIGINT and SIGQUIT signals, and blocks the
    SIGCHLD signal, while waiting for the command to terminate. If this
    might cause the application to miss a signal that would have killed
    it, the application should examine the return value from system() and
    take whatever action is appropriate to the application if the command
    terminated due to receipt of a signal.

    So the SIGQUIT went to the recovery script command and was missed by the
    startup process. It looks to me like your script actually ignored the
    signal, which you'll need to fix, but it also looks like we are not
    My script was just a ksh script and didn't do anything special with signals.
    Essentially it does
    #!/bin/ksh -p

    [...variable setup...]
    while [ ! -f $wanted_file ]
    do
    if [ -f $abort_file ]
    then
    exit 1
    fi
    sleep 5
    done
    cat $wanted_file

    I know signals can be deferred in scripts (a signal sent to the script during
    the sleep will be deferred if a trap handler had been written for the signal)
    but they _do_ get delivered.

    However, it seems the signal wasn't sent at all. Once the wanted file
    appeared the recovery thread from postmaster started a _new_ script for
    the next log. I'll rewrite the script in perl (probably monday when
    I'm back in the office) and stick lots of signal() traps in to see if
    anything does get sent to the script.
    As the code stands, if the recovery script is killed by a signal, we'd
    take that as normal termination of the recovery and proceed to come up,
    which is definitely the Wrong Thing.
    Oh good; that means I'm not mad :-)

    --

    rgds
    Stephen
  • Gregory Stark at Nov 18, 2006 at 2:39 am

    "Stephen Harris" <lists@spuddy.org> writes:

    My script was just a ksh script and didn't do anything special with signals.
    Essentially it does
    #!/bin/ksh -p

    [...variable setup...]
    while [ ! -f $wanted_file ]
    do
    if [ -f $abort_file ]
    then
    exit 1
    fi
    sleep 5
    done
    cat $wanted_file

    I know signals can be deferred in scripts (a signal sent to the script during
    the sleep will be deferred if a trap handler had been written for the signal)
    but they _do_ get delivered.
    Sure, but it might be getting delivered to, say, your "sleep" command. You
    haven't checked the return value of sleep to handle any errors that may occur.
    As it stands you have to check for errors from every single command executed
    by your script.

    That doesn't seem terribly practical to expect of useres. As long as Postgres
    is using SIGQUIT for its own communication it seems it really ought to arrange
    to block the signal while the script is running so it will receive the signals
    it expects once the script ends.

    Alternatively perhaps Postgres really ought to be using USR1/USR2 or other
    signals that library routines won't think they have any business rearranging.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Stephen Harris at Nov 18, 2006 at 3:52 am

    On Fri, Nov 17, 2006 at 09:39:39PM -0500, Gregory Stark wrote:
    "Stephen Harris" <lists@spuddy.org> writes:
    [...variable setup...]
    while [ ! -f $wanted_file ]
    do
    if [ -f $abort_file ]
    then
    exit 1
    fi
    sleep 5
    done
    cat $wanted_file
    I know signals can be deferred in scripts (a signal sent to the script during
    Sure, but it might be getting delivered to, say, your "sleep" command. You
    No. The sleep command keeps on running. I could see that using "ps".

    To the best of my knowldge, a random child process of the script wouldn't
    even get a signal. All the postmaster recovery thread knows about is the
    system() - ie "sh -c". All sh knows about is the ksh process. Neither
    postmaster or sh know about "sleep" and so "sleep" wouldn't receive the
    signal (unless it was sent to all processes in the process group).

    Here's an example from Solaris 10 demonstrating lack of signal propogation.

    $ uname -sr
    SunOS 5.10
    $ echo $0
    /bin/sh
    $ cat x
    #!/bin/ksh -p

    sleep 10000
    $ ./x &
    4622
    $ kill 4622
    $
    4622 Terminated
    $ ps -ef | grep sleep
    sweh 4624 4602 0 22:13:13 pts/1 0:00 grep sleep
    sweh 4623 1 0 22:13:04 pts/1 0:00 sleep 10000

    This is, in fact, what proper "job control" shells do. Doing the same
    test with ksh as the command shell will kill the sleep :-)

    $ echo $0
    -ksh
    $ ./x &
    [1] 4632
    $ kill %1
    [1] + Terminated ./x &
    $ ps -ef | grep sleep
    sweh 4635 4582 0 22:15:17 pts/1 0:00 grep sleep

    [ Aside: The only way I've been able to guarantee all processes and child
    processes and everything to be killed is to run a subprocess with
    setsid() to create a new process group and kill the whole process group.
    It's a pain ]

    If postmaster was sending a signal to the system() process then "sh -c"
    might not signal the ksh script, anyway. The ksh script might terminate,
    or it might defer until sleep had finished. Only if postmaster had
    signalled a complete process group would sleep ever see the signal.

    --

    rgds
    Stephen
  • Tom Lane at Nov 18, 2006 at 5:15 am

    Gregory Stark writes:
    Sure, but it might be getting delivered to, say, your "sleep" command. You
    haven't checked the return value of sleep to handle any errors that may occur.
    As it stands you have to check for errors from every single command executed
    by your script.
    The expectation is that something like SIGINT or SIGQUIT would be
    delivered to both the sleep command and the shell process running the
    script. So the shell should fail anyway. (Of course, a nontrivial
    archive or recovery script had better be checking for failures at each
    step, but this is not very relevant to the immediate problem.)
    Alternatively perhaps Postgres really ought to be using USR1/USR2 or other
    signals that library routines won't think they have any business rearranging.
    The existing signal assignments were all picked for what seem to me
    to be good reasons; I'm disinclined to change them. In any case, the
    important point here is that we'd really like an archive or recovery
    script, or for that matter any command executed via system() from a
    backend, to abort when the parent backend is SIGINT'd or SIGQUIT'd.

    Stephen's idea of executing setsid() at each backend start seems
    interesting, but is there a way that will work on Windows?

    regards, tom lane
  • Gregory Stark at Nov 18, 2006 at 5:38 am

    "Tom Lane" <tgl@sss.pgh.pa.us> writes:

    Gregory Stark <stark@enterprisedb.com> writes:
    Sure, but it might be getting delivered to, say, your "sleep" command. You
    haven't checked the return value of sleep to handle any errors that may occur.
    As it stands you have to check for errors from every single command executed
    by your script.
    The expectation is that something like SIGINT or SIGQUIT would be
    delivered to both the sleep command and the shell process running the
    script. So the shell should fail anyway. (Of course, a nontrivial
    archive or recovery script had better be checking for failures at each
    step, but this is not very relevant to the immediate problem.)
    Hm, I tried to test that before I sent that. But I guess my test was faulty
    since I was really testing what process the terminal handling delivered the
    signal to:


    $ cat /tmp/test.sh
    #!/bin/sh

    echo before
    sleep 5 || echo sleep failed
    echo after

    $ sh /tmp/test.sh ; echo $?
    before
    ^\
    /tmp/test.sh: line 4: 23407 Quit sleep 5
    sleep failed
    after



    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Nov 18, 2006 at 6:02 am

    Gregory Stark writes:
    Hm, I tried to test that before I sent that. But I guess my test was faulty
    since I was really testing what process the terminal handling delivered the
    signal to:
    Interesting. I tried the same test on HPUX, and find that its /bin/sh
    seems to ignore SIGQUIT but not SIGINT:

    $ sh /tmp/test.sh ; echo $?
    before -- typed ^C here
    130
    $ sh /tmp/test.sh ; echo $?
    before -- typed ^\ here
    /tmp/test.sh[4]: 25166 Quit(coredump)
    sleep failed
    after

    $

    There is nothing in the shell man page about this :-(

    That seems to leave us back at square one. How can we ensure an archive
    or recovery script will fail on being signaled? (Obviously we can't
    prevent someone from trapping the signal, but it'd be good if the
    default behavior was this way.)

    regards, tom lane
  • Tom Lane at Nov 18, 2006 at 3:49 am

    Stephen Harris writes:
    However, it seems the signal wasn't sent at all.
    Now that I think about it, the behavior of system() is predicated on the
    assumption that SIGINT and SIGQUIT originate with the tty driver and are
    broadcast to all members of the session's process group --- so the
    called command will get them too, and there's no need for system() to
    do anything except wait to see whether the called command dies or traps
    the signal.

    This does not apply to signals originated by the postmaster --- it
    doesn't even know that the child process is doing a system(), much less
    have any way to signal the grandchild. Ugh.

    Reimplementing system() seems pretty ugly, but maybe we have no choice.
    It strikes me that system() has a race condition as defined anyway,
    because if a signal arrives between blocking the handler and issuing the
    fork(), it'll disappear into the ether; and the same at the end of the
    routine.

    regards, tom lane
  • Stephen Harris at Nov 18, 2006 at 3:55 am

    On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    However, it seems the signal wasn't sent at all.
    Now that I think about it, the behavior of system() is predicated on the
    assumption that SIGINT and SIGQUIT originate with the tty driver and are
    broadcast to all members of the session's process group --- so the
    This does not apply to signals originated by the postmaster --- it
    doesn't even know that the child process is doing a system(), much less
    have any way to signal the grandchild. Ugh.
    Why not, after calling fork() create a new process group with setsid() and
    then instead of killing the recovery thread, kill the whole process group
    (-PID rather than PID)? Then every process (the recovery thread, the
    system, the script, any child of the script) will all receive the signal.

    --

    rgds
    Stephen
  • Tom Lane at Nov 18, 2006 at 4:40 am

    Stephen Harris writes:
    On Fri, Nov 17, 2006 at 10:49:39PM -0500, Tom Lane wrote:
    This does not apply to signals originated by the postmaster --- it
    doesn't even know that the child process is doing a system(), much less
    have any way to signal the grandchild. Ugh.
    Why not, after calling fork() create a new process group with setsid() and
    then instead of killing the recovery thread, kill the whole process group
    (-PID rather than PID)? Then every process (the recovery thread, the
    system, the script, any child of the script) will all receive the signal.
    This seems like a good answer if setsid and/or setpgrp are universally
    available. I fear it won't work on Windows though :-(. Also, each
    backend would become its own process group leader --- does anyone know
    if adding hundreds of process groups would slow down any popular
    kernels?

    [ thinks for a bit... ] Another issue is that there'd be a race
    condition during backend start: if the postmaster tries to kill -PID
    before the backend has managed to execute setsid, it wouldn't work.

    regards, tom lane
  • Stephen Harris at Nov 18, 2006 at 2:35 pm

    On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    Why not, after calling fork() create a new process group with setsid() and
    then instead of killing the recovery thread, kill the whole process group
    (-PID rather than PID)? Then every process (the recovery thread, the
    system, the script, any child of the script) will all receive the signal.
    This seems like a good answer if setsid and/or setpgrp are universally
    available. I fear it won't work on Windows though :-(. Also, each
    It's POSIX, so I would suppose it's standard on most modern *nix
    platforms. Windows... bluh. I wonder how perl handles POSIX::setsid()
    on Windows!
    backend would become its own process group leader --- does anyone know
    if adding hundreds of process groups would slow down any popular
    kernels?
    Shouldn't hurt. This is, after all, what using "&" in a command line
    shell with job control (csh, ksh, tcsh, bash, zsh) does. Because you only
    run one archive or recovery thread at a time (which is very good and very
    clever) you won't have too many process groups at any instance in time.
    [ thinks for a bit... ] Another issue is that there'd be a race
    condition during backend start: if the postmaster tries to kill -PID
    before the backend has managed to execute setsid, it wouldn't work.
    *ponder* Bugger. Standard solutions (eg try three times with a second
    pause) would mitigate this, but.... Hmm.

    Another idea is to make the shutdown be more co-operative under control
    of the script; eg an exit code of 0 means xlog is now available, code
    if 1 means the log is non-existent (so recovery is complete) and an
    exit code of 255 means "failure to recover; perform database shutdown".
    In this way a solution similar to the existing trigger files ("recovery
    complete") could be used. It's a little messy in that pg_ctl wouldn't
    be used to shutdown the database; the script would essentially tell
    the recovery thread to abort, which would tell the main postmaster to
    shutdown. We'd have no clients connected, no child process running,
    so a smart shutdown would work.

    --

    rgds
    Stephen
  • Tom Lane at Nov 20, 2006 at 4:20 pm

    Stephen Harris writes:
    On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
    [ thinks for a bit... ] Another issue is that there'd be a race
    condition during backend start: if the postmaster tries to kill -PID
    before the backend has managed to execute setsid, it wouldn't work.
    *ponder* Bugger. Standard solutions (eg try three times with a second
    pause) would mitigate this, but.... Hmm.
    I think we could make it work if the postmaster does this:

    kill(child_pid, SIGxxx);
    #ifdef HAVE_SETSID
    kill(-child_pid, SIGxxx);
    #endif

    In the normal case where the child has already completed setsid(), the
    extra signal sent to it should do no harm. In the startup race
    condition case, the second kill() accomplishes nothing, and we are
    effectively assuming that the child will enable signals and respond to
    the first kill before it launches any grandchildren; seems safe enough.

    Of course the postmaster should only do this when sending signals that
    are likely to be interpreted the same way by random shell scripts as
    they are by backends, ie, SIGINT, SIGQUIT, SIGTERM (not, say, SIGUSR1).

    This would fix the problem on Unixen with setsid, which should be pretty
    much all of them; on any that don't have it, the behavior doesn't
    change. Still leaves Windows out in the cold, but I don't see that we
    have anything much we can do about it there --- Windows users will just
    have to know to kill the child script process instead of the postmaster.

    Comments?

    regards, tom lane
  • Magnus Hagander at Nov 20, 2006 at 4:23 pm

    This would fix the problem on Unixen with setsid, which
    should be pretty much all of them; on any that don't have it,
    the behavior doesn't change. Still leaves Windows out in the
    cold, but I don't see that we have anything much we can do
    about it there --- Windows users will just have to know to
    kill the child script process instead of the postmaster.
    I haven't really been folliowing this thread. What exactly is it you
    want to accomplish (maybe there is a different way to fix it on win32)?
    Kill a process along with all it's subprocesses?

    //Magnus
  • Tom Lane at Nov 20, 2006 at 4:30 pm

    "Magnus Hagander" <mha@sollentuna.net> writes:
    I haven't really been folliowing this thread. What exactly is it you
    want to accomplish (maybe there is a different way to fix it on win32)?
    Kill a process along with all it's subprocesses?
    Right. The problem (as seen on Unix) is that if the recovery subprocess
    is waiting for a recovery_command script to complete, it doesn't react
    to SIGQUIT (because system() ignores SIGQUIT) and the script doesn't
    react either (because it never gets the signal at all). We'd like
    "pg_ctl stop -m immediate" to abort the script and then have the
    recovery process fail, allowing restart later. More generally, it
    seems like any time a backend is executing system(), the called script
    ought to be subject to SIGINT or SIGQUIT signals sent to the backend
    by the postmaster.

    regards, tom lane
  • Magnus Hagander at Nov 20, 2006 at 4:34 pm

    I haven't really been folliowing this thread. What exactly is it you
    want to accomplish (maybe there is a different way to fix
    it on win32)?
    Kill a process along with all it's subprocesses?
    Right. The problem (as seen on Unix) is that if the recovery
    subprocess is waiting for a recovery_command script to
    complete, it doesn't react to SIGQUIT (because system()
    ignores SIGQUIT) and the script doesn't react either (because
    it never gets the signal at all). We'd like "pg_ctl stop -m
    immediate" to abort the script and then have the recovery
    process fail, allowing restart later. More generally, it
    seems like any time a backend is executing system(), the
    called script ought to be subject to SIGINT or SIGQUIT
    signals sent to the backend by the postmaster.
    Hm. ok.
    Well, first of all, the generic case will never really wrok - because we
    send our own kind of signals, and a script (batchfile executing under
    cmd.exe) or anythin non-postgresql-backedn will never respond to that
    anwyay.

    However, for this specific case we might be able to do something. pg_ctl
    creates a job object that it assigns the postmaster to when it's started
    if that works. (Won't work if pg_ctl itself is running inside a job
    object, but normally it won't be doing that). That means that we could
    have pg_ctl look up this job object and execute TerminateJobObject() on
    it. That will kill all processes inside the job object (includnig
    backends, but if we run it from pg_ctl *after* postmaster has shut down,
    there should be none of those left).
    (This will work for the generic processes as well, but it will *only*
    work for the pg_ctl-stops-theprocess scenario, it can't be done from
    inside the postmaster)

    It's not entirely pretty (to say the least), but might be worthwhile
    looking at?

    //Magnus
  • Tom Lane at Nov 20, 2006 at 4:40 pm

    "Magnus Hagander" <mha@sollentuna.net> writes:
    However, for this specific case we might be able to do something. pg_ctl
    creates a job object that it assigns the postmaster to when it's started
    if that works. (Won't work if pg_ctl itself is running inside a job
    object, but normally it won't be doing that). That means that we could
    have pg_ctl look up this job object and execute TerminateJobObject() on
    it. That will kill all processes inside the job object (includnig
    backends, but if we run it from pg_ctl *after* postmaster has shut down,
    there should be none of those left).
    That would take care of cleaning up any scripts after the postmaster has
    already quit, but what about the problem of kicking the waiting backend
    off the system() call in the first place? I suppose that if we can hack
    things so system() doesn't block SIGQUIT on Windows, it might be OK, at
    least for the "-m immediate" case. I was hoping we could get Query
    Cancel to SIGINT a system()-invoked script in a regular backend, but
    that might be too much to hope for on Windows, especially if it hasn't
    even got a distinction between INT and QUIT ...

    regards, tom lane
  • Stephen Harris at Nov 20, 2006 at 4:53 pm

    On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    On Fri, Nov 17, 2006 at 11:40:36PM -0500, Tom Lane wrote:
    [ thinks for a bit... ] Another issue is that there'd be a race
    condition during backend start: if the postmaster tries to kill -PID
    before the backend has managed to execute setsid, it wouldn't work.
    *ponder* Bugger. Standard solutions (eg try three times with a second
    pause) would mitigate this, but.... Hmm.
    I think we could make it work if the postmaster does this:

    kill(child_pid, SIGxxx);
    #ifdef HAVE_SETSID
    kill(-child_pid, SIGxxx);
    #endif
    That should work. I just created a dummy test case where a process forks
    a child, the child does setsid and then forks 3 times, so 8 processes
    all in the new process group. The original parent sleeps for 3 seconds then
    kills the original child (the process group leader) then sleeps 3 more seconds
    and then kills the process group. A simple sig handler shows the signal
    being received by the remaining children.

    Tested on RHEL4 (2.6.9-42.0.3.ELsmp) and Solaris 10 (Sparc) and got the
    same results.

    #include <signal.h>
    #include <unistd.h>
    #include <stdio.h>
    #include <stdlib.h>

    void sig(int x)
    {
    int y=getpid();
    printf("Process %d got signal\n",y);
    exit(0);
    }

    void do_child()
    {
    setsid();
    fork();
    fork();
    fork();

    printf("%d sleeping for a long time!\n",(int)getpid());
    sleep(10000);
    }

    int main()
    {
    signal(SIGQUIT,&sig);
    signal(SIGCLD,SIG_IGN);
    int ch=fork();

    if (ch==0)
    {
    do_child();
    }
    else
    {
    int e;
    sleep(3);
    printf("Killing first child\n");
    printf("Return is: %d\n",e=kill(ch,SIGQUIT));
    if (e) { perror("Bad kill"); }
    sleep(3);
    printf("Killing process group\n");
    printf("Return is: %d\n",e=kill(-ch,SIGQUIT));
    if (e) { perror("Bad kill"); }
    }

    exit(0);
    }

    --

    rgds
    Stephen
  • Stephen Harris at Nov 22, 2006 at 6:56 pm

    On Mon, Nov 20, 2006 at 11:20:41AM -0500, Tom Lane wrote:

    kill(child_pid, SIGxxx);
    #ifdef HAVE_SETSID
    kill(-child_pid, SIGxxx);
    #endif

    In the normal case where the child has already completed setsid(), the
    extra signal sent to it should do no harm. In the startup race
    Hmm. It looks like something more than this may be needed. The postgres
    recovery process appears to be ignoring it. I ran the whole database
    in it's own process group (ksh runs processes in their own process group
    by default, so pg_ctl became the session leader and so everything under
    pg_ctl all stayed in that process group).

    % ps -o pid,ppid,pgid,args -g 29141 | sort
    PID PPID PGID COMMAND
    29145 1 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
    29146 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
    29147 29145 29141 /local/apps/postgres/8.2.b3.0/solaris/bin/postgres
    29501 29147 29141 sh -c /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYXLOG
    29502 29501 29141 /bin/ksh -p /export/home/swharris/rr 000000010000000100000057 pg_xlog/RECOVERYX
    29537 29502 29141 sleep 5

    I did
    kill -QUIT -29141 ; sleep 1 ; touch /export/home/swharris/archives/STOP_SWEH_RECOVERY

    This sent the QUIT signal to all those processes. The shell script ignores
    it and so tries to start again, so the 'touch' command tells it to exit(1)
    rather than loop again.

    The log file (the timestamp entries are from my 'rr' program so I
    can see what it's doing)...

    To start with we see a normal recovery:

    Wed Nov 22 13:41:20 EST 2006: Attempting to restore 000000010000000100000056
    Wed Nov 22 13:41:25 EST 2006: Finished 000000010000000100000056
    LOG: restored log file "000000010000000100000056" from archive
    Wed Nov 22 13:41:25 EST 2006: Attempting to restore 000000010000000100000057
    Wed Nov 22 13:41:25 EST 2006: Waiting for file to become available

    Now I send the kill signal...

    LOG: received immediate shutdown request

    We can see that the sleep process got it!
    /export/home/swharris/rr[37]: 29537 Quit(coredump)
    And my script detects the trigger file
    Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found

    Now database recovery appears to continue as normal; the postgres
    recovery processes are still running, despite having received SIGQUIT

    LOG: could not open file "pg_xlog/000000010000000100000057" (log file 1, segment 87): No such file or directory
    LOG: redo done at 1/56000070
    Wed Nov 22 13:43:51 EST 2006: Attempting to restore 000000010000000100000056
    Wed Nov 22 13:43:55 EST 2006: Finished 000000010000000100000056
    LOG: restored log file "000000010000000100000056" from archive
    LOG: archive recovery complete
    LOG: database system is ready
    LOG: logger shutting down

    pg_xlog now contains 000000010000000100000056 and 000000010000000100000057

    A similar sort of thing happens if I use SIGTERM rather than SIGQUIT

    I'm out of here in an hour, so for all you US based people, have a good
    Thanksgiving holiday!

    --

    rgds
    Stephen
  • Tom Lane at Nov 22, 2006 at 8:03 pm

    Stephen Harris writes:
    We can see that the sleep process got it!
    /export/home/swharris/rr[37]: 29537 Quit(coredump)
    And my script detects the trigger file
    Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found
    Now database recovery appears to continue as normal; the postgres
    recovery processes are still running, despite having received SIGQUIT
    Well, sure, because system() will have ignored the SIGQUIT. You have to
    make the shell script return an abort code rather than exit(1) or
    whatever it's doing in this case. (I've tested that CVS HEAD does the
    right thing when the script is just a straight invocation of cp. But
    if the script is trapping errors, it's going to have to cooperate.)

    regards, tom lane
  • Stephen Harris at Nov 23, 2006 at 6:18 pm

    On Wed, Nov 22, 2006 at 03:03:03PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    We can see that the sleep process got it!
    /export/home/swharris/rr[37]: 29537 Quit(coredump)
    And my script detects the trigger file
    Wed Nov 22 13:43:51 EST 2006: End of recovery trigger file found
    Now database recovery appears to continue as normal; the postgres
    recovery processes are still running, despite having received SIGQUIT
    Well, sure, because system() will have ignored the SIGQUIT. You have to
    make the shell script return an abort code rather than exit(1) or
    whatever it's doing in this case. (I've tested that CVS HEAD does the
    Hmm, on Linux I did a kill -QUIT on the process group and then rather
    than cleanly exiting my recovery script I "kill -9"d it. The database
    still completed recovery as previously described.
    right thing when the script is just a straight invocation of cp. But
    if the script is trapping errors, it's going to have to cooperate.)
    The script doesn't trap errors; ksh itself plays with SIGQUIT.

    I guess I should rewrite my script in perl and see what happens.

    --

    rgds
    Stephen
  • Stephen Harris at Nov 23, 2006 at 7:38 pm

    On Thu, Nov 23, 2006 at 01:18:06PM -0500, Stephen Harris wrote:
    Hmm, on Linux I did a kill -QUIT on the process group and then rather
    than cleanly exiting my recovery script I "kill -9"d it. The database
    still completed recovery as previously described.
    OK, I just did the same tests on the CVS tree, rather than beta3 and
    I see different results. This time recovery aborts with a PANIC. I
    did a "pg_ctl stop -m i" and then kill -9'd the recovery script and:

    LOG: restored log file "000000010000000000000006" from archive
    Looking for 000000010000000000000007
    LOG: received immediate shutdown request
    /tmp/rr: line 12: 22031 Quit sleep 5
    PANIC: could not restore file "000000010000000000000007" from archive: return code 9

    Similar behaviour was seen with "smart" and "fast" shutdowns.

    Looking for 000000010000000000000005
    Restoring 000000010000000000000005 to pg_xlog/RECOVERYXLOG
    LOG: restored log file "000000010000000000000005" from archive
    LOG: checkpoint record is at 0/5000020
    LOG: redo record is at 0/5000020; undo record is at 0/0; shutdown FALSE
    LOG: next transaction ID: 0/601; next OID: 24577
    LOG: next MultiXactId: 1; next MultiXactOffset: 0
    LOG: automatic recovery in progress
    LOG: redo starts at 0/5000068
    Looking for 000000010000000000000006
    Restoring 000000010000000000000006 to pg_xlog/RECOVERYXLOG
    LOG: restored log file "000000010000000000000006" from archive
    Looking for 000000010000000000000007
    LOG: received fast shutdown request
    PANIC: could not restore file "000000010000000000000007" from archive: return code 9
    LOG: startup process (PID 22246) was terminated by signal 6
    LOG: aborting startup due to startup process failure


    Restarting the database appears to go back to asking for archive log 5
    again. I'll have to do some testing with more archives to see how far back
    we have to go.

    --

    rgds
    Stephen
  • Tom Lane at Nov 23, 2006 at 7:47 pm

    Stephen Harris writes:
    OK, I just did the same tests on the CVS tree, rather than beta3 and
    I see different results. This time recovery aborts with a PANIC.
    Yeah, that's expected since the whole recovery process is a critical
    section. We could change that but it's cosmetic.
    Restarting the database appears to go back to asking for archive log 5
    again.
    It would back up to whereever the last recovery checkpoint was. I think
    we've got it hardwired at one recovery checkpoint per 100 source
    checkpoints ... maybe that's too far apart.

    regards, tom lane
  • Stephen Harris at Nov 24, 2006 at 12:20 pm

    On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:

    It would back up to whereever the last recovery checkpoint was. I think
    we've got it hardwired at one recovery checkpoint per 100 source
    checkpoints ... maybe that's too far apart.
    I'm talking out of my arse now, since I have no ideas of the internals
    of postgres recovery, but maybe a recovery checkpoint can be added each
    time it asks for a new archive file, or 100 source checkpoints, whatever
    comes first.

    A database that is mostly idle (eg overnight) but has 1 or 2 critical
    transactions will have received lots of archive logs (from the 5 minute
    checkpoint timeout) but not necessarily enough to cause recovery to
    checkpoint. This could mean needing to keep a large number of shipped
    logfiles available on the standby system "just in case".

    --

    rgds
    Stephen
  • Simon Riggs at Nov 30, 2006 at 9:40 pm

    On Fri, 2006-11-24 at 07:20 -0500, Stephen Harris wrote:
    On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:

    It would back up to whereever the last recovery checkpoint was. I think
    we've got it hardwired at one recovery checkpoint per 100 source
    checkpoints ... maybe that's too far apart.
    I'm talking out of my arse now, since I have no ideas of the internals
    of postgres recovery, but maybe a recovery checkpoint can be added each
    time it asks for a new archive file, or 100 source checkpoints, whatever
    comes first.
    We can take a restorepoint during replay only when we come across a
    checkpoint WAL record. We take the first checkpoint WAL record we come
    across after checkpoint_timeout seconds have expired, not 100 as
    described.

    History: The 100 checkpoints per restorepoint idea was the way it was
    originally following the discussions at Anniversary. That was replaced
    prior to code commit with the concept of a minimum delay between
    restorepoints equal to the checkpoint_timeout.
    A database that is mostly idle (eg overnight) but has 1 or 2 critical
    transactions will have received lots of archive logs (from the 5 minute
    checkpoint timeout) but not necessarily enough to cause recovery to
    checkpoint. This could mean needing to keep a large number of shipped
    logfiles available on the standby system "just in case".
    It doesn't work that way. There is only ever one WAL file being worked
    upon at a time, so there is no build up of log files in the case you
    mention.

    The case you mention would be captured by the current code, since after
    5 minutes worth of WAL replay we would attempt a restorepoint at the
    next checkpoint record. If replay goes at exactly the same speed as the
    original WAL writes, then there would be one restorepoint per original
    checkpoint. If replay goes 10 times faster then there would be one
    restorepoint for each 10 checkpoints, etc.. So the restorepoint is taken
    every 5 minutes in wall-clock time during replay, whereas that might be
    anywhere from 5 minutes upwards in original-WAL-write-time, possibly
    hours or days if you aren't writing any WAL at all for long periods. But
    note that will not translate into many WAL files.

    If the de-archive time is significant then the average rate of WAL apply
    will go down and so the number of restorepoints per checkpoint will
    fall. So overall, I can't see any need to change the code, but I'm
    listening for different opinions and Stephen's especially has been very
    good to have, thanks.

    Humble apologies to all for my late arrival on this thread. I missed the
    start of it during a busy period last week and didn't notice subsequent
    postings. I'll try to filter things better in future since I'm clearly
    not managing to read all postings. Please feel free to include me into
    any threads.

    If there's anything that people think I should be working on in support
    of this, please say and I'll prioritise that now ahead of my docs and
    advocacy activities around the release.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Stephen Harris at Nov 30, 2006 at 2:11 pm

    On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    OK, I just did the same tests on the CVS tree, rather than beta3 and
    I see different results. This time recovery aborts with a PANIC.
    Yeah, that's expected since the whole recovery process is a critical
    section. We could change that but it's cosmetic.
    Should this cause a coredump when it happens? With rc1 whenever I
    do a shutdown I get a core file

    core: ELF 32-bit MSB core file SPARC Version 1, from 'postgres'

    (gdb) bt
    #0 0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1
    #1 0xff035a5c in abort () from /usr/lib/libc.so.1
    #2 0x0020cd50 in errfinish ()
    #3 0x0006d258 in RestoreArchivedFile ()

    --

    rgds
    Stephen
  • Bruce Momjian at Nov 30, 2006 at 2:41 pm

    Stephen Harris wrote:
    On Thu, Nov 23, 2006 at 02:47:27PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    OK, I just did the same tests on the CVS tree, rather than beta3 and
    I see different results. This time recovery aborts with a PANIC.
    Yeah, that's expected since the whole recovery process is a critical
    section. We could change that but it's cosmetic.
    Should this cause a coredump when it happens? With rc1 whenever I
    do a shutdown I get a core file

    core: ELF 32-bit MSB core file SPARC Version 1, from 'postgres'

    (gdb) bt
    #0 0xff0a02d4 in _libc_kill () from /usr/lib/libc.so.1
    #1 0xff035a5c in abort () from /usr/lib/libc.so.1
    #2 0x0020cd50 in errfinish ()
    #3 0x0006d258 in RestoreArchivedFile ()
    You should never get a core file. I think we need the binary compiled
    with symbols so we can get more information from this backtrace.

    --
    Bruce Momjian bruce@momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +
  • Tom Lane at Nov 30, 2006 at 2:51 pm

    Bruce Momjian writes:
    Stephen Harris wrote:
    Should this cause a coredump when it happens?
    You should never get a core file.
    elog(PANIC) => abort() => core dump. This is completely expected.

    regards, tom lane
  • Bruce Momjian at Nov 30, 2006 at 2:59 pm

    Tom Lane wrote:
    Bruce Momjian <bruce@momjian.us> writes:
    Stephen Harris wrote:
    Should this cause a coredump when it happens?
    You should never get a core file.
    elog(PANIC) => abort() => core dump. This is completely expected.
    Well, then I should have said there is no reason you should ever get a
    panic.

    --
    Bruce Momjian bruce@momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +
  • Stephen Harris at Nov 30, 2006 at 3:59 pm

    On Thu, Nov 30, 2006 at 09:58:52AM -0500, Bruce Momjian wrote:
    Tom Lane wrote:
    Bruce Momjian <bruce@momjian.us> writes:
    Stephen Harris wrote:
    Should this cause a coredump when it happens?
    You should never get a core file.
    elog(PANIC) => abort() => core dump. This is completely expected.
    Well, then I should have said there is no reason you should ever get a
    panic.
    Well, as Tom said earlier in the thread
    I see different results. This time recovery aborts with a PANIC.
    Yeah, that's expected since the whole recovery process is a critical
    section. We could change that but it's cosmetic.

    Because of the changes made, we don't actually need to do a database
    shutdown. Merely killing the restore_command process (eg with kill -9)
    appears to cause the same effect.

    And now a personal opinion...

    I think this is more than cosmetic; shutting down a standby database
    cleanly is critical functionality for proper warm-standby procedures.
    What we have now "works", but should be tidied up. Probably low on the
    priority queue though :-)

    --

    rgds
    Stephen
  • Tom Lane at Nov 30, 2006 at 6:32 pm

    Stephen Harris writes:
    I think this is more than cosmetic; shutting down a standby database
    cleanly is critical functionality for proper warm-standby procedures.
    What we have now "works", but should be tidied up.
    I've made some small adjustments that make "pg_ctl stop -m fast" work
    for shutting down a warm-standby server. Please test.

    regards, tom lane
  • Stephen Harris at Nov 30, 2006 at 8:52 pm

    On Thu, Nov 30, 2006 at 01:32:31PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    I think this is more than cosmetic; shutting down a standby database
    cleanly is critical functionality for proper warm-standby procedures.
    What we have now "works", but should be tidied up.
    I've made some small adjustments that make "pg_ctl stop -m fast" work
    for shutting down a warm-standby server. Please test.
    That's a lot cleaner. "fast" is good because of SIGTERM, which the
    shell script also will respond to even while waiting on another process


    LOG: restored log file "00000001000000000000003E" from archive
    Thu Nov 30 15:45:56 EST 2006: Attempting to restore 00000001000000000000003F
    Thu Nov 30 15:45:58 EST 2006: Finished 00000001000000000000003F
    LOG: restored log file "00000001000000000000003F" from archive
    Thu Nov 30 15:45:59 EST 2006: Attempting to restore 000000010000000000000040
    Thu Nov 30 15:45:59 EST 2006: Waiting for file to become available
    LOG: received fast shutdown request
    Terminated
    FATAL: could not restore file "000000010000000000000040" from archive: return code 36608
    LOG: startup process (PID 10730) exited with exit code 1
    LOG: aborting startup due to startup process failure
    LOG: logger shutting down

    (the datestamp entries are from my script)

    Starting up the standby database still goes back to earlier log files,
    but I guess that's the 100 checkpoint thing you mentioned earlier.

    The change you just made is very good and very much tidier; no core files,
    no panics, cleans up children nicely. Good stuff!


    --

    rgds
    Stephen
  • Tom Lane at Nov 30, 2006 at 8:59 pm

    Stephen Harris writes:
    Starting up the standby database still goes back to earlier log files,
    but I guess that's the 100 checkpoint thing you mentioned earlier.
    Actually I was misremembering that: the frequency of recovery
    checkpoints is time-based, and for a slave that's in sync with its
    master, they should occur about as often as checkpoints on the master.
    So the amount of log to be replayed should be comparable to what the
    master would have to replay if it crashed.

    regards, tom lane
  • Stephen Harris at Nov 30, 2006 at 9:32 pm

    On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote:
    Stephen Harris <lists@spuddy.org> writes:
    Starting up the standby database still goes back to earlier log files,
    but I guess that's the 100 checkpoint thing you mentioned earlier.
    Actually I was misremembering that: the frequency of recovery
    checkpoints is time-based, and for a slave that's in sync with its
    master, they should occur about as often as checkpoints on the master.
    So the amount of log to be replayed should be comparable to what the
    master would have to replay if it crashed.
    Hmm. For testing purposes I have

    archive_timeout=5

    so an archive log is created every 5 seconds if there's activity.
    In idle periods it's created every 5 minutes as the internal checkpoint
    occurs on the server. However if I replay all those "empty" logs then
    shutdown then startup it will ask for some of them again.

    As a test, I started a recover, recovered 11 archive logs
    (000000010000000000000035 to 00000001000000000000003F), some may have had
    activity; I not sure. I then waited 10 minutes after the last log file
    had been replayed. Then I did a fast shutdown. Starting the standby
    database again required it going back to 000000010000000000000035 again.

    --

    rgds
    Stephen
  • Tom Lane at Nov 30, 2006 at 9:37 pm

    Stephen Harris writes:
    On Thu, Nov 30, 2006 at 03:59:15PM -0500, Tom Lane wrote:
    Actually I was misremembering that: the frequency of recovery
    checkpoints is time-based, and for a slave that's in sync with its
    master, they should occur about as often as checkpoints on the master.
    So the amount of log to be replayed should be comparable to what the
    master would have to replay if it crashed.
    As a test, I started a recover, recovered 11 archive logs
    (000000010000000000000035 to 00000001000000000000003F), some may have had
    activity; I not sure. I then waited 10 minutes after the last log file
    had been replayed. Then I did a fast shutdown. Starting the standby
    database again required it going back to 000000010000000000000035 again.
    That case is not what I described --- you weren't tracking a live master
    that's generating checkpoints.

    regards, tom lane
  • Tom Lane at Nov 30, 2006 at 4:25 pm

    Bruce Momjian writes:
    Tom Lane wrote:
    elog(PANIC) => abort() => core dump. This is completely expected.
    Well, then I should have said there is no reason you should ever get a
    panic.
    The reason it's a PANIC is that the whole of StartupXLOG() runs as a
    critical section. Given the recent changes to make elog(FATAL) do
    exit(1) instead of exit(0), I think it might be possible to dispense
    with that, but it's rather a ticklish thing to be changing the day
    before we wrap 8.2. How excited are you about this?

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 17, '06 at 10:03p
activeNov 30, '06 at 9:40p
posts35
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase