one of my new buildfarm boxes (an Debian/Etch based ARM box) is
sometimes failing to stop the database during the regression tests:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03

this only seems to happen sometimes and only if --with-tcl is enabled on
quagga.

lionfish (my mipsel box) is able to trigger that on every build if I
enable --with-tcl but it is nearly impossible to debug it there because
of the low amount of memory and diskspace it has. (two consecutive
failures will run the kernel out of memory due to the resources consumed
by the still running processes).

After the stopdb failure we still have those processes running:

pgbuild 3389 0.0 1.5 39632 4112 ? S 06:14 0:03
/home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
pgbuild 3391 0.0 0.9 39632 2540 ? Ss 06:14 0:00
postgres: writer process
pgbuild 3392 0.0 0.5 11220 1348 ? Ss 06:14 0:00
postgres: stats collector process
pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01
postgres: pgbuild pl_regression [local] idle
pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00
[postgres] <defunct>


Any ideas on how to debug that any further ?


Stefan

Search Discussions

  • Tom Lane at Jan 9, 2007 at 2:52 am

    Stefan Kaltenbrunner writes:
    one of my new buildfarm boxes (an Debian/Etch based ARM box) is
    sometimes failing to stop the database during the regression tests:
    http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
    this only seems to happen sometimes and only if --with-tcl is enabled on
    quagga.
    lionfish (my mipsel box) is able to trigger that on every build if I
    enable --with-tcl but it is nearly impossible to debug it there because
    of the low amount of memory and diskspace it has.
    Hm, could pl/tcl somehow be preventing the backend from exiting once
    it's run any pl/tcl stuff? I have no idea why though, and even less
    why it wouldn't be repeatable.
    After the stopdb failure we still have those processes running:
    pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01
    postgres: pgbuild pl_regression [local] idle
    Can you get a stack trace from this process?
    pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00
    [postgres] <defunct>
    This is a bit odd ... if that process is a direct child of the
    postmaster it should have been reaped promptly. Could it be a child
    of the other backend? If so, why was it started? Please try the
    ps again with whatever switch it needs to list parent process ID.

    regards, tom lane
  • Stefan Kaltenbrunner at Jan 9, 2007 at 7:24 am

    Tom Lane wrote:
    Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
    one of my new buildfarm boxes (an Debian/Etch based ARM box) is
    sometimes failing to stop the database during the regression tests:
    http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
    this only seems to happen sometimes and only if --with-tcl is enabled on
    quagga.
    lionfish (my mipsel box) is able to trigger that on every build if I
    enable --with-tcl but it is nearly impossible to debug it there because
    of the low amount of memory and diskspace it has.
    Hm, could pl/tcl somehow be preventing the backend from exiting once
    it's run any pl/tcl stuff? I have no idea why though, and even less
    why it wouldn't be repeatable.
    After the stopdb failure we still have those processes running:
    pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01
    postgres: pgbuild pl_regression [local] idle
    Can you get a stack trace from this process?
    (gdb) bt
    #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
    #1 0x406b8a7c in __pthread_wait_for_restart_signal () from
    /lib/libpthread.so.0
    #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
    #3 0x40438658 in exit () from /lib/libc.so.6
    #4 0x40438658 in exit () from /lib/libc.so.6
    Previous frame identical to this frame (corrupt stack?)


    pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00
    [postgres] <defunct>
    This is a bit odd ... if that process is a direct child of the
    postmaster it should have been reaped promptly. Could it be a child
    of the other backend? If so, why was it started? Please try the
    ps again with whatever switch it needs to list parent process ID.
    looks you are right - the defunct 3489 seems to be a child of 3488:

    PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
    1 3389 18341 18341 ? -1 S 1001 0:03
    /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
    3389 3391 3391 3391 ? -1 Ss 1001 0:00 postgres:
    writer process
    3389 3392 3392 3392 ? -1 Ss 1001 0:00 postgres: stats
    collector process
    3389 3488 3488 3488 ? -1 Ss 1001 0:01 postgres:
    pgbuild pl_regression [local] idle
    3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres]
    <defunct>


    Stefan
  • Stefan Kaltenbrunner at Jan 24, 2007 at 6:36 pm

    Stefan Kaltenbrunner wrote:
    Tom Lane wrote:
    Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
    one of my new buildfarm boxes (an Debian/Etch based ARM box) is
    sometimes failing to stop the database during the regression tests:
    http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quagga&dt=2007-01-08%2003:03:03
    this only seems to happen sometimes and only if --with-tcl is enabled on
    quagga.
    lionfish (my mipsel box) is able to trigger that on every build if I
    enable --with-tcl but it is nearly impossible to debug it there because
    of the low amount of memory and diskspace it has.
    Hm, could pl/tcl somehow be preventing the backend from exiting once
    it's run any pl/tcl stuff? I have no idea why though, and even less
    why it wouldn't be repeatable.
    After the stopdb failure we still have those processes running:
    pgbuild 3488 0.0 2.4 43640 6300 ? Ss 06:15 0:01
    postgres: pgbuild pl_regression [local] idle
    Can you get a stack trace from this process?
    (gdb) bt
    #0 0x406b9d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
    #1 0x406b8a7c in __pthread_wait_for_restart_signal () from
    /lib/libpthread.so.0
    #2 0x406b91f8 in pthread_onexit_process () from /lib/libpthread.so.0
    #3 0x40438658 in exit () from /lib/libc.so.6
    #4 0x40438658 in exit () from /lib/libc.so.6
    Previous frame identical to this frame (corrupt stack?)


    pgbuild 3489 0.0 0.0 0 0 ? Z 06:15 0:00
    [postgres] <defunct>
    This is a bit odd ... if that process is a direct child of the
    postmaster it should have been reaped promptly. Could it be a child
    of the other backend? If so, why was it started? Please try the
    ps again with whatever switch it needs to list parent process ID.
    looks you are right - the defunct 3489 seems to be a child of 3488:

    PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
    1 3389 18341 18341 ? -1 S 1001 0:03
    /home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D data
    3389 3391 3391 3391 ? -1 Ss 1001 0:00 postgres:
    writer process
    3389 3392 3392 3392 ? -1 Ss 1001 0:00 postgres: stats
    collector process
    3389 3488 3488 3488 ? -1 Ss 1001 0:01 postgres:
    pgbuild pl_regression [local] idle
    3488 3489 3488 3488 ? -1 Z 1001 0:00 [postgres]
    <defunct>
    FWIW - I removed --with-tcl from quagga's configuration about two weeks
    ago and it has not failed(for that reason) again. So the issue most
    definitly looks like plptcl related ...


    Stefan
  • Tom Lane at Jan 24, 2007 at 6:54 pm

    Stefan Kaltenbrunner writes:
    FWIW - I removed --with-tcl from quagga's configuration about two weeks
    ago and it has not failed(for that reason) again. So the issue most
    definitly looks like plptcl related ...
    It sorta looks like Tcl might be installing an atexit() callback that is
    doing the Wrong Thing somehow. What Tcl version do you have installed
    exactly, and with what configure options? (The contents of tclConfig.sh
    should be reasonably complete info.)

    regards, tom lane
  • Stefan Kaltenbrunner at Jan 24, 2007 at 7:24 pm

    Tom Lane wrote:
    Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
    FWIW - I removed --with-tcl from quagga's configuration about two weeks
    ago and it has not failed(for that reason) again. So the issue most
    definitly looks like plptcl related ...
    It sorta looks like Tcl might be installing an atexit() callback that is
    doing the Wrong Thing somehow. What Tcl version do you have installed
    exactly, and with what configure options? (The contents of tclConfig.sh
    should be reasonably complete info.)
    version is 8.4.12-1.1(current debian etch package) and tclConfig.sh is at:

    http://www.kaltenbrunner.cc/files/tclConfig.sh.txt


    Stefan
  • Tom Lane at Jan 28, 2007 at 9:07 pm

    Stefan Kaltenbrunner writes:
    FWIW - I removed --with-tcl from quagga's configuration about two weeks
    ago and it has not failed(for that reason) again. So the issue most
    definitly looks like plptcl related ...
    I poked around a bit in the Tcl sources, and as far as I can tell it
    will never call atexit() on a Linux platform. So that theory isn't
    right. However, what I found out via strace'ing is that on my own
    Linux machine, Tcl launches a second thread as soon as it's loaded.
    This thread just does a select() and sits there. I am not certain
    but I think it must be the "notifier" facility seen in the Tcl sources.

    What seems likely to me is that the "child process" we see in your ps
    output is actually this notifier thread (can you confirm that it's a
    thread via additional ps arguments?), and that for some reason the
    shutdown of the notifier isn't working right, or maybe the notifier
    is dying for some reason and then the main thread can't exit because
    it expects the notifier to respond.

    Can you "strace -f" the Tcl test and see what happens? The
    way I did it was

    createdb pl_regression
    createlang pltcl pl_regression
    psql pl_regression

    -- in another window, identify PID of connected backend, then
    strace -f -p PID 2>strace.out

    psql> \i .../pltcl_setup.sql
    psql> \i .../pltcl_queries.sql
    psql> \q

    One possibility for fixing it is that maybe we should be making an
    effort to execute Tcl_Finalize() before exiting the backend. If so,
    having pltcl set up an on_proc_exit callback to do it would be the
    appropriate thing. This is all speculation though.

    regards, tom lane
  • Tom Lane at Jan 28, 2007 at 9:49 pm

    I wrote:
    One possibility for fixing it is that maybe we should be making an
    effort to execute Tcl_Finalize() before exiting the backend. If so,
    having pltcl set up an on_proc_exit callback to do it would be the
    appropriate thing. This is all speculation though.
    Just for grins I tried this, and I can see by strace'ing that it changes
    the process-shutdown-time behavior quite a lot: the secondary thread now
    exits first, apparently after being told to via a message from the
    primary. So I think this might indeed be something good to do. Would
    you try the attached patch and see if it changes the behavior on your
    systems? (This patch is very ugly and will draw compiler warnings, but
    don't worry about that yet.)

    regards, tom lane

    Index: pltcl.c
    ===================================================================
    RCS file: /cvsroot/pgsql/src/pl/tcl/pltcl.c,v
    retrieving revision 1.108
    diff -c -r1.108 pltcl.c
    *** pltcl.c 4 Oct 2006 00:30:14 -0000 1.108
    --- pltcl.c 28 Jan 2007 21:45:40 -0000
    ***************
    *** 26,31 ****
    --- 26,32 ----
    #include "fmgr.h"
    #include "nodes/makefuncs.h"
    #include "parser/parse_type.h"
    + #include "storage/ipc.h"
    #include "tcop/tcopprot.h"
    #include "utils/builtins.h"
    #include "utils/lsyscache.h"
    ***************
    *** 245,250 ****
    --- 246,253 ----
    ************************************************************/
    if (!pltcl_be_init_done)
    {
    + on_proc_exit(Tcl_Finalize, 0);
    +
    if (SPI_connect() != SPI_OK_CONNECT)
    elog(ERROR, "SPI_connect failed");
    pltcl_init_load_unknown(pltcl_norm_interp);
  • Stefan Kaltenbrunner at Jan 29, 2007 at 6:23 pm

    Tom Lane wrote:
    I wrote:
    One possibility for fixing it is that maybe we should be making an
    effort to execute Tcl_Finalize() before exiting the backend. If so,
    having pltcl set up an on_proc_exit callback to do it would be the
    appropriate thing. This is all speculation though.
    Just for grins I tried this, and I can see by strace'ing that it changes
    the process-shutdown-time behavior quite a lot: the secondary thread now
    exits first, apparently after being told to via a message from the
    primary. So I think this might indeed be something good to do. Would
    you try the attached patch and see if it changes the behavior on your
    systems? (This patch is very ugly and will draw compiler warnings, but
    don't worry about that yet.)
    this patch definitly changes behaviour but not actually for the better :-(

    after running the tcl regression tests and exiting psql I'm left with
    two(!) backends(and I'm unable to stop the postmaster short of using -m
    immedidate):


    UID PID PPID LWP C NLWP STIME TTY TIME CMD
    1000 7191 1 7191 0 1 19:02 pts/0 00:00:03
    /home/mastermind/pginst/bin/postgres -D /home/mastermind/data
    1000 7202 7191 7202 0 1 19:02 ? 00:00:00 postgres:
    writer process
    1000 7203 7191 7203 0 1 19:02 ? 00:00:00 postgres:
    stats collector process
    1000 7235 7191 7235 0 1 19:06 ? 00:00:01 postgres:
    mastermind pl_regression [local] idle
    1000 7267 7235 7267 0 1 19:08 ? 00:00:00 postgres:
    mastermind pl_regression [local] idle

    tracefile for 7235:

    http://www.kaltenbrunner.cc/files/strace2.out

    backtrace for 7235:

    (gdb) bt
    #0 0x41f04d80 in __pthread_sigsuspend () from /lib/libpthread.so.0
    #1 0x41f03a7c in __pthread_wait_for_restart_signal () from
    /lib/libpthread.so.0
    #2 0x41f05d18 in pthread_key_delete () from /lib/libpthread.so.0
    #3 0x41edbe1c in TclpFinalizeThreadDataKey () from /usr/lib/libtcl8.4.so.0
    #4 0x41ec96dc in TclFinalizeSynchronization () from /usr/lib/libtcl8.4.so.0
    #5 0x41e92040 in Tcl_Finalize () from /usr/lib/libtcl8.4.so.0
    #6 0x001ab204 in proc_exit (code=0) at ipc.c:109
    #7 0x001be3a8 in PostgresMain (argc=1074783216, argv=<value optimized
    out>, username=0x0) at postgres.c:3638
    #8 0x0018fb98 in ServerLoop () at postmaster.c:2953
    #9 0x00190898 in PostmasterMain (argc=3, argv=0x36e5c0) at postmaster.c:963
    #10 0x001468b0 in main (argc=3, argv=<value optimized out>) at main.c:188


    Stefan
  • Alvaro Herrera at Jan 29, 2007 at 6:54 pm

    Stefan Kaltenbrunner wrote:

    backtrace for 7235:

    (gdb) bt
    Please do this in GDB:

    thread apply all bt

    (or maybe it is
    threads apply all bt)

    This'll give you backtraces for all threads in the process.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Stefan Kaltenbrunner at Jan 29, 2007 at 7:00 pm

    Alvaro Herrera wrote:
    Stefan Kaltenbrunner wrote:
    backtrace for 7235:

    (gdb) bt
    Please do this in GDB:

    thread apply all bt

    (or maybe it is
    threads apply all bt)

    This'll give you backtraces for all threads in the process.
    sorry forgot to mention that - the backtrace for the other one is an
    endless loop of:

    Thread 2 (Thread 32769 (LWP 7267)):
    #0 0x4018062c in poll () from /lib/libc.so.6
    #1 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #2 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #3 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #4 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #5 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #6 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #7 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #8 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #9 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #10 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #11 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0
    #12 0x41f023b4 in __pthread_manager () from /lib/libpthread.so.0


    Stefan
  • Tom Lane at Jan 29, 2007 at 7:00 pm

    Stefan Kaltenbrunner writes:
    this patch definitly changes behaviour but not actually for the better :-(
    Oh well, it was worth a try. At this point I think we have to suppose
    this is a Tcl bug and not our fault. Can you reproduce the problem in
    bare "tclsh"? Try

    $ tclsh
    % interp create
    interp0
    % interp create -safe
    interp1
    % exit
    $

    If tclsh doesn't quit when told then it's easy to file (but you might
    want to try the latest tcl version first --- they're up to 8.4.14)

    regards, tom lane
  • Stefan Kaltenbrunner at Jan 29, 2007 at 7:14 pm

    Tom Lane wrote:
    Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
    this patch definitly changes behaviour but not actually for the better :-(
    Oh well, it was worth a try. At this point I think we have to suppose
    this is a Tcl bug and not our fault. Can you reproduce the problem in
    bare "tclsh"? Try

    $ tclsh
    % interp create
    interp0
    % interp create -safe
    interp1
    % exit
    $

    If tclsh doesn't quit when told then it's easy to file (but you might
    want to try the latest tcl version first --- they're up to 8.4.14)
    hmm the above sequence just works fine on that box - will experiment
    with a more recent version as time permits (there is no other version
    available in the debian repo right now which makes this a bit more
    difficult).

    Stefan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJan 8, '07 at 7:39p
activeJan 29, '07 at 7:14p
posts13
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase