The 8.1 build for cuckoo is currently hung, with the *postmaster* taking
all the CPU it can get. The build started almost 5 hours ago.

The postmaster is stuck in the following loop, according to
ktrace/kdump:

2023 postgres RET write 59/0x3b
2023 postgres CALL close(0xffffffff)
2023 postgres RET close -1 errno 9 Bad file descriptor
2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL accept(0x7,0x200148c,0x200150c)
2023 postgres RET accept -1 errno 24 Too many open files
2023 postgres CALL write(0x2,0x2003928,0x3b)
2023 postgres GIO fd 2 wrote 59 bytes
"LOG: could not accept new connection: Too many open files
"
2023 postgres RET write 59/0x3b
2023 postgres CALL close(0xffffffff)
2023 postgres RET close -1 errno 9 Bad file descriptor
2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
2023 postgres RET select 1
2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
2023 postgres RET sigprocmask 0
2023 postgres CALL accept(0x7,0x200148c,0x200150c)
2023 postgres RET accept -1 errno 24 Too many open files
2023 postgres CALL write(0x2,0x200381c,0x3b)
2023 postgres GIO fd 2 wrote 59 bytes
"LOG: could not accept new connection: Too many open files
"
2023 postgres RET write 59/0x3b

ulimit is set to 1224 open files, though I seem to keep bumping into that
(anyone know what the system-level limit is, or how to change it?)

Is there other useful info to be had about this process, or should I just kill
it?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

Search Discussions

  • Tom Lane at Feb 13, 2007 at 6:15 pm

    "Jim C. Nasby" <jim@nasby.net> writes:
    The postmaster is stuck in the following loop, according to
    ktrace/kdump:
    2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
    2023 postgres RET select 1
    2023 postgres CALL sigprocmask(0x3,0x2f0d38,0)
    2023 postgres RET sigprocmask 0
    2023 postgres CALL accept(0x7,0x200148c,0x200150c)
    2023 postgres RET accept -1 errno 24 Too many open files
    2023 postgres CALL write(0x2,0x2003928,0x3b)
    2023 postgres GIO fd 2 wrote 59 bytes
    "LOG: could not accept new connection: Too many open files
    "
    2023 postgres RET write 59/0x3b
    2023 postgres CALL close(0xffffffff)
    2023 postgres RET close -1 errno 9 Bad file descriptor
    2023 postgres CALL sigprocmask(0x3,0x2e6400,0)
    2023 postgres RET sigprocmask 0
    2023 postgres CALL select(0x8,0xbfffe194,0,0,0xbfffe16c)
    2023 postgres RET select 1
    Interesting. So accept() fails because it can't allocate an FD, which
    means that the select condition isn't cleared, so we keep retrying
    forever. I don't see what else we could do though. Having the
    postmaster abort on what might well be a transient condition doesn't
    sound like a hot idea. We could possibly sleep() a bit before retrying,
    just to not suck 100% CPU, but that doesn't really *fix* anything ...

    I've been meaning to bug you about increasing cuckoo's FD limit anyway;
    it keeps failing in the regression tests.
    ulimit is set to 1224 open files, though I seem to keep bumping into that
    (anyone know what the system-level limit is, or how to change it?)
    On my OS X machine, "ulimit -n unlimited" seems to set the limit to
    10240 (or so a subsequent ulimit -a reports). But you could probably
    fix it using the buildfarm parameter that cuts the number of concurrent
    regression test runs.

    regards, tom lane
  • Jim Nasby at Feb 13, 2007 at 6:59 pm

    On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
    Interesting. So accept() fails because it can't allocate an FD, which
    means that the select condition isn't cleared, so we keep retrying
    forever. I don't see what else we could do though. Having the
    postmaster abort on what might well be a transient condition doesn't
    sound like a hot idea. We could possibly sleep() a bit before
    retrying,
    just to not suck 100% CPU, but that doesn't really *fix* anything ...
    Well, not only that, but the machine is currently writing to the
    postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
    (perhaps growing exponentially to some limit) would be a good idea.
    I've been meaning to bug you about increasing cuckoo's FD limit
    anyway;
    it keeps failing in the regression tests.
    ulimit is set to 1224 open files, though I seem to keep bumping
    into that
    (anyone know what the system-level limit is, or how to change it?)
    On my OS X machine, "ulimit -n unlimited" seems to set the limit to
    10240 (or so a subsequent ulimit -a reports). But you could probably
    fix it using the buildfarm parameter that cuts the number of
    concurrent
    regression test runs.
    Odd... that works on my MBP (sudo bash; ulimit -n unlimited) and I
    get 12288. But the same thing doesn't work on cuckoo, which is a G4;
    the limit stays at 1224 no matter what. Perhaps because I'm setting
    maxfiles in launchd.conf.

    In any case, I've upped it to a bit over 2k; we'll see what that
    does. I find it interesting that aubrac isn't affected by this, since
    it's still running with the default of only 256 open files.

    I'm thinking we might want to change the default value for
    max_files_per_process on OS X, or have initdb test it like it does
    for other things.
    --
    Jim Nasby jim@nasby.net
    EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
  • Tom Lane at Feb 13, 2007 at 7:21 pm

    Jim Nasby writes:
    On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
    We could possibly sleep() a bit before retrying,
    just to not suck 100% CPU, but that doesn't really *fix* anything ...
    Well, not only that, but the machine is currently writing to the
    postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
    (perhaps growing exponentially to some limit) would be a good idea.
    Well, since the code has always behaved that way and no one noticed
    before, I don't think it's worth anything as complicated as a variable
    delay. I just stuck a fixed 100msec delay into the accept-failed code
    path.

    regards, tom lane
  • Alvaro Herrera at Feb 13, 2007 at 7:59 pm

    Tom Lane wrote:
    Jim Nasby <jim@nasby.net> writes:
    On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:
    We could possibly sleep() a bit before retrying,
    just to not suck 100% CPU, but that doesn't really *fix* anything ...
    Well, not only that, but the machine is currently writing to the
    postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
    (perhaps growing exponentially to some limit) would be a good idea.
    Well, since the code has always behaved that way and no one noticed
    before, I don't think it's worth anything as complicated as a variable
    delay. I just stuck a fixed 100msec delay into the accept-failed code
    path.
    Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
    (And so does the autovac code I'm currently looking at).

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Andrew Dunstan at Feb 13, 2007 at 8:05 pm

    Alvaro Herrera wrote:
    Tom Lane wrote:
    Jim Nasby <jim@nasby.net> writes:
    On Feb 13, 2007, at 12:15 PM, Tom Lane wrote:

    We could possibly sleep() a bit before retrying,
    just to not suck 100% CPU, but that doesn't really *fix* anything ...
    Well, not only that, but the machine is currently writing to the
    postmaster log at the rate of 2-3MB/s. ISTM some kind of sleep
    (perhaps growing exponentially to some limit) would be a good idea.
    Well, since the code has always behaved that way and no one noticed
    before, I don't think it's worth anything as complicated as a variable
    delay. I just stuck a fixed 100msec delay into the accept-failed code
    path.
    Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
    (And so does the autovac code I'm currently looking at).
    There is probably a good case for a shorter delay in postmaster, though.

    cheers

    andrew
  • Tom Lane at Feb 13, 2007 at 8:08 pm

    Andrew Dunstan writes:
    Alvaro Herrera wrote:
    Tom Lane wrote:
    delay. I just stuck a fixed 100msec delay into the accept-failed code
    path.
    Seems worth mentioning that bgwriter sleeps 1 sec in case of failure.
    (And so does the autovac code I'm currently looking at).
    There is probably a good case for a shorter delay in postmaster, though.
    Yeah, that's what I thought. We don't really care if either bgwriter or
    autovac goes AWOL for a little while, but if the postmaster's asleep
    then nobody can connect.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedFeb 13, '07 at 5:25p
activeFeb 13, '07 at 8:08p
posts7
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase