This might be the last bug from my concurrent-vacuum-full testing --- at
least, I have no remaining unexplained events from about two full days
of running the tests. The ones that are left involve backends randomly
failing like this:

psql: FATAL: could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes

usually during startup, though I have one example of a backend being
repeatedly unable to access pg_proc due to similar errors.

I believe what this traces to is stale relfilenode information taken
from relcache init files, which contain precomputed relcache data
intended to speed up backend startup. There is a curious little dance
between write_relcache_init_file and RelationCacheInitFileInvalidate
that is intended to ensure that when a process creates a new relcache
init file that is already stale (because someone else invalidated the
information concurrently), the bad file will get unlinked and not used.
I think I invented that logic, so it's my fault that it doesn't work.

It works fine as long as you consider only the two processes directly
involved; but a third process can get fooled into using stale data.
The scenario requires two successive invalidations, as for example from
vacuum full on two system catalogs in a row, plus a stream of incoming
new backends. It goes like this:

1. Process A vacuums a system catalog, unlinks init file, sends sinval
messages, does second unlink (which does nothing).

2. Process B starts, observes lack of init file, begins to construct
a new one. It gets to the end of AcceptInvalidationMessages in
write_relcache_init_file. Since it started after A sent the first
sinvals, it sees no incoming sinval messages and has no reason to think
its new init file isn't good.

3. Meanwhile, Process A vacuums another system catalog, unlinks init
file (doing nothing), and finally sends its sinval messages just after
B looked for them. Now it will block trying to get RelCacheInitLock,
which B holds.

4. Process B renames its already-stale init file into place, then
releases RelCacheInitLock.

5. Process A gets the lock and removes the stale init file.

Now process B is okay, because it will see A's second sinvals before it
tries to make any use of the relcache data it has. And the stale init
file is definitely gone after step 5.

However, between steps 4 and 5 there is a window for Process C to start,
read the stale init file, and attempt to use it. Since C started after
A's second set of sinval messages, it doesn't see them and doesn't know
it has stale data.

As far as I can see at the moment, the only way to make this bulletproof
is to turn both creation and deletion of the init file into atomic
operations that include sinval messaging. What I have in mind is

Creator: must take RelCacheInitLock, check for incoming invals, rename
the new file into place if none, release RelCacheInitLock. (This is
the same as what it does now.)

Destroyer: must take RelCacheInitLock, unlink the init file, send its
sinvals, release RelCacheInitLock.

This guarantees that we serialize the sending of the sinval messages
so that anyone who sees a bad init file in place *must* see the sinval
messages afterwards, so long as they join the sinval messaging ring
before looking for the init file (which they do). I don't think it's
any worse than the current scheme from a parallelism point of view: the
destroyer is holding RelCacheInitLock a bit longer than before, but that
should not be a performance critical situation.

Anybody see a hole in that?

regards, tom lane

Search Discussions

  • Robert Haas at Aug 16, 2011 at 7:17 pm

    On Mon, Aug 15, 2011 at 9:01 PM, Tom Lane wrote:
    This might be the last bug from my concurrent-vacuum-full testing --- at
    least, I have no remaining unexplained events from about two full days
    of running the tests.  The ones that are left involve backends randomly
    failing like this:

    psql: FATAL:  could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes

    usually during startup, though I have one example of a backend being
    repeatedly unable to access pg_proc due to similar errors.

    I believe what this traces to is stale relfilenode information taken
    from relcache init files, which contain precomputed relcache data
    intended to speed up backend startup.  There is a curious little dance
    between write_relcache_init_file and RelationCacheInitFileInvalidate
    that is intended to ensure that when a process creates a new relcache
    init file that is already stale (because someone else invalidated the
    information concurrently), the bad file will get unlinked and not used.
    I think I invented that logic, so it's my fault that it doesn't work.

    It works fine as long as you consider only the two processes directly
    involved; but a third process can get fooled into using stale data.
    The scenario requires two successive invalidations, as for example from
    vacuum full on two system catalogs in a row, plus a stream of incoming
    new backends.  It goes like this:

    1. Process A vacuums a system catalog, unlinks init file, sends sinval
    messages, does second unlink (which does nothing).

    2. Process B starts, observes lack of init file, begins to construct
    a new one.  It gets to the end of AcceptInvalidationMessages in
    write_relcache_init_file.  Since it started after A sent the first
    sinvals, it sees no incoming sinval messages and has no reason to think
    its new init file isn't good.

    3. Meanwhile, Process A vacuums another system catalog, unlinks init
    file (doing nothing), and finally sends its sinval messages just after
    B looked for them.  Now it will block trying to get RelCacheInitLock,
    which B holds.

    4. Process B renames its already-stale init file into place, then
    releases RelCacheInitLock.

    5. Process A gets the lock and removes the stale init file.

    Now process B is okay, because it will see A's second sinvals before it
    tries to make any use of the relcache data it has.  And the stale init
    file is definitely gone after step 5.

    However, between steps 4 and 5 there is a window for Process C to start,
    read the stale init file, and attempt to use it.  Since C started after
    A's second set of sinval messages, it doesn't see them and doesn't know
    it has stale data.

    As far as I can see at the moment, the only way to make this bulletproof
    is to turn both creation and deletion of the init file into atomic
    operations that include sinval messaging.  What I have in mind is

    Creator: must take RelCacheInitLock, check for incoming invals, rename
    the new file into place if none, release RelCacheInitLock.  (This is
    the same as what it does now.)

    Destroyer: must take RelCacheInitLock, unlink the init file, send its
    sinvals, release RelCacheInitLock.

    This guarantees that we serialize the sending of the sinval messages
    so that anyone who sees a bad init file in place *must* see the sinval
    messages afterwards, so long as they join the sinval messaging ring
    before looking for the init file (which they do).  I don't think it's
    any worse than the current scheme from a parallelism point of view: the
    destroyer is holding RelCacheInitLock a bit longer than before, but that
    should not be a performance critical situation.

    Anybody see a hole in that?
    I don't. Seems more robust than the old way.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 16, '11 at 1:01a
activeAug 16, '11 at 7:17p
posts2
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Tom Lane: 1 post Robert Haas: 1 post

People

Translate

site design / logo © 2022 Grokbase