I've been testing the problem reported by Dave Gould by running "make
installcheck-parallel" together with a tight loop of "vacuum full
pg_class":

while true; do psql -c "vacuum full pg_class" regression; usleep 100000; done

Even after fixing the cache-reset-recovery-order problem I described
yesterday, there are still occasional bizarre failures, for example
in this bit of truncate.sql:

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE trunc_f;
SELECT * FROM trunc_f;
ROLLBACK;

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE ONLY trunc_f;
ERROR: attempted to update invisible tuple

I'm not sure that there's only one bug remaining here, but I've
identified this bug at least. In the first transaction above, the
TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode
value. Say the update invalidates the tuple at tid A and inserts the
updated version at tid B. Then vacuum full gets control and rewrites
pg_class. It correctly keeps both versions of the trunc_f tuple, but
now they will be at new tid locations, say C and D. The original
transaction meanwhile was blocked trying to re-open pg_class to reload
trunc_f's pg_class tuple into its catcaches. Now it successfully does
that, and the tuple it reads has tid D. It finishes out the
transaction, and rolls back as per script. In the rollback, we know
that we have to flush catcache entries for tuple versions inserted by
the failed transaction ... but what inval.c has stored is an inval event
against pg_class tid B. There is no such entry in the catcaches, so
nothing happens, and the entry with tid D stays valid. Ooops.
The relcache entry for trunc_f does get rebuilt correctly (without
consulting the bogus catcache entry), so the next SELECT goes through
all right. But when the second TRUNCATE wants to again update the
pg_class tuple, it finds it in the catcaches ... and what it finds has
tid D, so it attempts to heap_update that TID instead of the actually
live tuple at tid C.

In this particular case, the incorrectly targeted tuple is the
just-invalidated version of trunc_f, so heap_update sees it as
XMIN_INVALID and fails with "attempted to update invisible tuple".

The potential consequences are hugely worse than that, though: with a
bit more time between the two operations, it'd be possible for someone
else to reclaim the dead tuple and replace it with something else.
As long as the TID is live when we get to it, heap_update will blindly
replace it, whether or not it has anything to do with the tuple we think
we're replacing. I suspect that some of the other bizarre cases I'm
seeing are artifacts of that type of outcome, but they're hard to
reproduce so I've not been able to pin them down yet.

In principle this same type of failure could have occurred before 9.0,
since catcache inval has always been TID-based. It turns out we were
saved by the fact that the old VACUUM FULL implementation would chicken
out and not do any tuple moving if it found any INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples. The pre-9.0 versions of CLUSTER reject such
tuples too, so I don't think we need to do anything in those branches.

But in 9.0 and up, we have a problem. So far I've thought of two possible
avenues to fix it:

1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
FULL or CLUSTER just finished on a system catalog), enter that message
into the transaction's local inval list as well as executing it
immediately. On transaction abort, redo the resulting cache flushes a
second time. This would ensure we flushed any bad entries even though
they were logged with obsolete TIDs elsewhere in the inval list.
(We might need to do this on commit too, though right now I don't think
so. Also, a cache reset event would a fortiori have to queue a similar
entry to blow things away a second time at transaction end.)

2. Forget about targeting catcache invals by TID, and instead just use the
key hash value to determine which cache entries to drop.

Approach #2 seems a lot less invasive and more trustworthy, but it has the
disadvantage that cache invals would become more likely to blow away
entries unnecessarily (because of chance hashvalue collisions), even
without any VACUUM FULL being done. If we could make approach #1 work
reliably, it would result in more overhead during VACUUM FULL but less at
other times --- or at least we could hope so. In an environment where
lots of sinval overflows and consequent resets happen, we might come out
behind due to doubling the number of catcache flushes forced by a reset
event.

Right at the moment I'm leaning to approach #2. I wonder if anyone
sees it differently, or has an idea for a third approach?

BTW, going forward it might be interesting to think about invalidating
the catcaches based on xmin/xmax rather than specific TIDs. That would
reduce the sinval traffic to one message per transaction that had
affected the catalogs, instead of one per TID. But that clearly is not
going to lead to something we'd dare back-patch.

regards, tom lane

Search Discussions

  • Robert Haas at Aug 12, 2011 at 6:50 pm

    On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane wrote:
    But in 9.0 and up, we have a problem.  So far I've thought of two possible
    avenues to fix it:

    1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
    FULL or CLUSTER just finished on a system catalog), enter that message
    into the transaction's local inval list as well as executing it
    immediately.  On transaction abort, redo the resulting cache flushes a
    second time.  This would ensure we flushed any bad entries even though
    they were logged with obsolete TIDs elsewhere in the inval list.
    (We might need to do this on commit too, though right now I don't think
    so.  Also, a cache reset event would a fortiori have to queue a similar
    entry to blow things away a second time at transaction end.)

    2. Forget about targeting catcache invals by TID, and instead just use the
    key hash value to determine which cache entries to drop.

    Approach #2 seems a lot less invasive and more trustworthy, but it has the
    disadvantage that cache invals would become more likely to blow away
    entries unnecessarily (because of chance hashvalue collisions), even
    without any VACUUM FULL being done.  If we could make approach #1 work
    reliably, it would result in more overhead during VACUUM FULL but less at
    other times --- or at least we could hope so.  In an environment where
    lots of sinval overflows and consequent resets happen, we might come out
    behind due to doubling the number of catcache flushes forced by a reset
    event.

    Right at the moment I'm leaning to approach #2.  I wonder if anyone
    sees it differently, or has an idea for a third approach?
    I don't think it really matters whether we occasionally blow away an
    entry unnecessarily due to a hash-value collision. IIUC, we'd only
    need to worry about hash-value collisions between rows in the same
    catalog; and the number of entries that we have cached had better be
    many orders of magnitude less than 2^32. If the cache is large enough
    that we're having hash value collisions more than once in a great
    while, we probably should have flushed some entries out of it a whole
    lot sooner and a whole lot more aggressively, because we're likely
    eating memory like crazy.

    On the other hand, having to duplicate sinval resets seems like a bad
    thing. So +1 for #2.
    BTW, going forward it might be interesting to think about invalidating
    the catcaches based on xmin/xmax rather than specific TIDs.  That would
    reduce the sinval traffic to one message per transaction that had
    affected the catalogs, instead of one per TID.  But that clearly is not
    going to lead to something we'd dare back-patch.
    To be honest, I am not real keen on back-patching either of the
    options you list above, either. I think we should probably do
    something about the problem Dave Gould reported (to wit: sinval resets
    need to be smarter about the order they do things in), but I am not
    sure it's a good idea to back-patch what seems like a fairly radical
    overhaul of the sinval mechanism to fix a problem nobody's actually
    complained about hitting in production, especially given there's no
    certainty that this is the last bug. Perhaps we should just fix this
    one in master and consider back-patching it if and when we get some
    plausibly related bug reports.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Aug 12, 2011 at 7:14 pm

    Robert Haas writes:
    On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane wrote:
    2. Forget about targeting catcache invals by TID, and instead just use the
    key hash value to determine which cache entries to drop.
    Right at the moment I'm leaning to approach #2.  I wonder if anyone
    sees it differently, or has an idea for a third approach?
    To be honest, I am not real keen on back-patching either of the
    options you list above, either. I think we should probably do
    something about the problem Dave Gould reported (to wit: sinval resets
    need to be smarter about the order they do things in), but I am not
    sure it's a good idea to back-patch what seems like a fairly radical
    overhaul of the sinval mechanism to fix a problem nobody's actually
    complained about hitting in production, especially given there's no
    certainty that this is the last bug.
    What radical overhaul? I'm talking about removing one if-test in
    CatalogCacheIdInvalidate, viz

    nextelt = DLGetSucc(elt);

    if (hashValue != ct->hash_value)
    continue; /* ignore non-matching hash values */

    - if (ct->negative ||
    - ItemPointerEquals(pointer, &ct->tuple.t_self))
    {
    if (ct->refcount > 0 ||
    (ct->c_list && ct->c_list->refcount > 0))
    {

    In any case, it is now clear to me that this bug is capable of eating
    peoples' databases, as in "what just happened to our most critical
    table? Uh, it's not there anymore, boss, but we seem to have duplicate
    pg_class entries for this other table". This may not have happened
    yet in the field, but that's probably only because 9.0 hasn't been in
    production that long. Doing nothing about it in the released branches
    is unacceptable IMO. People that that happens to will not be satisfied
    with a response like "you shouldn't have been using VACUUM FULL on
    system catalogs, it's buggy".

    For that matter, I'm not convinced that Gould hasn't seen some form of
    this --- he's mentioned seeing bizarre errors above and beyond the
    "could not find pg_class tuple for index 2662" one.

    As for it not being the last bug, no, it's probably not. That's a
    pretty sucky excuse for not fixing it too.

    regards, tom lane
  • Kevin Grittner at Aug 12, 2011 at 7:22 pm

    Tom Lane wrote:

    In any case, it is now clear to me that this bug is capable of
    eating peoples' databases, as in "what just happened to our most
    critical table? Uh, it's not there anymore, boss, but we seem to
    have duplicate pg_class entries for this other table".
    Based on this, I don't think we want to wait for bug reports. One
    would be too many.

    +1 for backpatching a fix.
    As for it not being the last bug, no, it's probably not. That's a
    pretty sucky excuse for not fixing it too.
    I agree.

    -Kevin
  • Robert Haas at Aug 12, 2011 at 8:04 pm

    On Fri, Aug 12, 2011 at 3:14 PM, Tom Lane wrote:
    In any case, it is now clear to me that this bug is capable of eating
    peoples' databases, as in "what just happened to our most critical
    table?  Uh, it's not there anymore, boss, but we seem to have duplicate
    pg_class entries for this other table".
    OK, I take your point. That's certainly scary enough to worry about.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Aug 12, 2011 at 7:16 pm

    Robert Haas wrote:

    Perhaps we should just fix this one in master and consider
    back-patching it if and when we get some plausibly related bug
    reports.
    I'm not completely clear on what one would do to be vulnerable to
    hitting the bug, or what the impact of hitting it would be. Tom
    said:
    The potential consequences are hugely worse than that, though:
    with a bit more time between the two operations, it'd be possible
    for someone else to reclaim the dead tuple and replace it with
    something else. As long as the TID is live when we get to it,
    heap_update will blindly replace it, whether or not it has
    anything to do with the tuple we think we're replacing.
    What would a worst-case bug report based on hitting that look like?

    -Kevin
  • Heikki Linnakangas at Aug 12, 2011 at 7:25 pm

    On 12.08.2011 21:49, Robert Haas wrote:
    On Fri, Aug 12, 2011 at 2:09 PM, Tom Lanewrote:
    2. Forget about targeting catcache invals by TID, and instead just use the
    key hash value to determine which cache entries to drop.

    Approach #2 seems a lot less invasive and more trustworthy, but it has the
    disadvantage that cache invals would become more likely to blow away
    entries unnecessarily (because of chance hashvalue collisions), even
    without any VACUUM FULL being done. If we could make approach #1 work
    reliably, it would result in more overhead during VACUUM FULL but less at
    other times --- or at least we could hope so. In an environment where
    lots of sinval overflows and consequent resets happen, we might come out
    behind due to doubling the number of catcache flushes forced by a reset
    event.

    Right at the moment I'm leaning to approach #2. I wonder if anyone
    sees it differently, or has an idea for a third approach?
    I don't think it really matters whether we occasionally blow away an
    entry unnecessarily due to a hash-value collision. IIUC, we'd only
    need to worry about hash-value collisions between rows in the same
    catalog; and the number of entries that we have cached had better be
    many orders of magnitude less than 2^32. If the cache is large enough
    that we're having hash value collisions more than once in a great
    while, we probably should have flushed some entries out of it a whole
    lot sooner and a whole lot more aggressively, because we're likely
    eating memory like crazy.
    What would suck, though, is if you have an application that repeatedly
    creates and drops a temporary table, and the hash value for that happens
    to match some other table in the database. catcache invalidation would
    keep flushing the entry for that other table too, and you couldn't do
    anything about it except for renaming one of the tables.

    Despite that, +1 for option #2. The risk of collision seems acceptable,
    and the consequence of a collision wouldn't be too bad in most
    applications anyway.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Aug 12, 2011 at 7:41 pm

    Heikki Linnakangas writes:
    On 12.08.2011 21:49, Robert Haas wrote:
    I don't think it really matters whether we occasionally blow away an
    entry unnecessarily due to a hash-value collision. IIUC, we'd only
    need to worry about hash-value collisions between rows in the same
    catalog; and the number of entries that we have cached had better be
    many orders of magnitude less than 2^32. If the cache is large enough
    that we're having hash value collisions more than once in a great
    while, we probably should have flushed some entries out of it a whole
    lot sooner and a whole lot more aggressively, because we're likely
    eating memory like crazy.
    What would suck, though, is if you have an application that repeatedly
    creates and drops a temporary table, and the hash value for that happens
    to match some other table in the database. catcache invalidation would
    keep flushing the entry for that other table too, and you couldn't do
    anything about it except for renaming one of the tables.
    Despite that, +1 for option #2. The risk of collision seems acceptable,
    and the consequence of a collision wouldn't be too bad in most
    applications anyway.
    Yeah. Also, to my mind this is only a fix that will be used in 9.0 and
    9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
    to invalidate catcaches instead of recording individual TIDs, I'm
    excited about doing that instead for 9.2 and beyond. I believe that
    that could result in a significant reduction in sinval traffic, which
    would be a considerable performance win.

    regards, tom lane
  • Tom Lane at Aug 13, 2011 at 4:18 pm

    I wrote:
    Yeah. Also, to my mind this is only a fix that will be used in 9.0 and
    9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
    to invalidate catcaches instead of recording individual TIDs, I'm
    excited about doing that instead for 9.2 and beyond. I believe that
    that could result in a significant reduction in sinval traffic, which
    would be a considerable performance win.
    On closer inspection this idea doesn't seem workable. I was imagining
    that after a transaction commits, we could find obsoleted catcache
    entries by looking for tuples with xmax equal to the transaction's XID.
    But a catcache entry made before the transaction had done the update
    wouldn't contain the correct xmax, so we'd fail to remove it. The only
    apparent way to fix that would be to go out to disk and consult the
    current on-disk xmax, which would hardly be any cheaper than just
    dropping the cache entry and then reloading it when/if needed.

    All is not entirely lost, however: there's still some possible
    performance benefit to be gained here, if we go to the scheme of
    identifying victim catcache entries by hashvalue only. Currently,
    each heap_update in a cached catalog has to issue two sinval messages
    (per cache!): one against the old TID and one against the new TID.
    We'd be able to reduce that to one message in the common case where the
    hashvalue remains the same because the cache key columns didn't change.

    Another thing we could consider doing, if one-in-2^32 hash collisions
    seems too high, is widening the hash values to 64 bits. I'm not
    terribly excited about that, because a lot of the caches are on OID
    columns for which there'd be zero benefit.

    regards, tom lane
  • Robert Haas at Aug 13, 2011 at 7:27 pm

    On Sat, Aug 13, 2011 at 12:18 PM, Tom Lane wrote:
    I wrote:
    Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
    9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
    to invalidate catcaches instead of recording individual TIDs, I'm
    excited about doing that instead for 9.2 and beyond.  I believe that
    that could result in a significant reduction in sinval traffic, which
    would be a considerable performance win.
    On closer inspection this idea doesn't seem workable.  I was imagining
    that after a transaction commits, we could find obsoleted catcache
    entries by looking for tuples with xmax equal to the transaction's XID.
    But a catcache entry made before the transaction had done the update
    wouldn't contain the correct xmax, so we'd fail to remove it.  The only
    apparent way to fix that would be to go out to disk and consult the
    current on-disk xmax, which would hardly be any cheaper than just
    dropping the cache entry and then reloading it when/if needed.

    All is not entirely lost, however: there's still some possible
    performance benefit to be gained here, if we go to the scheme of
    identifying victim catcache entries by hashvalue only.  Currently,
    each heap_update in a cached catalog has to issue two sinval messages
    (per cache!): one against the old TID and one against the new TID.
    We'd be able to reduce that to one message in the common case where the
    hashvalue remains the same because the cache key columns didn't change. Cool.
    Another thing we could consider doing, if one-in-2^32 hash collisions
    seems too high, is widening the hash values to 64 bits.  I'm not
    terribly excited about that, because a lot of the caches are on OID
    columns for which there'd be zero benefit.
    Yeah, and I can't get excited about the increased memory usage, either.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Aug 12, 2011 at 8:33 pm

    On Fri, Aug 12, 2011 at 7:09 PM, Tom Lane wrote:

    Right at the moment I'm leaning to approach #2.  I wonder if anyone
    sees it differently, or has an idea for a third approach?
    You are trying to solve the problem directly, which seems overkill.

    With HOT, there is very little need to perform a VACUUM FULL on any
    shared catalog table. Look at the indexes...

    I would a suggest that VACUUM FULL perform only a normal VACUUM on
    shared catalog tables, then perform an actual VACUUM FULL only in dire
    need (some simple heuristic in size and density). This avoids doing a
    VACUUM FULL unless it is actually necessary to do so. That has the
    added advantage of not locking out essential tables, which is always a
    concern.

    In the unlikely event we do actually have to VACUUM FULL a shared
    catalog table, nuke any cache entry for the whole shared catalog. That
    way we absolutely and positively will never get any more bugs in this
    area, ever again. Sounds harsh, but these events are only actually
    needed very, very rarely and hygiene is more important than a few
    minor points of performance.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Daveg at Aug 12, 2011 at 10:20 pm

    On Fri, Aug 12, 2011 at 09:26:02PM +0100, Simon Riggs wrote:
    With HOT, there is very little need to perform a VACUUM FULL on any
    shared catalog table. Look at the indexes...

    I would a suggest that VACUUM FULL perform only a normal VACUUM on
    shared catalog tables, then perform an actual VACUUM FULL only in dire
    need (some simple heuristic in size and density). This avoids doing a
    VACUUM FULL unless it is actually necessary to do so. That has the
    added advantage of not locking out essential tables, which is always a
    concern.

    In the unlikely event we do actually have to VACUUM FULL a shared
    catalog table, nuke any cache entry for the whole shared catalog. That
    way we absolutely and positively will never get any more bugs in this
    area, ever again. Sounds harsh, but these events are only actually
    needed very, very rarely and hygiene is more important than a few
    minor points of performance.
    This is a very optimistic view. My client makes heavy use of temp tables.
    HOT and autovacuum are not sufficient to keep catalog bloat under control.
    We run a daily script that calculates the density of the catalog and only
    vaccum fulls those that are severely bloated. Here is a result from a
    recent bloat check on one db. 'packed' is the number of pages needed for
    the rows if they were packed, 'bloat' is the multiple of pages in use over
    the number really needed.

    relation | tuples | pages | packed | bloat
    ------------------+--------+-------+--------+-------
    pg_class; -- | 4292 | 10619 | 114 | 93.2
    pg_depend; -- | 25666 | 7665 | 217 | 35.4
    pg_attrdef; -- | 6585 | 7595 | 236 | 32.2
    pg_type; -- | 4570 | 8177 | 416 | 19.6
    pg_shdepend; -- | 52040 | 7968 | 438 | 18.2

    -dg
    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Tom Lane at Aug 13, 2011 at 2:11 pm

    Simon Riggs writes:
    With HOT, there is very little need to perform a VACUUM FULL on any
    shared catalog table. Look at the indexes...
    This is not really a useful argument. People do do VAC FULL on
    catalogs, whether we think they should or not. Also, it's not only
    "shared" catalogs that are at stake.

    We could avoid fixing the bug if we forbade both VAC FULL and CLUSTER
    on all system catalogs, period, no exceptions. That doesn't seem like
    a workable answer though. Some usage patterns do see bloat on the
    catalogs, especially if you disable or dial down autovacuum.
    In the unlikely event we do actually have to VACUUM FULL a shared
    catalog table, nuke any cache entry for the whole shared catalog.
    You're still not getting the point. We *do* nuke all cache entries
    after a VAC FULL. That does not avoid this bug.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 12, '11 at 6:15p
activeAug 13, '11 at 7:27p
posts13
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase