I have been finding a lot of interesting stuff while looking into
the buffer reference count/leakage issue.
It turns out that there were two specific things that were camouflaging
the existence of bugs in this area:
1. The BufferLeakCheck routine that's run at transaction commit was
only looking for nonzero PrivateRefCount to indicate a missing unpin.
It failed to notice nonzero LastRefCount --- which meant that an
error in refcount save/restore usage could leave a buffer pinned,
and BufferLeakCheck wouldn't notice.
2. The BufferIsValid macro, which you'd think just checks whether
it's handed a valid buffer identifier or not, actually did more:
it only returned true if the buffer ID was valid *and* the buffer
had positive PrivateRefCount. That meant that the common pattern
wouldn't complain if it were handed a valid but already unpinned buffer.
And that behavior masks bugs that result in buffers being unpinned too
early. For example, consider a sequence like
1. LockBuffer (buffer now has refcount 1). Store reference to
a tuple on that buffer page in a tuple table slot.
2. Copy buffer reference to a second tuple-table slot, but forget to
increment buffer's refcount.
3. Release second tuple table slot. Buffer refcount drops to 0,
so it's unpinned.
4. Release original tuple slot. Because of BufferIsValid behavior,
no assert happens here; in fact nothing at all happens.
This is, of course, buggy code: during the interval from 3 to 4 you
still have an apparently valid tuple reference in the original slot,
which someone might try to use; but the buffer it points to is unpinned
and could be replaced at any time by another backend.
In short, we had errors that would mask both missing-pin bugs and
missing-unpin bugs. And naturally there were a few such bugs lurking
3. The buffer refcount save/restore stuff, which I had suspected
was useless, is not only useless but also buggy. The reason it's
buggy is that it only works if used in a nested fashion. You could
save state A, pin some buffers, save state B, pin some more
buffers, restore state B (thereby unpinning what you pinned since
the save), and finally restore state A (unpinning the earlier stuff).
What you could not do is save state A, pin, save B, pin more, then
restore state A --- that might unpin some of A's buffers, or some
of B's buffers, or some unforeseen combination thereof. If you
restore A and then restore B, you do not necessarily return to a zero-
pins state, either. And it turns out the actual usage pattern was a
nearly random sequence of saves and restores, compounded by a failure to
do all of the restores reliably (which was masked by the oversight in
What I have done so far is to rip out the buffer refcount save/restore
support (including LastRefCount), change BufferIsValid to a simple
validity check (so that you get an assert if you unpin something that
was pinned), change ExecStoreTuple so that it increments the refcount
when it is handed a buffer reference (for symmetry with ExecClearTuple's
decrement of the refcount), and fix about a dozen bugs exposed by these
I am still getting Buffer Leak notices in the "misc" regression test,
specifically in the queries that invoke more than one SQL function.
What I find there is that SQL functions are not always run to
completion. Apparently, when a function can return multiple tuples,
it won't necessarily be asked to produce them all. And when it isn't,
postquel_end() isn't invoked for the function's current query, so its
tuple table isn't cleared, so we have dangling refcounts if any of the
tuples involved are in disk buffers.
It may be that the save/restore code was a misguided attempt to fix
this problem. I can't tell. But I think what we really need to do is
find some way of ensuring that Postquel function execution contexts
always get shut down by the end of the query, so that they don't leak
I suppose a straightforward approach would be to keep a list of open
function contexts somewhere (attached to the outer execution context,
perhaps), and clean them up at outer-plan shutdown.
What I am wondering, though, is whether this addition is actually
necessary, or is it a bug that the functions aren't run to completion
in the first place? I don't really understand the semantics of this
"nested dot notation". I suppose it is a Berkeleyism; I can't find
anything about it in the SQL92 document. The test cases shown in the
misc regress test seem peculiar, not to say wrong. For example:
regression=> SELECT p.hobbies.equipment.name, p.hobbies.name, p.name FROM person p;
name |name |name
peet's coffee|basketball |joe
hightops |basketball |sally
which doesn't appear to agree with the contents of the underlying
regression=> SELECT * FROM hobbies_r;
regression=> SELECT * FROM equipment_r;
I'd have expected an output along the lines of
hightops |basketball |joe
hightops |basketball |sally
Is the regression test's expected output wrong, or am I misunderstanding
what this query is supposed to do? Is there any documentation anywhere
about how SQL functions returning multiple tuples are supposed to
regards, tom lane