FAQ
I have been making some notes about the rules for accessing shared disk
buffers, since they aren't spelled out anywhere now AFAIK. In process
I found what seems to be a nasty bug in the code that tries to build
btree indexes that include already-dead tuples. (If memory serves,
Hiroshi added that code awhile back to help suppress the "heap tuples
!= index tuples" complaint from VACUUM.)

Would people look this over and see if they agree with my deductions?

regards, tom lane


Notes about shared buffer access rules
--------------------------------------

There are two separate access control mechanisms for shared disk buffers:
reference counts (a/k/a pin counts) and buffer locks. (Actually, there's
a third level of access control: one must hold the appropriate kind of
lock on a relation before one can legally access any page belonging to
the relation. Relation-level locks are not discussed here.)

Pins: one must "hold a pin on" a buffer (increment its reference count)
before being allowed to do anything at all with it. An unpinned buffer is
subject to being reclaimed and reused for a different page at any instant,
so touching it is unsafe. Typically a pin is acquired by ReadBuffer and
released by WriteBuffer (if one modified the page) or ReleaseBuffer (if not).
It is OK and indeed common for a single backend to pin a page more than
once concurrently; the buffer manager handles this efficiently. It is
considered OK to hold a pin for long intervals --- for example, sequential
scans hold a pin on the current page until done processing all the tuples
on the page, which could be quite a while if the scan is the outer scan of
a join. Similarly, btree index scans hold a pin on the current index
page. This is OK because there is actually no operation that waits for a
page's pin count to drop to zero. (Anything that might need to do such a
wait is instead handled by waiting to obtain the relation-level lock,
which is why you'd better hold one first.) Pins may not be held across
transaction boundaries, however.

Buffer locks: there are two kinds of buffer locks, shared and exclusive,
which act just as you'd expect: multiple backends can hold shared locks
on the same buffer, but an exclusive lock prevents anyone else from
holding either shared or exclusive lock. (These can alternatively be
called READ and WRITE locks.) These locks are relatively short term:
they should not be held for long. They are implemented as per-buffer
spinlocks, so another backend trying to acquire a competing lock will
spin as long as you hold yours! Buffer locks are acquired and released by
LockBuffer(). It will *not* work for a single backend to try to acquire
multiple locks on the same buffer. One must pin a buffer before trying
to lock it.

Buffer access rules:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive lock. To examine the commit status (XIDs and status bits) of
a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

2. Once one has determined that a tuple is interesting (visible to the
current transaction) one may drop the buffer lock, yet continue to access
the tuple's data for as long as one holds the buffer pin. This is what is
typically done by heap scans, since the tuple returned by heap_fetch
contains a pointer to tuple data in the shared buffer. Therefore the
tuple cannot go away while the pin is held (see rule #5). Its state could
change, but that is assumed not to matter after the initial determination
of visibility is made.

3. To add a tuple or change the xmin/xmax fields of an existing tuple,
one must hold a pin and an exclusive lock on the containing buffer.
This ensures that no one else might see a partially-updated state of the
tuple.

4. It is considered OK to update tuple commit status bits (ie, OR the
values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
pin on a buffer. This is OK because another backend looking at the tuple
at about the same time would OR the same bits into the field, so there
is little or no risk of conflicting update; what's more, if there did
manage to be a conflict it would merely mean that one bit-update would
be lost and need to be done again later.

5. To physically remove a tuple or compact free space on a page, one
must hold a pin and an exclusive lock, *and* observe while holding the
exclusive lock that the buffer's shared reference count is one (ie,
no other backend holds a pin). If these conditions are met then no other
backend can perform a page scan until the exclusive lock is dropped, and
no other backend can be holding a reference to an existing tuple that it
might expect to examine again. Note that another backend might pin the
buffer (increment the refcount) while one is performing the cleanup, but
it won't be able to to actually examine the page until it acquires shared
or exclusive lock.


Currently, the only operation that removes tuples or compacts free space
is VACUUM. It does not have to implement rule #5 directly, because it
instead acquires exclusive lock at the relation level, which ensures
indirectly that no one else is accessing tuples of the relation at all.
To implement concurrent VACUUM we will need to make it obey rule #5.


I believe that nbtree.c's btbuild() code is currently in violation of
these rules, because it calls HeapTupleSatisfiesNow() while holding a
pin but no lock on the containing buffer. Not only does this risk
returning a wrong answer if it sees an intermediate state of the tuple
xmin/xmax fields, but HeapTupleSatisfiesNow() may try to update the
tuple's infomask. This could produce a wrong state of the infomask if
there is a concurrent change of the infomask by an exclusive-lock holder.
(Even if that catastrophe does not happen, SetBufferCommitInfoNeedsSave
is not called as it should be when the status bits change.)

I am also disturbed that several places in heapam.c call
HeapTupleSatisfiesUpdate without checking for infomask change and calling
SetBufferCommitInfoNeedsSave. In most paths through the code, it seems
the buffer will get marked dirty anyway, but wouldn't it be better to make
this check?

Search Discussions

  • Hiroshi Inoue at Jul 3, 2001 at 3:26 am

    Tom Lane wrote:

    I have been making some notes about the rules for accessing shared disk
    buffers, since they aren't spelled out anywhere now AFAIK. In process
    I found what seems to be a nasty bug in the code that tries to build
    btree indexes that include already-dead tuples. (If memory serves,
    Hiroshi added that code awhile back to help suppress the "heap tuples
    != index tuples" complaint from VACUUM.) [snip]
    I believe that nbtree.c's btbuild() code is currently in violation of
    these rules, because it calls HeapTupleSatisfiesNow() while holding a
    pin but no lock on the containing buffer.
    OK, we had better avoid using heapam routines in btbuild() ?

    regards,

    Hiroshi Inoue
  • Tom Lane at Jul 3, 2001 at 5:06 pm

    Hiroshi Inoue writes:
    Tom Lane wrote:
    I believe that nbtree.c's btbuild() code is currently in violation of
    these rules, because it calls HeapTupleSatisfiesNow() while holding a
    pin but no lock on the containing buffer.
    OK, we had better avoid using heapam routines in btbuild() ?
    On further thought, btbuild is not that badly broken at the moment,
    because CREATE INDEX acquires ShareLock on the relation, so there can be
    no concurrent writers at the page level. Still, it seems like it'd be a
    good idea to do "LockBuffer(buffer, BUFFER_LOCK_SHARE)" here, and
    probably also to invoke HeapTupleSatisfiesNow() via the
    HeapTupleSatisfies() macro so that infomask update is checked for.
    Vadim, what do you think?

    regards, tom lane
  • Nathan Myers at Jul 3, 2001 at 7:36 pm

    On Mon, Jul 02, 2001 at 09:40:25PM -0400, Tom Lane wrote:

    4. It is considered OK to update tuple commit status bits (ie, OR the
    values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
    HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
    pin on a buffer. This is OK because another backend looking at the tuple
    at about the same time would OR the same bits into the field, so there
    is little or no risk of conflicting update; what's more, if there did
    manage to be a conflict it would merely mean that one bit-update would
    be lost and need to be done again later.
    Without looking at the code, this seems mad. Are you sure?

    Nathan Myers
    ncm@zembu.com
  • Tom Lane at Jul 3, 2001 at 9:11 pm

    Nathan Myers writes:
    On Mon, Jul 02, 2001 at 09:40:25PM -0400, Tom Lane wrote:
    4. It is considered OK to update tuple commit status bits (ie, OR the
    values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
    HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
    pin on a buffer. This is OK because another backend looking at the tuple
    at about the same time would OR the same bits into the field, so there
    is little or no risk of conflicting update; what's more, if there did
    manage to be a conflict it would merely mean that one bit-update would
    be lost and need to be done again later.
    Without looking at the code, this seems mad. Are you sure?
    Yes. Those status bits aren't ground truth, only hints. They cache the
    results of looking up transaction status in pg_log; if they get dropped,
    the only consequence is the next visitor to the tuple has to do the
    lookup over again.

    Changing any other bits in t_infomask requires exclusive lock, however.

    regards, tom lane
  • Nathan Myers at Jul 3, 2001 at 10:59 pm

    On Tue, Jul 03, 2001 at 05:11:46PM -0400, Tom Lane wrote:
    ncm@zembu.com (Nathan Myers) writes:
    On Mon, Jul 02, 2001 at 09:40:25PM -0400, Tom Lane wrote:
    4. It is considered OK to update tuple commit status bits (ie, OR the
    values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
    HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
    pin on a buffer. This is OK because another backend looking at the tuple
    at about the same time would OR the same bits into the field, so there
    is little or no risk of conflicting update; what's more, if there did
    manage to be a conflict it would merely mean that one bit-update would
    be lost and need to be done again later.
    Without looking at the code, this seems mad. Are you sure?
    Yes. Those status bits aren't ground truth, only hints. They cache the
    results of looking up transaction status in pg_log; if they get dropped,
    the only consequence is the next visitor to the tuple has to do the
    lookup over again.

    Changing any other bits in t_infomask requires exclusive lock, however.
    Hmm, look:

    A B

    1 load t_infomask

    2 or XMIN_INVALID

    3 lock

    4 load t_infomask

    5 or MOVED_IN

    6 store t_infomask

    7 unlock

    8 store t_infomask

    Here, backend B is a good citizen and locks while it makes its change.
    Backend A only means to touch the "hint" bits, but it ends up clobbering
    the MOVED_IN bit too.

    Also, as hints, would it be Bad(tm) if an attempt to clear one failed?
    That seems possible as well.

    Nathan Myers
    ncm@zembu.com
  • Tom Lane at Jul 4, 2001 at 3:29 am

    Nathan Myers writes:
    Here, backend B is a good citizen and locks while it makes its change.
    No, backend B wasn't a good citizen: it should have been holding
    exclusive lock on the buffer.
    Also, as hints, would it be Bad(tm) if an attempt to clear one failed?
    Clearing hint bits is also an exclusive-lock-only operation. Notice
    I specified that *setting* them is the only case allowed to be done
    with shared lock.

    regards, tom lane
  • Bill Studenmund at Jul 10, 2001 at 8:19 pm
    Sorry for the delay.
    On Tue, 3 Jul 2001, Tom Lane wrote:

    ncm@zembu.com (Nathan Myers) writes:
    Also, as hints, would it be Bad(tm) if an attempt to clear one failed?
    Clearing hint bits is also an exclusive-lock-only operation. Notice
    I specified that *setting* them is the only case allowed to be done
    with shared lock.
    One problem though is that if you don't have a spin lock around the flag,
    you can end up clearing it inadvertenty. i.e. two backends go to update
    (different) bit flags. They each load the current value, and each set the
    (different) bit they want to set. They then store the new value they each
    have come up with. The second store will effectively clear the bit set in
    the first store.

    ??

    Take care,

    Bill
  • Tom Lane at Jul 10, 2001 at 8:38 pm

    Bill Studenmund writes:
    One problem though is that if you don't have a spin lock around the flag,
    you can end up clearing it inadvertenty. i.e. two backends go to update
    (different) bit flags. They each load the current value, and each set the
    (different) bit they want to set. They then store the new value they each
    have come up with. The second store will effectively clear the bit set in
    the first store.
    Yes, this is the scenario we're talking about. The point is that losing
    the bit is harmless, in this particular situation, because it will be
    set again the next time the tuple is visited. It's only a hint. That
    being the case, we judge that avoiding the necessity for an exclusive
    lock during read is well worth a (very infrequent) need to do an extra
    pg_log lookup due to loss of a hint bit.

    regards, tom lane
  • Mikheev, Vadim at Jul 5, 2001 at 7:46 pm

    What I'm wondering is if you had any other intended use for "mark for
    cleanup" than VACUUM. The cheapest implementation would allow only
    one process to be waiting for cleanup on a given buffer, which is OK
    for VACUUM because we'll only allow one VACUUM at a time on a relation
    anyway. But if you had some other uses in mind, maybe the code needs
    to support multiple waiters.
    I was going to use it for UNDO but it seems that UNDO w/o OSMGR is not
    popular and OSMGR will require different approaches anyway, so -
    do whatever you want.

    Vadim
  • Hiroshi Inoue at Jul 6, 2001 at 1:21 am

    "Mikheev, Vadim" wrote:
    What I'm wondering is if you had any other intended use for "mark for
    cleanup" than VACUUM. The cheapest implementation would allow only
    one process to be waiting for cleanup on a given buffer, which is OK
    for VACUUM because we'll only allow one VACUUM at a time on a relation
    anyway. But if you had some other uses in mind, maybe the code needs
    to support multiple waiters.
    I was going to use it for UNDO but it seems that UNDO w/o OSMGR is not
    popular and OSMGR will require different approaches anyway, so -
    do whatever you want.
    How is UNDO now ?
    I've wanted a partial rollback functionality for a long
    time and I've never seen the practical solution other
    than UNDO.

    I'm thinking the following rstricted UNDO.

    UNDO is never invoked for an entire rollback of a transaction.
    The implicit savepoint at the beginning of a transaction isn't
    needed. If there's no savepoints, UNDO is never invoked.
    Especially UNDO is never invoked for commands outside the
    transaction block. WAL logs before savepoints could be
    discarded at CheckPoint.

    Comments ?

    regards,
    Hiroshi Inoue

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJul 3, '01 at 1:41a
activeJul 10, '01 at 8:38p
posts11
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase