clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication. Google's Spanner database uses snapshots based on a
The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up. To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.
To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.
A more detailed view of synchronization primitives required for common
Taking a new snapshot
Taking a CSN based snapshot under this scheme would consist of reading
xmin, csn and xmax from global variables, unlocked and in that order
with read barriers in between each load. If this is our oldest
snapshot we write our csn_min into pgproc, do a full memory barrier
and check from a global variable if the CSN we used is still
guaranteed to not be evicted (exceedingly unlikely but cannot be ruled
The read barrier between reading xmin and csn is needed so the
guarantee applies that no transaction with tx.xid < ss.xmin could have
committed with tx.csn >= ss.csn, so xmin can be used to safely exclude
csn lookups. Read barrier between reading csn and xmax is needed to
guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >=
ss.csn. From the write side, there needs to be at least one full
memory barrier between GetNewTransactionId updating nextXid and
CommitTransaction updating nextCsn, which is quite easily satisfied.
Updating global xmin without races is slightly trickier but doable.
XidInMVCCSnapshot will need to switch on the type of snapshot used
after checking xmin and xmax. For list-of-XIDs snapshots the current
logic applies. CSN based snapshots need to first do an unlocked read
from a backend specific global variable to check if we have been
instructed to convert our snapshots to xid based, if so the snapshot
is converted (more on that separately). Otherwise we look up the CSN
of the xid.
To map xid to csn, first the value from the csn slot corresponding to
the xid is read in from dense map (just a plain denseCSNMap[xid %
denseMapSize]), then after issuing a read barrier, the dense map
eviction horizon is checked to verify that the value that we read in
was in fact valid, if it is, it can be compared to the snapshot csn
value to get the result, if not continue to check the sparse map.
To check the sparse map, we read in the sparse map version counter
value and use it to determine which one of the two maps is currently
active (an optimistic locking scheme). We use linear or binary search
to look up the slot for the XID. If the XID is not found we know that
it wasn't committed after taking the snapshot, we then have to check
the clog if it was committed or not. Otherwise compare the value
against our snapshots csn to determine visibility. Finally after
issuing a read barrier, check the sparse map version counter to see if
the result is valid.
Checking from the dense map can omit clog checks as we can use special
values to signify aborted and uncommitted values. Even more so, we can
defer clog updates until the corresponding slots are evicted from the
Assigning XIDs and evicting from CSN buffers
XID assignment will need an additional step to allocate a CSN slot for
the transaction. If the dense map ring buffer has filled up, this
might require evicting some entries from the dense CSN map. For less
contention and better efficiency it would be a good idea to do
evictions larger blocks at a time. One 8th or 16th of the dense map at
a time might be a good balance here. Eviction will be done under a new
First we publish the point up to which we are evicting from dense to
notify committing backends of the hazard.
We then read the current nextCSN value and publish it as the largest
possible global_csn_min value we can arrive at, so if there is a
backend in the middle of taking a snapshot that has fetched a CSN but
hasn't yet updated procarray entry will notice that it is at risk and
will retry. Using nextCSN is being very conservative, but as far as I
can see the invalidations will be rare and cheap. We have to issue a
full memory barrier here so either the snapshot take sees our value or
we see its csn min.
If there is enough space, we just use global_csn_min as the sparse map
eviction horizon. If there is not enough free space in the current
active sparse map to guarantee that dense map will fit, we scan both
the active sparse array and the to-be-evicted block, collecting the
missing space worth xid-csn pairs with smallest CSN values to a heap,
reducing the heap size for every xid,csn we omit due to not being
needed either because the transaction was aborted or because it is
visible to all active snapshots. When we finish scanning and the heap
isn't empty, then the largest value in the heap is the sparse map
After arriving at the correct sparse map eviction horizon, we iterate
through the sparse map and the dense map block to be evicted, copying
all active or not-all-visible CSN slots to the inactive sparse map. We
also update clog for every committed transaction that we found in the
dense map. If we decided to evict some values, we write them to the
CSN log here and update the sparse map eviction horizon with the CSN
we arrived at. At this point the state in the current active sparse
map and the evictable dense map block are duplicated into the inactive
sparse map and CSN log. We now need to make the new state visible to
visibility checkers in a safe order, issuing a write barrier before
each step so the previous changes are visible:
* Notify all backends that have csn_min <= sparse map eviction horizon
that their snapshots are invalid and at what CSN log location they can
start to read to find concurrent xids.
* Increment sparse map version counter to switch sparse maps.
* Raise the dense map eviction horizon, to free up the dense map block.
* Overwrite the dense map block with special UncommittedCSN values
that are tagged with their XID (more on that later)
At this point we can consider the block cleared up for further use.
Because we don't want to lock shared structures for snapshotting we
need to maintain a global xmin value. To do this we acquire a spinlock
on the global xmin value, and check if it's empty, if no other
transaction is running we replace it with our xid.
At this point we know the minimum CSN of any unconverted snapshots, so
it's also a good time to clean up unnecessary CSN log.
Finally we are done, so we can release CSNEvictionLock and XIDGenLock.
Committing a transaction
Most of the commit sequence remains the same, but where we currently
do ProcArrayEndTransaction, we will acquire a new LWLock protecting
the nextCSN value, I'll call it the CSNGenLock for now. We first read
the dense array eviction horizon, then stamp all our non-overflowed or
still in dense map subtransaction slots with special SubcommittedCSN
value, then stamp the top level xid with nextCSN. We then issue a full
memory barrier, and check the soon-to-be-evicted pointer into the
dense map. If it overlaps with any of the XIDs we just tagged then the
backend doing the eviction might have missed our update, we have to
wait for CSNEvictionLock to become free and go and restamp the
relevant XIDs in the sparse map.
To stamp a XID with the commit CSN, we compare the XID to the dense
map eviction horizon. If the XID still maps to the dense array, we do
CAS and swap out UncommittedCSN(xid) with the value that we needed. If
the CAS fails, then between when we read the dense array horizon and
the actual stamping an eviction process replaced our slot with a newer
one. If we didn't tag the slots with the XID value, then we might
accidentally stamp another transactions slot. If the XID maps to the
sparse array, we have to take the CSNEvictionLock so the sparse array
doesn't get swapped out underneath us, and then look up the slot and
stamp it and then update the CLOG before releasing the lock. Lock
contention shouldn't be a problem here as only long running
transactions map to the sparse array.
When done with the stamping we will check the global xmin value, if
it's our xid, we grab the spinlock, scan through the procarray for the
next xmin value, update and release it.
Rolling back a transaction
Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.
Because of limited size of the sparse array, we cannot keep open CSN
slots for all of the potentially unbounded number of subtransactions
there. I propose something similar to what is done currently with
PGPROC_MAX_CACHED_SUBXIDS. When we assign xids to subtransactions that
are above this limit, we tag them in the dense array with a special
OverflowedSubxidCSN value. When evicting subtransactions from dense
array, non-overflowed subtransaction slots are handled like regular
slots. We discard the overflowed slots when evicting from the dense
map. We also keep track of the lowest subxid that was overflowed and
the latest subxid that was overflowed, lowest overflowed subxid is
reset when before eviction the highest overflowed subxid is lower than
the smallest xid in the sparse array (i.e. we know that the XID region
convered by the sparse array doesn't contain any overflowed subxids).
When constructing the regular snapshot we can then detect that we
don't have the full information about subtransactions and can
correctly set the overflowed flag on the snapshot. Similarly
visibility checks can omit subxid lookups for xids missing from the
sparse array when they know that the xid can't be overflowed.
Prepared transactions are handled basically like regular transactions,
when starting up with prepared transactions, they are inserted into
the sparse array, when they are committed they get stamped with CSNs
and become visible like usual. We just need to account for them when
sizing the sparse array.
Crash safety and checkpoints
Because clog updating is delayed for transactions in the dense map,
checkpoints need to flush the dense array before writing out clog.
Otherwise the datastructures are fully transient and don't need any
I haven't yet worked out how CSN based snapshots best integrate with
hot standby. As far as I can tell, we could just use the current
KnownAssignedXidsGetAndSetXmin mechanism and get regular snapshots.
But I think there is an opportunity here to get rid of most of or all
of the KnownAssignedXids mechanism if we WAL log the CSNs that are
assigned to commits (or use a side channel as proposed by Robert). The
extra write is not great, but might not be a large problem after the
WAL insertion scaling work is done. Another option would be to use the
LSN of commit record as the next CSN value. The locking in that case
requires some further thought to guarantee that commits are stamped in
the same order as WAL records are inserted without holding
WALInsertLock for too long. That seems doable by inserting commiting
backends into a queue under WALInsertLock and then have them wait for
the transaction in front of them to commit when WALInsertLock has been
I won't pretend to be familiar with SSI code, but as far as I can tell
serializable transactions don't need any modifications to work with
the CSN based snapshot scheme. There actually already is a commit
sequence number in the SSI code that could be replaced by the actual
CSN. IIRC one of the problems with serializable transactions on hot
standby was that transaction visibility order on the standby is
different from the master. If we use CSNs for visibility on the slave
then we can actually provide identical visibility order.
Required atomic primitives
Besides the copious amount of memory barriers that are required for
correctness. We will need the following lockless primitives:
* 64bit atomic read
* 64bit atomic write
* 64bit CAS
Are there any supported platforms where it would be impossible to have
those? AFAIK everything from 32bit x86 through POWER and MIPS to ARM
can do it. If there are any platforms that can't handle 64bit atomics,
would it be ok to run on them with reduced concurrency/performance?
Sizing the CSN maps
CSN maps need to sized to accomodate the number of backends.
Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.
Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.
Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.
The main extra work for writing transactions is the eviction process.
The amortized worst case extra work per xid is dominated by copying
the sparse buffer back and forth and spooling out the csn log. We need
to write out 16 bytes per xid and copy sparse buffer size / eviction
block size sparse buffer slots. If we evict 1/8th of dense map at each
eviction it works out as 520 bytes copied per xid assigned. About the
same ballpark as GetSnapshotData is now.
With the described scheme, long snapshots will cause the sparse buffer
to quickly fill up and then spool out until the backend wakes up,
converts its snapshots and releases the eviction process to free up
the log. It would be more efficient to be slightly more proactive and
tell them to convert the snapshots earlier so if they manage to be
timely with their conversion we can avoid writing any CSN log.
I'm not particularly pleased about the fact that both xid assignment
and committing can block behind the eviction lock. On the other hand,
plugging in some numbers, with 100 backends doing 100k xid
assignments/s the lock will be acquired 1000 times per second for less
than 100us at a time. The contention might not be bad enough to
warrant extra complexity to deal with it. If it does happen to be a
problem, then I have some ideas how to cope with it.
Having to do CSN log writes while holding a LWLock might not be the
best of ideas, to combat that we can either add one more buffer so we
can do the actual write syscall after we release CSNEvictionLock, or
we can reuse the SLRU machinery to handle this.
Overall it looks to be a big win for typical workloads. Workloads
using large amounts of subtransactions might not be as well off, but I
doubt there will be a regression.
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit. It might take a while though as
my free hacking time has been severely cut down since we have a small
one in the family.
Cybertec Schönig & Schönig GmbH
A-2700 Wiener Neustadt
Cybertec Schönig & Schönig GmbH
A-2700 Wiener Neustadt