On 06/07/2013 12:42 AM, Ants Aasma wrote:
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.
Thanks for this write-up.
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
You mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.
The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
Agreed. Para-phrasing, you can also say that CSNs can only ever identify
completed transactions, where XIDs can be used to identify transactions
in progress as well.

[ We cannot possibly get rid of XIDs entirely for that reason. And the
mapping between CSNs and XIDs has some additional cost. ]
We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.

For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.
I like this idea of dense and sparse "areas". Seems like a simple
enough, yet reasonably compact representation that might work well in
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up.
I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.

What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.

To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.
Hm, I dislike the requirement to maintain two different snapshot formats.

Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.
Rolling back a transaction

Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.
Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?
Sizing the CSN maps

CSN maps need to sized to accomodate the number of backends.

Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.
Sounds reasonable to me.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.
A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
Performance discussion

Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.

Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.
Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit.
Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format. :-)
we have a small one in the family.
Congratulations on that one.


Markus Wanner

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 7 | next ›
Discussion Overview
grouppgsql-hackers @
postedJun 6, '13 at 10:42p
activeJun 11, '13 at 6:55a



site design / logo © 2021 Grokbase