On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner wrote:
To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
You mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.
That was what I meant. I guess my coffee hadn't kicked in yet there.
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.
Do you think having this snapshot scheme would be helpful for Postgres-R?
Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up.
I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.

What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
Yes, that is what I meant to say but for some reason didn't.
To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.

To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.
Hm, I dislike the requirement to maintain two different snapshot formats.

Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.
I agree that two snapshot formats is not great. On the other hand, the
additional logic is confined to XidInMVCCSnapshot and is reasonably
simple. If we didn't convert the snapshots we would have to keep
spooling out CSN log and look up XIDs for each visibility check. We
could add a cache for XIDs that were deemed concurrent, but that is in
effect just lazily constructing the same datastructure. The work
needed to convert is reasonably well bounded, can be done without
holding global locks and in most circumstances should only be
necessary for snapshots that are used for a long time and will
amortize the cost. I'm not worried about the bursts because the
conversion is done lockless and starting at the same point in time
leads to better cache utilization.
Rolling back a transaction

Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.
Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?
I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slots with
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.
Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.
A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
8 byte alignment for CSNs is needed for atomic if not something else.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
Performance discussion

Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.

Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.
Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
Visibility checks are done lock free so they don't cause any
contention. The number of times each cache line can be invalidated is
bounded by 8. Overall I think actual performance tests are needed to
see if there is a problem, or perhaps if having the data shared
actually helps with cache hit rates.
At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit.
Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format. :-)
we have a small one in the family.
Congratulations on that one.
Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 7 | next ›
Discussion Overview
grouppgsql-hackers @
postedJun 6, '13 at 10:42p
activeJun 11, '13 at 6:55a



site design / logo © 2021 Grokbase