FAQ
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.

To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication. Google's Spanner database uses snapshots based on a
similar scheme.

The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
gain.

We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.

For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.

Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up. To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.

To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.

A more detailed view of synchronization primitives required for common
operations follows.

Taking a new snapshot
---------------------

Taking a CSN based snapshot under this scheme would consist of reading
xmin, csn and xmax from global variables, unlocked and in that order
with read barriers in between each load. If this is our oldest
snapshot we write our csn_min into pgproc, do a full memory barrier
and check from a global variable if the CSN we used is still
guaranteed to not be evicted (exceedingly unlikely but cannot be ruled
out).

The read barrier between reading xmin and csn is needed so the
guarantee applies that no transaction with tx.xid < ss.xmin could have
committed with tx.csn >= ss.csn, so xmin can be used to safely exclude
csn lookups. Read barrier between reading csn and xmax is needed to
guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >=
ss.csn. From the write side, there needs to be at least one full
memory barrier between GetNewTransactionId updating nextXid and
CommitTransaction updating nextCsn, which is quite easily satisfied.
Updating global xmin without races is slightly trickier but doable.

Checking visibility
-------------------

XidInMVCCSnapshot will need to switch on the type of snapshot used
after checking xmin and xmax. For list-of-XIDs snapshots the current
logic applies. CSN based snapshots need to first do an unlocked read
from a backend specific global variable to check if we have been
instructed to convert our snapshots to xid based, if so the snapshot
is converted (more on that separately). Otherwise we look up the CSN
of the xid.

To map xid to csn, first the value from the csn slot corresponding to
the xid is read in from dense map (just a plain denseCSNMap[xid %
denseMapSize]), then after issuing a read barrier, the dense map
eviction horizon is checked to verify that the value that we read in
was in fact valid, if it is, it can be compared to the snapshot csn
value to get the result, if not continue to check the sparse map.

To check the sparse map, we read in the sparse map version counter
value and use it to determine which one of the two maps is currently
active (an optimistic locking scheme). We use linear or binary search
to look up the slot for the XID. If the XID is not found we know that
it wasn't committed after taking the snapshot, we then have to check
the clog if it was committed or not. Otherwise compare the value
against our snapshots csn to determine visibility. Finally after
issuing a read barrier, check the sparse map version counter to see if
the result is valid.

Checking from the dense map can omit clog checks as we can use special
values to signify aborted and uncommitted values. Even more so, we can
defer clog updates until the corresponding slots are evicted from the
dense map.

Assigning XIDs and evicting from CSN buffers
--------------------------------------------

XID assignment will need an additional step to allocate a CSN slot for
the transaction. If the dense map ring buffer has filled up, this
might require evicting some entries from the dense CSN map. For less
contention and better efficiency it would be a good idea to do
evictions larger blocks at a time. One 8th or 16th of the dense map at
a time might be a good balance here. Eviction will be done under a new
CSNEvictionLock.

First we publish the point up to which we are evicting from dense to
notify committing backends of the hazard.

We then read the current nextCSN value and publish it as the largest
possible global_csn_min value we can arrive at, so if there is a
backend in the middle of taking a snapshot that has fetched a CSN but
hasn't yet updated procarray entry will notice that it is at risk and
will retry. Using nextCSN is being very conservative, but as far as I
can see the invalidations will be rare and cheap. We have to issue a
full memory barrier here so either the snapshot take sees our value or
we see its csn min.

If there is enough space, we just use global_csn_min as the sparse map
eviction horizon. If there is not enough free space in the current
active sparse map to guarantee that dense map will fit, we scan both
the active sparse array and the to-be-evicted block, collecting the
missing space worth xid-csn pairs with smallest CSN values to a heap,
reducing the heap size for every xid,csn we omit due to not being
needed either because the transaction was aborted or because it is
visible to all active snapshots. When we finish scanning and the heap
isn't empty, then the largest value in the heap is the sparse map
eviction horizon.

After arriving at the correct sparse map eviction horizon, we iterate
through the sparse map and the dense map block to be evicted, copying
all active or not-all-visible CSN slots to the inactive sparse map. We
also update clog for every committed transaction that we found in the
dense map. If we decided to evict some values, we write them to the
CSN log here and update the sparse map eviction horizon with the CSN
we arrived at. At this point the state in the current active sparse
map and the evictable dense map block are duplicated into the inactive
sparse map and CSN log. We now need to make the new state visible to
visibility checkers in a safe order, issuing a write barrier before
each step so the previous changes are visible:

* Notify all backends that have csn_min <= sparse map eviction horizon
that their snapshots are invalid and at what CSN log location they can
start to read to find concurrent xids.
* Increment sparse map version counter to switch sparse maps.
* Raise the dense map eviction horizon, to free up the dense map block.
* Overwrite the dense map block with special UncommittedCSN values
that are tagged with their XID (more on that later)

At this point we can consider the block cleared up for further use.
Because we don't want to lock shared structures for snapshotting we
need to maintain a global xmin value. To do this we acquire a spinlock
on the global xmin value, and check if it's empty, if no other
transaction is running we replace it with our xid.

At this point we know the minimum CSN of any unconverted snapshots, so
it's also a good time to clean up unnecessary CSN log.

Finally we are done, so we can release CSNEvictionLock and XIDGenLock.

Committing a transaction
------------------------

Most of the commit sequence remains the same, but where we currently
do ProcArrayEndTransaction, we will acquire a new LWLock protecting
the nextCSN value, I'll call it the CSNGenLock for now. We first read
the dense array eviction horizon, then stamp all our non-overflowed or
still in dense map subtransaction slots with special SubcommittedCSN
value, then stamp the top level xid with nextCSN. We then issue a full
memory barrier, and check the soon-to-be-evicted pointer into the
dense map. If it overlaps with any of the XIDs we just tagged then the
backend doing the eviction might have missed our update, we have to
wait for CSNEvictionLock to become free and go and restamp the
relevant XIDs in the sparse map.

To stamp a XID with the commit CSN, we compare the XID to the dense
map eviction horizon. If the XID still maps to the dense array, we do
CAS and swap out UncommittedCSN(xid) with the value that we needed. If
the CAS fails, then between when we read the dense array horizon and
the actual stamping an eviction process replaced our slot with a newer
one. If we didn't tag the slots with the XID value, then we might
accidentally stamp another transactions slot. If the XID maps to the
sparse array, we have to take the CSNEvictionLock so the sparse array
doesn't get swapped out underneath us, and then look up the slot and
stamp it and then update the CLOG before releasing the lock. Lock
contention shouldn't be a problem here as only long running
transactions map to the sparse array.

When done with the stamping we will check the global xmin value, if
it's our xid, we grab the spinlock, scan through the procarray for the
next xmin value, update and release it.

Rolling back a transaction
--------------------------

Rolling back a transaction is basically the same as committing, but
the CSN slots need to be stamped with a AbortedCSN.

Subtransactions
---------------

Because of limited size of the sparse array, we cannot keep open CSN
slots for all of the potentially unbounded number of subtransactions
there. I propose something similar to what is done currently with
PGPROC_MAX_CACHED_SUBXIDS. When we assign xids to subtransactions that
are above this limit, we tag them in the dense array with a special
OverflowedSubxidCSN value. When evicting subtransactions from dense
array, non-overflowed subtransaction slots are handled like regular
slots. We discard the overflowed slots when evicting from the dense
map. We also keep track of the lowest subxid that was overflowed and
the latest subxid that was overflowed, lowest overflowed subxid is
reset when before eviction the highest overflowed subxid is lower than
the smallest xid in the sparse array (i.e. we know that the XID region
convered by the sparse array doesn't contain any overflowed subxids).
When constructing the regular snapshot we can then detect that we
don't have the full information about subtransactions and can
correctly set the overflowed flag on the snapshot. Similarly
visibility checks can omit subxid lookups for xids missing from the
sparse array when they know that the xid can't be overflowed.

Prepared transactions
---------------------

Prepared transactions are handled basically like regular transactions,
when starting up with prepared transactions, they are inserted into
the sparse array, when they are committed they get stamped with CSNs
and become visible like usual. We just need to account for them when
sizing the sparse array.

Crash safety and checkpoints
----------------------------

Because clog updating is delayed for transactions in the dense map,
checkpoints need to flush the dense array before writing out clog.
Otherwise the datastructures are fully transient and don't need any
special treatment.

Hot standby
-----------

I haven't yet worked out how CSN based snapshots best integrate with
hot standby. As far as I can tell, we could just use the current
KnownAssignedXidsGetAndSetXmin mechanism and get regular snapshots.
But I think there is an opportunity here to get rid of most of or all
of the KnownAssignedXids mechanism if we WAL log the CSNs that are
assigned to commits (or use a side channel as proposed by Robert). The
extra write is not great, but might not be a large problem after the
WAL insertion scaling work is done. Another option would be to use the
LSN of commit record as the next CSN value. The locking in that case
requires some further thought to guarantee that commits are stamped in
the same order as WAL records are inserted without holding
WALInsertLock for too long. That seems doable by inserting commiting
backends into a queue under WALInsertLock and then have them wait for
the transaction in front of them to commit when WALInsertLock has been
released.

Serializable transactions
-------------------------

I won't pretend to be familiar with SSI code, but as far as I can tell
serializable transactions don't need any modifications to work with
the CSN based snapshot scheme. There actually already is a commit
sequence number in the SSI code that could be replaced by the actual
CSN. IIRC one of the problems with serializable transactions on hot
standby was that transaction visibility order on the standby is
different from the master. If we use CSNs for visibility on the slave
then we can actually provide identical visibility order.

Required atomic primitives
--------------------------

Besides the copious amount of memory barriers that are required for
correctness. We will need the following lockless primitives:
* 64bit atomic read
* 64bit atomic write
* 64bit CAS

Are there any supported platforms where it would be impossible to have
those? AFAIK everything from 32bit x86 through POWER and MIPS to ARM
can do it. If there are any platforms that can't handle 64bit atomics,
would it be ok to run on them with reduced concurrency/performance?

Sizing the CSN maps
-------------------

CSN maps need to sized to accomodate the number of backends.

Dense array size should be picked so that most xids commit before
being evicted from the dense map and sparse array will contain slots
necessary for either long running transactions or for long snapshots
not yet converted to XID based snapshots. I did a few quick
simulations to measure the dynamics. If we assume a power law
distribution of transaction lengths and snapshots for the full length
of transactions with no think time, then 16 slots per backend is
enough to make the probability of eviction before commit less than
1e-6 and being needed at eviction due to a snapshot about 1:10000. In
the real world very long transactions are more likely than predicted
model, but at least the normal variation is mostly buffered by this
size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
the default value of 100 backends, or 125KB for 1000 backends.

Sparse buffer needs to be at least big enough to fit CSN slots for the
xids of all active transactions and non-overflowed subtransactions. At
the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
or 203KB total in the default configuration.

Performance discussion
----------------------

Taking a snapshot is extremely cheap in this scheme, I think the cost
will be mostly for publishing the csnmin and rechecking validity after
it. Taking snapshots in the shadow of another snapshot (often the case
for the proposed MVCC catalog access) will be even cheaper as we don't
have to advertise the new snapshot. The delayed XID based snapshot
construction should be unlikely, but even if it isn't the costs are
similar to GetSnapshotData, but we don't need to hold a lock.

Visibility checking will also be simpler as for the cases where the
snapshot is covered by the dense array it only requires two memory
lookups and comparisons.

The main extra work for writing transactions is the eviction process.
The amortized worst case extra work per xid is dominated by copying
the sparse buffer back and forth and spooling out the csn log. We need
to write out 16 bytes per xid and copy sparse buffer size / eviction
block size sparse buffer slots. If we evict 1/8th of dense map at each
eviction it works out as 520 bytes copied per xid assigned. About the
same ballpark as GetSnapshotData is now.

With the described scheme, long snapshots will cause the sparse buffer
to quickly fill up and then spool out until the backend wakes up,
converts its snapshots and releases the eviction process to free up
the log. It would be more efficient to be slightly more proactive and
tell them to convert the snapshots earlier so if they manage to be
timely with their conversion we can avoid writing any CSN log.

I'm not particularly pleased about the fact that both xid assignment
and committing can block behind the eviction lock. On the other hand,
plugging in some numbers, with 100 backends doing 100k xid
assignments/s the lock will be acquired 1000 times per second for less
than 100us at a time. The contention might not be bad enough to
warrant extra complexity to deal with it. If it does happen to be a
problem, then I have some ideas how to cope with it.

Having to do CSN log writes while holding a LWLock might not be the
best of ideas, to combat that we can either add one more buffer so we
can do the actual write syscall after we release CSNEvictionLock, or
we can reuse the SLRU machinery to handle this.

Overall it looks to be a big win for typical workloads. Workloads
using large amounts of subtransactions might not be as well off, but I
doubt there will be a regression.

At this point I don't see any major issues with this approach. If the
ensuing discussion doesn't find any major showstoppers then I will
start to work on a patch bit-by-bit. It might take a while though as
my free hacking time has been severely cut down since we have a small
one in the family.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Search Discussions

  • Markus Wanner at Jun 7, 2013 at 12:01 pm
    Ants,
    On 06/07/2013 12:42 AM, Ants Aasma wrote:
    Given the recent ideas being thrown about changing how freezing and
    clog is handled and MVCC catalog access I thought I would write out
    the ideas that I have had about speeding up snapshots in case there is
    an interesting tie in with the current discussions.
    Thanks for this write-up.
    To refresh your memory the basic idea is to change visibility
    determination to be based on a commit sequence number (CSN for short)
    - a 8 byte number incremented on every commit representing the total
    ordering of commits. To take a snapshot in this scheme you only need
    to know the value of last assigned CSN, all transactions with XID less
    You mean with a *CSN* less than or equal to that number? There's
    certainly no point in comparing a CSN with an XID.
    than or equal to that number were commited at the time of the
    snapshots, everything above wasn't committed. Besides speeding up
    snapshot taking, this scheme can also be a building block for
    consistent snapshots in a multi-master environment with minimal
    communication.
    Agreed. Postgres-R uses a CommitOrderId, which is very similar in
    concept, for example.
    The main tricky part about this scheme is finding the CSN that was
    assigned to each XIDs in face of arbitrarily long transactions and
    snapshots using only a bounded amount of shared memory. The secondary
    tricky part is doing this in a way that doesn't need locks for
    visibility determination as that would kill any hope of a performance
    gain.
    Agreed. Para-phrasing, you can also say that CSNs can only ever identify
    completed transactions, where XIDs can be used to identify transactions
    in progress as well.

    [ We cannot possibly get rid of XIDs entirely for that reason. And the
    mapping between CSNs and XIDs has some additional cost. ]
    We need to keep around CSN slots for all currently running
    transactions and CSN slots of transactions that are concurrent with
    any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
    this I propose the following datastructures to do the XID-to-CSN
    mapping. For most recently assigned XIDs there is a ringbuffer of
    slots that contain the CSN values of the XIDs or special CSN values
    for transactions that haven't completed yet, aborted transactions or
    subtransactions. I call this the dense CSN map. Looking up a CSN of a
    XID from the ringbuffer is just a trivial direct indexing into the
    ring buffer.

    For long running transactions the ringbuffer may do a full circle
    before a transaction commits. Such CSN slots along with slots that are
    needed for older snapshots are evicted from the dense buffer into a
    sorted array of XID-CSN pairs, or the sparse mapping. For locking
    purposes there are two sparse buffers, one of them being active the
    other inactive, more on that later. Looking up the CSN value of a XID
    that has been evicted into the sparse mapping is a matter of
    performing a binary search to find the slot and reading the CSN value.
    I like this idea of dense and sparse "areas". Seems like a simple
    enough, yet reasonably compact representation that might work well in
    practice.
    Because snapshots can remain active for an unbounded amount of time
    and there can be unbounded amount of active snapshots, even the sparse
    mapping can fill up.
    I don't think the number of active snapshots matters - after all, they
    could all refer the same CSN. So that number shouldn't have any
    influence on the size of the sparse mapping.

    What does matter is the number of transactions referenced by such a
    sparse map. You are of course correct in that this number is equally
    unbounded.
    To handle that case, each backend advertises its
    lowest snapshot number csn_min. When values need to be evicted from
    the sparse mapping, they are evicted in CSN order and written into the
    CSN log - a series of CSN-XID pairs. Backends that may still be
    concerned about those values are then notified that values that they
    might need to use have been evicted. Backends with invalidated
    snapshots can then convert their snapshots to regular list of
    concurrent XIDs snapshots at their leisure.

    To convert a CSN based snapshot to XID based, a backend would first
    scan the shared memory structures for xids up to snapshot.xmax for
    CSNs that are concurrent to the snapshot and insert the XIDs into the
    snapshot, then read in the CSN log starting from snapshots CSN,
    looking for xid's less than the snapshots xmax. After this the
    snapshot can be handled like current snapshots are handled.
    Hm, I dislike the requirement to maintain two different snapshot formats.

    Also mind that snapshot conversions - however unlikely you choose to
    make them - may well result in bursts as multiple processes may need to
    do such a conversion, all starting at the same point in time.
    Rolling back a transaction
    --------------------------

    Rolling back a transaction is basically the same as committing, but
    the CSN slots need to be stamped with a AbortedCSN.
    Is that really necessary? After all, an aborted transaction behaves
    pretty much the same as a transaction in progress WRT visibility: it's
    simply not visible.

    Or why do you need to tell apart aborted from in-progress transactions
    by CSN?
    Sizing the CSN maps
    -------------------

    CSN maps need to sized to accomodate the number of backends.

    Dense array size should be picked so that most xids commit before
    being evicted from the dense map and sparse array will contain slots
    necessary for either long running transactions or for long snapshots
    not yet converted to XID based snapshots. I did a few quick
    simulations to measure the dynamics. If we assume a power law
    distribution of transaction lengths and snapshots for the full length
    of transactions with no think time, then 16 slots per backend is
    enough to make the probability of eviction before commit less than
    1e-6 and being needed at eviction due to a snapshot about 1:10000. In
    the real world very long transactions are more likely than predicted
    model, but at least the normal variation is mostly buffered by this
    size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
    the default value of 100 backends, or 125KB for 1000 backends.
    Sounds reasonable to me.
    Sparse buffer needs to be at least big enough to fit CSN slots for the
    xids of all active transactions and non-overflowed subtransactions. At
    the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
    at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
    or 203KB total in the default configuration.
    A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
    the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
    Performance discussion
    ----------------------

    Taking a snapshot is extremely cheap in this scheme, I think the cost
    will be mostly for publishing the csnmin and rechecking validity after
    it. Taking snapshots in the shadow of another snapshot (often the case
    for the proposed MVCC catalog access) will be even cheaper as we don't
    have to advertise the new snapshot. The delayed XID based snapshot
    construction should be unlikely, but even if it isn't the costs are
    similar to GetSnapshotData, but we don't need to hold a lock.

    Visibility checking will also be simpler as for the cases where the
    snapshot is covered by the dense array it only requires two memory
    lookups and comparisons.
    Keep in mind, though, that both of these lookups are into shared memory.
    Especially the dense ring buffer may well turn into a point of
    contention. Or at least the cache lines holding the most recent XIDs
    within that ring buffer.

    Where as currently, the snapshot's xip array resides in process-local
    memory. (Granted, often enough, the proc array also is a point of
    contention.)
    At this point I don't see any major issues with this approach. If the
    ensuing discussion doesn't find any major showstoppers then I will
    start to work on a patch bit-by-bit.
    Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
    in that format. :-)
    we have a small one in the family.
    Congratulations on that one.

    Regards

    Markus Wanner
  • Ants Aasma at Jun 7, 2013 at 12:50 pm

    On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner wrote:
    To refresh your memory the basic idea is to change visibility
    determination to be based on a commit sequence number (CSN for short)
    - a 8 byte number incremented on every commit representing the total
    ordering of commits. To take a snapshot in this scheme you only need
    to know the value of last assigned CSN, all transactions with XID less
    You mean with a *CSN* less than or equal to that number? There's
    certainly no point in comparing a CSN with an XID.
    That was what I meant. I guess my coffee hadn't kicked in yet there.
    than or equal to that number were commited at the time of the
    snapshots, everything above wasn't committed. Besides speeding up
    snapshot taking, this scheme can also be a building block for
    consistent snapshots in a multi-master environment with minimal
    communication.
    Agreed. Postgres-R uses a CommitOrderId, which is very similar in
    concept, for example.
    Do you think having this snapshot scheme would be helpful for Postgres-R?
    Because snapshots can remain active for an unbounded amount of time
    and there can be unbounded amount of active snapshots, even the sparse
    mapping can fill up.
    I don't think the number of active snapshots matters - after all, they
    could all refer the same CSN. So that number shouldn't have any
    influence on the size of the sparse mapping.

    What does matter is the number of transactions referenced by such a
    sparse map. You are of course correct in that this number is equally
    unbounded.
    Yes, that is what I meant to say but for some reason didn't.
    To handle that case, each backend advertises its
    lowest snapshot number csn_min. When values need to be evicted from
    the sparse mapping, they are evicted in CSN order and written into the
    CSN log - a series of CSN-XID pairs. Backends that may still be
    concerned about those values are then notified that values that they
    might need to use have been evicted. Backends with invalidated
    snapshots can then convert their snapshots to regular list of
    concurrent XIDs snapshots at their leisure.

    To convert a CSN based snapshot to XID based, a backend would first
    scan the shared memory structures for xids up to snapshot.xmax for
    CSNs that are concurrent to the snapshot and insert the XIDs into the
    snapshot, then read in the CSN log starting from snapshots CSN,
    looking for xid's less than the snapshots xmax. After this the
    snapshot can be handled like current snapshots are handled.
    Hm, I dislike the requirement to maintain two different snapshot formats.

    Also mind that snapshot conversions - however unlikely you choose to
    make them - may well result in bursts as multiple processes may need to
    do such a conversion, all starting at the same point in time.
    I agree that two snapshot formats is not great. On the other hand, the
    additional logic is confined to XidInMVCCSnapshot and is reasonably
    simple. If we didn't convert the snapshots we would have to keep
    spooling out CSN log and look up XIDs for each visibility check. We
    could add a cache for XIDs that were deemed concurrent, but that is in
    effect just lazily constructing the same datastructure. The work
    needed to convert is reasonably well bounded, can be done without
    holding global locks and in most circumstances should only be
    necessary for snapshots that are used for a long time and will
    amortize the cost. I'm not worried about the bursts because the
    conversion is done lockless and starting at the same point in time
    leads to better cache utilization.
    Rolling back a transaction
    --------------------------

    Rolling back a transaction is basically the same as committing, but
    the CSN slots need to be stamped with a AbortedCSN.
    Is that really necessary? After all, an aborted transaction behaves
    pretty much the same as a transaction in progress WRT visibility: it's
    simply not visible.

    Or why do you need to tell apart aborted from in-progress transactions
    by CSN?
    I need to detect aborted transactions so they can be discared during
    the eviction process, otherwise the sparse array will fill up. They
    could also be filtered out by cross-referencing uncommitted slots with
    the procarray. Having the abort case do some additional work to make
    xid assigment cheaper looks like a good tradeoff.
    Sparse buffer needs to be at least big enough to fit CSN slots for the
    xids of all active transactions and non-overflowed subtransactions. At
    the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
    at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
    or 203KB total in the default configuration.
    A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
    the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
    8 byte alignment for CSNs is needed for atomic if not something else.
    I think the size could be cut in half by using a base value for CSNs
    if we assume that no xid is active for longer than 2B transactions as
    is currently the case. I didn't want to include the complication in
    the first iteration, so I didn't verify if that would have any
    gotchas.
    Performance discussion
    ----------------------

    Taking a snapshot is extremely cheap in this scheme, I think the cost
    will be mostly for publishing the csnmin and rechecking validity after
    it. Taking snapshots in the shadow of another snapshot (often the case
    for the proposed MVCC catalog access) will be even cheaper as we don't
    have to advertise the new snapshot. The delayed XID based snapshot
    construction should be unlikely, but even if it isn't the costs are
    similar to GetSnapshotData, but we don't need to hold a lock.

    Visibility checking will also be simpler as for the cases where the
    snapshot is covered by the dense array it only requires two memory
    lookups and comparisons.
    Keep in mind, though, that both of these lookups are into shared memory.
    Especially the dense ring buffer may well turn into a point of
    contention. Or at least the cache lines holding the most recent XIDs
    within that ring buffer.

    Where as currently, the snapshot's xip array resides in process-local
    memory. (Granted, often enough, the proc array also is a point of
    contention.)
    Visibility checks are done lock free so they don't cause any
    contention. The number of times each cache line can be invalidated is
    bounded by 8. Overall I think actual performance tests are needed to
    see if there is a problem, or perhaps if having the data shared
    actually helps with cache hit rates.
    At this point I don't see any major issues with this approach. If the
    ensuing discussion doesn't find any major showstoppers then I will
    start to work on a patch bit-by-bit.
    Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
    in that format. :-)
    we have a small one in the family.
    Congratulations on that one.
    Thanks,
    Ants Aasma
    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt
    Web: http://www.postgresql-support.de
  • Markus Wanner at Jun 11, 2013 at 6:55 am
    Ants,

    the more I think about this, the more I start to like it.
    On 06/07/2013 02:50 PM, Ants Aasma wrote:
    On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner wrote:
    Agreed. Postgres-R uses a CommitOrderId, which is very similar in
    concept, for example.
    Do you think having this snapshot scheme would be helpful for Postgres-R?
    Yeah, it could help to reduce patch size, after a rewrite to use such a CSN.
    Or why do you need to tell apart aborted from in-progress transactions
    by CSN?
    I need to detect aborted transactions so they can be discared during
    the eviction process, otherwise the sparse array will fill up. They
    could also be filtered out by cross-referencing uncommitted slots with
    the procarray. Having the abort case do some additional work to make
    xid assigment cheaper looks like a good tradeoff.
    I see.
    Sparse buffer needs to be at least big enough to fit CSN slots for the
    xids of all active transactions and non-overflowed subtransactions. At
    the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
    at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
    or 203KB total in the default configuration.
    A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
    the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
    8 byte alignment for CSNs is needed for atomic if not something else.
    Oh, right, atomic writes.
    I think the size could be cut in half by using a base value for CSNs
    if we assume that no xid is active for longer than 2B transactions as
    is currently the case. I didn't want to include the complication in
    the first iteration, so I didn't verify if that would have any
    gotchas.
    In Postgres-R, I effectively used a 32-bit order id which wraps around.

    In this case, I guess adjusting the base value will get tricky. Wrapping
    could probably be used as well, instead.
    The number of times each cache line can be invalidated is
    bounded by 8.
    Hm.. good point.

    Regards

    Markus Wanner
  • Greg Stark at Jun 7, 2013 at 12:48 pm

    On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma wrote:
    To refresh your memory the basic idea is to change visibility
    determination to be based on a commit sequence number (CSN for short)
    - a 8 byte number incremented on every commit representing the total
    ordering of commits
    I think we would just use the LSN of the commit record which is
    effectively the same but doesn't require a new counter.
    I don't think this changes anything though.


    --
    greg
  • Ants Aasma at Jun 7, 2013 at 12:54 pm

    On Fri, Jun 7, 2013 at 3:47 PM, Greg Stark wrote:
    On Thu, Jun 6, 2013 at 11:42 PM, Ants Aasma wrote:
    To refresh your memory the basic idea is to change visibility
    determination to be based on a commit sequence number (CSN for short)
    - a 8 byte number incremented on every commit representing the total
    ordering of commits
    I think we would just use the LSN of the commit record which is
    effectively the same but doesn't require a new counter.
    I don't think this changes anything though.
    I briefly touched on that point in the Hot Standby section. This has
    some consequences for locking in CommitTransaction, but otherwise LSN
    is as good as any other monotonically increasing value.

    Regards,
    Ants Aasma
    --
    Cybertec Schönig & Schönig GmbH
    Gröhrmühlgasse 26
    A-2700 Wiener Neustadt
    Web: http://www.postgresql-support.de
  • Kevin Grittner at Jun 7, 2013 at 3:15 pm

    Ants Aasma wrote:

    Serializable transactions
    -------------------------

    I won't pretend to be familiar with SSI code, but as far as I can
    tell serializable transactions don't need any modifications to
    work with the CSN based snapshot scheme. There actually already
    is a commit sequence number in the SSI code that could be
    replaced by the actual CSN.
    That seems quite likely to work, and may be good for performance.
    IIRC one of the problems with serializable transactions on hot
    standby was that transaction visibility order on the standby is
    different from the master.
    Pretty much.  Technically what SSI does is to ensure that every
    serializable transaction's view of the data is consistent with some
    serial (one-at-a time) exection of serializable transactions.  That
    "apparent order of execution" does not always match commit order.
    The problem is that without further work a hot standby query could
    see a state which would not have been possible for it to see on the
    master.  For example, a batch is closed but a transaction has not
    yet committed which is part of that batch.  For an example, see:

    http://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions

    As that example demonstrates, as long as no serializable
    transaction *sees* the incomplete batch with knowledge (or at least
    potential knowledge) of the batch being closed, the pending
    transaction affecting the batch is allowed to commit.  If the
    conflicting state is exposed by even a read-only query, the
    transaction with the pending change to the batch is canceled.

    A hot standby can't cancel the pending transaction -- at least not
    without adding additional communications channels and latency.  The
    ideas which have been bandied about have had to do with allowing
    serializable transactions on a hot standby to use snapshots which
    are known to be "safe" -- that is, they cannot expose any such
    state.  It might be possible to identify known safe points in the
    commit stream on the master and pass that information along in the
    WAL stream.  The down side is that the hot standby would need to
    either remember the last such safe snapshot or wait for the next
    one, and while these usually come up fairly quickly in most
    workloads, there is no actual bounding on how long that could take.

    A nicer solution, if we can manage it, is to allow snapshots on the
    hot standby which are not based exclusively on the commit order,
    but use the apparent order of execution from the master.  It seems
    non-trivial to get that right.
    If we use CSNs for visibility on the slave then we can actually
    provide identical visibility order.
    As the above indicates, that's not really true without using
    "apparent order of execution" instead of "commit order".  In the
    absence of serializable transactions those are always the same (I
    think), but to provide a way to allow serializable transactions on
    the hot standby the order would need to be subject to rearrangement
    based on read-write conflicts among transactions on the master, or
    snapshots which could expose serialization anomalies would need to
    be prohibited.

    --
    Kevin Grittner
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJun 6, '13 at 10:42p
activeJun 11, '13 at 6:55a
posts7
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase