Hi,

I have been thinking about how to handle long running transactions with
Robert’s commit sequence number (CSN) idea.
http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com

I just started to go through transaction management code, so I would
appreciate if anyone could point out any gaping holes in my ideas.

There are two reasons why XID-to-CSN values need to be kept around. One
is that long running transactions need a place to publish their CSN for
CSN based snapshots. The other is that CSN based snapshots need to know
which transactions between xmin and xmax are visible and which are
invisible.

My idea is to use hybrid storage for the XID-to-CSN storage. For recent
transactions shared memory contains a dense XID indexed ring buffer of
CSN values. Head of the buffer moves with nextXid. The tail is also
moved on xid assignment, but done in larger chunks to reduce lock
contention. When moving the tail up, any XID that might still be
invisible to a snapshot is moved to a sparse buffer. To keep the size of
the sparse buffer within limits, too old snapshots are converted into
list-of-running-xids snapshots. This allows us to move up the visible to
all csn-based snapshots horizon (csnmin) and release xids below that
horizon from the sparse buffer. CSNs themselves can be uint32s, this
means that no transaction can be older than 2 billion writing
transactions, which is already enforced by xids being 32 bit values.

Taking a snapshot using this idea would consist of a Xmin, Xmax and CSN.
Snapshot Xmin guarantees that no xid below Xmin can have a CSN bigger
than the snapshots. One way to get Xmin would be to maintain a global
value when doing the tail update of the dense array. Snapshot Xmax
guarantees that no xid above Xmax can have a CSN less than the
snapshots. Xmax can be easily obtained from nextXid. CSN is simply the
current value of the CSN counter. After the snapshot is taken, if
necessary the csnmin of the proc needs to be updated and snapshot added
to the list of acquired snapshots.

All of the guarantees of taking a snapshot can be maintained without
global locking if we have atomic reads/writes of CSN and XID values. We
need to obtain them in the order Xmin -> CSN -> Xmax, inserting read
barriers between the loads. Write side ordering is guaranteed by the CSN
lock. Updating the proc entry can be done under the per proc lock.

To check visibility of a xid between snapshots xmin and xmax, we first
check if its above the dense xid array tail. If it is in dense, just
read the CSN value and compare it to the snapshot value. If xid falls
under the ringbuffer tail, go through the sparse buffer. If the xid is
found in the sparse buffer, just read the CSN value and compare. If it
is not found, then it is guaranteed to be completed before the snapshot,
just read the clog status for the transaction. This is all done without
locking. When looking something up from the dense array, we have to
issue a read barrier and recheck the tail location to ensure that it
hasn’t moved out from under us. If it has, retry.

To commit a transaction we take a lock on the CSN, stamp all our XIDs
with the value, update clog, a write barrier to ensure that lock free
readers see the changes, increment the csn and relase the lock. For each
of our XIDs we need to check if it is above or below the dense
ringbuffer tail and update the respective value.

Moving of the dense ringbuffer tail should be done when assigning XIDs.
This allows us to enforce a hard limit on the size of the ringbuffer. To
reduce contention on the CSN lock, this should be done in large
increments. 1/8 of the ringbuffer at a time seems like a good compromise
between frequency and buffer space efficiency.

To move the tail, we first find the global minimum value of snapshot
CSNs (csnmin). If every proc maintains its own csnmin under per proc
lock, we can just get the current csn and go through all the procs, take
the per proc lock and fetch the csnmin to get the global value. Then we
go through XIDs between old tail and new tail and find all that are
still running or not visible to csnmin. We insert the XIDs into the
sparse buffer, issue a write barrier so they are visible everyone and
then update the tail location. Because we are trawling through the
procarray here anyway, its also a good time to update global xmin.

Because we want to avoid unbounded growth of the sparse buffer, we need
to get rid of old CSN based snapshots. Tying this into XID assignment is
a good way to get some provable limits, if subtransactions can be
overflowed somehow. For instance, we can keep track of the CSN during
last few tail updates, and convert any too old snapshots - this puts the
cap on the sparse buffer at O(num_backends*non-overflowed-sub-txs +
n_steps*tail-update-step) entries. When go try to find the new csnmin
and discover that a backend has a csnmin that is too old, we go through
the snapshots of that backend and convert every snapshot under the
desired csnmin to a traditional snapshot. To convert a snapshot to
traditional, we must find list of xids that were running when the CSN
was assigned. For this, we go through the sparse buffer and find all
xids between xmin and xmax that have CSN above that of the snapshots or
are still running. While doing that, we might as well update xmin and
xmax to tighter bounds, and maybe if xmax - xmin is small enough, store
it as a bitmap. When done, assign the array to the snapshot, issue a
write barrier and update a flag on the snapshot that tells it to use the
embedded running xids array when checking visibility. As with updating
the tail of the dense array, the status of this flag needs to be
rechecked after any CSN based visibility check to ensure correctness.

After we are sure that no snapshot exists above our new csnmin, we can
go through the sparse buffer and evict xids that are visible to all.

Moving of the tail needn’t be done under xidGenLock, a separate
denseHorizonUpdate lock would do. Commits would still block though,
unless we are willing to either make snapshots block or hand out stale
snapshots.

The sparse buffer needs to be a single updater bounded size lock free
datastructure. When inserting to it, we need to guarantee visibility
before dense array tail update. When removing from it, correct traversal
needs to be guaranteed and that no one can read it after the entry is
reused.

The dense buffer needs to be sized so that most writing transactions
complete before falling out of the ringbuffer and most snapshots are
released before converting. Something like num_backends*16 might be a
good start. This would use 4KB of memory for every 64 connections.

Under this sheme, taking a snapshot would be O(1), commiting would be
O(subtransactions), getting a new xid would be O(1 + num backends/dense
buffer step), if the buffer step is scaled with num backends, O(1)
amortized. Worst case seems to be num_backends - 1 with long read/write
transactions and one backend with very high write transaction rate.

I haven’t yet thought through how subtransactions and replication would
work with this scheme, but I feel that they don’t change any of the
basic characteristics.

Your thoughts?

--
Ants Aasma

Search Discussions

  • Robert Haas at Sep 8, 2011 at 2:28 pm

    On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma wrote:
    When go try to find the new csnmin
    and discover that a backend has a csnmin that is too old, we go through
    the snapshots of that backend and convert every snapshot under the
    desired csnmin to a traditional snapshot.
    I thought about something along these lines (though I didn't flesh out
    the details as much as you have here), but rejected it because the
    step described above would require all snapshots to be stored in
    shared memory. That's problematic for several reasons:

    1. A backend can have lots of snapshots, potentially requiring an
    unbounded amount of shared memory. We can't accommodate that.

    2. You'd need to protect all of those snapshots with spinlocks or
    something, which would be wicked expensive, because the owning process
    would need to take and release that spinlock every time it touched the
    snapshot.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Ants Aasma at Sep 8, 2011 at 3:33 pm

    On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas wrote:
    On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma wrote:
    When go try to find the new csnmin
    and discover that a backend has a csnmin that is too old, we go through
    the snapshots of that backend and convert every snapshot under the
    desired csnmin to a traditional snapshot.
    I thought about something along these lines (though I didn't flesh out
    the details as much as you have here), but rejected it because the
    step described above would require all snapshots to be stored in
    shared memory.  That's problematic for several reasons:

    1. A backend can have lots of snapshots, potentially requiring an
    unbounded amount of shared memory.  We can't accommodate that.
    If PostgreSQL gets POSIX shared memory capability at some point in the
    future, would it be enough to accommodate snapshots in shared memory?
    2. You'd need to protect all of those snapshots with spinlocks or
    something, which would be wicked expensive, because the owning process
    would need to take and release that spinlock every time it touched the
    snapshot.
    It seems to me that converting a transactions type can be done
    lock-free. The process that does the converting needs to ensure that
    the new transaction type flag is visible before releasing any xids.
    For visibility checking, the additional cost is a read barrier, two
    volatile reads (recheck snapshot type and dense horizon) and occasional
    retry after getting a visibility result.

    Maybe I'm missing something. I'll take a deeper look at the snapshot
    handling code tonight to see if anything else might have any
    implications.

    --
    Ants Aasma
  • Tom Lane at Sep 8, 2011 at 3:46 pm

    Ants Aasma writes:
    On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas wrote:
    1. A backend can have lots of snapshots, potentially requiring an
    unbounded amount of shared memory.  We can't accommodate that.
    If PostgreSQL gets POSIX shared memory capability at some point in the
    future, would it be enough to accommodate snapshots in shared memory?
    Don't hold your breath. Even if we moved over to using mmap() in place
    of shmat(), it's a *very* long way from there to being able to
    dynamically expand shared memory and have all backends able to access
    the additional space. Doing that in a way that's cheap enough and
    transparent enough that we'd actually accept it is even more challenging
    --- for instance, it seems hard to impossible to guarantee that all
    backends would be able to map the added space at the same addresses.

    Most of the interest in mmap is not about any dynamic-expansion ideas,
    it's just about getting out from under the ridiculously small limits on
    SysV shmem space that are still default on many platforms.

    regards, tom lane
  • Robert Haas at Sep 8, 2011 at 3:46 pm

    On Thu, Sep 8, 2011 at 11:33 AM, Ants Aasma wrote:
    On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas wrote:
    On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma wrote:
    When go try to find the new csnmin
    and discover that a backend has a csnmin that is too old, we go through
    the snapshots of that backend and convert every snapshot under the
    desired csnmin to a traditional snapshot.
    I thought about something along these lines (though I didn't flesh out
    the details as much as you have here), but rejected it because the
    step described above would require all snapshots to be stored in
    shared memory.  That's problematic for several reasons:

    1. A backend can have lots of snapshots, potentially requiring an
    unbounded amount of shared memory.  We can't accommodate that.
    If PostgreSQL gets POSIX shared memory capability at some point in the
    future, would it be enough to accommodate snapshots in shared memory?
    By itself, that particular change would not help with this problem.
    2. You'd need to protect all of those snapshots with spinlocks or
    something, which would be wicked expensive, because the owning process
    would need to take and release that spinlock every time it touched the
    snapshot.
    It seems to me that converting a transactions type can be done
    lock-free. The process that does the converting needs to ensure that
    the new transaction type flag is visible before releasing any xids.
    For visibility checking, the additional cost is a read barrier, two
    volatile reads (recheck snapshot type and dense horizon) and occasional
    retry after getting a visibility result.

    Maybe I'm missing something. I'll take a deeper look at the snapshot
    handling code tonight to see if anything else might have any
    implications.
    I'm not convinced it's anywhere near that easy. For one thing, on at
    least one big server I'm playing with, memory latency on shared memory
    is vastly higher (like >10x!) than on backend-local memory due to NUMA
    effects.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Ants Aasma at Sep 8, 2011 at 9:45 pm

    On Thu, Sep 8, 2011 at 6:46 PM, Robert Haas wrote:
    I'm not convinced it's anywhere near that easy.  For one thing, on at
    least one big server I'm playing with, memory latency on shared memory
    is vastly higher (like >10x!) than on backend-local memory due to NUMA
    effects.
    I wonder if both the shared mem and non-local memory issue can be
    circumvented by using a slru like mechanism as a side channel to
    publish taken snapshots and make concurrent xids available with a
    sinval/hasmessages like per proc flag in shared memory to notify of
    migrated snapshots.

    I'll have to think through the space, locking and performance
    considerations. That might take a small while though, I just managed
    to contract the flu and can't really think straight.

    Sorry to waste your time if this whole approach is completely untenable.
    It seemed like a interesting topic to sink my teeth in, but in hindsight
    seems a bit too much for a first try.

    --
    Ants Aasma

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 8, '11 at 1:26p
activeSep 8, '11 at 9:45p
posts6
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase