On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas wrote:
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane wrote:
I wonder whether we could do something involving WAL properties --- the
current tuple visibility logic was designed before WAL existed, so it's
not exploiting that resource at all.  I'm imagining that the kernel of a
snapshot is just a WAL position, ie the end of WAL as of the time you
take the snapshot (easy to get in O(1) time).  Visibility tests then
reduce to "did this transaction commit with a WAL record located before
the specified position?".  You'd need some index datastructure that made
it reasonably cheap to find out the commit locations of recently
committed transactions, where "recent" means "back to recentGlobalXmin".
That seems possibly do-able, though I don't have a concrete design in
mind.
[discussion of why I don't think an LSN will work]

But having said that an LSN can't work, I don't see why we can't just
use a 64-bit counter.  In fact, the predicate locking code already
does something much like this, using an SLRU, for serializable
transactions only.  In more detail, what I'm imagining is an array
with 4 billion entries, one per XID, probably broken up into files of
say 16MB each with 2 million entries per file.  Each entry is a 64-bit
value.  It is 0 if the XID has not yet started, is still running, or
has aborted.  Otherwise, it is the commit sequence number of the
transaction.
I've been giving this quite a bit more thought, and have decided to
abandon the scheme described above, at least for now. It has the
advantage of avoiding virtually all locking, but it's extremely
inefficient in its use of memory in the presence of long-running
transactions. For example, if there's an open transaction that's been
sitting around for 10 million transactions or so and has an XID
assigned, any new snapshot is going to need to probe into the big
array for any XID in that range. At 8 bytes per entry, that means
we're randomly accessing about ~80MB of memory-mapped data. That
seems problematic both in terms of blowing out the cache and (on small
machines) possibly even blowing out RAM. Nor is that the worst case
scenario: a transaction could sit open for 100 million transactions.

Heikki has made the suggestion a few times (and a few other people
have since made somewhat similar suggestions in different words) of
keeping an-up-to-date snapshot in shared memory such that transactions
that need a snapshot can simply copy it. I've since noted that in Hot
Standby mode, that's more or less what the KnownAssignedXids stuff
already does. I objected that, first, the overhead of updating the
snapshot for every commit would be too great, and second, it didn't
seem to do a whole lot to reduce the size of the critical section, and
therefore probably wouldn't improve performance that much. But I'm
coming around to the view that these might be solvable problems rather
than reasons to give up on the idea altogether.

With respect to the first problem, what I'm imagining is that we not
do a complete rewrite of the snapshot in shared memory on every
commit. Instead, when a transaction ends, we'll decide whether to (a)
write a new snapshot or (b) just record the XIDs that ended. If we do
(b), then any backend that wants a snapshot will need to copy from
shared memory both the most recently written snapshot and the XIDs
that have subsequently ended. From there, it can figure out which
XIDs are still running. Of course, if the list of recently-ended XIDs
gets too long, then taking a snapshot will start to get expensive, so
we'll need to periodically do (a) instead. There are other ways that
this could be done as well; for example, the KnownAssignedXids stuff
just flags XIDs that should be ignored and then periodically compacts
away the ignored entries.

I think the real trick is figuring out a design that can improve
concurrency. If you keep a snapshot in shared memory and periodically
overwrite it in place, I don't think you're going to gain much.
Everyone who wants a snapshot still needs a share-lock and everyone
who wants to commit still needs an exclusive-lock, and while you might
be able to make the critical section a bit shorter, I think it's still
going to be hard to make big gains that way. What I'm thinking about
instead is using a ring buffer with three pointers: a start pointer, a
stop pointer, and a write pointer. When a transaction ends, we
advance the write pointer, write the XIDs or a whole new snapshot into
the buffer, and then advance the stop pointer. If we wrote a whole
new snapshot, we advance the start pointer to the beginning of the
data we just wrote.

Someone who wants to take a snapshot must read the data between the
start and stop pointers, and must then check that the write pointer
hasn't advanced so far in the meantime that the data they read might
have been overwritten before they finished reading it. Obviously,
that's a little risky, since we'll have to do the whole thing over if
a wraparound occurs, but if the ring buffer is large enough it
shouldn't happen very often. And a typical snapshot is pretty small
unless massive numbers of subxids are in use, so it seems like it
might not be too bad. Of course, it's pretty hard to know for sure
without coding it up and testing it.

There are a couple of reasons to think that this design might be
better than what we're doing now. Writers (people trying to commit)
can overlap with readers (people trying to take a snapshot), so long
as any reader is guaranteed to see the most recently completed write
(and no portion of a write that is still in progress). Also, this
approach admits some optimizations that are hard to manage with our
current scheme. For example, you can cache the data that you most
recently removed from the array. If you go to take a new snapshot and
find that the stop pointer hasn't moved (I'm imagining it as a 64-bit
counter that never wraps), then you don't need to copy the data out of
the array again; you can just reuse what you got last time. Maybe
that doesn't matter, since contents of the ProcArray change pretty
quickly, but then again I've seen quite a few profiles where
GetSnapshotData() was very high up in the list, so it seems at least
possible that cache line contention is making access to shared memory
slow enough that there is value in trying to optimize some of it away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Jim Nasby at Aug 22, 2011 at 10:45 pm

    On Aug 22, 2011, at 4:25 PM, Robert Haas wrote:
    What I'm thinking about
    instead is using a ring buffer with three pointers: a start pointer, a
    stop pointer, and a write pointer. When a transaction ends, we
    advance the write pointer, write the XIDs or a whole new snapshot into
    the buffer, and then advance the stop pointer. If we wrote a whole
    new snapshot, we advance the start pointer to the beginning of the
    data we just wrote.

    Someone who wants to take a snapshot must read the data between the
    start and stop pointers, and must then check that the write pointer
    hasn't advanced so far in the meantime that the data they read might
    have been overwritten before they finished reading it. Obviously,
    that's a little risky, since we'll have to do the whole thing over if
    a wraparound occurs, but if the ring buffer is large enough it
    shouldn't happen very often. And a typical snapshot is pretty small
    unless massive numbers of subxids are in use, so it seems like it
    might not be too bad. Of course, it's pretty hard to know for sure
    without coding it up and testing it.
    Something that would be really nice to fix is our reliance on a fixed size of shared memory, and I'm wondering if this could be an opportunity to start in a new direction. My thought is that we could maintain two distinct shared memory snapshots and alternate between them. That would allow us to actually resize them as needed. We would still need something like what you suggest to allow for adding to the list without locking, but with this scheme we wouldn't need to worry about extra locking when taking a snapshot since we'd be doing that in a new segment that no one is using yet.

    The downside is such a scheme does add non-trivial complexity on top of what you proposed. I suspect it would be much better if we had a separate mechanism for dealing with shared memory requirements (shalloc?). But if it's just not practical to make a generic shared memory manager it would be good to start thinking about ways we can work around fixed shared memory size issues.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Robert Haas at Aug 22, 2011 at 11:22 pm

    On Mon, Aug 22, 2011 at 6:45 PM, Jim Nasby wrote:
    Something that would be really nice to fix is our reliance on a fixed size of shared memory, and I'm wondering if this could be an opportunity to start in a new direction. My thought is that we could maintain two distinct shared memory snapshots and alternate between them. That would allow us to actually resize them as needed. We would still need something like what you suggest to allow for adding to the list without locking, but with this scheme we wouldn't need to worry about extra locking when taking a snapshot since we'd be doing that in a new segment that no one is using yet.

    The downside is such a scheme does add non-trivial complexity on top of what you proposed. I suspect it would be much better if we had a separate mechanism for dealing with shared memory requirements (shalloc?). But if it's just not practical to make a generic shared memory manager it would be good to start thinking about ways we can work around fixed shared memory size issues.
    Well, the system I'm proposing is actually BETTER than having two
    distinct shared memory snapshots. For example, right now we cache up
    to 64 subxids per backend. I'm imagining that going away and using
    that memory for the ring buffer. Out of the box, that would imply a
    ring buffer of 64 * 103 = 6592 slots. If the average snapshot lists
    100 XIDs, you could rewrite the snapshot dozens of times times before
    the buffer wraps around, which is obviously a lot more than two. Even
    if subtransactions are being heavily used and each snapshot lists 1000
    XIDs, you still have enough space to rewrite the snapshot several
    times over before wraparound occurs. Of course, at some point the
    snapshot gets too big and you have to switch to retaining only the
    toplevel XIDs, which is more or less the equivalent of what happens
    under the current implementation when any single transaction's subxid
    cache overflows.

    With respect to a general-purpose shared memory allocator, I think
    that there are cases where that would be useful to have, but I don't
    think there are as many of them as many people seem to think. I
    wouldn't choose to implement this using a general-purpose allocator
    even if we had it, both because it's undesirable to allow this or any
    subsystem to consume an arbitrary amount of memory (nor can it fail...
    especially in the abort path) and because a ring buffer is almost
    certainly faster than a general-purpose allocator. We have enough
    trouble with palloc overhead already. That having been said, I do
    think there are cases where it would be nice to have... and it
    wouldn't surprise me if I end up working on something along those
    lines in the next year or so. It turns out that memory management is
    a major issue in lock-free programming; you can't assume that it's
    safe to recycle an object once the last pointer to it has been removed
    from shared memory - because someone may have fetched the pointer just
    before you removed it and still be using it to examine the object. An
    allocator with some built-in capabilities for handling such problems
    seems like it might be very useful....

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Aug 24, 2011 at 8:30 am
    Robert, Jim,

    thanks for thinking out loud about dynamic allocation of shared memory.
    Very much appreciated.
    On 08/23/2011 01:22 AM, Robert Haas wrote:
    With respect to a general-purpose shared memory allocator, I think
    that there are cases where that would be useful to have, but I don't
    think there are as many of them as many people seem to think. I
    wouldn't choose to implement this using a general-purpose allocator
    even if we had it, both because it's undesirable to allow this or any
    subsystem to consume an arbitrary amount of memory (nor can it fail...
    especially in the abort path) and because a ring buffer is almost
    certainly faster than a general-purpose allocator.
    I'm in respectful disagreement regarding the ring-buffer approach and
    think that dynamic allocation can actually be more efficient if done
    properly, because there doesn't need to be head and tail pointers, which
    might turn into a point of contention.

    As a side note: that I've been there with imessages. Those were first
    organized as a ring-bufffer. The major problem with that approach was
    the imessages were consumed with varying delay. In case an imessage was
    left there for a longer amount of time, it blocked creation of new
    imessages, because the ring-buffer cycled around once and its head
    arrived back at the unconsumed imessage.

    IIUC (which might not be the case) the same issue applies for snapshots.

    Regards

    Markus Wanner
  • Robert Haas at Aug 25, 2011 at 2:59 am

    On Wed, Aug 24, 2011 at 4:30 AM, Markus Wanner wrote:
    I'm in respectful disagreement regarding the ring-buffer approach and
    think that dynamic allocation can actually be more efficient if done
    properly, because there doesn't need to be head and tail pointers, which
    might turn into a point of contention.
    True; although there are some other complications. With a
    sufficiently sophisticated allocator you can avoid mutex contention
    when allocating chunks, but then you have to store a pointer to the
    chunk somewhere or other, and that then requires some kind of
    synchronization.
    As a side note: that I've been there with imessages.  Those were first
    organized as a ring-bufffer.  The major problem with that approach was
    the imessages were consumed with varying delay.  In case an imessage was
    left there for a longer amount of time, it blocked creation of new
    imessages, because the ring-buffer cycled around once and its head
    arrived back at the unconsumed imessage.

    IIUC (which might not be the case) the same issue applies for snapshots.
    One difference with snapshots is that only the latest snapshot is of
    any interest.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Aug 25, 2011 at 5:55 am
    Robert,
    On 08/25/2011 04:59 AM, Robert Haas wrote:
    True; although there are some other complications. With a
    sufficiently sophisticated allocator you can avoid mutex contention
    when allocating chunks, but then you have to store a pointer to the
    chunk somewhere or other, and that then requires some kind of
    synchronization.
    Hm.. right.
    One difference with snapshots is that only the latest snapshot is of
    any interest.
    Theoretically, yes. But as far as I understood, you proposed the
    backends copy that snapshot to local memory. And copying takes some
    amount of time, possibly being interrupted by other backends which add
    newer snapshots... Or do you envision the copying to restart whenever a
    new snapshot arrives?

    Regards

    Markus
  • Robert Haas at Aug 25, 2011 at 1:24 pm

    On Thu, Aug 25, 2011 at 1:55 AM, Markus Wanner wrote:
    One difference with snapshots is that only the latest snapshot is of
    any interest.
    Theoretically, yes.  But as far as I understood, you proposed the
    backends copy that snapshot to local memory.  And copying takes some
    amount of time, possibly being interrupted by other backends which add
    newer snapshots...  Or do you envision the copying to restart whenever a
    new snapshot arrives?
    My hope (and it might turn out that I'm an optimist) is that even with
    a reasonably small buffer it will be very rare for a backend to
    experience a wraparound condition. For example, consider a buffer
    with ~6500 entries, approximately 64 * MaxBackends, the approximate
    size of the current subxip arrays taken in aggregate. I hypothesize
    that a typical snapshot on a running system is going to be very small
    - a handful of XIDs at most - because, on the average, transactions
    are going to commit in *approximately* increasing XID order and, if
    you take the regression tests as representative of a real workload,
    only a small fraction of transactions will have more than one XID. So
    it seems believable to think that the typical snapshot on a machine
    with max_connections=100 might only be ~10 XIDs, even if none of the
    backends are read-only. So the backend taking a snapshot only needs
    to be able to copy < ~64 bytes of information from the ring buffer
    before other backends write ~27k of data into that buffer, likely
    requiring hundreds of other commits. That seems vanishingly unlikely;
    memcpy() is very fast. If it does happen, you can recover by
    retrying, but it should be a once-in-a-blue-moon kind of thing. I
    hope.

    Now, as the size of the snapshot gets bigger, things will eventually
    become less good. For example if you had a snapshot with 6000 XIDs in
    it then every commit would need to write over the previous snapshot
    and things would quickly deteriorate. But you can cope with that
    situation using the same mechanism we already use to handle big
    snapshots: toss out all the subtransaction IDs, mark the snapshot as
    overflowed, and just keep the toplevel XIDs. Now you've got at most
    ~100 XIDs to worry about, so you're back in the safety zone. That's
    not ideal in the sense that you will cause more pg_subtrans lookups,
    but that's the price you pay for having a gazillion subtransactions
    floating around, and any system is going to have to fall back on some
    sort of mitigation strategy at some point. There's no useful limit on
    the number of subxids a transaction can have, so unless you're
    prepared to throw an unbounded amount of memory at the problem you're
    going to eventually have to punt.

    It seems to me that the problem case is when you are just on the edge.
    Say you have 1400 XIDs in the snapshot. If you compact the snapshot
    down to toplevel XIDs, most of those will go away and you won't have
    to worry about wraparound - but you will pay a performance penalty in
    pg_subtrans lookups. On the other hand, if you don't compact the
    snapshot, it's not that hard to imagine a wraparound occurring - four
    snapshot rewrites could wrap the buffer. You would still hope that
    memcpy() could finish in time, but if you're rewriting 1400 XIDs with
    any regularity, it might not take that many commits to throw a spanner
    into the works. If the system is badly overloaded and the backend
    trying to take a snapshot gets descheduled for a long time at just the
    wrong moment, it doesn't seem hard to imagine a wraparound happening.

    Now, it's not hard to recover from a wraparound. In fact, we can
    pretty easily guarantee that any given attempt to take a snapshot will
    suffer a wraparound at most once. The writers (who are committing)
    have to be serialized anyway, so anyone who suffers a wraparound can
    just grab the same lock in shared mode and retry its snapshot. Now
    concurrency decreases significantly, because no one else is allowed to
    commit until that guy has got his snapshot, but right now that's true
    *every time* someone wants to take a snapshot, so falling back to that
    strategy occasionally doesn't seem prohibitively bad. However, you
    don't want it to happen very often, because even leaving aside the
    concurrency hit, it's double work: you have to try to take a snapshot,
    realize you've had a wraparound, and then retry. It seems pretty
    clear that with a big enough ring buffer the wraparound problem will
    become so infrequent as to be not worth worrying about. I'm
    theorizing that even with a quite small ring buffer the problem will
    still be infrequent enough not to worry about, but that might be
    optimistic. I think I'm going to need some kind of test case that
    generates very large, frequently changing snapshots.

    Of course even if wraparound turns out not to be a problem there are
    other things that could scuttle this whole approach, but I think the
    idea has enough potential to be worth testing. If the whole thing
    crashes and burns I hope I'll at least learn enough along the way to
    design something better...

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Aug 25, 2011 at 2:19 pm
    Robert,
    On 08/25/2011 03:24 PM, Robert Haas wrote:
    My hope (and it might turn out that I'm an optimist) is that even with
    a reasonably small buffer it will be very rare for a backend to
    experience a wraparound condition.
    It certainly seems less likely than with the ring-buffer for imessages, yes.

    Note, however, that for imessages, I've also had the policy in place
    that a backend *must* consume its message before sending any. And that
    I took great care for all receivers to consume their messages as early
    as possible. None the less, I kept incrementing the buffer size (to
    multiple megabytes) to make this work. Maybe I'm overcautious because
    of that experience.
    - a handful of XIDs at most - because, on the average, transactions
    are going to commit in *approximately* increasing XID order
    This assumption quickly turns false, if you happen to have just one
    long-running transaction, I think. Or in general, if transaction
    duration varies a lot.
    So the backend taking a snapshot only needs
    to be able to copy < ~64 bytes of information from the ring buffer
    before other backends write ~27k of data into that buffer, likely
    requiring hundreds of other commits.
    You said earlier, that "only the latest snapshot" is required. It takes
    only a single commit for such a snapshot to not be the latest anymore.

    Instead, if you keep around older snapshots for some time - as what your
    description here implies - readers are free to copy from those older
    snapshots while other backends are able to make progress concurrently
    (writers or readers of other snapshots).

    However, that either requires keeping track of readers of a certain
    snapshot (reference counting) or - as I understand your description -
    you simply invalidate all concurrent readers upon wrap-around, or something.
    That seems vanishingly unlikely; Agreed.
    Now, as the size of the snapshot gets bigger, things will eventually
    become less good.
    Also keep configurations with increased max_connections in mind. With
    that, we not only the snapshots get bigger, but more processes have to
    share CPU time, on avg. making memcpy slower for a single process.
    Of course even if wraparound turns out not to be a problem there are
    other things that could scuttle this whole approach, but I think the
    idea has enough potential to be worth testing. If the whole thing
    crashes and burns I hope I'll at least learn enough along the way to
    design something better...
    That's always a good motivation. In that sense: happy hacking!

    Regards

    Markus Wanner
  • Robert Haas at Aug 25, 2011 at 2:48 pm

    On Thu, Aug 25, 2011 at 10:19 AM, Markus Wanner wrote:
    Note, however, that for imessages, I've also had the policy in place
    that a backend *must* consume its message before sending any.  And that
    I took great care for all receivers to consume their messages as early
    as possible.  None the less, I kept incrementing the buffer size (to
    multiple megabytes) to make this work.  Maybe I'm overcautious because
    of that experience.
    What's a typical message size for imessages?
    - a handful of XIDs at most - because, on the average, transactions
    are going to commit in *approximately* increasing XID order
    This assumption quickly turns false, if you happen to have just one
    long-running transaction, I think.  Or in general, if transaction
    duration varies a lot.
    Well, one long-running transaction that only has a single XID is not
    really a problem: the snapshot is still small. But one very old
    transaction that also happens to have a large number of
    subtransactions all of which have XIDs assigned might be a good way to
    stress the system.
    So the backend taking a snapshot only needs
    to be able to copy < ~64 bytes of information from the ring buffer
    before other backends write ~27k of data into that buffer, likely
    requiring hundreds of other commits.
    You said earlier, that "only the latest snapshot" is required.  It takes
    only a single commit for such a snapshot to not be the latest anymore.

    Instead, if you keep around older snapshots for some time - as what your
    description here implies - readers are free to copy from those older
    snapshots while other backends are able to make progress concurrently
    (writers or readers of other snapshots).

    However, that either requires keeping track of readers of a certain
    snapshot (reference counting) or - as I understand your description -
    you simply invalidate all concurrent readers upon wrap-around, or something.
    Each reader decides which data he needs to copy from the buffer, and
    then copies it, and then checks whether any of it got overwritten
    before the copy was completed. So there's a lively possibility that
    the snapshot that was current when the reader began copying it will no
    longer be current by the time he finishes copying it, because a commit
    has intervened. That's OK: it just means that, effectively, the
    snapshot is taken at the moment the start and stop pointers are read,
    and won't take into account any commits that happen later, which is
    exactly what a snapshot is supposed to do anyway.

    There is a hopefully quite small possibility that by the time the
    reader finishes copying it so much new data will have been written to
    the buffer that it will have wrapped around and clobbered the portion
    the reader was interested in. That needs to be rare.
    Now, as the size of the snapshot gets bigger, things will eventually
    become less good.
    Also keep configurations with increased max_connections in mind.  With
    that, we not only the snapshots get bigger, but more processes have to
    share CPU time, on avg. making memcpy slower for a single process.
    Right. I'm imagining making the default buffer size proportional to
    max_connections.
    Of course even if wraparound turns out not to be a problem there are
    other things that could scuttle this whole approach, but I think the
    idea has enough potential to be worth testing.  If the whole thing
    crashes and burns I hope I'll at least learn enough along the way to
    design something better...
    That's always a good motivation.  In that sense: happy hacking!
    Thanks.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Aug 25, 2011 at 2:59 pm

    Robert Haas writes:
    Well, one long-running transaction that only has a single XID is not
    really a problem: the snapshot is still small. But one very old
    transaction that also happens to have a large number of
    subtransactions all of which have XIDs assigned might be a good way to
    stress the system.
    That's a good point. If the ring buffer size creates a constraint on
    the maximum number of sub-XIDs per transaction, you're going to need a
    fallback path of some sort.

    regards, tom lane
  • Markus Wanner at Aug 25, 2011 at 3:15 pm
    Tom,
    On 08/25/2011 04:59 PM, Tom Lane wrote:
    That's a good point. If the ring buffer size creates a constraint on
    the maximum number of sub-XIDs per transaction, you're going to need a
    fallback path of some sort.
    I think Robert envisions the same fallback path we already have:
    subxids.overflowed.

    Regards

    Markus Wanner
  • Robert Haas at Aug 25, 2011 at 3:27 pm

    On Thu, Aug 25, 2011 at 11:15 AM, Markus Wanner wrote:
    On 08/25/2011 04:59 PM, Tom Lane wrote:
    That's a good point.  If the ring buffer size creates a constraint on
    the maximum number of sub-XIDs per transaction, you're going to need a
    fallback path of some sort.
    I think Robert envisions the same fallback path we already have:
    subxids.overflowed.
    I have a slightly more nuanced idea, but basically yes. The trouble
    is that if you're keeping the snapshot around and updating it (rather
    than scanning the ProcArray each time) you need some sort of mechanism
    for the snapshot to eventually un-overflow. Otherwise, the first
    overflow leaves you in the soup for the entire lifetime of the
    cluster.

    What I have in mind is to store the highest subxid that has been
    removed from the snapshot, or InvalidTransactonId if we know the
    snapshot is complete. Whenever the highest removed subxid falls
    behind xmin, we can reset it to InvalidTransactionId.

    It would be sensible for clients to store the exact value of
    highest_removed_subxid in their snapshots as well, instead of just a
    Boolean flag. A pg_subtrans lookup is needed only for XIDs which are
    greater than xmin and less than or equal to highest_removed_subxid.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Markus Wanner at Aug 25, 2011 at 3:13 pm
    Robert,
    On 08/25/2011 04:48 PM, Robert Haas wrote:
    What's a typical message size for imessages?
    Most message types in Postgres-R are just a couple bytes in size.
    Others, especially change sets, can be up to 8k.

    However, I think you'll have an easier job guaranteeing that backends
    "consume" their portions of the ring-buffer in time. Plus wrap-around
    isn't that much of a problem in your case. (I couldn't drop imessage,
    but had to let senders wait).
    Well, one long-running transaction that only has a single XID is not
    really a problem: the snapshot is still small. But one very old
    transaction that also happens to have a large number of
    subtransactions all of which have XIDs assigned might be a good way to
    stress the system.
    Ah, right, that's why its a list of transactions in progress and not a
    list of completed transactions in SnapshotData... good.
    Each reader decides which data he needs to copy from the buffer, and
    then copies it, and then checks whether any of it got overwritten
    before the copy was completed. So there's a lively possibility that
    the snapshot that was current when the reader began copying it will no
    longer be current by the time he finishes copying it, because a commit
    has intervened. That's OK: it just means that, effectively, the
    snapshot is taken at the moment the start and stop pointers are read,
    and won't take into account any commits that happen later, which is
    exactly what a snapshot is supposed to do anyway.
    Agreed, that makes sense. Thanks for explaining.
    There is a hopefully quite small possibility that by the time the
    reader finishes copying it so much new data will have been written to
    the buffer that it will have wrapped around and clobbered the portion
    the reader was interested in. That needs to be rare.
    Yeah.

    Regards

    Markus Wanner
  • Jim Nasby at Aug 25, 2011 at 10:24 pm

    On Aug 25, 2011, at 8:24 AM, Robert Haas wrote:
    My hope (and it might turn out that I'm an optimist) is that even with
    a reasonably small buffer it will be very rare for a backend to
    experience a wraparound condition. For example, consider a buffer
    with ~6500 entries, approximately 64 * MaxBackends, the approximate
    size of the current subxip arrays taken in aggregate. I hypothesize
    that a typical snapshot on a running system is going to be very small
    - a handful of XIDs at most - because, on the average, transactions
    are going to commit in *approximately* increasing XID order and, if
    you take the regression tests as representative of a real workload,
    only a small fraction of transactions will have more than one XID. So
    BTW, there's a way to actually gather some data on this by using PgQ (part of Skytools and used by Londiste). PgQ works by creating "ticks" at regular intervals, where a tick is basically just a snapshot of committed XIDs. Presumably Slony does something similar.

    I can provide you with sample data from our production systems if you're interested.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Robert Haas at Aug 27, 2011 at 3:43 am

    On Thu, Aug 25, 2011 at 6:24 PM, Jim Nasby wrote:
    On Aug 25, 2011, at 8:24 AM, Robert Haas wrote:
    My hope (and it might turn out that I'm an optimist) is that even with
    a reasonably small buffer it will be very rare for a backend to
    experience a wraparound condition.  For example, consider a buffer
    with ~6500 entries, approximately 64 * MaxBackends, the approximate
    size of the current subxip arrays taken in aggregate.  I hypothesize
    that a typical snapshot on a running system is going to be very small
    - a handful of XIDs at most - because, on the average, transactions
    are going to commit in *approximately* increasing XID order and, if
    you take the regression tests as representative of a real workload,
    only a small fraction of transactions will have more than one XID.  So
    BTW, there's a way to actually gather some data on this by using PgQ (part of Skytools and used by Londiste). PgQ works by creating "ticks" at regular intervals, where a tick is basically just a snapshot of committed XIDs. Presumably Slony does something similar.

    I can provide you with sample data from our production systems if you're interested.
    Yeah, that would be great.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jim Nasby at Aug 25, 2011 at 10:29 pm

    On Aug 22, 2011, at 6:22 PM, Robert Haas wrote:
    With respect to a general-purpose shared memory allocator, I think
    that there are cases where that would be useful to have, but I don't
    think there are as many of them as many people seem to think. I
    wouldn't choose to implement this using a general-purpose allocator
    even if we had it, both because it's undesirable to allow this or any
    subsystem to consume an arbitrary amount of memory (nor can it fail...
    especially in the abort path) and because a ring buffer is almost
    certainly faster than a general-purpose allocator. We have enough
    trouble with palloc overhead already. That having been said, I do
    think there are cases where it would be nice to have... and it
    wouldn't surprise me if I end up working on something along those
    lines in the next year or so. It turns out that memory management is
    a major issue in lock-free programming; you can't assume that it's
    safe to recycle an object once the last pointer to it has been removed
    from shared memory - because someone may have fetched the pointer just
    before you removed it and still be using it to examine the object. An
    allocator with some built-in capabilities for handling such problems
    seems like it might be very useful....
    Actually, I wasn't thinking about the system dynamically sizing shared memory on it's own... I was only thinking of providing the ability for a user to change something like shared_buffers and allow that change to take effect with a SIGHUP instead of requiring a full restart.

    I agree that we'd have to be very careful with allowing the code to start changing shared memory size on it's own...
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Robert Haas at Aug 27, 2011 at 3:44 am

    On Thu, Aug 25, 2011 at 6:29 PM, Jim Nasby wrote:
    Actually, I wasn't thinking about the system dynamically sizing shared memory on it's own... I was only thinking of providing the ability for a user to change something like shared_buffers and allow that change to take effect with a SIGHUP instead of requiring a full restart.
    I agree. That would be awesome. Sadly, I don't have time to work on it. :-(

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Aug 23, 2011 at 3:06 pm

    On Mon, Aug 22, 2011 at 10:25 PM, Robert Haas wrote:

    I've been giving this quite a bit more thought, and have decided to
    abandon the scheme described above, at least for now.
    I liked your goal of O(1) snapshots and think you should go for that.

    I didn't realise you were still working on this, and had some thoughts
    at the weekend which I recorded just now. Different tack entirely.
    Heikki has made the suggestion a few times (and a few other people
    have since made somewhat similar suggestions in different words) of
    keeping an-up-to-date snapshot in shared memory such that transactions
    that need a snapshot can simply copy it.  I've since noted that in Hot
    Standby mode, that's more or less what the KnownAssignedXids stuff
    already does.  I objected that, first, the overhead of updating the
    snapshot for every commit would be too great, and second, it didn't
    seem to do a whole lot to reduce the size of the critical section, and
    therefore probably wouldn't improve performance that much.  But I'm
    coming around to the view that these might be solvable problems rather
    than reasons to give up on the idea altogether.
    Sounds easy enough to just link up KnownAssignedXids and see...

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Tom Lane at Aug 23, 2011 at 4:13 pm

    Robert Haas writes:
    With respect to the first problem, what I'm imagining is that we not
    do a complete rewrite of the snapshot in shared memory on every
    commit. Instead, when a transaction ends, we'll decide whether to (a)
    write a new snapshot or (b) just record the XIDs that ended. If we do
    (b), then any backend that wants a snapshot will need to copy from
    shared memory both the most recently written snapshot and the XIDs
    that have subsequently ended. From there, it can figure out which
    XIDs are still running. Of course, if the list of recently-ended XIDs
    gets too long, then taking a snapshot will start to get expensive, so
    we'll need to periodically do (a) instead. There are other ways that
    this could be done as well; for example, the KnownAssignedXids stuff
    just flags XIDs that should be ignored and then periodically compacts
    away the ignored entries.
    I'm a bit concerned that this approach is trying to optimize the heavy
    contention situation at the cost of actually making things worse anytime
    that you're not bottlenecked by contention for access to this shared
    data structure. In particular, given the above design, then every
    reader of the data structure has to duplicate the work of eliminating
    subsequently-ended XIDs from the latest stored snapshot. Maybe that's
    relatively cheap, but if you do it N times it's not going to be so cheap
    anymore. In fact, it looks to me like that cost would scale about as
    O(N^2) in the number of transactions you allow to elapse before storing
    a new snapshot, so you're not going to be able to let very many go by
    before you do that.

    I don't say this can't be made to work, but I don't want to blow off
    performance for single-threaded applications in pursuit of scalability
    that will only benefit people running massively parallel applications
    on big iron.

    regards, tom lane
  • Robert Haas at Aug 23, 2011 at 5:03 pm

    On Tue, Aug 23, 2011 at 12:13 PM, Tom Lane wrote:
    I'm a bit concerned that this approach is trying to optimize the heavy
    contention situation at the cost of actually making things worse anytime
    that you're not bottlenecked by contention for access to this shared
    data structure.  In particular, given the above design, then every
    reader of the data structure has to duplicate the work of eliminating
    subsequently-ended XIDs from the latest stored snapshot.  Maybe that's
    relatively cheap, but if you do it N times it's not going to be so cheap
    anymore.  In fact, it looks to me like that cost would scale about as
    O(N^2) in the number of transactions you allow to elapse before storing
    a new snapshot, so you're not going to be able to let very many go by
    before you do that.
    That's certainly a fair concern, and it might even be worse than
    O(n^2). On the other hand, the current approach involves scanning the
    entire ProcArray for every snapshot, even if nothing has changed and
    90% of the backends are sitting around playing tiddlywinks, so I don't
    think I'm giving up something for nothing except perhaps in the case
    where there is only one active backend in the entire system. On the
    other hand, you could be entirely correct that the current
    implementation wins in the uncontended case. Without testing it, I
    just don't know...

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Aug 23, 2011 at 5:28 pm

    Robert Haas writes:
    That's certainly a fair concern, and it might even be worse than
    O(n^2). On the other hand, the current approach involves scanning the
    entire ProcArray for every snapshot, even if nothing has changed and
    90% of the backends are sitting around playing tiddlywinks, so I don't
    think I'm giving up something for nothing except perhaps in the case
    where there is only one active backend in the entire system. On the
    other hand, you could be entirely correct that the current
    implementation wins in the uncontended case. Without testing it, I
    just don't know...
    Sure. Like I said, I don't know that this can't be made to work.
    I'm just pointing out that we have to keep an eye on the single-backend
    case as well as the many-backends case.

    regards, tom lane
  • Dimitri Fontaine at Aug 23, 2011 at 5:04 pm

    Robert Haas writes:
    I think the real trick is figuring out a design that can improve
    concurrency.
    I'm far from familiar with the detailed concepts here, but allow me to
    comment. I have two open questions:

    - is it possible to use a distributed algorithm to produce XIDs,
    something like Vector Clocks?

    Then each backend is able to create a snapshot (well, and XID) on its
    own, and any backend is still able to compare its snapshot to any
    other snapshot (well, XID)

    - is it possible to cache the production of the next snapshots so that
    generating an XID only means getting the next in a pre-computed
    vector?

    My guess by reading the emails here is that we need to add some
    information at snapshot generation time, it's not just about getting
    a 32 bit sequence number.

    I'm not sure I'm being that helpful here, but sometime stating the
    obviously impossible ideas allows to think about some new design, so I
    figured I would still try :)

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Markus Wanner at Aug 24, 2011 at 8:19 am
    Hello Dimitri,
    On 08/23/2011 06:39 PM, Dimitri Fontaine wrote:
    I'm far from familiar with the detailed concepts here, but allow me to
    comment. I have two open questions:

    - is it possible to use a distributed algorithm to produce XIDs,
    something like Vector Clocks?

    Then each backend is able to create a snapshot (well, and XID) on its
    own, and any backend is still able to compare its snapshot to any
    other snapshot (well, XID)
    Creation of snapshots and XID assignment are not as related as you imply
    here. Keep in mind that a read-only transaction have a snapshot, but no
    XID. (Not sure if it's possible for a transaction to have an XID, but
    no snapshot. If it only touches system catalogs with SnapshotNow,
    maybe? Don't think we support that, ATM).
    - is it possible to cache the production of the next snapshots so that
    generating an XID only means getting the next in a pre-computed
    vector?
    The way I look at it, what Robert proposed can be thought of as "cache
    the production of the next snapshot", with a bit of a stretch of what a
    cache is, perhaps. I'd rather call it "early snapshot creation", maybe
    "look-ahead something".

    ATM backends all scan ProcArray to generate their snapshot. Instead,
    what Robert proposes would - sometimes, somewhat - move that work from
    snapshot creation time to commit time.

    As Tom points out, the difficulty lies in the question of when it's
    worth doing that: if you have lots of commits in a row, and no
    transaction ever uses the (pre generated) snapshots of the point in time
    in between, then those were wasted. OTOH, if there are just very few
    COMMITs spread across lots of writes, the read-only backends will
    re-create the same snapshots, over and over again. Seems wasteful as
    well (as GetSnapshotData popping up high on profiles confirms somewhat).

    Hope to have cleared up things a bit.

    Regards

    Markus
  • Gokulakannan Somasundaram at Aug 27, 2011 at 5:39 am

    On Tue, Aug 23, 2011 at 5:25 AM, Robert Haas wrote:

    I've been giving this quite a bit more thought, and have decided to
    abandon the scheme described above, at least for now. It has the
    advantage of avoiding virtually all locking, but it's extremely
    inefficient in its use of memory in the presence of long-running
    transactions. For example, if there's an open transaction that's been
    sitting around for 10 million transactions or so and has an XID
    assigned, any new snapshot is going to need to probe into the big
    array for any XID in that range. At 8 bytes per entry, that means
    we're randomly accessing about ~80MB of memory-mapped data. That
    seems problematic both in terms of blowing out the cache and (on small
    machines) possibly even blowing out RAM. Nor is that the worst case
    scenario: a transaction could sit open for 100 million transactions.

    First i respectfully disagree with you on the point of 80MB. I would say
    that its very rare that a small system( with <1 GB RAM ) might have a long
    running transaction sitting idle, while 10 million transactions are sitting
    idle. Should an optimization be left, for the sake of a very small system to
    achieve high enterprise workloads?

    Second, if we make use of the memory mapped files, why should we think, that
    all the 80MB of data will always reside in memory? Won't they get paged out
    by the operating system, when it is in need of memory? Or do you have some
    specific OS in mind?

    Thanks,
    Gokul.
  • Robert Haas at Aug 28, 2011 at 1:47 am

    On Sat, Aug 27, 2011 at 1:38 AM, Gokulakannan Somasundaram wrote:
    First i respectfully disagree with you on the point of 80MB. I would say
    that its very rare that a small system( with <1 GB RAM ) might have a long
    running transaction sitting idle, while 10 million transactions are sitting
    idle. Should an optimization be left, for the sake of a very small system to
    achieve high enterprise workloads?
    With the design where you track commit-visbility sequence numbers
    instead of snapshots, you wouldn't need 10 million transactions that
    were all still running. You would just need a snapshot that had been
    sitting around while 10 million transactions completed meanwhile.

    That having been said, I don't necessarily think that design is
    doomed. I just think it's going to be trickier to get working than
    the design I'm now hacking on, and a bigger change from what we do
    now. If this doesn't pan out, I might try that one, or something
    else.
    Second, if we make use of the memory mapped files, why should we think, that
    all the 80MB of data will always reside in memory? Won't they get paged out
    by the  operating system, when it is in need of memory? Or do you have some
    specific OS in mind?
    No, I don't think it will all be in memory - but that's part of the
    performance calculation. If you need to check on the status of an XID
    and find that you need to read a page of data in from disk, that's
    going to be many orders of magnitude slower than anything we do with s
    snapshot now. Now, if you gain enough elsewhere, it could still be a
    win, but I'm not going to just assume that.

    As I play with this, I'm coming around to the conclusion that, in
    point of fact, the thing that's hard about snapshots has a great deal
    more to do with memory than it does with CPU time. Sure, using the
    snapshot has to be cheap. But it already IS cheap. We don't need to
    fix that problem; we just need to not break it. What's not cheap is
    constructing the snapshot - principally because of ProcArrayLock, and
    secondarily because we're grovelling through fairly large amounts of
    shared memory to get all the XIDs we need.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Gokulakannan Somasundaram at Aug 28, 2011 at 8:33 am

    No, I don't think it will all be in memory - but that's part of the
    performance calculation. If you need to check on the status of an XID
    and find that you need to read a page of data in from disk, that's
    going to be many orders of magnitude slower than anything we do with s
    snapshot now. Now, if you gain enough elsewhere, it could still be a
    win, but I'm not going to just assume that.

    I was just suggesting this, because the memory costs have come down a
    lot(as you may know) and people can afford to buy more memory in enterprise
    scenario. We may not need to worry about MBs of memory, especially with the
    cloud computing being widely adopted, when we get scalability.

    Gokul.
  • Robert Haas at Aug 28, 2011 at 11:50 pm

    On Sun, Aug 28, 2011 at 4:33 AM, Gokulakannan Somasundaram wrote:
    No, I don't think it will all be in memory - but that's part of the
    performance calculation.  If you need to check on the status of an XID
    and find that you need to read a page of data in from disk, that's
    going to be many orders of magnitude slower than anything we do with s
    snapshot now.  Now, if you gain enough elsewhere, it could still be a
    win, but I'm not going to just assume that.
    I was just suggesting this, because the memory costs have come down a lot(as
    you may know) and people can afford to buy more memory in enterprise
    scenario. We may not need to worry about MBs of memory, especially with the
    cloud computing being widely adopted, when we get scalability.
    The proof of the pudding is in the eating, so let me finish coding up
    this approach and see how it works. Then we can decide where to go
    next...

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Amit Kapila at Sep 7, 2011 at 3:12 am
    I wanted to clarify my understanding and have some doubts.
    What I'm thinking about instead is using a ring buffer with three pointers:
    a start pointer, a stop pointer, and a write pointer. When a transaction
    ends, we
    advance the write pointer, write the XIDs or a whole new snapshot into the
    buffer, and then advance the stop pointer. If we wrote a whole new
    snapshot,
    we advance the start pointer to the beginning of the data we just wrote.
    Someone who wants to take a snapshot must read the data between the start
    and stop pointers, and must then check that the write pointer
    hasn't advanced so far in the meantime that the data they read might have
    been overwritten before they finished reading it. Obviously,
    that's a little risky, since we'll have to do the whole thing over if a
    wraparound occurs, but if the ring buffer is large enough it shouldn't
    happen very often.

    Clarification
    ------------------
    1. With the above, you want to reduce/remove the concurrency issue between
    the GetSnapshotData() [used at begining of sql command execution] and
    ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
    is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
    and by ProcArrayEndTransaction() in X mode.
    There may be other instances for similar thing, but this the main thing
    which you want to resolve.

    2. You want to resolve it by using ring buffer such that readers don't need
    to take any lock.

    Is my above understanding correct?

    Doubts
    ------------

    1. 2 Writers; Won't 2 different sessions who try to commit at same time will
    get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
    as I understood?

    2. 1 Reader, 1 Writter; It might be case that some body has written a new
    snapshot and advanced the stop pointer and at that point of time one reader
    came and read between start pointer and stop pointer. Now the reader will
    see as follows:
    snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
    snapshot?

    3. How will you detect overwrite.

    4. Won't it effect if we don't update xmin everytime and just noting the
    committed XIDs. The reason I am asking is that it is used in tuple
    visibility check
    so with new idea in some cases instead of just returning from begining
    by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
    idea can supesede this minimal effect. However some cases can be defeated.


    --
    With Regards,
    Amit Kapila.



    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!

    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org
    On Behalf Of Robert Haas
    Sent: Sunday, August 28, 2011 7:17 AM
    To: Gokulakannan Somasundaram
    Cc: pgsql-hackers@postgresql.org
    Subject: Re: [HACKERS] cheaper snapshots redux

    On Sat, Aug 27, 2011 at 1:38 AM, Gokulakannan Somasundaram
    wrote:
    First i respectfully disagree with you on the point of 80MB. I would say
    that its very rare that a small system( with <1 GB RAM ) might have a long
    running transaction sitting idle, while 10 million transactions are sitting
    idle. Should an optimization be left, for the sake of a very small system to
    achieve high enterprise workloads?
    With the design where you track commit-visbility sequence numbers
    instead of snapshots, you wouldn't need 10 million transactions that
    were all still running. You would just need a snapshot that had been
    sitting around while 10 million transactions completed meanwhile.

    That having been said, I don't necessarily think that design is
    doomed. I just think it's going to be trickier to get working than
    the design I'm now hacking on, and a bigger change from what we do
    now. If this doesn't pan out, I might try that one, or something
    else.
    Second, if we make use of the memory mapped files, why should we think, that
    all the 80MB of data will always reside in memory? Won't they get paged out
    by the  operating system, when it is in need of memory? Or do you have some
    specific OS in mind?
    No, I don't think it will all be in memory - but that's part of the
    performance calculation. If you need to check on the status of an XID
    and find that you need to read a page of data in from disk, that's
    going to be many orders of magnitude slower than anything we do with s
    snapshot now. Now, if you gain enough elsewhere, it could still be a
    win, but I'm not going to just assume that.

    As I play with this, I'm coming around to the conclusion that, in
    point of fact, the thing that's hard about snapshots has a great deal
    more to do with memory than it does with CPU time. Sure, using the
    snapshot has to be cheap. But it already IS cheap. We don't need to
    fix that problem; we just need to not break it. What's not cheap is
    constructing the snapshot - principally because of ProcArrayLock, and
    secondarily because we're grovelling through fairly large amounts of
    shared memory to get all the XIDs we need.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Robert Haas at Sep 8, 2011 at 2:19 pm

    On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila wrote:
    1. With the above, you want to reduce/remove the concurrency issue between
    the GetSnapshotData() [used at begining of sql command execution] and
    ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
    is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
    and by ProcArrayEndTransaction() in X mode.
    There may be other instances for similar thing, but this the main thing
    which you want to resolve. Yep.
    2. You want to resolve it by using ring buffer such that readers don't need
    to take any lock.
    Yep. Actually, they're still going to need some spinlocks at least in
    the first go round, to protect the pointers. I'm hoping those can
    eventually be eliminated on machines with 8-byte atomic reads using
    appropriate memory barrier primitives.
    1. 2 Writers; Won't 2 different sessions who try to commit at same time will
    get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
    as I understood?
    Yes, commits have to be serialized. No way around that. The best
    we'll ever be able to do is shorten the critical section.
    2. 1 Reader, 1 Writter; It might be case that some body has written a new
    snapshot and advanced the stop pointer and at that point of time one reader
    came and read between start pointer and stop pointer. Now the reader will
    see as follows:
    snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
    snapshot?
    In my prototype implementation that can't happen because the start and
    stop pointers are protected by a single spinlock and are moved
    simultaneously. But I think we can get rid on machines with 8-byte
    atomic writes of that and just move the stop pointer first and then
    the start pointer. If you get more than one snapshot in the middle
    you just ignore the first part of the data you read and start with the
    beginning of the last snapshot.
    3. How will you detect overwrite.
    If the write pointer is greater than the start pointer by more than
    the ring size, you've wrapped.
    4. Won't it effect if we don't update xmin everytime and just noting the
    committed XIDs. The reason I am asking is that it is used in tuple
    visibility check
    so with new idea in some cases instead of just returning from begining
    by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
    idea can supesede this minimal effect. However some cases can be defeated.
    The snapshot xmin has to be up to date. I'm not planning to break
    that because it would be wrong.

    RecentGlobalXmin doesn't need to be completely up to date, and in fact
    recomputing it on every snapshot becomes prohibitively expensive with
    this approach. I'm still struggling with the best way to handle that.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Amit Kapila at Sep 12, 2011 at 3:08 am
    4. Won't it effect if we don't update xmin everytime and just noting the
    committed XIDs. The reason I am asking is that it is used in tuple
    visibility check so with new idea in some cases instead of just returning >>
    from begining by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your idea
    can supesede this minimal effect. However some cases can be defeated.
    The snapshot xmin has to be up to date. I'm not planning to break that
    because it would be wrong.

    In the approach mentioned in your idea, it mentioned that once after
    taking snapshot, only committed XIDs will be updated and sometimes snapshot
    itself.

    So when the xmin will be updated according to your idea as snapshot will
    not be updated everytime so xmin cannot be latest always.
    RecentGlobalXmin doesn't need to be completely up to date, and in fact
    recomputing it on every snapshot becomes prohibitively expensive with this
    approach. I'm still struggling with the best way to handle that.

    RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin
    and that too outside ProcArrayLock, so why it will be expensive if you have
    updated xmin.



    With Regards,

    Amit Kapila.


    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!

    -----Original Message-----
    From: Robert Haas
    Sent: Thursday, September 08, 2011 7:50 PM
    To: Amit Kapila
    Cc: pgsql-hackers@postgresql.org
    Subject: Re: [HACKERS] cheaper snapshots redux
    On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila wrote:
    1. With the above, you want to reduce/remove the concurrency issue between
    the GetSnapshotData() [used at begining of sql command execution] and
    ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
    is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
    and by ProcArrayEndTransaction() in X mode.
    There may be other instances for similar thing, but this the main thing
    which you want to resolve. Yep.
    2. You want to resolve it by using ring buffer such that readers don't need
    to take any lock.
    Yep. Actually, they're still going to need some spinlocks at least in
    the first go round, to protect the pointers. I'm hoping those can
    eventually be eliminated on machines with 8-byte atomic reads using
    appropriate memory barrier primitives.
    1. 2 Writers; Won't 2 different sessions who try to commit at same time will
    get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
    as I understood?
    Yes, commits have to be serialized. No way around that. The best
    we'll ever be able to do is shorten the critical section.
    2. 1 Reader, 1 Writter; It might be case that some body has written a new
    snapshot and advanced the stop pointer and at that point of time one reader
    came and read between start pointer and stop pointer. Now the reader will
    see as follows:
    snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
    snapshot?
    In my prototype implementation that can't happen because the start and
    stop pointers are protected by a single spinlock and are moved
    simultaneously. But I think we can get rid on machines with 8-byte
    atomic writes of that and just move the stop pointer first and then
    the start pointer. If you get more than one snapshot in the middle
    you just ignore the first part of the data you read and start with the
    beginning of the last snapshot.
    3. How will you detect overwrite.
    If the write pointer is greater than the start pointer by more than
    the ring size, you've wrapped.
    4. Won't it effect if we don't update xmin everytime and just noting the
    committed XIDs. The reason I am asking is that it is used in tuple
    visibility check
    so with new idea in some cases instead of just returning from begining
    by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
    idea can supesede this minimal effect. However some cases can be defeated.
    The snapshot xmin has to be up to date. I'm not planning to break
    that because it would be wrong.

    RecentGlobalXmin doesn't need to be completely up to date, and in fact
    recomputing it on every snapshot becomes prohibitively expensive with
    this approach. I'm still struggling with the best way to handle that.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Sep 12, 2011 at 2:09 pm

    On Sun, Sep 11, 2011 at 11:08 PM, Amit Kapila wrote:
    In the approach mentioned in your idea, it mentioned that once after
    taking snapshot, only committed XIDs will be updated and sometimes snapshot
    itself.

    So when the xmin will be updated according to your idea as snapshot will
    not be updated everytime so xmin cannot be latest always.
    If you know what transactions were running the last time a snapshot
    summary was written and what transactions have ended since then, you
    can work out the new xmin on the fly. I have working code for this
    and it's actually quite simple.
    RecentGlobalXmin doesn't need to be completely up to date, and in fact
    recomputing it on every snapshot becomes prohibitively expensive with this
    approach.  I'm still struggling with the best way to handle that.

    RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin
    and that too outside ProcArrayLock, so why it will be expensive if you have
    updated xmin.
    Because GetSnapshotData() computes a new value for RecentGlobalXmin by
    scanning the ProcArray. This isn't costing a whole lot extra right
    now because the xmin and xid fields are normally in the same cache
    line, so once you've looked at one of them it doesn't cost that much
    extra to look at the other. If, on the other hand, you're not looking
    at (or even locking) the ProcArray, then doing so just to recompute
    RecentGlobalXmin sucks.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Amit Kapila at Sep 12, 2011 at 3:08 pm
    If you know what transactions were running the last time a snapshot summary
    was written and what >transactions have ended since then, you can work out
    the new xmin on the fly. I have working >code for this and it's actually
    quite simple.



    I believe one method to do same is as follows:

    Let us assume at some point of time the snapshot and completed XID list is
    somewhat as follows:

    Snapshot

    {

    Xmin - 5

    Xip[] - 8 10 12

    Xmax - 15

    }

    Committed XIDS - 8, 10 , 12, 18, 20, 21

    So it means 16,17,19 are running transactions. So it will behave as follows:

    Xmin - 16

    Xmax - 21

    Xip[] - 17,19



    But if we do above way to calculate Xmin, we need to check in existing Xip
    array and committed Xid array to find Xmin. Won't this cause reasonable time
    even though it is outside lock time if Xip and Xid are large.


    Because GetSnapshotData() computes a new value for RecentGlobalXmin by
    scanning the ProcArray. > This isn't costing a whole lot extra right now
    because the xmin and xid fields are normally in > the same cache line, so
    once you've looked at one of them it doesn't cost that much extra to
    look at the other. If, on the other hand, you're not looking at (or even
    locking) the
    ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.


    Yes, this is more time as compare to earlier, but if our approach to
    calculate Xmin is like above point, then one extra read outside lock should
    not matter. However if for above point approach is different then it will be
    costlier.



    ****************************************************************************
    ***********

    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!





    -----Original Message-----
    From: Robert Haas
    Sent: Monday, September 12, 2011 7:39 PM
    To: Amit Kapila
    Cc: pgsql-hackers@postgresql.org
    Subject: Re: [HACKERS] cheaper snapshots redux


    On Sun, Sep 11, 2011 at 11:08 PM, Amit Kapila wrote:

    In the approach mentioned in your idea, it mentioned that once after
    taking snapshot, only committed XIDs will be updated and sometimes snapshot
    itself. >
    So when the xmin will be updated according to your idea as snapshot will
    not be updated everytime so xmin cannot be latest always.


    If you know what transactions were running the last time a snapshot

    summary was written and what transactions have ended since then, you

    can work out the new xmin on the fly. I have working code for this

    and it's actually quite simple.


    RecentGlobalXmin doesn't need to be completely up to date, and in fact
    recomputing it on every snapshot becomes prohibitively expensive with this
    approach. I'm still struggling with the best way to handle that. >
    RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin
    and that too outside ProcArrayLock, so why it will be expensive if you have
    updated xmin.


    Because GetSnapshotData() computes a new value for RecentGlobalXmin by

    scanning the ProcArray. This isn't costing a whole lot extra right

    now because the xmin and xid fields are normally in the same cache

    line, so once you've looked at one of them it doesn't cost that much

    extra to look at the other. If, on the other hand, you're not looking

    at (or even locking) the ProcArray, then doing so just to recompute

    RecentGlobalXmin sucks.



    --

    Robert Haas

    EnterpriseDB: http://www.enterprisedb.com

    The Enterprise PostgreSQL Company
  • Robert Haas at Sep 12, 2011 at 4:01 pm

    On Mon, Sep 12, 2011 at 11:07 AM, Amit Kapila wrote:
    If you know what transactions were running the last time a snapshot summary
    was written and what >transactions have ended since then, you can work out
    the new xmin on the fly.  I have working >code for this and it's actually
    quite simple.
    I believe one method to do same is as follows:

    Let us assume at some point of time the snapshot and completed XID list is
    somewhat as follows:

    Snapshot

    { Xmin – 5, Xip[] – 8 10 12, Xmax  - 15 }

    Committed XIDS – 8, 10 , 12, 18, 20, 21

    So it means 16,17,19 are running transactions. So it will behave as follows:

    { Xmin – 16, Xmax – 21, Xip[] – 17,19 }
    Yep, that's pretty much what it does, although xmax is actually
    defined as the XID *following* the last one that ended, and I think
    xmin needs to also be in xip, so in this case you'd actually end up
    with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }. But you've got
    the basic idea of it.
    But if we do above way to calculate Xmin, we need to check in existing Xip
    array and committed Xid array to find Xmin. Won’t this cause reasonable time
    even though it is outside lock time if Xip and Xid are large.
    Yes, Tom raised this concern earlier. I can't answer it for sure
    without benchmarking, but clearly xip[] can't be allowed to get too
    big.
    Because GetSnapshotData() computes a new value for RecentGlobalXmin by
    scanning the ProcArray.  > This isn't costing a whole lot extra right now
    because the xmin and xid fields are normally in > the same cache line, so
    once you've looked at one of them it doesn't cost that much extra to
    look at the other.  If, on the other hand, you're not looking at (or even
    locking) the
    ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.
    Yes, this is more time as compare to earlier, but if our approach to
    calculate Xmin is like above point, then one extra read outside lock should
    not matter. However if for above point approach is different then it will be
    costlier.
    It's not one extra read - you'd have to look at every PGPROC. And it
    is not outside a lock, either. You definitely need locking around
    computing RecentGlobalXmin; see src/backend/access/transa/README. In
    particular, if someone with proc->xmin = InvalidTransactionId is
    taking a snapshot while you're computing RecentGlobalXmin, and then
    stores a proc->xmin less than your newly-computed RecentGlobalXmin,
    you've got a problem. That can't happen right now because no
    transactions can commit while RecentGlobalXmin is being computed, but
    the point here is precisely to allow those operations to (mostly) run
    in parallel.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Amit Kapila at Sep 13, 2011 at 11:50 am

    Yep, that's pretty much what it does, although xmax is actually
    defined as the XID *following* the last one that ended, and I think
    xmin needs to also be in xip, so in this case you'd actually end up
    with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }. But you've got
    the basic idea of it.
    Shouldn't Xmax be 21 okay as current check in TupleVisibility indicate if
    XID is greater than equal to Xmax then it returns tuple is not visible.
    In particular, if someone with proc->xmin = InvalidTransactionId is
    taking a snapshot while you're computing RecentGlobalXmin, and then
    stores a proc->xmin less than your newly-computed RecentGlobalXmin,
    you've got a problem.
    I am assuming here you are reffering to take a snapshot means it has to be
    updated in shared memory because otherwise no need to refer proc with your
    new design.

    Session-1
    Updating RecentGlobalXmin during GetSnapshotData() using shared memory copy
    of snapshot and completed transactions as RecentGlobalXmin can be updated if
    we get xmin.

    Session-2
    Getting Snapshot to update in shared memory, here it needs to go through
    procarray.
    Now when it is going through procarray using proclock it can be case that
    proc of Session-1 has InvalidTransId, so we will ignore it and go through
    remaining session procs.
    Now normally Session-1 proc should not get lesser xmin as compare to other
    session procs but incase it has got his copy from shared memory ring buffer
    before other session procs then it can be lower and which can cause a
    problem.
    It's not one extra read - you'd have to look at every PGPROC.
    If the above explanation is right then is this the reason that to update
    RecentGlobalXmin, it has to go through every PGPROC.
    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!

    -----Original Message-----
    From: Robert Haas
    Sent: Monday, September 12, 2011 9:31 PM
    To: Amit Kapila
    Cc: pgsql-hackers@postgresql.org
    Subject: Re: [HACKERS] cheaper snapshots redux
    On Mon, Sep 12, 2011 at 11:07 AM, Amit Kapila wrote:
    If you know what transactions were running the last time a snapshot
    summary
    was written and what >transactions have ended since then, you can work
    out
    the new xmin on the fly.  I have working >code for this and it's actually
    quite simple.
    I believe one method to do same is as follows:

    Let us assume at some point of time the snapshot and completed XID list is
    somewhat as follows:

    Snapshot

    { Xmin – 5, Xip[] – 8 10 12, Xmax  - 15 }

    Committed XIDS – 8, 10 , 12, 18, 20, 21

    So it means 16,17,19 are running transactions. So it will behave as follows:
    { Xmin – 16, Xmax – 21, Xip[] – 17,19 }
    Yep, that's pretty much what it does, although xmax is actually
    defined as the XID *following* the last one that ended, and I think
    xmin needs to also be in xip, so in this case you'd actually end up
    with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }. But you've got
    the basic idea of it.
    But if we do above way to calculate Xmin, we need to check in existing Xip
    array and committed Xid array to find Xmin. Won’t this cause reasonable time
    even though it is outside lock time if Xip and Xid are large.
    Yes, Tom raised this concern earlier. I can't answer it for sure
    without benchmarking, but clearly xip[] can't be allowed to get too
    big.
    Because GetSnapshotData() computes a new value for RecentGlobalXmin by
    scanning the ProcArray.  > This isn't costing a whole lot extra right now
    because the xmin and xid fields are normally in > the same cache line, so
    once you've looked at one of them it doesn't cost that much extra to
    look at the other.  If, on the other hand, you're not looking at (or even
    locking) the
    ProcArray, then doing so just to recomputed RecentGlobalXmin sucks.
    Yes, this is more time as compare to earlier, but if our approach to
    calculate Xmin is like above point, then one extra read outside lock should
    not matter. However if for above point approach is different then it will be
    costlier.
    It's not one extra read - you'd have to look at every PGPROC. And it
    is not outside a lock, either. You definitely need locking around
    computing RecentGlobalXmin; see src/backend/access/transa/README. In
    particular, if someone with proc->xmin = InvalidTransactionId is
    taking a snapshot while you're computing RecentGlobalXmin, and then
    stores a proc->xmin less than your newly-computed RecentGlobalXmin,
    you've got a problem. That can't happen right now because no
    transactions can commit while RecentGlobalXmin is being computed, but
    the point here is precisely to allow those operations to (mostly) run
    in parallel.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Sep 13, 2011 at 12:44 pm

    On Tue, Sep 13, 2011 at 7:49 AM, Amit Kapila wrote:
    Yep, that's pretty much what it does, although xmax is actually
    defined as the XID *following* the last one that ended, and I think
    xmin needs to also be in xip, so in this case you'd actually end up
    with xmin = 15, xmax = 22, xip = { 15, 16, 17, 19 }.  But you've got
    the basic idea of it.
    Shouldn't Xmax be 21 okay as current check in TupleVisibility indicate if
    XID is greater than equal to Xmax then it returns tuple is not visible.
    No, that's not OK. You stipulated 21 as committed, so it had better be visible.
    In particular, if someone with proc->xmin = InvalidTransactionId is
    taking a snapshot while you're computing RecentGlobalXmin, and then
    stores a proc->xmin less than your newly-computed RecentGlobalXmin,
    you've got a problem.
    I am assuming here you are reffering to take a snapshot means it has to be
    updated in shared memory because otherwise no need to refer proc with your
    new design.

    Session-1
    Updating RecentGlobalXmin during GetSnapshotData() using shared memory copy
    of snapshot and completed transactions as RecentGlobalXmin can be updated if
    we get xmin.

    Session-2
    Getting Snapshot to update in shared memory, here it needs to go through
    procarray.
    Now when it is going through procarray using proclock it can be case that
    proc of Session-1 has InvalidTransId, so we will ignore it and go through
    remaining session procs.
    Now normally Session-1 proc should not get lesser xmin as compare to other
    session procs but incase it has got his copy from shared memory ring buffer
    before other session procs then it can be lower and which can cause a
    problem.
    It's not one extra read - you'd have to look at every PGPROC.
    If the above explanation is right then is this the reason that to update
    RecentGlobalXmin, it has to go through every PGPROC.
    Your explanation isn't very clear to me. But I will post the patch
    once I have some of these details sorted out.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Amit kapila at Sep 10, 2011 at 12:04 pm
    4. Won't it effect if we don't update xmin everytime and just noting the committed XIDs. The reason I am asking is that it is used in tuple visibility check so with new idea in some cases instead of just returning >> from begining by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your idea can supesede this minimal effect. However some cases can be defeated.
    The snapshot xmin has to be up to date. I'm not planning to break that because it would be wrong.
    In the approach mentioned in your idea, it mentioned that once after taking snapshot, only committed XIDs will be updated and sometimes snapshot itself.

    So when the xmin will be updated according to your idea.
    RecentGlobalXmin doesn't need to be completely up to date, and in fact recomputing it on every snapshot becomes prohibitively expensive with this approach. I'm still struggling with the best way to handle that.
    RecentGlobalXmin and RecentXmin are mostly updated with snapshots xmin and that too outside ProcArrayLock, so why it will be expensive if you have updated xmin.


    With Regards,

    Amit Kapila.

    ***************************************************************************************
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 22, '11 at 9:25p
activeSep 13, '11 at 12:44p
posts36
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase