Attached is the latest Serializable Snapshot Isolation (SSI) patch.

With Joe's testing and review, and with stress tests adapted from
those used by Florian for his patch, we were able to identify and
fix several bugs. Stability seems good now. We have many tests for
correct behavior which are all looking good. The only solid
benchmarks we have so far show no impact on isolation levels other
than SERIALIZABLE, and a 1.8% increase in run time for a saturation
run of small, read only SERIALIZABLE transactions against a fully
cached database. Dan has been working on setting up some benchmarks
using DBT-2, but doesn't yet have results to publish. If we can get
more eyes on the code during this CF, I'm hoping we can get this
patch committed this round.

This patch is basically an implementation of the techniques
described in the 2008 paper by Cahill et al, and which was further
developed in Cahill's 2009 PhD thesis. Techniques needed to be
adapted somewhat because of differences between PostgreSQL and the
two databases used for prototype implementations for those papers
(Oracle Berkeley DB and InnoDB), and there are a few original ideas
from Dan and myself used to optimize the implementation. One reason
for hoping that this patch gets committed in this CF is that it will
leave time to try out some other, more speculative optimizations
before release.

Documentation is not included in this patch; I plan on submitting
that to a later CF as a separate patch. Changes should be almost
entirely within the Concurrency Control chapter. The current patch
has one new GUC which (if kept) will need to be documented, and one
of the potential optimizations could involve adding a new
transaction property which would then need documentation.

The premise of the patch is simple: that snapshot isolation comes so
close to supporting fully serializable transactions that S2PL is not
necessary -- the database engine can watch for rw-dependencies among
transactions, without introducing any blocking, and roll back
transactions as required to prevent serialization anomalies. This
eliminates the need for using the SELECT FOR SHARE or SELECT FOR
UPDATE clauses, the need for explicit locking, and the need for
additional updates to introduce conflict points.

While block-level locking is included in this patch for btree and
GiST indexes, an index relation lock is still used for predicate
locks when a search is made through a GIN or hash index. These
additional index types can be implemented separately. Dan is
looking at bringing btree indexes to finer granularity, but wants to
have good benchmarks first, to confirm that the net impact is a gain
in performance.

Most of the work is in the new predicate.h and predicate.c files,
which total 2,599 lines, over 39% of which are comment lines. There
are 1626 lines in the new pg_dtester.py.in files, which uses Markus
Wanner's dtester software to implement a large number of correctness
tests. We added 79 lines to lockfuncs.c to include the new
SIReadLock entries in the pg_locks view. The rest of the patch
affects 286 lines (counting an updated line twice) across 25
existing PostgreSQL source files to implement the actual feature.

The code organization and naming issues mentioned here remain:

http://archives.postgresql.org/pgsql-hackers/2010-07/msg00383.php

-Kevin

Search Discussions

  • Heikki Linnakangas at Sep 14, 2010 at 7:49 pm

    On 14/09/10 19:34, Kevin Grittner wrote:
    Attached is the latest Serializable Snapshot Isolation (SSI) patch.
    Great work! A year ago I thought it would be impossible to have a true
    serializable mode in PostgreSQL because of the way we do MVCC, and now
    we have a patch.

    At a quick read-through, the code looks very tidy and clear now. Some
    comments:

    Should add a citation to Cahill's work this is based on. Preferably with
    a hyperlink. A short description of how the predicate locks help to
    implement serializable mode would be nice too. I haven't read Cahill's
    papers, and I'm left wondering what the RW conflicts and dependencies
    are, when you're supposed to grab predicate locks etc.

    If a page- or relation level SILOCK is taken, is it possible to get
    false conflicts? Ie. a transaction is rolled back because it modified a
    tuple on a page where some other transaction modified another tuple,
    even though there's no dependency between the two.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 14, 2010 at 8:42 pm

    Heikki Linnakangas wrote:

    Great work! A year ago I thought it would be impossible to have a
    true serializable mode in PostgreSQL because of the way we do
    MVCC, and now we have a patch.

    At a quick read-through, the code looks very tidy and clear now.
    Some comments:

    Should add a citation to Cahill's work this is based on.
    Preferably with a hyperlink.
    I'm planning on drawing from the current Wiki page:

    http://wiki.postgresql.org/wiki/Serializable

    to put together a README file; do you think the references should go
    in the README file, the source code, or both?

    A short description of how the predicate locks help to implement
    serializable mode would be nice too. I haven't read Cahill's
    papers, and I'm left wondering what the RW conflicts and
    dependencies are, when you're supposed to grab predicate locks
    etc.
    Again, I summarize that in the Wiki page, and was planning on
    putting it into the README. If you've read the Wiki page and it's
    not clear, then I definitely have some work to do there.
    If a page- or relation level SILOCK is taken, is it possible to
    get false conflicts?
    Yes. This technique will generate some false positive rollbacks.
    Software will need to be prepared to retry any database transaction
    which fails with a serialization failure SQLSTATE. I expect that
    proper connection pooling will be particularly important when using
    SSI, and flagging transactions which don't write to permanent tables
    as READ ONLY transactions will help reduce the rollback rate, too.

    Some of the optimizations we have sketched out will definitely
    reduce the rate of false positives; however, we don't want to
    implement them without a better performance baseline because the
    cost of tracking the required information and the contention for LW
    locks to maintain the information may hurt performance more than the
    restart of transactions which experience serialization failure.

    I don't want to steal Dan's thunder after all the hard work he's
    done to get good numbers from the DBT-2 benchmark, but suffice it to
    say that I've been quite pleased with the performance on that
    benchmark. He's pulling together the data for a post on the topic.
    Ie. a transaction is rolled back because it modified a tuple on a
    page where some other transaction modified another tuple, even
    though there's no dependency between the two.
    Well, no, because this patch doesn't do anything new with write
    conflicts. It's all about the apparent order of execution, based on
    one transaction not being able to read what was written by a
    concurrent transaction. The reading transaction must be considered
    to have run first in that case (Hey, now you know what a rw-conflict
    is!) -- but such references can create a cycle -- which is the
    source of all serialization anomalies in snapshot isolation.

    -Kevin
  • Kevin Grittner at Sep 14, 2010 at 9:50 pm
    I've been thinking about these points, and reconsidered somewhat.

    Heikki Linnakangas wrote:
    Should add a citation to Cahill's work this is based on.
    Preferably with a hyperlink.
    I've been thinking that this should be mentioned in both the README
    and the source code.
    A short description of how the predicate locks help to implement
    serializable mode would be nice too. I haven't read Cahill's
    papers, and I'm left wondering what the RW conflicts and
    dependencies are, when you're supposed to grab predicate locks
    etc.
    Again -- why be stingy? Given a more complete README file, how
    about something like?:

    /*
    * A rw-conflict occurs when a read by one serializable transaction
    * does not see the write of a concurrent serializable transaction
    * when that write would have been visible had the writing
    * transaction committed before the start of the reading
    * transaction. When the write occurs first, the read can detect
    * this conflict by examining the MVCC information. When the read
    * occurs first, it must record this somewhere so that writes can
    * check for a conflict. Predicate locks are used for this.
    * Detection of such a conflict does not cause blocking, and does
    * not, in itself, cause a transaction rollback.
    *
    * Transaction rollback is required when one transaction (called a
    * "pivot") has a rw-conflict *in* (a concurrent transaction
    * couldn't see its write) as well as *out* (it couldn't see the
    * write of another transaction). In addition, the transaction on
    * the "out" side of the pivot must commit first, and if the
    * transaction on the "in" side of the pivot is read-only, it must
    * acquire its snapshot after the successful commit of the
    * transaction on the "out" side of the pivot.
    */

    Would something like that have helped?

    -Kevin
  • Heikki Linnakangas at Sep 15, 2010 at 7:50 am

    On 15/09/10 00:49, Kevin Grittner wrote:
    Heikki Linnakangaswrote:
    A short description of how the predicate locks help to implement
    serializable mode would be nice too. I haven't read Cahill's
    papers, and I'm left wondering what the RW conflicts and
    dependencies are, when you're supposed to grab predicate locks
    etc.
    Again -- why be stingy? Given a more complete README file, how
    about something like?:
    Well, if it's explained in the readme, that's probably enough.
    /*
    * A rw-conflict occurs when a read by one serializable transaction
    * does not see the write of a concurrent serializable transaction
    * when that write would have been visible had the writing
    * transaction committed before the start of the reading
    * transaction. When the write occurs first, the read can detect
    * this conflict by examining the MVCC information. When the read
    * occurs first, it must record this somewhere so that writes can
    * check for a conflict. Predicate locks are used for this.
    * Detection of such a conflict does not cause blocking, and does
    * not, in itself, cause a transaction rollback.
    *
    * Transaction rollback is required when one transaction (called a
    * "pivot") has a rw-conflict *in* (a concurrent transaction
    * couldn't see its write) as well as *out* (it couldn't see the
    * write of another transaction). In addition, the transaction on
    * the "out" side of the pivot must commit first, and if the
    * transaction on the "in" side of the pivot is read-only, it must
    * acquire its snapshot after the successful commit of the
    * transaction on the "out" side of the pivot.
    */

    Would something like that have helped?
    Yes. An examples would be very nice too, that description alone is
    pretty hard to grasp. Having read the Wiki page, and the slides from
    your presentation at pg east 2010, I think understand it now.

    Now that I understand what the predicate locks are for, I'm now trying
    to get my head around all the data structures in predicate.c. The
    functions are well commented, but an overview at the top of the file of
    all the hash tables and other data structures would be nice. What is
    stored in each, when are they updated, etc.

    I've been meaning to look at this patch for some time, but now I'm
    actually glad I haven't because I'm now getting a virgin point of view
    on the code, seeing the problems that anyone who's not familiar with the
    approach will run into. :-)

    BTW, does the patch handle prepared transactions yet? It introduces a
    call to PreCommit_CheckForSerializationFailure() in CommitTransaction, I
    think you'll need that in PrepareTransaction as well.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 15, 2010 at 1:16 pm

    Heikki Linnakangas wrote:

    Now that I understand what the predicate locks are for, I'm now
    trying to get my head around all the data structures in
    predicate.c. The functions are well commented, but an overview at
    the top of the file of all the hash tables and other data
    structures would be nice. What is stored in each, when are they
    updated, etc.
    It probably doesn't help that they're split between predicate.c and
    predicate.h. (They were originally all in predicate.c because
    nobody else needed to see them, but we moved some to the .h file to
    expose them to lockfuncs.c to support listing the locks.)

    I'm inclined to move everything except the function prototypes out
    of predicate.h to a new predicate_interal.h, and move the structures
    defined in predicate.c there, too. And, of course, add the overview
    comments in the new file. If that sounds good, I can probably
    post a new patch with those changes today -- would that be a good
    idea, or should I wait for more feedback before doing that? (It
    will be in the git repo either way.)
    BTW, does the patch handle prepared transactions yet? It
    introduces a call to PreCommit_CheckForSerializationFailure() in
    CommitTransaction, I think you'll need that in PrepareTransaction
    as well.
    Good point. In spite of the NB comment, I did not notice that.
    Will fix.

    Thanks for the feedback!

    -Kevin
  • Alvaro Herrera at Sep 15, 2010 at 6:38 pm

    Excerpts from Kevin Grittner's message of mié sep 15 09:15:53 -0400 2010:

    I'm inclined to move everything except the function prototypes out
    of predicate.h to a new predicate_interal.h, and move the structures
    defined in predicate.c there, too.
    I think that would also solve a concern that I had, which is that we
    were starting to include relcache.h (and perhaps other headers as well,
    but that's the one that triggered it for me) a bit too liberally, so +1
    from me.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Kevin Grittner at Sep 15, 2010 at 6:52 pm

    Alvaro Herrera wrote:

    I think that would also solve a concern that I had, which is that
    we were starting to include relcache.h (and perhaps other headers
    as well, but that's the one that triggered it for me) a bit too
    liberally, so +1 from me.
    Unfortunately, what I proposed doesn't solve that for relcache.h,
    although it does eliminate lock.h from almost everywhere and htup.h
    from everywhere. (The latter seemed to be left over from an
    abandoned approach, and was no longer needed in predicate.h in any
    event.)

    Most of the functions in predicate.c take a Relation as a parameter.
    I could split out the function prototypes for those which *don't*
    use it to a separate .h file if you think it is worthwhile. The
    functions would be:

    void InitPredicateLocks(void);
    Size PredicateLockShmemSize(void);
    void RegisterSerializableTransaction(const Snapshot snapshot);
    void ReleasePredicateLocks(const bool isCommit);
    void PreCommit_CheckForSerializationFailure(void);

    The files where these are used are:

    src/backend/storage/ipc/ipci.c
    src/backend/utils/time/snapmgr.c
    src/backend/utils/resowner/resowner.c
    src/backend/access/transam/xact.c

    So any of these files which don't already include relcache.h could
    remain without it if we make this split. Is there an easy way to
    check which might already include it? Is it worth adding one more
    .h file to avoid including relcache.h and snapshot.h in these four
    files?

    Let me know -- I'm happy to arrange this any way people feel is most
    appropriate. I have a profound appreciation for the organization of
    this code, and want to maintain it, even if I don't possess the
    perspective to know how to best do so. The respect comes from
    developing this patch -- every time I gave my manager an estimate of
    how long it would take to do something, I found it actually took
    about one-third of that time -- and it was entirely due to the
    organization and documentation of the code.

    -Kevin
  • Alvaro Herrera at Sep 16, 2010 at 10:58 pm

    Excerpts from Kevin Grittner's message of mié sep 15 14:52:36 -0400 2010:
    Alvaro Herrera wrote:
    I think that would also solve a concern that I had, which is that
    we were starting to include relcache.h (and perhaps other headers
    as well, but that's the one that triggered it for me) a bit too
    liberally, so +1 from me.
    Unfortunately, what I proposed doesn't solve that for relcache.h,
    although it does eliminate lock.h from almost everywhere and htup.h
    from everywhere.
    Now that I look at your new patch, I noticed that I was actually
    confusing relcache.h with rel.h. The latter includes a big chunk of our
    headers, but relcache.h is pretty thin. Including relcache.h in another
    header is not much of a problem.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Kevin Grittner at Sep 16, 2010 at 11:09 pm

    Alvaro Herrera wrote:

    Now that I look at your new patch, I noticed that I was actually
    confusing relcache.h with rel.h. The latter includes a big chunk
    of our headers, but relcache.h is pretty thin. Including
    relcache.h in another header is not much of a problem.
    OK, thanks for the clarification.

    With the structures all brought back together in a logical order,
    and the new comments in front of the structure declarations, do you
    think a summary at the top of the file is still needed in that
    header file?

    -Kevin
  • Kevin Grittner at Sep 16, 2010 at 10:35 pm

    Heikki Linnakangas wrote:

    The functions are well commented, but an overview at the top of
    the file of all the hash tables and other data structures would be
    nice. What is stored in each, when are they updated, etc.
    I moved all the structures from predicate.h and predicate.c to a new
    predicate_internal.h file and added comments. You can view its
    current contents here:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate_internal.h;h=7cdb5af6eebdc148dd5ed5030847ca50d7df4fe8;hb=7f05b21bc4d846ad22ae8c160b1bf8888495e254

    Does this work for you?

    That leaves the predicate.h file with just this:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate.h;h=7dcc2af7628b860f9cec9ded6b78f55163b58934;hb=7f05b21bc4d846ad22ae8c160b1bf8888495e254

    -Kevin
  • Heikki Linnakangas at Sep 17, 2010 at 6:11 am

    On 17/09/10 01:35, Kevin Grittner wrote:
    Heikki Linnakangaswrote:
    The functions are well commented, but an overview at the top of
    the file of all the hash tables and other data structures would be
    nice. What is stored in each, when are they updated, etc.
    I moved all the structures from predicate.h and predicate.c to a new
    predicate_internal.h file and added comments. You can view its
    current contents here:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=blob;f=src/include/storage/predicate_internal.h;h=7cdb5af6eebdc148dd5ed5030847ca50d7df4fe8;hb=7f05b21bc4d846ad22ae8c160b1bf8888495e254

    Does this work for you?
    Yes, thank you, that helps a lot.

    So, the purpose of SerializableXidHash is to provide quick access to the
    SERIALIZABLEXACT struct of a top-level transaction, when you know its
    transaction id or any of its subtransaction ids. To implement the "or
    any of its subtransaction ids" part, you need to have a SERIALIZABLEXID
    struct for each subtransaction in shared memory. That sounds like it can
    eat through your shared memory very quickly if you have a lot of
    subtransactions.

    Why not use SubTransGetTopmostTransaction() ?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 17, 2010 at 11:56 am

    Heikki Linnakangas wrote:

    So, the purpose of SerializableXidHash is to provide quick access
    to the SERIALIZABLEXACT struct of a top-level transaction, when you
    know its transaction id or any of its subtransaction ids. Right.
    To implement the "or any of its subtransaction ids" part, you need
    to have a SERIALIZABLEXID struct for each subtransaction in shared
    memory.
    Close -- each subtransaction which writes any tuples.
    That sounds like it can eat through your shared memory very quickly
    if you have a lot of subtransactions.
    Hmmm.... I've never explicitly used subtransactions, so I don't tend
    to think of them routinely going too deep. And the struct is pretty
    small.
    Why not use SubTransGetTopmostTransaction() ?
    This needs to work when the xid of a transaction is found in the MVCC
    data of a tuple for any overlapping serializable transaction -- even
    if that transaction has completed and its connection has been
    closed. It didn't look to me like SubTransGetTopmostTransaction()
    would work after the transaction was gone.

    I guess that's something I should mention in the comments....

    -Kevin
  • Heikki Linnakangas at Sep 17, 2010 at 12:00 pm

    On 17/09/10 14:56, Kevin Grittner wrote:
    Heikki Linnakangas wrote:
    Why not use SubTransGetTopmostTransaction() ?
    This needs to work when the xid of a transaction is found in the MVCC
    data of a tuple for any overlapping serializable transaction -- even
    if that transaction has completed and its connection has been
    closed. It didn't look to me like SubTransGetTopmostTransaction()
    would work after the transaction was gone.
    You're right, it doesn't retain that old transactions. But it could
    easily be modified to do so.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 17, 2010 at 2:08 pm

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    Heikki Linnakangas wrote:
    That sounds like it can eat through your shared memory very quickly
    if you have a lot of subtransactions.
    Hmmm.... I've never explicitly used subtransactions, so I don't tend
    to think of them routinely going too deep. And the struct is pretty
    small.
    That assumption is absolutely, totally not going to fly.
    Why not use SubTransGetTopmostTransaction() ?
    This needs to work when the xid of a transaction is found in the MVCC
    data of a tuple for any overlapping serializable transaction -- even
    if that transaction has completed and its connection has been
    closed. It didn't look to me like SubTransGetTopmostTransaction()
    would work after the transaction was gone.
    Yes, it should work. If it doesn't, you are failing to manage the
    TransactionXmin horizon correctly.

    regards, tom lane
  • Kevin Grittner at Sep 17, 2010 at 2:22 pm

    Tom Lane wrote:

    That assumption is absolutely, totally not going to fly.
    Understood; I'm already working on it based on Heikki's input.
    This needs to work when the xid of a transaction is found in the
    MVCC data of a tuple for any overlapping serializable transaction
    -- even if that transaction has completed and its connection has
    been closed. It didn't look to me like
    SubTransGetTopmostTransaction() would work after the transaction
    was gone.
    Yes, it should work. If it doesn't, you are failing to manage the
    TransactionXmin horizon correctly.
    So far I haven't wanted to mess with the global xmin values for fear
    of the possible impact on other transactions. It actually hasn't
    been that hard to maintain a SerializableGlobalXmin value, which is
    more efficient than the existing ones for predicate lock cleanup
    purposes. That still isn't exactly what I need to modify cleanup of
    the subtransaction information, though. Once I've got my head
    around the subtrans.c code, I think I'll need to maintain a minimum
    that includes the xids for serializable transactions which *overlap*
    SerializableGlobalXmin. That doesn't seem very hard to do; I just
    haven't needed it until now. Then I'll modify the subtransaction
    cleanup to only remove entries before the earlier of the global xmin
    of all transactions and the xmin of serializable transactions which
    overlap active serializable transactions.

    Does all that sound reasonable?

    -Kevin
  • Kevin Grittner at Sep 17, 2010 at 12:08 pm

    Heikki Linnakangas wrote:
    On 17/09/10 14:56, Kevin Grittner wrote:
    Heikki Linnakangas wrote:
    Why not use SubTransGetTopmostTransaction() ?
    This needs to work when the xid of a transaction is found in the
    MVCC data of a tuple for any overlapping serializable transaction
    -- even if that transaction has completed and its connection has
    been closed. It didn't look to me like
    SubTransGetTopmostTransaction() would work after the transaction
    was gone.
    You're right, it doesn't retain that old transactions. But it could
    easily be modified to do so.
    I shall look into it.

    -Kevin
  • Kevin Grittner at Sep 18, 2010 at 6:52 pm
    [Apologies for not reply-linking this; work email is down so I'm
    sending from gmail.]

    Based on feedback from Heikki and Tom I've reworked how I find the
    top-level transaction. This is in the git repo, and the changes can
    be viewed at:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b

    -Kevin
  • Heikki Linnakangas at Sep 18, 2010 at 7:39 pm

    On 18/09/10 21:52, Kevin Grittner wrote:
    [Apologies for not reply-linking this; work email is down so I'm
    sending from gmail.]

    Based on feedback from Heikki and Tom I've reworked how I find the
    top-level transaction. This is in the git repo, and the changes can
    be viewed at:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b
    Thanks, much simpler. Now let's simplify it some more ;-)

    ISTM you never search the SerializableXactHash table using a hash key,
    except the one call in CheckForSerializableConflictOut, but there you
    already have a pointer to the SERIALIZABLEXACT struct. You only re-find
    it to make sure it hasn't gone away while you trade the shared lock for
    an exclusive one. If we find another way to ensure that, ISTM we don't
    need SerializableXactHash at all. My first thought was to forget about
    VirtualTransactionId and use TransactionId directly as the hash key for
    SERIALIZABLEXACT. The problem is that a transaction doesn't have a
    transaction ID when RegisterSerializableTransaction is called. We could
    leave the TransactionId blank and only add the SERIALIZABLEXACT struct
    to the hash table when an XID is assigned, but there's no provision to
    insert an existing struct to a hash table in the current hash table API.

    So, I'm not sure of the details yet, but it seems like it could be made
    simpler somehow..

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 19, 2010 at 1:48 pm

    Heikki Linnakangas wrote:

    ISTM you never search the SerializableXactHash table using a hash
    key, except the one call in CheckForSerializableConflictOut, but
    there you already have a pointer to the SERIALIZABLEXACT struct.
    You only re-find it to make sure it hasn't gone away while you
    trade the shared lock for an exclusive one. If we find another way
    to ensure that, ISTM we don't need SerializableXactHash at all. My
    first thought was to forget about VirtualTransactionId and use
    TransactionId directly as the hash key for SERIALIZABLEXACT. The
    problem is that a transaction doesn't have a transaction ID when
    RegisterSerializableTransaction is called. We could leave the
    TransactionId blank and only add the SERIALIZABLEXACT struct to the
    hash table when an XID is assigned, but there's no provision to
    insert an existing struct to a hash table in the current hash table
    API.

    So, I'm not sure of the details yet, but it seems like it could be
    made simpler somehow..
    After tossing it around in my head for a bit, the only thing that I
    see (so far) which might work is to maintain a *list* of
    SERIALIZABLEXACT objects in memory rather than a using a hash table.
    The recheck after releasing the shared lock and acquiring an
    exclusive lock would then go through SerializableXidHash. I think
    that can work, although I'm not 100% sure that it's an improvement.
    I'll look it over in more detail. I'd be happy to hear your thoughts
    on this or any other suggestions.

    -Kevin
  • Heikki Linnakangas at Sep 19, 2010 at 6:57 pm

    On 19/09/10 16:48, Kevin Grittner wrote:
    After tossing it around in my head for a bit, the only thing that I
    see (so far) which might work is to maintain a *list* of
    SERIALIZABLEXACT objects in memory rather than a using a hash table.
    The recheck after releasing the shared lock and acquiring an
    exclusive lock would then go through SerializableXidHash. I think
    that can work, although I'm not 100% sure that it's an improvement.
    Yeah, also keep in mind that a linked list with only a few items is
    faster to scan through than sequentially scanning an almost empty hash
    table.

    Putting that aside for now, we have one very serious problem with this
    algorithm:
    While they [SIREAD locks] are associated with a transaction, they must survive
    a successful COMMIT of that transaction, and remain until all overlapping
    transactions complete.
    Long-running transactions are already nasty because they prevent VACUUM
    from cleaning up old tuple versions, but this escalates the problem to a
    whole new level. If you have one old transaction sitting idle, every
    transaction that follows consumes a little bit of shared memory, until
    that old transaction commits. Eventually you will run out of shared
    memory, and will not be able to start new transactions anymore.

    Is there anything we can do about that? Just a thought, but could you
    somehow coalesce the information about multiple already-committed
    transactions to keep down the shared memory usage? For example, if you
    have this:

    1. Transaction <slow> begins
    2. 100 other transactions begin and commit

    Could you somehow group together the 100 committed transactions and
    represent them with just one SERIALIZABLEXACT struct?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at Sep 22, 2010 at 6:54 pm

    On 19/09/10 21:57, I wrote:
    Putting that aside for now, we have one very serious problem with this
    algorithm:
    While they [SIREAD locks] are associated with a transaction, they must
    survive
    a successful COMMIT of that transaction, and remain until all overlapping
    transactions complete.
    Long-running transactions are already nasty because they prevent VACUUM
    from cleaning up old tuple versions, but this escalates the problem to a
    whole new level. If you have one old transaction sitting idle, every
    transaction that follows consumes a little bit of shared memory, until
    that old transaction commits. Eventually you will run out of shared
    memory, and will not be able to start new transactions anymore.

    Is there anything we can do about that? Just a thought, but could you
    somehow coalesce the information about multiple already-committed
    transactions to keep down the shared memory usage? For example, if you
    have this:

    1. Transaction <slow> begins
    2. 100 other transactions begin and commit

    Could you somehow group together the 100 committed transactions and
    represent them with just one SERIALIZABLEXACT struct?
    Ok, I think I've come up with a scheme that puts an upper bound on the
    amount of shared memory used, wrt. number of transactions. You can still
    run out of shared memory if you lock a lot of objects, but that doesn't
    worry me as much.

    When a transaction is commits, its predicate locks must be held, but
    it's not important anymore *who* holds them, as long as they're hold for
    long enough.

    Let's move the finishedBefore field from SERIALIZABLEXACT to
    PREDICATELOCK. When a transaction commits, set the finishedBefore field
    in all the PREDICATELOCKs it holds, and then release the
    SERIALIZABLEXACT struct. The predicate locks stay without an associated
    SERIALIZABLEXACT entry until finishedBefore expires.

    Whenever there are two predicate locks on the same target that both
    belonged to an already-committed transaction, the one with a smaller
    finishedBefore can be dropped, because the one with higher
    finishedBefore value covers it already.

    There. That was surprisingly simple, I must be missing something.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 22, 2010 at 11:14 pm

    Heikki Linnakangas wrote:

    When a transaction is commits, its predicate locks must be held,
    but it's not important anymore *who* holds them, as long as
    they're hold for long enough.

    Let's move the finishedBefore field from SERIALIZABLEXACT to
    PREDICATELOCK. When a transaction commits, set the finishedBefore
    field in all the PREDICATELOCKs it holds, and then release the
    SERIALIZABLEXACT struct. The predicate locks stay without an
    associated SERIALIZABLEXACT entry until finishedBefore expires.

    Whenever there are two predicate locks on the same target that
    both belonged to an already-committed transaction, the one with a
    smaller finishedBefore can be dropped, because the one with higher
    finishedBefore value covers it already.
    I don't think this works. Gory details follow.

    The predicate locks only matter when a tuple is being written which
    might conflict with one. In the notation often used for the
    dangerous structures, the conflict only occurs if TN writes
    something which T1 can't read or T1 writes something which T0 can't
    read. When you combine this with the fact that you don't have a
    problem unless TN commits *first*, then you can't have a problem
    with TN looking up a predicate lock of a committed transaction; if
    it's still writing tuples after T1's commit, the conflict can't
    matter and really should be ignored. If T1 is looking up a
    predicate lock for T0 and finds it committed, there are two things
    which must be true for this to generate a real conflict: TN must
    have committed before T0, and T0 must have overlapped T1 -- T0 must
    not have been able to see T1's write. If we have a way to establish
    these two facts without keeping transaction level data for committed
    transactions, predicate lock *lookup* wouldn't stand in the way of
    your proposal.

    Since the writing transaction is active, if the xmin of its starting
    transaction comes before the finishedBefore value, they must have
    overlapped; so I think we have that part covered, and I can't see a
    problem with your proposed use of the earliest finishedBefore value.

    There is a rub on the other point, though. Without transaction
    information you have no way of telling whether TN committed before
    T0, so you would need to assume that it did. So on this count,
    there is bound to be some increase in false positives leading to
    transaction rollback. Without more study, and maybe some tests, I'm
    not sure how significant it is. (Actually, we might want to track
    commit sequence somehow, so we can determine this with greater
    accuracy.)

    But wait, the bigger problems are yet to come.

    The other way we can detect conflicts is a read by a serializable
    transaction noticing that a different and overlapping serializable
    transaction wrote the tuple we're trying to read. How do you
    propose to know that the other transaction was serializable without
    keeping the SERIALIZABLEXACT information? And how do you propose to
    record the conflict without it? The wheels pretty much fall off the
    idea entirely here, as far as I can see.

    Finally, this would preclude some optimizations which I *think* will
    pay off, which trade a few hundred kB more of shared memory, and
    some additional CPU to maintain more detailed conflict data, for a
    lower false positive rate -- meaning fewer transactions rolled back
    for hard-to-explain reasons. This more detailed information is also
    what seems to be desired by Dan S (on another thread) to be able to
    log the information needed to be able to reduce rollbacks.

    -Kevin
  • Heikki Linnakangas at Sep 23, 2010 at 5:21 am

    On 23/09/10 02:14, Kevin Grittner wrote:
    There is a rub on the other point, though. Without transaction
    information you have no way of telling whether TN committed before
    T0, so you would need to assume that it did. So on this count,
    there is bound to be some increase in false positives leading to
    transaction rollback. Without more study, and maybe some tests, I'm
    not sure how significant it is. (Actually, we might want to track
    commit sequence somehow, so we can determine this with greater
    accuracy.)
    I'm confused. AFAICS there is no way to tell if TN committed before T0
    in the current patch either.
    But wait, the bigger problems are yet to come.

    The other way we can detect conflicts is a read by a serializable
    transaction noticing that a different and overlapping serializable
    transaction wrote the tuple we're trying to read. How do you
    propose to know that the other transaction was serializable without
    keeping the SERIALIZABLEXACT information?
    Hmm, I see. We could record which transactions were serializable in a
    new clog-like structure that wouldn't exhaust shared memory.
    And how do you propose to record the conflict without it?
    I thought you just abort the transaction that would cause the conflict
    right there. The other transaction is committed already, so you can't do
    anything about it anymore.
    Finally, this would preclude some optimizations which I *think* will
    pay off, which trade a few hundred kB more of shared memory, and
    some additional CPU to maintain more detailed conflict data, for a
    lower false positive rate -- meaning fewer transactions rolled back
    for hard-to-explain reasons. This more detailed information is also
    what seems to be desired by Dan S (on another thread) to be able to
    log the information needed to be able to reduce rollbacks.
    Ok, I think I'm ready to hear about those optimizations now :-).

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 23, 2010 at 3:08 pm

    Heikki Linnakangas wrote:
    On 23/09/10 02:14, Kevin Grittner wrote:
    There is a rub on the other point, though. Without transaction
    information you have no way of telling whether TN committed
    before T0, so you would need to assume that it did. So on this
    count, there is bound to be some increase in false positives
    leading to transaction rollback. Without more study, and maybe
    some tests, I'm not sure how significant it is. (Actually, we
    might want to track commit sequence somehow, so we can determine
    this with greater accuracy.)
    I'm confused. AFAICS there is no way to tell if TN committed
    before T0 in the current patch either.
    Well, we can certainly infer it if the finishedBefore values differ.
    And, as I said, if we don't eliminate this structure for committed
    transactions, we could add a commitId or some such, with "precedes"
    and "follows" tests similar to TransactionId.
    The other way we can detect conflicts is a read by a serializable
    transaction noticing that a different and overlapping
    serializable transaction wrote the tuple we're trying to read.
    How do you propose to know that the other transaction was
    serializable without keeping the SERIALIZABLEXACT information?
    Hmm, I see. We could record which transactions were serializable
    in a new clog-like structure that wouldn't exhaust shared memory.
    And how do you propose to record the conflict without it?
    I thought you just abort the transaction that would cause the
    conflict right there. The other transaction is committed already,
    so you can't do anything about it anymore.
    No, it always requires a rw-conflict from T0 to T1 and a rw-conflict
    from T1 to TN, as well as TN committing first and (T0 not being READ
    ONLY or TN not overlapping T0). The number and complexity of the
    conditions which must be met to cause a serialization failure are
    what keep the failure rate reasonable. If we start rolling back
    transactions every time one transaction simply reads a row modified
    by a concurrent transaction I suspect that we'd have such a storm of
    serialization failures in most workloads that nobody would want to
    use it.
    Finally, this would preclude some optimizations which I *think*
    will pay off, which trade a few hundred kB more of shared memory,
    and some additional CPU to maintain more detailed conflict data,
    for a lower false positive rate -- meaning fewer transactions
    rolled back for hard-to-explain reasons. This more detailed
    information is also what seems to be desired by Dan S (on another
    thread) to be able to log the information needed to be able to
    reduce rollbacks.
    Ok, I think I'm ready to hear about those optimizations now :-).
    Dan Ports is eager to implement "next key" predicate locking for
    indexes, but wants more benchmarks to confirm the benefit. (Most of
    the remaining potential optimizations carry some risk of being
    counter-productive, so we want to go in with something conservative
    and justify each optimization separately.) That one only affects
    your proposal to the extent that the chance to consolidate locks on
    the same target by committed transactions would likely have fewer
    matches to collapse.

    One that I find interesting is the idea that we could set a
    SERIALIZABLE READ ONLY transaction with some additional property
    (perhaps DEFERRED or DEFERRABLE) which would cause it to take a
    snapshot and then wait until there were no overlapping serializable
    transactions which are not READ ONLY which overlap a running
    SERIALIZABLE transaction which is not READ ONLY. At this point it
    could make a valid snapshot which would allow it to run without
    taking predicate locks or checking for conflicts. It would have no
    chance of being rolled back with a serialization failure *or* of
    contributing to the failure of any other transaction, yet it would
    be guaranteed to see a view of the database consistent with the
    actions of all other serializable transactions.

    One place I'm particularly interested in using such a feature is in
    pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't seem
    good for a backup program) or using REPEATABLE READ (to get current
    snapshot isolation behavior), which might capture a view of the data
    which contains serialization anomalies. The notion of capturing a
    backup which doesn't comply with business rules enforced by
    serializable transactions gives me the willies, but it would be
    better than not getting a backup reliably, so in the absence of this
    feature, I think we need to change pg_dump to use REPEATABLE READ.
    I can't see how to do this without keeping information on committed
    transactions.

    This next paragraph is copied straight from the Wiki page:

    It appears that when a pivot is formed where T0 is a flagged as a
    READ ONLY transaction, and it is concurrent with TN, we can wait to
    see whether anything really needs to roll back. If T1 commits before
    developing a rw-dependency to another transaction with a commit
    early enough to make it visible to T0, the rw-dependency between T0
    and T1 can be removed or ignored. It might even be worthwhile to
    track whether a serializable transaction *has* written to any
    permanent table, so that this optimization can be applied to de
    facto READ ONLY transactions (i.e., not flagged as such, but not
    having done any writes).

    Again, copying from the Wiki "for the record" here:

    It seems that we could guarantee that the retry of a transaction
    rolled back due to a dangerous structure could never immediately
    roll back on the very same conflicts if we always ensure that there
    is a successful commit of one of the participating transactions
    before we roll back. Is it worth it? It seems like it might be,
    because it would ensure that some progress is being made and prevent
    the possibility of endless flailing on any set of transactions. We
    could be sure of this if we:
    * use lists for inConflict and outConflict
    * never roll back until we have a pivot with a commit of the
    transaction on the "out" side
    * never roll back the transaction being committed in the PreCommit
    check
    * have some way to cause another, potentially idle, transaction to
    roll back with a serialization failure SQLSTATE

    I'm afraid this would further boost shared memory usage, but the
    payoff may well be worth it. At one point I did some "back of an
    envelope" calculations, and I think I found that with 200
    connections an additional 640kB of shared memory would allow this.
    On top of the above optimization, just having the lists would allow
    more precise recognition of dangerous structures in heavy load,
    leading to fewer false positives even before you get to the above.
    Right now, if you have two conflicts with different transactions in
    the same direction it collapses to a self-reference, which precludes
    use of optimizations involving TN committing first or T0 being READ
    ONLY.

    Also, if we go to these lists, I think we can provide more of the
    information Dan S. has been requesting for the error detail. We
    could list all transactions which participated in any failure and I
    *think* we could show the statement which triggered the failure with
    confidence that some relation accessed by that statement was
    involved in the conflicts leading to the failure.

    Less important than any of the above, but still significant in my
    book, I fear that conflict recording and dangerous structure
    detection could become very convoluted and fragile if we eliminate
    this structure for committed transactions. Conflicts among specific
    sets of transactions are the linchpin of this whole approach, and I
    think that without an object to represent each one for the duration
    for which it is significant is dangerous. Inferring information
    from a variety of sources "feels" wrong to me.

    -Kevin
  • Heikki Linnakangas at Sep 23, 2010 at 7:30 pm

    On 23/09/10 18:08, Kevin Grittner wrote:
    Less important than any of the above, but still significant in my
    book, I fear that conflict recording and dangerous structure
    detection could become very convoluted and fragile if we eliminate
    this structure for committed transactions. Conflicts among specific
    sets of transactions are the linchpin of this whole approach, and I
    think that without an object to represent each one for the duration
    for which it is significant is dangerous. Inferring information
    from a variety of sources "feels" wrong to me.
    Ok, so if we assume that we must keep all the information we have now,
    let me try again with that requirement. My aim is still to put an upper
    bound on the amount of shared memory required, regardless of the number
    of committed but still interesting transactions.

    Cahill's thesis mentions that the per-transaction information can be
    kept in a table like this:

    txnID beginTime commitTime inConf outConf
    100 1000 1100 N Y
    101 1000 1500 N N
    102 1200 N/A Y N

    That maps nicely to a SLRU table, truncated from the top as entries
    become old enough, and appended to the end.

    In addition to that, we need to keep track of locks held by each
    transaction, in a finite amount of shared memory. For each predicate
    lock, we need to store the lock tag, and the list of transactions
    holding the lock. The list of transactions is where the problem is,
    there is no limit on its size.

    Conveniently, we already have a way of representing an arbitrary set of
    transactions with a single integer: multi-transactions, in multixact.c.

    Now, we have a little issue in that read-only transactions don't have
    xids, and can't therefore be part of a multixid, but it could be used as
    a model to implement something similar for virtual transaction ids.

    Just a thought, not sure what the performance would be like or how much
    work such a multixid-like structure would be to implement..

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Sep 23, 2010 at 8:19 pm

    Heikki Linnakangas wrote:
    On 23/09/10 18:08, Kevin Grittner wrote:
    Less important than any of the above, but still significant in my
    book, I fear that conflict recording and dangerous structure
    detection could become very convoluted and fragile if we
    eliminate this structure for committed transactions. Conflicts
    among specific sets of transactions are the linchpin of this
    whole approach, and I think that without an object to represent
    each one for the duration for which it is significant is
    dangerous. Inferring information from a variety of sources
    "feels" wrong to me.
    Ok, so if we assume that we must keep all the information we have
    now, let me try again with that requirement. My aim is still to
    put an upper bound on the amount of shared memory required,
    regardless of the number of committed but still interesting
    transactions.

    Cahill's thesis mentions that the per-transaction information can
    be kept in a table like this:

    txnID beginTime commitTime inConf outConf
    100 1000 1100 N Y
    101 1000 1500 N N
    102 1200 N/A Y N

    That maps nicely to a SLRU table, truncated from the top as
    entries become old enough, and appended to the end.
    Well, the inConf and outConf were later converted to pointers in
    Cahill's work, and our MVCC implementation doesn't let us use times
    quite that way -- we're using xmins and such, but I assume the point
    holds regardless of such differences. (I mostly mention it to avoid
    confusion for more casual followers of the thread.)
    In addition to that, we need to keep track of locks held by each
    transaction, in a finite amount of shared memory. For each
    predicate lock, we need to store the lock tag, and the list of
    transactions holding the lock. The list of transactions is where
    the problem is, there is no limit on its size.

    Conveniently, we already have a way of representing an arbitrary
    set of transactions with a single integer: multi-transactions, in
    multixact.c.

    Now, we have a little issue in that read-only transactions don't
    have xids, and can't therefore be part of a multixid, but it could
    be used as a model to implement something similar for virtual
    transaction ids.

    Just a thought, not sure what the performance would be like or how
    much work such a multixid-like structure would be to implement..
    You're pointing toward some code I haven't yet laid eyes on, so it
    will probably take me a few days to really digest your suggestion
    and formulate an opinion. This is just to let you know I'm working
    on it.

    I really appreciate your attention to this. Thanks!

    -Kevin
  • Kevin Grittner at Sep 24, 2010 at 4:18 pm

    Heikki Linnakangas wrote:

    My aim is still to put an upper bound on the amount of shared
    memory required, regardless of the number of committed but still
    interesting transactions.
    That maps nicely to a SLRU table
    Well, that didn't take as long to get my head around as I feared.

    I think SLRU would totally tank performance if used for this, and
    would really not put much of a cap on the memory taken out of
    circulation for purposes of caching. Transactions are not
    referenced more heavily at the front of the list nor are they
    necessarily discarded more or less in order of acquisition. In
    transaction mixes where all transaction last about the same length
    of time, the upper limit of interesting transactions is about twice
    the number of active transactions, so memory demands are pretty
    light. The problems come in where you have at least one long-lived
    transaction and a lot of concurrent short-lived transactions. Since
    all transactions are scanned for cleanup every time a transaction
    completes, either they would all be taking up cache space or
    performance would drop to completely abysmal levels as it pounded
    disk. So SLRU in this case would be a sneaky way to effectively
    dynamically allocate shared memory, but about two orders of
    magnitude slower, at best.

    Here are the things which I think might be done, in some
    combination, to address your concern without killing performance:

    (1) Mitigate memory demand through more aggressive cleanup. As an
    example, a transaction which is READ ONLY (or which hasn't written
    to a relevant table as tracked by a flag in the transaction
    structure) is not of interest after commit, and can be immediately
    cleaned up, unless there is an overlapping non-read-only transaction
    which overlaps a committed transaction which wrote data. This is
    clearly not a solution to your concern in itself, but it combines
    with the other suggestions to make them more effective.

    (2) Similar to SLRU, allocate pages from shared buffers for lists,
    but pin them in memory without ever writing them to disk. A buffer
    could be freed when the last list item in it was freed and the
    buffer count for the list was above some minimum. This could deal
    with the episodic need for larger than typical amounts of RAM
    without permanently taking large quantities our of circulation.
    Obviously, we would still need some absolute cap, so this by itself
    doesn't answer your concern, either -- it just the impact to scale
    to the need dynamically and within bounds. It has the same
    effective impact on memory usage as SLRU for this application
    without the same performance penalty.

    (3) Here's the meat of it. When the lists hit their maximum, have
    some way to gracefully degrade the accuracy of the conflict
    tracking. This is similar to your initial suggestion that once a
    transaction committed we would not track it in detail, but
    implemented "at need" when memory resources for tracking the detail
    become exhausted. I haven't worked out all the details, but I have
    a rough outline in my head. I wanted to run this set of ideas past
    you before I put the work in to fully develop it. This would be an
    alternative to just canceling the oldest running serializable
    transaction, which is the solution we could use right now to live
    within some set limit, possibly with (1) or (2) to help push back
    the point at which that's necessary. Rather than deterministically
    canceling the oldest active transaction, it would increase the
    probability of transactions being canceled because of false
    positives, with the chance we'd get through the peak without any
    such cancellations.

    Thoughts?

    -Kevin
  • Robert Haas at Sep 24, 2010 at 5:06 pm

    On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner wrote:
    Thoughts?
    Premature optimization is the root of all evil. I'm not convinced
    that we should tinker with any of this before committing it and
    getting some real-world experience. It's not going to be perfect in
    the first version, just like any other major feature.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Kevin Grittner at Sep 24, 2010 at 5:36 pm

    Robert Haas wrote:
    On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
    wrote:
    Thoughts?
    Premature optimization is the root of all evil. I'm not convinced
    that we should tinker with any of this before committing it and
    getting some real-world experience. It's not going to be perfect
    in the first version, just like any other major feature.
    In terms of pure optimization, I totally agree -- that's why I'm
    submitting early without a number of potential optimizations. I
    think we're better off getting a solid base and then attempting to
    prove the merits of each optimization separately. The point Heikki
    is on about, however, gets into user-facing behavior issues. The
    current implementation will give users an "out of shared memory"
    error if they attempt to start a SERIALIZABLE transaction when our
    preallocated shared memory for tracking such transactions reaches
    its limit. A fairly easy alternative would be to kill running
    SERIALIZABLE transactions, starting with the oldest, until a new
    request can proceed. The question is whether either of these is
    acceptable behavior for an initial implementation, or whether
    something fancier is needed up front.

    Personally, I'd be fine with "out of shared memory" for an excess of
    SERIALIZABLE transactions for now, and leave refinement for later --
    I just want to be clear that there is user-visible behavior involved
    here.

    -Kevin
  • Robert Haas at Sep 24, 2010 at 5:49 pm

    On Fri, Sep 24, 2010 at 1:35 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
    wrote:
    Thoughts?
    Premature optimization is the root of all evil.  I'm not convinced
    that we should tinker with any of this before committing it and
    getting some real-world experience.  It's not going to be perfect
    in the first version, just like any other major feature.
    In terms of pure optimization, I totally agree -- that's why I'm
    submitting early without a number of potential optimizations.  I
    think we're better off getting a solid base and then attempting to
    prove the merits of each optimization separately.  The point Heikki
    is on about, however, gets into user-facing behavior issues.  The
    current implementation will give users an "out of shared memory"
    error if they attempt to start a SERIALIZABLE transaction when our
    preallocated shared memory for tracking such transactions reaches
    its limit.  A fairly easy alternative would be to kill running
    SERIALIZABLE transactions, starting with the oldest, until a new
    request can proceed.  The question is whether either of these is
    acceptable behavior for an initial implementation, or whether
    something fancier is needed up front.

    Personally, I'd be fine with "out of shared memory" for an excess of
    SERIALIZABLE transactions for now, and leave refinement for later --
    I just want to be clear that there is user-visible behavior involved
    here.
    Yeah, I understand, but I think the only changes we should make now
    are things that we're sure are improvements. I haven't read the code,
    but based on reading the thread so far, we're off into the realm of
    speculating about trade-offs, and I'm not sure that's a good place for
    us to be.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Kevin Grittner at Sep 24, 2010 at 7:23 pm

    Robert Haas wrote:

    I think the only changes we should make now are things that we're
    sure are improvements.
    In that vein, anyone who is considering reviewing the patch should
    check the latest from the git repo or request an incremental patch.
    I've committed a few things since the last patch post, but it
    doesn't seem to make sense to repost the whole thing for them. I
    fixed a bug in the new shared memory list code, fixed a misleading
    hint, and fixed some whitespace and comment issues.

    The changes I've committed to the repo so far based on Heikki's
    comments are, I feel, clear improvements. It was actually fairly
    embarrassing that I didn't notice some of that myself.
    based on reading the thread so far, we're off into the realm of
    speculating about trade-offs
    This latest issue seems that way to me. We're talking about
    somewhere around 100 kB of shared memory in a 64 bit build with the
    default number of connections, with a behavior on exhaustion which
    matches what we do on normal locks. This limit is easier to hit,
    and we should probably revisit it, but I am eager to get the feature
    as a whole in front of people, to see how well it works for them in
    other respects.

    I'll be quite surprised if we've found all the corner cases, but it
    is working, and working well, in a variety of tests. It has been
    for months, really; I've been holding back, as requested, to avoid
    distracting people from the 9.0 release.

    -Kevin
  • Greg Stark at Sep 25, 2010 at 12:06 pm

    On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner wrote:
    One place I'm particularly interested in using such a feature is in
    pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't seem
    good for a backup program) or using REPEATABLE READ (to get current
    snapshot isolation behavior), which might capture a view of the data
    which contains serialization anomalies.
    I'm puzzled how pg_dump could possibly have serialization anomalies.
    Snapshot isolation gives pg_dump a view of the database containing all
    modifications committed before it started and no modifications which
    committed after it started. Since pg_dump makes no database
    modifications itself it can always just be taken to occur
    instantaneously before any transaction which committed after it
    started.



    --
    greg
  • Nicolas Barbier at Sep 25, 2010 at 1:31 pm
    [ Forgot the list, resending. ]

    2010/9/25 Greg Stark <gsstark@mit.edu>:
    On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
    wrote:
    One place I'm particularly interested in using such a feature is in
    pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't seem
    good for a backup program) or using REPEATABLE READ (to get current
    snapshot isolation behavior), which might capture a view of the data
    which contains serialization anomalies.
    I'm puzzled how pg_dump could possibly have serialization anomalies.
    Snapshot isolation gives pg_dump a view of the database containing all
    modifications committed before it started and no modifications which
    committed after it started. Since pg_dump makes no database
    modifications itself it can always just be taken to occur
    instantaneously before any transaction which committed after it
    started.
    I guess that Kevin is referring to [1], where the dump would take the
    role of T3. That would mean that the dump itself must be aborted
    because it read inconsistent data.

    AFAICS, whether that reasoning means that a dump can produce an
    "inconsistent" backup is debatable. After restoring, all transactions
    that would have been in-flight at the moment the dump took its
    snapshot are gone, so none of their effects "happened". We would be in
    exactly the same situation as if all running transactions would be
    forcibly aborted at the moment that the dump would have started.

    OTOH, if one would compare the backup with what really happened,
    things may look inconsistent. The dump would show what T3 witnessed
    (i.e., the current date is incremented and the receipts table is
    empty), although the current state of the database system shows
    otherwise (i.e., the current date is incremented and the receipts
    table has an entry for the previous date).

    IOW, one could say that the backup is consistent only if it were never
    compared against the system as it continued running after the dump
    took place.

    This stuff will probably confuse the hell out of most DBAs :-).

    Nicolas

    [1] <URL:http://archives.postgresql.org/pgsql-hackers/2010-05/msg01360.php>
  • Tom Lane at Sep 25, 2010 at 2:45 pm

    Greg Stark writes:
    On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
    wrote:
    One place I'm particularly interested in using such a feature is in
    pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't seem
    good for a backup program) or using REPEATABLE READ (to get current
    snapshot isolation behavior), which might capture a view of the data
    which contains serialization anomalies.
    I'm puzzled how pg_dump could possibly have serialization anomalies.
    At the moment, it can't. If this patch means that it can, that's going
    to be a mighty good reason not to apply the patch.

    regards, tom lane
  • Robert Haas at Sep 26, 2010 at 3:15 am

    On Sat, Sep 25, 2010 at 10:45 AM, Tom Lane wrote:
    Greg Stark <gsstark@mit.edu> writes:
    On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
    wrote:
    One place I'm particularly interested in using such a feature is in
    pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't seem
    good for a backup program) or using REPEATABLE READ (to get current
    snapshot isolation behavior), which might capture a view of the data
    which contains serialization anomalies.
    I'm puzzled how pg_dump could possibly have serialization anomalies.
    At the moment, it can't.  If this patch means that it can, that's going
    to be a mighty good reason not to apply the patch.
    It certainly can, as can any other read-only transaction. This has
    been discussed many times here before with detailed examples, mostly
    by Kevin. T0 reads A and writes B. T1 then reads B and writes C. T0
    commits. pg_dump runs. T1 commits. What is the fully serial order
    of execution consistent with this chronology? Clearly, T1 must be run
    before T0, since it doesn't see T0's update to B. But pg_dump sees
    the effects of T0 but not T1, so T0 must be run before T1. Oops. Now
    you might say that this won't be a problem for most people in
    practice, and I think that's true, but it's still unserializable. And
    pg_dump is the reason, because otherwise T1 then T0 would be a valid
    serialization.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise Postgres Company
  • Kevin Grittner at Sep 20, 2010 at 2:10 pm

    I wrote:
    Heikki Linnakangas wrote:
    ISTM you never search the SerializableXactHash table using a hash
    key, except the one call in CheckForSerializableConflictOut, but
    there you already have a pointer to the SERIALIZABLEXACT struct.
    You only re-find it to make sure it hasn't gone away while you
    trade the shared lock for an exclusive one. If we find another
    way to ensure that, ISTM we don't need SerializableXactHash at
    all. My first thought was to forget about VirtualTransactionId
    and use TransactionId directly as the hash key for
    SERIALIZABLEXACT. The problem is that a transaction doesn't have
    a transaction ID when RegisterSerializableTransaction is called.
    We could leave the TransactionId blank and only add the
    SERIALIZABLEXACT struct to the hash table when an XID is
    assigned, but there's no provision to insert an existing struct
    to a hash table in the current hash table API.

    So, I'm not sure of the details yet, but it seems like it could
    be made simpler somehow..
    After tossing it around in my head for a bit, the only thing that
    I see (so far) which might work is to maintain a *list* of
    SERIALIZABLEXACT objects in memory rather than a using a hash
    table. The recheck after releasing the shared lock and acquiring
    an exclusive lock would then go through SerializableXidHash. I
    think that can work, although I'm not 100% sure that it's an
    improvement. I'll look it over in more detail. I'd be happy to
    hear your thoughts on this or any other suggestions.
    I haven't come up with any better ideas. Pondering this one, it
    seems to me that a list would be better than a hash table if we had
    a list which would automatically allocate and link new entries, and
    would maintain a list of available entries for (re)use. I wouldn't
    want to sprinkle such an implementation in with predicate locking
    and SSI code, but if there is a feeling that such a thing would be
    worth having in shmqueue.c or some new file which uses the SHM_QUEUE
    structure to provide an API for such functionality, I'd be willing
    to write that and use it in the SSI code. Without something like
    that, I have so far been unable to envision an improvement along the
    lines Heikki is suggesting here.

    Thoughts?

    -Kevin
  • Kevin Grittner at Sep 22, 2010 at 6:42 pm

    I wrote:
    Heikki Linnakangas wrote:
    ISTM you never search the SerializableXactHash table using a hash
    key, except the one call in CheckForSerializableConflictOut, but
    there you already have a pointer to the SERIALIZABLEXACT struct.
    You only re-find it to make sure it hasn't gone away while you
    trade the shared lock for an exclusive one. If we find another
    way to ensure that, ISTM we don't need SerializableXactHash at
    all.
    it seems like it could be made simpler somehow..
    After tossing it around in my head for a bit, the only thing that
    I see (so far) which might work is to maintain a *list* of
    SERIALIZABLEXACT objects in memory rather than a using a hash
    table. The recheck after releasing the shared lock and acquiring
    an exclusive lock would then go through SerializableXidHash.
    After discussion on a separate thread, I replaced that hash table
    with a home-grown shared memory list. I had to create a patch at
    that point due to the git migration, so I figured I might as well
    post it, too. There have been some non-trivial changes due to
    feedback on the prior posting.

    -Kevin
  • Kevin Grittner at Sep 25, 2010 at 3:24 pm

    Greg Stark wrote:
    Kevin Grittner wrote:
    One place I'm particularly interested in using such a feature is
    in pg_dump. Without it we have the choice of using a SERIALIZABLE
    transaction, which might fail or cause failures (which doesn't
    seem good for a backup program) or using REPEATABLE READ (to get
    current snapshot isolation behavior), which might capture a view
    of the data which contains serialization anomalies.
    I'm puzzled how pg_dump could possibly have serialization
    anomalies. Snapshot isolation gives pg_dump a view of the database
    containing all modifications committed before it started and no
    modifications which committed after it started. Since pg_dump makes
    no database modifications itself it can always just be taken to
    occur instantaneously before any transaction which committed after
    it started.
    Well, in the SQL-92 standard[1] the definition of serializable
    transactions was changed to the following:
    The execution of concurrent SQL-transactions at isolation level
    SERIALIZABLE is guaranteed to be serializable. A serializable
    execution is defined to be an execution of the operations of
    concurrently executing SQL-transactions that produces the same
    effect as some serial execution of those same SQL-transactions. A
    serial execution is one in which each SQL-transaction executes to
    completion before the next SQL-transaction begins.
    It hasn't changed since then.

    Some people in the community cling to the notion, now obsolete for
    almost two decades, that serializable transactions are defined by the
    same anomalies which define the other three levels of transaction
    isolation. (It's probably time to catch up on that one.) Note that
    the above does *not* say "it's OK for a SELECT in a transaction
    executing at the serializable isolation level to produce results
    which are not consistent with any serial execution of serializable
    transactions, as long as the database *eventually* reaches a state
    where a repeat of the same SELECT in a new transaction produces
    results consistent with such execution." Under the standard, even
    read-only transactions have to follow the rules, and I think most
    people would want that.

    Now, commit order in itself doesn't directly affect the apparent
    order of execution. It's only directs the apparent order of
    execution to the extent that multiple transactions access the same
    data. Read-read "conflicts" don't matter in this scheme, and
    write-write conflicts are already handled under snapshot isolation by
    preventing both writers from committing -- if one commits, the other
    is forced to roll back with a serialization failure. That leaves
    write-read and read-write conflicts.

    In write-read conflicts, one transaction writes data and then commits
    in time for another transaction to see it. This implies that the
    writing transaction came first first in the apparent order of
    execution. Now for the tricky one: in read-write conflicts (often
    written as rw-conflict) the reading transaction cannot see the write
    of a concurrent transaction because of snapshot visibility rules.
    Since the reading tranaction is unable to see the work of the writing
    transaction, it must be considered to have come first in the apparent
    order of execution.

    In order to have a serialization anomaly under snapshot isolation,
    you need a situation like this: a transaction which I'll call T0
    (matching much discussion on the topic published in recent years) has
    a rw-dependency on a transaction concurrent to T0, which I'll call
    T1. In addition, T1 has a rw-dependency on a transaction which is
    concurrent to it, which I'll call TN. The reason it's not T2 is that
    it can be the same transaction as T0 or a third transaction. So,
    because of the rw-conflicts, T0 appears to execute before T1, which
    appears to execute before TN. (At this point it should be obvious
    why you've got a problem if T0 and TN are the same transaction.)

    If T0 and TN are distinct, we still haven't quite met the conditions
    required to produce a serialization anomaly, however, The next
    requirement is that TN (the third in apparent order of execution)
    actually commits first. At this point, the third transaction's
    writes are exposed for the world to see, while there are still two
    uncommitted tranactions which appear to have committed first. There
    are so many ways that this can lead to a cycle in apparent order of
    execution, some of which can happen in the client application, that
    Serializable Snapshot Isolaiton (SSI) doesn't pretend to track that.

    Barring one exception that I worked out myself (although I'd be
    shocked if someone didn't beat me to it and I just haven't found a
    paper describing it in my researches), the above describes the
    conditions under which one of the transactions must be rolled back to
    prevent a serialization anomaly.

    The exception is interesting here, though. It is that if T0 is a
    READ ONLY transaction, it can't participate in an anomaly unless TN
    commits before T0 acquires its snapshot. This observation is based
    on the fact that since a READ ONLY transaction can't appear to come
    after another transaction based on a rw-conflict, I can see only two
    ways that it can appear to come after TN and thereby complete the
    cycle in the apparent order of execution:
    (1) There is a wr-conflict where T0 successfully reads a write from
    TN, or
    (2) application software outside the database receives confirmation
    of the commit of TN before it starts T0.

    [If anyone can see a way to complete the cycle in apparent order of
    execution when T0 is READ ONLY without having TN commit before T0
    acquires its snapshot, please let me know. I'm basing several
    optimizations on this, and it is an innovation I have not yet found
    mentioned in the literature.]

    OK, to get back to the question -- pg_dump's transaction (T0) could
    see an inconsistent version of the database if one transaction (TN)
    writes to a table, another transaction (T1) overlaps TN and can't
    read something written by TN because they are concurrent, TN commits
    before T0 acquires its snapshot, T1 writes to a table, T0 starts
    before T1 commits, and T0 can't read something which T1 wrote (which
    is sort of a given for a database dump and overlapping transactions).

    So that's the theory. For a practical example, consider the
    receipting example which I've posted to the list multiple times
    before. (TN updates a control record with a deposit date, and T1 is
    a receipt which uses the deposit date from the control record.) In
    this not-atypical accounting system, any date before the current
    deposit date is "closed" -- the receipts for each previous date were
    set from a control record before it advanced. If pg_dump's
    transaction plays the role of T0 here, it could capture the updated
    control record, but not a receipt based on the prior value of that
    control record. That is, it captures the effects of a *later*
    transaction, but not the *earlier* one.

    One last thing -- if anyone who has read the Serializable Wiki page
    didn't already understand the reason why I had a concern about
    pg_dump here, and the above helped clarify it, feel free to draw from
    this and update the Wiki page, or let me know what was helpful, and I
    will.

    -Kevin

    [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
  • Greg Stark at Sep 25, 2010 at 4:13 pm

    On Sat, Sep 25, 2010 at 4:24 PM, Kevin Grittner wrote:
    OK, to get back to the question -- pg_dump's transaction (T0) could
    see an inconsistent version of the database if one transaction (TN)
    writes to a table, another transaction (T1) overlaps TN and can't
    read something written by TN because they are concurrent, TN commits
    before T0 acquires its snapshot, T1 writes to a table, T0 starts
    before T1 commits, and T0 can't read something which T1 wrote (which
    is sort of a given for a database dump and overlapping transactions).
    Can I collapse this into a single list of events (apologies, this
    isn't going to line up, I'm writing it in a proportional font :( )

    TN starts
    T1 starts
    TN writes
    T1 reads
    TN commits
    T0 starts (pg_dump)
    T1 writes
    T0 reads (pg_dump)
    T1 commits

    So T1 must have happened before TN because it wrote something based on
    data as it was before TN modified it. But T0 can see TN but not T1 so
    there's no complete ordering between the three transactions that makes
    them all make sense.

    The thing is that the database state is reasonable, the database state
    is after it would be if the ordering were T1,TN with T0 happening any
    time. And the backup state is reasonable, it's as if it occurred after
    TN and before T1. They just don't agree.

    --
    greg
  • Kevin Grittner at Sep 25, 2010 at 3:35 pm

    Nicolas Barbier wrote:

    IOW, one could say that the backup is consistent only if it were
    never compared against the system as it continued running after the
    dump took place.
    Precisely. I considered making that point in the email I just sent,
    but figured I had rambled enough. I suppose I should have gone
    there; thanks for covering the omission. :-)

    -Kevin
  • Kevin Grittner at Sep 25, 2010 at 10:28 pm

    Greg Stark wrote:

    So T1 must have happened before TN because it wrote something based
    on data as it was before TN modified it. But T0 can see TN but not
    T1 so there's no complete ordering between the three transactions
    that makes them all make sense. Correct.
    The thing is that the database state is reasonable, the database
    state is after it would be if the ordering were T1,TN with T0
    happening any time. And the backup state is reasonable, it's as if
    it occurred after TN and before T1. They just don't agree.
    I agree that the database state eventually "settles" into a valid
    long-term condition in this particular example. The point you are
    conceding seems to be that the image captured by pg_dump is not
    consistent with that. If so, I agree. You don't see that as a
    problem; I do. I'm not sure where we go from there. Certainly that
    is better than making pg_dump vulnerable to serialization failure --
    if we don't implement the SERIALIZABLE READ ONLY DEFERRABLE
    transactions I was describing, we can change pg_dump to use
    REPEATABLE READ and we will be no worse off than we are now.

    The new feature I was proposing was that we create a SERIALIZABLE
    READ ONLY DEFERRABLE transaction style which would, rather than
    acquiring predicate locks and watching for conflicts, potentially
    wait until it could acquire a snapshot which was guaranteed to be
    conflict-free. In the example discussed on this thread, if we
    changed pg_dump to use such a mode, when it went to acquire a
    snapshot it would see that it overlapped T1, which was not READ ONLY,
    which in turn overlapped TN, which had written to a table and
    committed. It would then block until completion of the T1
    transaction and adjust its snapshot to make that transaction visible.
    You would now have a backup entirely consistent with the long-term
    state of the database, with no risk of serialization failure and no
    bloating of the predicate lock structures.

    The only down side is that there could be blocking when such a
    transaction acquires its snapshot. That seems a reasonable price to
    pay for backup integrity. Obviously, if we had such a mode, it would
    be trivial to add a switch to the pg_dump command line which would
    let the user choose between guaranteed dump integrity and guaranteed
    lack of blocking at the start of the dump.

    -Kevin
  • Greg Stark at Sep 25, 2010 at 10:38 pm
    Just to be clear I wasn't saying it was or wasn't a problem, I was just
    trying to see if I understand the problem and if I do maybe help bring
    others up to speed.
    On 25 Sep 2010 23:28, "Kevin Grittner" wrote:
    Greg Stark wrote:
    So T1 must have happened before TN because it wrote something based
    on data as it was before TN modified it. But T0 can see TN but not
    T1 so there's no complete ordering between the three transactions
    that makes them all make sense. Correct.
    The thing is that the database state is reasonable, the database
    state is after it would be if the ordering were T1,TN with T0
    happening any time. And the backup state is reasonable, it's as if
    it occurred after TN and before T1. They just don't agree.
    I agree that the database state eventually "settles" into a valid
    long-term condition in this particular example. The point you are
    conceding seems to be that the image captured by pg_dump is not
    consistent with that. If so, I agree. You don't see that as a
    problem; I do. I'm not sure where we go from there. Certainly that
    is better than making pg_dump vulnerable to serialization failure --
    if we don't implement the SERIALIZABLE READ ONLY DEFERRABLE
    transactions I was describing, we can change pg_dump to use
    REPEATABLE READ and we will be no worse off than we are now.

    The new feature I was proposing was that we create a SERIALIZABLE
    READ ONLY DEFERRABLE transaction style which would, rather than
    acquiring predicate locks and watching for conflicts, potentially
    wait until it could acquire a snapshot which was guaranteed to be
    conflict-free. In the example discussed on this thread, if we
    changed pg_dump to use such a mode, when it went to acquire a
    snapshot it would see that it overlapped T1, which was not READ ONLY,
    which in turn overlapped TN, which had written to a table and
    committed. It would then block until completion of the T1
    transaction and adjust its snapshot to make that transaction visible.
    You would now have a backup entirely consistent with the long-term
    state of the database, with no risk of serialization failure and no
    bloating of the predicate lock structures.

    The only down side is that there could be blocking when such a
    transaction acquires its snapshot. That seems a reasonable price to
    pay for backup integrity. Obviously, if we had such a mode, it would
    be trivial to add a switch to the pg_dump command line which would
    let the user choose between guaranteed dump integrity and guaranteed
    lack of blocking at the start of the dump.

    -Kevin
  • Kevin Grittner at Sep 26, 2010 at 3:52 am

    Greg Stark wrote:

    Just to be clear I wasn't saying it was or wasn't a problem, I was
    just trying to see if I understand the problem and if I do maybe
    help bring others up to speed.
    Thanks for that, and my apologies for misunderstanding you. It does
    sound like you have a firm grasp on my concern, and you expressed it
    clearly and concisely.

    To build on what you said, there are two properties which bother me
    about a backup based on snapshot isolation in an environment where
    data integrity is enforced with application code, including
    user-written triggers.

    (1) If the backup is used to make a copy of the database for running
    reports, while the main database continues to be updated, it could
    represent a state which never existed in the database, at least as
    visible to serializable transactions. This makes any reports run
    against it of dubious value.

    (2) It may not be possible to update such a copy of the database to
    the later state of the database using replay of the same transaction
    stream, in any order -- particularly if the recovery counts on firing
    of the same triggers to supply default data or enforce integrity
    rules which were in place during the initial run. To continue with
    the receipting example, replaying the receipt which was in flight
    during the pg_dump run would result in assigning the wrong deposit
    date. This could cause problems for some replication or recovery
    strategies, although it's not apparent that it breaks anything
    actually shipped in the base PostgreSQL distribution.

    FWIW, it seems premature to spend a lot of time on the concept of the
    DEFERRABLE (or whatever term we might choose) transaction snapshots.
    I think I'll update the patch to use REPEATABLE READ for pg_dump for
    now, and we can revisit this as a possible enhancement later. As you
    point out, it doesn't really impact the usefulness of the dump for
    crash recovery beyond the issue I mention in point 2 above, and
    that's only going to come into play for certain recovery strategies.
    Even then, it's only an issue if pg_dump gets its snapshot while
    certain very specific conditions exist. And we're certainly in no
    worse shape regarding these concerns with the patch than without;
    they're long-standing issues.

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 14, '10 at 4:35p
activeSep 26, '10 at 3:52a
posts44
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase