Due to popular request (Hey, David's popular, right?), I'm posting a
patch for Serializable Snapshot Isolation (SSI), although I don't
yet have everything in it that I was planning on submitting before
the CF. I will probably be submitting another version before the
deadline with the following, but there should be plenty here for
people to test and benchmark. We're done with the major refactoring
needed to address concerns raised in earlier reviews, and I don't
expect the remaining work to destabilize what's there or to have a
significant impact on performance.

(1) A README file needs to be prepared, mostly by copy/paste from
portions of the Wiki page at:

http://wiki.postgresql.org/wiki/Serializable

(2) The Concurrency Control chapter needs some major work:

http://developer.postgresql.org/pgdocs/postgres/mvcc.html

(3) Recent changes to index page split/combine logic are still
being tested.

(4) Two Phase Commit (2PC), used for distributed transactions,
needs work to move some of the SSI logic executed during PREPARE
TRANSACTION to COMMIT PREPARED.

(5) Index AMs other than btree need to be modified to support
fine-grained predicate locking. Currently GiST, GIN, and hash
indexes work correctly (in the sense of not allowing anomalies) by
acquiring SIRead locks at the index relation level. False positive
rw-conflicts, which could ultimately lead to transaction rollbacks,
will be more frequent when using these index access methods until
this work is done.

-Kevin

Search Discussions

  • David Fetter at Jan 10, 2011 at 5:48 pm

    On Mon, Jan 10, 2011 at 10:03:18AM -0600, Kevin Grittner wrote:
    Due to popular request (Hey, David's popular, right?),
    Well, I'm a person, and "popular" originally refers to something like
    that ;)

    Cheers,
    David.
    --
    David Fetter <david@fetter.org> http://fetter.org/
    Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
    Skype: davidfetter XMPP: david.fetter@gmail.com
    iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

    Remember to vote!
    Consider donating to Postgres: http://www.postgresql.org/about/donate
  • Anssi Kääriäinen at Jan 11, 2011 at 12:31 pm

    On 01/10/2011 06:03 PM, Kevin Grittner wrote:
    Due to popular request (Hey, David's popular, right?), I'm posting a
    patch for Serializable Snapshot Isolation (SSI), although I don't
    yet have everything in it that I was planning on submitting before
    the CF. I will probably be submitting another version before the
    deadline with the following, but there should be plenty here for
    people to test and benchmark. We're done with the major refactoring
    needed to address concerns raised in earlier reviews, and I don't
    expect the remaining work to destabilize what's there or to have a
    significant impact on performance.
    I think I found a problem. This is using SSI v8. The table definition:

    create table test_t (id integer, val1 text, val2 integer);

    create index test_idx on test_t(id) where val2 = 1;

    The data:

    insert into test_t (select generate_series(0, 10000), 'a', 2);
    insert into test_t (select generate_series(0, 10), 'a', 1);

    The transactions:
    T1:
    hot2=> begin transaction isolation level serializable;
    BEGIN
    hot2=> select * from test_t where val2 = 1;
    id | val1 | val2
    ----+------+------
    0 | a | 1
    1 | a | 1
    2 | a | 1
    3 | a | 1
    4 | a | 1
    5 | a | 1
    6 | a | 1
    7 | a | 1
    8 | a | 1
    9 | a | 1
    10 | a | 1
    (11 rows)

    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 10;
    UPDATE 1
    -- The concurrent transaction:
    T2:
    hot2=> begin transaction isolation level serializable;
    BEGIN
    hot2=> select * from test_t where val2 = 1;
    id | val1 | val2
    ----+------+------
    0 | a | 1
    1 | a | 1
    2 | a | 1
    3 | a | 1
    4 | a | 1
    5 | a | 1
    6 | a | 1
    7 | a | 1
    8 | a | 1
    9 | a | 1
    10 | a | 1
    (11 rows)

    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 9;
    UPDATE 1
    hot2=> commit;
    COMMIT
    -- Now, t1 can commit, too. Even though there is a serialization anomaly
    T1:
    hot2=> commit;
    COMMIT

    If the test_idx is changed:
    (outside any transaction...)
    hot2=> drop index test_idx;
    DROP INDEX
    hot2=> create index test_idx on test_t(id, val2);
    CREATE INDEX


    T1:
    hot2=> begin transaction isolation level serializable;
    BEGIN
    hot2=> select * from test_t where val2 = 1;
    id | val1 | val2
    ----+------+------
    0 | a | 1
    1 | a | 1
    2 | a | 1
    3 | a | 1
    4 | a | 1
    5 | a | 1
    6 | a | 1
    7 | a | 1
    8 | a | 1
    (9 rows)

    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 8;
    UPDATE 1

    T2:
    hot2=> select * from test_t where val2 = 1;
    id | val1 | val2
    ----+------+------
    0 | a | 1
    1 | a | 1
    2 | a | 1
    3 | a | 1
    4 | a | 1
    5 | a | 1
    6 | a | 1
    7 | a | 1
    8 | a | 1
    (9 rows)

    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 7;
    UPDATE 1
    hot2=> commit;
    ERROR: could not serialize access due to read/write dependencies among
    transactions
    HINT: The transaction might succeed if retried.
    T1:
    hot2=> commit;
    COMMIT

    So, something seems to be broken when using partial indexes.

    - Anssi
  • Kevin Grittner at Jan 11, 2011 at 2:59 pm

    Anssi Kääriäinenwrote:

    I think I found a problem. This is using SSI v8.
    Thanks much for testing. You're managing to exercise some code
    paths I didn't think to test, which is great! I guess this is the
    up side of having posted yesterday. :-)
    So, something seems to be broken when using partial indexes.
    Yes there is. The predicate locks against the heap tuples should
    have prevented that, but obviously we're missing a call to
    PredicateLockTuple from the code path for the partial index which is
    present on the code path for complete indexes. I'll go looking for
    the spot to add that line of code.

    -Kevin
  • Anssi Kääriäinen at Jan 11, 2011 at 3:41 pm

    On 01/11/2011 04:53 PM, Kevin Grittner wrote:
    Thanks much for testing. You're managing to exercise some code
    paths I didn't think to test, which is great! I guess this is the
    up side of having posted yesterday. :-)
    Glad that I can help. This feature is something that is very important
    to our usage of PostgreSQL.

    Just to add some positive feedback: I tried this patch with our in-house
    in development EAV database. Everything was working perfectly, and time
    to import current and history data for 7000 employees (total of 95000
    attributes) to the DB using stored procedures went from 4 minutes 20
    seconds to 4 minutes 30 seconds (the procedures are doing a lot of
    validation...). So in this case the real-world performance difference
    was barely noticeable. SSI was also able to block serialization
    anomalies and I did not have any false rollbacks in my (very limited)
    testing. So, thank you for the work on this feature. And, of course, I
    will try to break it some more.:-)

    - Anssi
  • Kevin Grittner at Jan 11, 2011 at 11:27 pm

    Anssi Kääriäinenwrote:

    something seems to be broken when using partial indexes.
    Boy do I feel dumb for taking all day to find the cause.

    The problem was a misdirected optimization -- on an update it was
    only checking the "after" image for conflict, assuming that it would
    be redundant to check both the before and after images. The problem
    is that with a partial index, you might only see one of those
    tuples, and I suspect there could be bugs with updates which changed
    a value later used for access.

    The evil premature optimization is eliminated here:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=a9c63f9c688629e2f6f92a880c5f1ac9669bf77d

    This also includes an attempt to eliminate the assertion failure Dan
    found in DBT-2 testing yesterday. I'm not sure if this change is
    radical enough, but I figured it was better to try the minimal
    change first, and see if that was sufficient. If not, I'll have to
    move some code between functions, and duplicate a bit of code.

    New patch (version 10) attached.

    -Kevin
  • Kevin Grittner at Jan 14, 2011 at 12:21 am

    Anssi Kääriäinenwrote:

    I think I found a problem. This is using SSI v8. The table
    definition:

    create table test_t (id integer, val1 text, val2 integer);
    create index test_idx on test_t(id) where val2 = 1;
    insert into test_t (select generate_series(0, 10000), 'a', 2);
    insert into test_t (select generate_series(0, 10), 'a', 1);
    T1:
    hot2=> begin transaction isolation level serializable;
    hot2=> select * from test_t where val2 = 1;
    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 10;
    T2:
    hot2=> begin transaction isolation level serializable;
    hot2=> select * from test_t where val2 = 1;
    hot2=> update test_t set val2 = 2 where val2 = 1 and id = 9;
    hot2=> commit;
    T1:
    hot2=> commit;
    I hope you have no objection to having the code you wrote included
    in the test suite which is part of the patch. Well, if you do, I'll
    pull it back out and invent something similar... ;-)

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=2502cccbdd5e5d44be469549b91fe49c0554ec3e

    -Kevin
  • Anssi Kääriäinen at Jan 14, 2011 at 10:27 am

    On 01/14/2011 02:21 AM, Kevin Grittner wrote:
    I hope you have no objection to having the code you wrote included
    in the test suite which is part of the patch. Well, if you do, I'll
    pull it back out and invent something similar... ;-)
    No objection.

    - Anssi
  • Anssi Kääriäinen at Jan 11, 2011 at 2:14 pm

    On 01/10/2011 06:03 PM, Kevin Grittner wrote:
    Due to popular request (Hey, David's popular, right?), I'm posting a
    patch for Serializable Snapshot Isolation (SSI), although I don't
    yet have everything in it that I was planning on submitting before
    the CF. I will probably be submitting another version before the
    deadline with the following, but there should be plenty here for
    people to test and benchmark. We're done with the major refactoring
    needed to address concerns raised in earlier reviews, and I don't
    expect the remaining work to destabilize what's there or to have a
    significant impact on performance.
    A speed test showing a significant drop in performance when using SSI:

    hot2=> create table test_t2 as (select generate_series(0, 1000000));
    hot2=> \timing
    begin transaction isolation level repeatable read;
    Time: 0.185 ms
    hot2=> select count(*) from test_t2;
    Time: 134.521 ms
    hot2=> select count(*) from test_t2;
    Time: 142.132 ms
    hot2=> select count(*) from test_t2;
    Time: 147.469 ms
    hot2=> commit;
    hot2=> begin transaction isolation level serializable;
    Time: 0.165 ms
    hot2=> select count(*) from test_t2;
    Time: 349.219 ms
    hot2=> select count(*) from test_t2;
    Time: 354.010 ms
    hot2=> select count(*) from test_t2;
    Time: 355.960 ms
    hot2=> commit;

    So, count(*) queries are more than twice as slow compared to the old
    serializable transaction isolation level. The relative speed difference
    stays the same if testing with more rows. Also, the same speed
    difference is there if testing with more columns:

    create table test_t4 as (select g g1, g g2, g g3, g g4, g g5, g g6 from
    (select generate_series as g from generate_series(0, 1000000)) as t1);

    repeatable read times:
    140.113 ms 134.628 ms 140.113 ms 156.166 ms

    serializable:
    353.257 ms 366.558 ms 373.914 ms 359.682 ms

    - Anssi
  • Kevin Grittner at Jan 11, 2011 at 2:28 pm

    Anssi Kääriäinenwrote:

    A speed test showing a significant drop in performance when using SSI:
    hot2=> create table test_t2 as (select generate_series(0, 1000000));
    hot2=> \timing
    begin transaction isolation level repeatable read;
    Time: 0.185 ms
    hot2=> select count(*) from test_t2;
    Time: 134.521 ms
    hot2=> select count(*) from test_t2;
    Time: 142.132 ms
    hot2=> select count(*) from test_t2;
    Time: 147.469 ms
    hot2=> commit;
    hot2=> begin transaction isolation level serializable;
    Time: 0.165 ms
    hot2=> select count(*) from test_t2;
    Time: 349.219 ms
    hot2=> select count(*) from test_t2;
    Time: 354.010 ms
    hot2=> select count(*) from test_t2;
    Time: 355.960 ms
    hot2=> commit;

    So, count(*) queries are more than twice as slow compared to the old
    serializable transaction isolation level.
    Thanks for the report. I'll profile tat and see what can be done about
    it.

    -Kevin
  • Kevin Grittner at Jan 12, 2011 at 7:19 pm

    Anssi Kääriäinenwrote:

    So, count(*) queries are more than twice as slow compared to the
    old serializable transaction isolation level.
    I've looked at this enough to know that I can do something about
    that, but wanted to point out that this is a good example of why you
    should specify READ ONLY when possible. My numbers:

    begin transaction isolation level repeatable read;
    Time: 394.946 ms
    Time: 248.675 ms
    Time: 242.559 ms

    begin transaction isolation level serializable;
    Time: 494.676 ms
    Time: 494.036 ms
    Time: 491.712 ms

    begin transaction isolation level serializable, read only;
    Time: 234.075 ms
    Time: 234.050 ms
    Time: 234.057 ms

    begin transaction isolation level serializable, read only,
    deferrable;
    Time: 233.494 ms
    Time: 234.099 ms
    Time: 235.290 ms

    The slower times for REPEATABLE READ gave me pause, so I ran those
    again:

    begin transaction isolation level repeatable read;
    Time: 233.946 ms
    Time: 236.200 ms
    Time: 236.414 ms

    I guess the database just hadn't "warmed up" enough for the first
    few tests....

    -Kevin
  • Kevin Grittner at Jan 13, 2011 at 12:02 am

    Anssi Kääriäinenwrote:

    So, count(*) queries are more than twice as slow compared to the
    old serializable transaction isolation level.
    I got this down from more than twice the run time to running 33%
    longer through remembering the last relation for which a search for
    a predicate lock held by the current transaction found a match at
    the coarsest (relation) level. It's a bit of a hack and 33% isn't
    very impressive, even for a worst case (and this is one type of
    worst case) -- especially given how often people use SELECT count(*)
    FROM table_x as a performance test. :-(

    I can see a way to improve on this if there's a low-cost way to
    determine from within the heapam.c:heapgettup_pagemode function
    whether it's returning tuples for a table scan. It seems likely
    that this is somehow contained in the HeapScanDesc structure, but
    I'm not seeing it. Can anyone point me in the right direction, or
    tell me that this avenue is a dead end?

    Thanks,

    -Kevin
  • Heikki Linnakangas at Jan 13, 2011 at 9:53 am

    On 13.01.2011 02:01, Kevin Grittner wrote:
    Anssi Kääriäinenwrote:
    So, count(*) queries are more than twice as slow compared to the
    old serializable transaction isolation level.
    I got this down from more than twice the run time to running 33%
    longer through remembering the last relation for which a search for
    a predicate lock held by the current transaction found a match at
    the coarsest (relation) level. It's a bit of a hack and 33% isn't
    very impressive, even for a worst case (and this is one type of
    worst case) -- especially given how often people use SELECT count(*)
    FROM table_x as a performance test. :-(

    I can see a way to improve on this if there's a low-cost way to
    determine from within the heapam.c:heapgettup_pagemode function
    whether it's returning tuples for a table scan. It seems likely
    that this is somehow contained in the HeapScanDesc structure, but
    I'm not seeing it. Can anyone point me in the right direction, or
    tell me that this avenue is a dead end?
    Pardon my ignorance, but where exactly is the extra overhead coming
    from? Searching for a predicate lock?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Jan 13, 2011 at 2:52 pm

    Heikki Linnakangas wrote:

    Pardon my ignorance, but where exactly is the extra overhead
    coming from? Searching for a predicate lock?
    Right. As each tuple is read we need to ensure that there is a
    predicate lock to cover it. Since finer-grained locks can be
    combined into coarser-grained locks we need to start with the fine
    grained and move toward checking the coarser grains, to avoid
    missing a lock during promotion. So for each tuple we calculate a
    hash, find a partition, lock it, and lookup the tuple as a lock
    target. When that's not found we do the same thing for the page.
    When that's not found we do the same thing for the relation.

    But we acquired a relation lock up front, when we determined that
    this would be a heap scan, so we could short-circuit this whole
    thing if within the heapgettup_pagemode function we could determine
    that this was a scan of the whole relation.

    The profiling also showed that it was spending an obscene amount of
    time calculating hash values (over 10% of total run time!). I'm
    inclined to think that a less sophisticated algorithm (like just
    adding oid, page, and tuple offset numbers) would generate very
    *real* savings with the down side being a very hypothetical
    *possible* cost to longer chains in the HTAB. But that's a separate
    issue, best settled on the basis of benchmarks rather than
    theoretical discussions.

    -Kevin
  • Heikki Linnakangas at Jan 13, 2011 at 3:02 pm

    On 13.01.2011 16:51, Kevin Grittner wrote:
    Right. As each tuple is read we need to ensure that there is a
    predicate lock to cover it. Since finer-grained locks can be
    combined into coarser-grained locks we need to start with the fine
    grained and move toward checking the coarser grains, to avoid
    missing a lock during promotion. So for each tuple we calculate a
    hash, find a partition, lock it, and lookup the tuple as a lock
    target. When that's not found we do the same thing for the page.
    When that's not found we do the same thing for the relation.

    But we acquired a relation lock up front, when we determined that
    this would be a heap scan, so we could short-circuit this whole
    thing if within the heapgettup_pagemode function we could determine
    that this was a scan of the whole relation.
    That sounds simple enough. Add a boolean field to HeapScanDesc,
    "rs_relpredicatelocked", and set it when you acquire the relation lock.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Kevin Grittner at Jan 13, 2011 at 3:39 pm

    Heikki Linnakangas wrote:

    That sounds simple enough. Add a boolean field to HeapScanDesc,
    "rs_relpredicatelocked", and set it when you acquire the relation
    lock.
    I'll take a look at doing that. Thanks!

    -Kevin
  • Kevin Grittner at Jan 13, 2011 at 4:28 pm

    Heikki Linnakangas wrote:
    On 13.01.2011 16:51, Kevin Grittner wrote:

    But we acquired a relation lock up front, when we determined that
    this would be a heap scan, so we could short-circuit this whole
    thing if within the heapgettup_pagemode function we could
    determine that this was a scan of the whole relation.
    That sounds simple enough. Add a boolean field to HeapScanDesc,
    "rs_relpredicatelocked", and set it when you acquire the relation
    lock.
    Heikki, I can't thank you enough. The fix is here:

    http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=64ca508a0e2fa9c21dc76a5d6a5f549c27f511fa

    The timings are now:

    begin transaction isolation level repeatable read;
    Time: 324.938 ms
    Time: 228.045 ms
    Time: 227.963 ms

    begin transaction isolation level serializable;
    Time: 311.954 ms
    Time: 311.928 ms
    Time: 311.848 ms

    begin transaction isolation level serializable, read only;
    Time: 227.471 ms
    Time: 228.137 ms
    Time: 227.778 ms

    begin transaction isolation level serializable, read only,
    deferrable;
    Time: 227.899 ms
    Time: 249.772 ms
    Time: 228.026 ms

    begin transaction isolation level repeatable read;
    Time: 231.173 ms
    Time: 245.041 ms
    Time: 228.149 ms

    I'm surprised the difference is still that high as a percentage, and
    will investigate, but this seems survivable. When I do the math,
    the difference comes out to 83.885 nanoseconds per row.

    -Kevin
  • Kevin Grittner at Jan 13, 2011 at 3:02 pm

    Heikki Linnakangas wrote:

    where exactly is the extra overhead coming from?
    Keep in mind that this is a sort of worst case scenario. The data
    is fully cached in shared memory and we're doing a sequential pass
    just counting the rows. In an earlier benchmark (which I should
    re-do after all this refactoring), random access queries against a
    fully cached data set only increased run time by 1.8%. Throw some
    disk access into the mix, and the overhead is likely to get lost in
    the noise.

    But, as I said, count(*) seems to be the first thing many people try
    as a benchmark, and this is a symptom of a more general issue, so
    I'd like to find a good solution.

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJan 10, '11 at 4:03p
activeJan 14, '11 at 10:27a
posts18
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase