Attached is a revamped version of the FSM rewrite. WAL-logging is now
gone. Any inconsistencies between levels of the FSM is fixed during
vacuum, and by searchers when they run into a dead end because of a
discrepancy. Corruption within FSM pages is likewise fixed by vacuum and
searchers.

The FSM in a warm standby gets updated by replay of heap CLEAN WAL
records. That means that the FSM will reflect the situation after the
last vacuum, which is better than what we have now, but not quite
up-to-date. I'm worried that this might not be enough, because on a
large table, a lot of pages could've been filled since last vacuum, and
the first guy who tries to insert to the table will have to grind
through all those pages, finding that they're all full now. It would be
simple to update the FSM at every heap insert and update record, but
that then might be an unacceptable amount of overhead at recovery. Also,
the index FSM is not yet updated during recovery.

The FSM is now extended lazily, so there's no explicit
FreeSpaceMapExtendRel function anymore. To avoid calling smgrnblocks all
the time, I added a field to RelationData to cache the size of the FSM fork.

The fsm_search_avail() function now emulates the old FSM behavior more
closely. It should now always return the next page to the right of the
next-pointer (wrapping around if necessary).

I believe I've addressed all the other Tom's comments on the code as
well. Also, the "next-pointer" is now reset in vacuum, to encourage the
use of low-numbered pages, per Bruce's comment.

There's one known bug left. If we crash after truncating a relation, and
the truncation of the FSM doesn't hit the disk but the truncation of the
main fork does, we can end up with an FSM that shows that there's free
space on pages that don't exist anymore. The straightforward way to fix
that is to put back the WAL-logging of FSM truncations, but given that I
just ripped off all the rest of the WAL-logging, does anyone have other
ideas?

TODO:
- Performance testing, again.
- Add test case to regression suite that exercises index FSM
- Fix the crash+truncate bug
- Update index FSM during recovery
- Update FSM more frequetly during recovery
- Documentation

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Simon Riggs at Sep 23, 2008 at 4:43 am

    On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote:

    Attached is a revamped version of the FSM rewrite. WAL-logging is now
    gone. Any inconsistencies between levels of the FSM is fixed during
    vacuum, and by searchers when they run into a dead end because of a
    discrepancy. Corruption within FSM pages is likewise fixed by vacuum and
    searchers.

    The FSM in a warm standby gets updated by replay of heap CLEAN WAL
    records. That means that the FSM will reflect the situation after the
    last vacuum, which is better than what we have now, but not quite
    up-to-date. I'm worried that this might not be enough...
    I hadn't realised you would remove it completely. Did you identify WAL
    as the bottleneck?

    Is there some mid-way point between every time and almost never
    (VACUUM!)?

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Heikki Linnakangas at Sep 24, 2008 at 7:58 am

    Simon Riggs wrote:
    On Mon, 2008-09-22 at 20:43 +0300, Heikki Linnakangas wrote:

    Attached is a revamped version of the FSM rewrite. WAL-logging is now
    gone. Any inconsistencies between levels of the FSM is fixed during
    vacuum, and by searchers when they run into a dead end because of a
    discrepancy. Corruption within FSM pages is likewise fixed by vacuum and
    searchers.

    The FSM in a warm standby gets updated by replay of heap CLEAN WAL
    records. That means that the FSM will reflect the situation after the
    last vacuum, which is better than what we have now, but not quite
    up-to-date. I'm worried that this might not be enough...
    I hadn't realised you would remove it completely. Did you identify WAL
    as the bottleneck?
    No. But it seemed like a sensible thing to do.
    Is there some mid-way point between every time and almost never
    (VACUUM!)?
    That's possible. However, if it's not fully WAL-logged, it needs to be
    self-correcting, and probably needs to be periodically vacuumed as well.
    So you'll get the code complexity of both approaches. I don't think
    VACUUM in particular needs to be separately WAL-logged, because we can
    easily piggy-back on the HEAP_CLEAN records. The question is whether we
    should do the same for every heap insert and update record, or is that
    too much overhead?

    It looks like truncations do need to be WAL-logged, though.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Zdenek Kotala at Sep 24, 2008 at 3:15 pm

    Heikki Linnakangas napsal(a):

    It would be
    simple to update the FSM at every heap insert and update record, but
    that then might be an unacceptable amount of overhead at recovery. Also,
    the index FSM is not yet updated during recovery.
    I expect locking problems specially on small tables where FSM has only one level
    instead slower recovery. Maybe background writer could update FSM info, but it
    breaks modularity and It could bring problems during checkpoints.

    <snip>
    There's one known bug left. If we crash after truncating a relation, and
    the truncation of the FSM doesn't hit the disk but the truncation of the
    main fork does, we can end up with an FSM that shows that there's free
    space on pages that don't exist anymore. The straightforward way to fix
    that is to put back the WAL-logging of FSM truncations, but given that I
    just ripped off all the rest of the WAL-logging, does anyone have other
    ideas?
    What's about truncate FSM during WAL replay of main fork truncation record?

    Zdenek


    --
    Zdenek Kotala Sun Microsystems
    Prague, Czech Republic http://sun.com/postgresql
  • Heikki Linnakangas at Sep 24, 2008 at 5:25 pm

    Zdenek Kotala wrote:
    Heikki Linnakangas napsal(a):
    There's one known bug left. If we crash after truncating a relation, and
    the truncation of the FSM doesn't hit the disk but the truncation of the
    main fork does, we can end up with an FSM that shows that there's free
    space on pages that don't exist anymore. The straightforward way to fix
    that is to put back the WAL-logging of FSM truncations, but given that I
    just ripped off all the rest of the WAL-logging, does anyone have other
    ideas?
    What's about truncate FSM during WAL replay of main fork truncation record?
    That would work, except that there's no WAL record for truncating the
    main fork.

    I considered adding that, WAL-logging the truncation of the main fork,
    in vacuum.c. But it would have to be done for all indexams that use the
    FSM as well.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 24, 2008 at 8:15 pm

    Heikki Linnakangas writes:
    Zdenek Kotala wrote:
    What's about truncate FSM during WAL replay of main fork truncation record?
    That would work, except that there's no WAL record for truncating the
    main fork.
    Whaddya mean there's no WAL record for that? smgrtruncate produces one.

    regards, tom lane
  • Heikki Linnakangas at Sep 24, 2008 at 8:29 pm

    Tom Lane wrote:
    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    Zdenek Kotala wrote:
    What's about truncate FSM during WAL replay of main fork truncation record?
    That would work, except that there's no WAL record for truncating the
    main fork.
    Whaddya mean there's no WAL record for that? smgrtruncate produces one.
    Ooh, I didn't realize that!

    Hmm. The smgrtuncate WAL record is generated after the file is
    truncated, so there's still a small window there, where we can be left
    with a truncated main fork, but no smgrtruncate record for it, and thus
    the page of the FSM representing the truncated blocks doesn't get zeroed
    at replay.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 24, 2008 at 8:37 pm

    Heikki Linnakangas writes:
    Hmm. The smgrtuncate WAL record is generated after the file is
    truncated, so there's still a small window there, where we can be left
    with a truncated main fork, but no smgrtruncate record for it, and thus
    the page of the FSM representing the truncated blocks doesn't get zeroed
    at replay.
    Hmm. I seem to recall that that ordering was intentional, but I don't
    recall why just now. The code doesn't say but maybe there's something
    in the archives. It does seem a little odd since it's an apparent
    violation of the wal-before-data rule.

    If you wanted to be certain that the WAL record existed you'd have to
    not only generate it first but fsync it, which would be a bit of a
    performance hit. OTOH I'm not sure we care about smgrtruncate being
    really quick...

    regards, tom lane
  • Heikki Linnakangas at Sep 24, 2008 at 5:31 pm

    Zdenek Kotala wrote:
    Heikki Linnakangas napsal(a):
    It would be
    simple to update the FSM at every heap insert and update record, but
    that then might be an unacceptable amount of overhead at recovery. Also,
    the index FSM is not yet updated during recovery.
    I expect locking problems specially on small tables where FSM has only one level
    instead slower recovery. Maybe background writer could update FSM info, but it
    breaks modularity and It could bring problems during checkpoints.
    Recovery is single-threaded (for the lack of a better word) anyway, so
    there can't be other backends competing for the locks. Background writer
    is one option.

    One option is to update the FSM if there's less than X% of free space on
    the page left after the insert/update. That would be similar to the rule
    we use during normal operation, which is to update the FSM whenever the
    target page fills up and we have to get a new page from the FSM.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Zdenek Kotala at Sep 25, 2008 at 7:12 am

    Heikki Linnakangas napsal(a):
    Attached is a revamped version of the FSM rewrite. WAL-logging is now
    gone. Any inconsistencies between levels of the FSM is fixed during
    vacuum, and by searchers when they run into a dead end because of a
    discrepancy. Corruption within FSM pages is likewise fixed by vacuum and
    searchers.
    Hi Heikki,

    I'm testing this version now and I got core dump in initdb phase:

    08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)
    08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710)
    08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b,
    8302cac)
    08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0)
    08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94)
    08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b,
    fef00e40, 8047bf4)
    08047bfc main+0x3a9(4, 83aad00, 8047c30)
    08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0)

    Let me know if you need more info. (I'm not using fresh CVS snapshot but
    two weeks old)

    Zdenek
  • Heikki Linnakangas at Sep 25, 2008 at 7:50 am

    Zdenek Kotala wrote:
    I'm testing this version now and I got core dump in initdb phase:

    08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)
    08047608 heap_insert+0x124(840b120, 842d738, 0, 1, 1, 842d710)
    08047638 simple_heap_insert+0x23(840b120, 842d738, 8367f60, 80cc25a, b,
    8302cac)
    08047668 InsertOneTuple+0x93(4da, 80479ec, 8047b88, 80cb0b3, 0, 0)
    08047b88 boot_yyparse+0x7f1(0, feff0e68, 0, 1, 83ab990, fefcba94)
    08047bb8 AuxiliaryProcessMain+0x41d(3, 83aad04, fee88540, 816b63b,
    fef00e40, 8047bf4)
    08047bfc main+0x3a9(4, 83aad00, 8047c30)
    08047c10 _start+0x80(4, 8047d04, 8047d2c, 8047d33, 8047d37, 0)

    Let me know if you need more info. (I'm not using fresh CVS snapshot but
    two weeks old)
    pgstat_count_heap_insert is an odd place to crash. I'd suggest updating
    your CVS snapshot, and "make clean".

    Attached is a new version, now with WAL-logging of the FSM truncation. I
    decided to go with the separate WAL record for that, rather than
    piggybacking on the smgrtruncate's WAL record. It seems much better from
    a modularity point of view this way. I've also worked on the comments,
    renamed many of the internal functions, in a more coherent scheme, and I
    also started using the "struct FSMAddress" you suggested a while ago.

    But I don't think I've changed anything that could explain that crash.
    Let me know if it still doesn't work.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Zdenek Kotala at Sep 25, 2008 at 8:17 am

    Heikki Linnakangas napsal(a):
    Zdenek Kotala wrote:
    I'm testing this version now and I got core dump in initdb phase:

    08047558 pgstat_count_heap_insert+0x20(840b120, 842d738, 80, 8047580)
    <snip>
    Let me know if you need more info. (I'm not using fresh CVS snapshot
    but two weeks old)
    pgstat_count_heap_insert is an odd place to crash. I'd suggest updating
    your CVS snapshot, and "make clean".
    Updating is not good idea :( - not for performance test, because it will
    require retest original version and so on ...
    Attached is a new version, now with WAL-logging of the FSM truncation. I
    decided to go with the separate WAL record for that, rather than
    piggybacking on the smgrtruncate's WAL record. It seems much better from
    a modularity point of view this way. I've also worked on the comments,
    renamed many of the internal functions, in a more coherent scheme, and I
    also started using the "struct FSMAddress" you suggested a while ago.

    But I don't think I've changed anything that could explain that crash.
    Let me know if it still doesn't work.
    OK. I'll look on it.

    Zdenek
  • Zdenek Kotala at Sep 26, 2008 at 1:05 pm

    Heikki Linnakangas napsal(a):
    Zdenek Kotala wrote:
    Attached is a new version, now with WAL-logging of the FSM truncation. I
    decided to go with the separate WAL record for that, rather than
    piggybacking on the smgrtruncate's WAL record. It seems much better from
    a modularity point of view this way. I've also worked on the comments,
    renamed many of the internal functions, in a more coherent scheme, and I
    also started using the "struct FSMAddress" you suggested a while ago.

    But I don't think I've changed anything that could explain that crash.
    Let me know if it still doesn't work.
    This version works on my "old" repo.

    I performed performance test (iGEN) on SUN x4600 with 60 concurrent
    users and see result:

    Original:
    ---------
    MQThL (Maximum Qualified Throughput LIGHT): 1209.60 tpm
    MQThM (Maximum Qualified Throughput MEDIUM): 2576.72 tpm
    MQThH (Maximum Qualified Throughput HEAVY): 2191.20 tpm


    TRANSACTION MIX

    Total number of transactions = 181232
    TYPE TX. COUNT MIX
    ---- --------- ---
    Light: 30240 16.69%
    Medium: 64418 35.54%
    DSS: 19865 10.96%
    Heavy: 54780 30.23%
    Connection: 11929 6.58%


    RESPONSE TIMES AVG. MAX. 90TH

    Light 0.304 6.405 0.400
    Medium 0.317 6.533 0.400
    DSS 0.266 6.343 0.020
    Heavy 0.361 6.737 3.000
    Connections 0.264 5.983 0.400
    Number of users = 60
    Sum of Avg. RT * TPS for all Tx. Types = 32.770142


    FSM with WAL
    ------------
    MQThL (Maximum Qualified Throughput LIGHT): 1199.36 tpm
    MQThM (Maximum Qualified Throughput MEDIUM): 2569.12 tpm
    MQThH (Maximum Qualified Throughput HEAVY): 2171.64 tpm


    TRANSACTION MIX

    Total number of transactions = 180625
    TYPE TX. COUNT MIX
    ---- --------- ---
    Light: 29984 16.60%
    Medium: 64228 35.56%
    DSS: 20181 11.17%
    Heavy: 54291 30.06%
    Connection: 11941 6.61%


    RESPONSE TIMES AVG. MAX. 90TH

    Light 0.309 6.560 0.400
    Medium 0.323 6.529 0.400
    DSS 0.268 6.327 0.020
    Heavy 0.360 6.675 3.000
    Connections 0.274 6.359 0.400
    Number of users = 60
    Sum of Avg. RT * TPS for all Tx. Types = 32.845712


    FSM no WAL last version
    -----------------------
    MQThL (Maximum Qualified Throughput LIGHT): 1207.92 tpm
    MQThM (Maximum Qualified Throughput MEDIUM): 2611.84 tpm
    MQThH (Maximum Qualified Throughput HEAVY): 2177.68 tpm


    TRANSACTION MIX

    Total number of transactions = 182222
    TYPE TX. COUNT MIX
    ---- --------- ---
    Light: 30198 16.57%
    Medium: 65296 35.83%
    DSS: 20118 11.04%
    Heavy: 54442 29.88%
    Connection: 12168 6.68%


    RESPONSE TIMES AVG. MAX. 90TH

    Light 0.301 6.106 0.400
    Medium 0.315 6.130 0.400
    DSS 0.261 5.977 0.020
    Heavy 0.361 6.220 3.000
    Connections 0.260 6.044 0.400
    Number of users = 60
    Sum of Avg. RT * TPS for all Tx. Types = 32.696832


    -----------------------------------------

    I don't see any big difference. Throughput is similar. Only response
    time seems to be better with your last FSM version.

    I personally happy with performance.

    Zdenek
  • Heikki Linnakangas at Sep 26, 2008 at 1:20 pm

    Zdenek Kotala wrote:
    I performed performance test (iGEN) on SUN x4600 with 60 concurrent
    users and see result:

    ...

    I don't see any big difference. Throughput is similar. Only response
    time seems to be better with your last FSM version.

    I personally happy with performance.
    Thanks! I've been running DBT-2 tests myself, and I'm not seeing any
    difference there either:

    testno TPM NO90 time comment
    36 1405 1.724 6h fsm-nowal
    35 1421 0.761 6h CVS HEAD
    34 1439 1.066 2h fsm-nowal
    33 1442 0.868 2h CVS HEAD

    The NO90 is the 90% percentile of New Order transaction response time.
    there's big variation there in the 6h tests, because of a huge dip in
    performance near the end of the test, both with and without the patch. I
    don't have an explanation for the dips, but it throws off the response
    times for the whole tests. Given that in the 2h tests, which is exactly
    the same as the 6h test, just shorter, there's no degradation in the
    response times, I'm not worried about that.

    I don't have access to the site where I used to publish the test results
    earlier, but let me know if you want to see the full test results, and
    I'll try to zip them up and FTP somewhere (~500 MB uncompressed).

    I've also tried various pgbench tests, on a RAM disk and otherwise, as
    well as the "table population" test I ran earlier, and don't see any
    difference in performance.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Zdenek Kotala at Sep 26, 2008 at 1:44 pm

    Heikki Linnakangas napsal(a):
    Zdenek Kotala wrote: <snip>
    I've also tried various pgbench tests, on a RAM disk and otherwise, as
    well as the "table population" test I ran earlier, and don't see any
    difference in performance.
    I think performance is OK

    Zdenek

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 22, '08 at 5:44p
activeSep 26, '08 at 1:44p
posts15
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase