FAQ
Attached is an updated version of the 'hint bit cache'.

What's changed:
*) 'bucket' was changed to 'page' everywhere
*) rollup array is now gets added during 'set', not the 'get' (pretty
dumb the way it was before -- wasn't really dealing with non-commit
bits yet)
*) more source comments, including a description of the cache in the intro
*) now caching 'invalid' bits.

I went back and forth several times whether to store invalid bits in
the same cache, a separate cache, or not at all. I finally settled
upon storing them in the same cache which has some pros and cons. It
makes it more or less exactly like the clog cache (so I could
copy/pasto some code out from there), but adds some overhead because 2
bit addressing is more expensive than 1 bit addressing -- this is
showing up in profiling...i'm upping the estimate of cpu bound scan
overhead from 1% to 2%. Still fairly cheap, but i'm running into the
edge of where I can claim the cache is 'free' for most workloads --
any claim is worthless without real world testing though. Of course,
if tuple hint bits are set or PD_ALL_VISIBLE is set, you don't have to
pay that price.

What's not:
*) Haven't touched any satisfies routine besides
HeapTupleSatisfiesMVCC (should they be?)
*) Haven't pushed the cache data into CacheMemoryContext. I figure
this is the way to go, but requires extra 'if' on every cache 'get'.
*) Didn't abstract the clog bit addressing macros. I'm leaning on not
doing this, but maybe they should be. My reasoning is that there is
no requirement for hint bit cache that pages should be whatever block
size is, and I'd like to reserve the ability to adjust the cache page
size independently.

I'd like to know if this is a strategy that merits further work...If
anybody has time/interest that is. It's getting close to the point
where I can just post it to the commit fest for review. In
particular, I'm concerned if Tom's earlier objections can be
satisfied. If not, it's back to the drawing board...

merlin

Search Discussions

  • Simon Riggs at May 10, 2011 at 4:59 pm

    On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure wrote:

    I'd like to know if this is a strategy that merits further work...If
    anybody has time/interest that is.  It's getting close to the point
    where I can just post it to the commit fest for review.  In
    particular, I'm concerned if Tom's earlier objections can be
    satisfied. If not, it's back to the drawing board...
    I'm interested in what you're doing here.

    From here, there's quite a lot of tuning possibilities. It would be
    very useful to be able to define some metrics we are interested in
    reducing and working out how to measure them.

    That way we can compare all the different variants of this to see
    which way of doing things works best.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Merlin Moncure at May 10, 2011 at 5:33 pm

    On Tue, May 10, 2011 at 11:59 AM, Simon Riggs wrote:
    On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure wrote:

    I'd like to know if this is a strategy that merits further work...If
    anybody has time/interest that is.  It's getting close to the point
    where I can just post it to the commit fest for review.  In
    particular, I'm concerned if Tom's earlier objections can be
    satisfied. If not, it's back to the drawing board...
    I'm interested in what you're doing here.

    From here, there's quite a lot of tuning possibilities. It would be
    very useful to be able to define some metrics we are interested in
    reducing and working out how to measure them.

    That way we can compare all the different variants of this to see
    which way of doing things works best.
    thanks for that! I settled on this approach because the downside
    cases should hopefully be pretty limited. The upside is a matter of
    debate although fairly trivial to demonstrate synthetically.

    I'm looking for some way of benchmarking the benefits in non-simulated
    fashion. We desperately need something like a performance farm (as
    many many others have mentioned).

    merlin
  • Merlin Moncure at May 11, 2011 at 3:38 pm

    On Tue, May 10, 2011 at 11:59 AM, Simon Riggs wrote:
    On Mon, May 9, 2011 at 5:12 PM, Merlin Moncure wrote:

    I'd like to know if this is a strategy that merits further work...If
    anybody has time/interest that is.  It's getting close to the point
    where I can just post it to the commit fest for review.  In
    particular, I'm concerned if Tom's earlier objections can be
    satisfied. If not, it's back to the drawing board...
    I'm interested in what you're doing here.

    From here, there's quite a lot of tuning possibilities. It would be
    very useful to be able to define some metrics we are interested in
    reducing and working out how to measure them.
    Following are results that are fairly typical of the benefits you
    might see when the optimization kicks in. The attached benchmark just
    creates a bunch of records in a random table and scans it. This is
    more or less the scenario that causes people to grip about hint bit
    i/o, especially in systems that are already under moderate to heavy
    i/o stress. I'm gonna call it for 20%, although it could be less if
    you have an i/o system that spanks the test (try cranking -c and the
    creation # records in bench.sql in that case). Anecdotal reports of
    extreme duress caused by hint bit i/o suggest problematic or mixed use
    (OLTP + OLAP) workloads might see even more benefit. One thing I need
    to test is how much benefit you'll see with wider records.

    I think I'm gonna revert the change to cache invalid bits. I just
    don't see hint bits as a major contributor to dead tuples following
    epic rollbacks (really, the solution for that case is simply to try
    and not get in that scenario if you can). This will put the code back
    into the cheaper and simpler bit per transaction addressing. What I
    do plan to do though, is to check and set xmax commit bits in the
    cache...that way deleted tuples will see cache benefits.

    [hbcache]
    merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
    transaction type: Custom query
    scaling factor: 1
    query mode: simple
    number of clients: 4
    number of threads: 1
    duration: 200 s
    number of transactions actually processed: 8
    tps = 0.037167 (including connections establishing)
    tps = 0.037171 (excluding connections establishing)

    real 3m35.549s
    user 0m0.008s
    sys 0m0.004s

    [HEAD]
    merlin@mmoncure-ubuntu:~$ time pgbench -c 4 -n -T 200 -f bench.sql
    transaction type: Custom query
    scaling factor: 1
    query mode: simple
    number of clients: 4
    number of threads: 1
    duration: 200 s
    number of transactions actually processed: 8
    tps = 0.030313 (including connections establishing)
    tps = 0.030317 (excluding connections establishing)

    real 4m24.216s
    user 0m0.000s
    sys 0m0.012s
  • Merlin Moncure at May 11, 2011 at 4:44 pm

    On Wed, May 11, 2011 at 10:38 AM, Merlin Moncure wrote:
    One thing I need to test is how much benefit you'll see with wider records.
    The results are a bit better, around 25% using a similar methodology
    on ~ 1k wide records.
    I think I'm gonna revert the change to cache invalid bits. I just
    don't see hint bits as a major contributor to dead tuples following
    epic rollbacks
    what I meant to say here was, I don't see hint bit i/o following
    rollbacks as a major issue. Point being, I don't see much use in
    optimizing management of INVALID tuple bits beyond what is already
    done.

    Anyways, demonstrating a 'good' case is obviously not the whole story.
    But what are the downsides? There are basically two:

    1) tiny cpu penalty on every heap fetch
    2) extremely widely dispersed (in terms of transaction id) unhinted
    tuples can force the cache to refresh every 100 tuples in the absolute
    worst case. A cache refresh is a 100 int sort and a loop.

    For '1', the absolute worst case I can come up with, cpu bound scans
    of extremely narrow records, the overall cpu usage goes up around 1%.
    '2' seems just impossible to see in the real world -- and if it does,
    you are also paying for lots of clog lookups all the way through the
    slru, and you are having i/o and other issues on top of it. Even if
    all the stars align and it does happen, all the tuples get hinted and
    dirtied anyways so it will only happen at most once on that particular
    set of data.

    merlin
  • Simon Riggs at May 11, 2011 at 5:40 pm

    On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure wrote:

    Following are results that are fairly typical of the benefits you
    might see when the optimization kicks in.  The attached benchmark just
    [hbcache]
    real    3m35.549s
    [HEAD]
    real    4m24.216s
    These numbers look very good. Thanks for responding to my request.

    What people have said historically at this point is "ah, but you've
    just deferred the pain from clog lookups".

    The best way to show this does what we hope is to run a normal-ish
    OLTP access to the table that would normally thrash the clog and show
    no ill effects there either. Run that immediately after the above
    tests so that the cache and hint bits are both primed.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Merlin Moncure at May 11, 2011 at 6:28 pm

    On Wed, May 11, 2011 at 12:40 PM, Simon Riggs wrote:
    On Wed, May 11, 2011 at 4:38 PM, Merlin Moncure wrote:

    Following are results that are fairly typical of the benefits you
    might see when the optimization kicks in.  The attached benchmark just
    [hbcache]
    real    3m35.549s
    [HEAD]
    real    4m24.216s
    These numbers look very good. Thanks for responding to my request.

    What people have said historically at this point is "ah, but you've
    just deferred the pain from clog lookups".
    Deferred, or eliminated. If any tuple on the page gets updated,
    deleted, etc or the the table itself is dropped then you've
    'won'...the page with rhw hint bit only change was never booted out to
    the heap before another substantive change happened. This is exactly
    what happens in certain common workloads -- you insert a bunch of
    records, scan them with some batch process, then delete them. Let's
    say a million records were inserted under a single transaction and you
    are getting bounced out of the transam.c cache, you just made a
    million calls to TransactionIdIsCurrentTransactionId and (especially)
    TransactionIdIsInProgress for the *exact same* transaction_id, over
    and over. That stuff adds up even before looking at the i/o incurred.

    Put another way, the tuple hint bits have a lot of usefulness when the
    tuples on the page are coming from all kinds of differently aged
    transactions. When all the tuples have the same or similar xid, the
    information value is quite low, and the i/o price isn't worth it. The
    cache neatly haircuts the downside case. If the cache isn't helping
    (any tuple fetch on the page faults through it), the page is dirtied
    and the next time it's fetched all the bits will be set.
    The best way to show this does what we hope is to run a normal-ish
    OLTP access to the table that would normally thrash the clog and show
    no ill effects there either. Run that immediately after the above
    tests so that the cache and hint bits are both primed.
    yeah. the only easy way I know of to do this extremely long pgbench
    runs, and getting good results is harder than it sounds...if the tuple
    hint bits make it to disk (via vacuum or a cache fault), they stay
    there and that tuple is no longer interesting from the cache point of
    view.

    If you make the scale really large the test will just take forever
    just to get the tables primed (like a week). Keep in mind, autovacuum
    can roll around at any time and set the bits under you (you can of
    course disable it, but who really does than on OLTP?). Small scale
    oltp tests are not real world realistic because anybody sane would
    just let autovacuum loose on the table. clog thrashing systems are
    typically mature, high load oltp databases...not fun to test on your
    single 7200 rpm drive.

    I'm going to boldly predict that with all the i/o flying around in
    cases like that, the paltry cpu cycles spent dealing with the cache
    are the least of your problems. Not discounting the need to verify
    that though.

    merlin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 9, '11 at 7:21p
activeMay 11, '11 at 6:28p
posts7
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Merlin Moncure: 5 posts Simon Riggs: 2 posts

People

Translate

site design / logo © 2022 Grokbase