Whilst looking around for stuff that could be deleted thanks to removing
old-style VACUUM FULL, I came across some code in btree that seems
rather seriously buggy. For reasons explained in nbtree/README, we
can't physically recycle a "deleted" btree index page until all
transactions open at the time of deletion are gone --- otherwise we
might re-use a page that an existing scan is about to land on, and
confuse that scan. (This condition is overly strong, of course, but
it's what's designed in at the moment.) The way this is implemented
is to label a freshly-deleted page with the current value of
ReadNewTransactionId(). Once that value is older than RecentXmin,
the page is presumed recyclable.

I think this was all right when it was designed, but isn't it rather
badly broken by our subsequent changes to have transactions not take
out an XID until/unless they write something? A read-only transaction
could easily be much older than RecentXmin, no?

The odds of an actual problem seem not very high, since to be affected
a scan would have to be already "in flight" to the problem page when
the deletion occurs. By the time RecentXmin advances and we feed the
page to the FSM and get it back, the scan's almost surely going to have
arrived. And I think the logic is such that this would not happen
before the next VACUUM in any case. Still, it seems pretty bogus.

Another issue is that it's not clear what happens in a Hot Standby
slave --- it doesn't look like Simon put any interlocking in this
area to protect slave queries against having the page disappear
from under them. The odds of an actual problem are probably a
good bit higher in an HS slave.

And there's another problem: _bt_pagedel is designed to recurse
in certain improbable cases, but I think this is flat out wrong
when doing WAL replay --- if the original process did recurse
then it will have emitted a WAL record for each deleted page,
meaning replay would try to delete twice.

That last problem is easy to fix, but I'm not at all sure what to do
about the scan interlock problem. Thoughts?

regards, tom lane

Search Discussions

  • Tom Lane at Feb 8, 2010 at 2:59 am

    I wrote:
    And there's another problem: _bt_pagedel is designed to recurse
    in certain improbable cases, but I think this is flat out wrong
    when doing WAL replay --- if the original process did recurse
    then it will have emitted a WAL record for each deleted page,
    meaning replay would try to delete twice.
    No, scratch that, I misread it: _bt_pagedel isn't invoked during
    WAL replay, but for cleanup of incomplete deletions at termination
    of WAL replay. So any recursing it has to do also corresponds to
    actions that weren't in WAL. So that's OK.

    I'm still concerned about the interlock against read-only
    transactions though.

    regards, tom lane
  • Heikki Linnakangas at Feb 8, 2010 at 12:02 pm

    Tom Lane wrote:
    Whilst looking around for stuff that could be deleted thanks to removing
    old-style VACUUM FULL, I came across some code in btree that seems
    rather seriously buggy. For reasons explained in nbtree/README, we
    can't physically recycle a "deleted" btree index page until all
    transactions open at the time of deletion are gone --- otherwise we
    might re-use a page that an existing scan is about to land on, and
    confuse that scan. (This condition is overly strong, of course, but
    it's what's designed in at the moment.) The way this is implemented
    is to label a freshly-deleted page with the current value of
    ReadNewTransactionId(). Once that value is older than RecentXmin,
    the page is presumed recyclable.

    I think this was all right when it was designed, but isn't it rather
    badly broken by our subsequent changes to have transactions not take
    out an XID until/unless they write something? A read-only transaction
    could easily be much older than RecentXmin, no? Yeah :-(.
    The odds of an actual problem seem not very high, since to be affected
    a scan would have to be already "in flight" to the problem page when
    the deletion occurs. By the time RecentXmin advances and we feed the
    page to the FSM and get it back, the scan's almost surely going to have
    arrived. And I think the logic is such that this would not happen
    before the next VACUUM in any case. Still, it seems pretty bogus.
    One idea is to change the traversal logic slightly, so that whenever you
    follow a pointer from page A to B, you keep A pinned until you've pinned
    B. Then we could just take cleanup lock on the left, right and parent of
    the empty page, one at a time, to ensure that no-one is in-flight to it,
    and recycle it immediately. However, forward scans screw that up. When a
    forward scan reads a page, it saves its right pointer at the same time,
    so that if the page is subsequently split, it doesn't process tuples on
    the new right half again. Even if we take cleanup lock on the left
    sibling of an empty page, there can be scans further left that have a
    direct pointer to the empty page. It could be salvaged if we require a
    forward scan to pin the next page too when it saves the right pointer,
    but that seems inefficient.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Feb 11, 2010 at 11:19 am

    On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote:
    Another issue is that it's not clear what happens in a Hot Standby
    slave --- it doesn't look like Simon put any interlocking in this
    area to protect slave queries against having the page disappear
    from under them. The odds of an actual problem are probably a
    good bit higher in an HS slave.
    Seen.

    --
    Simon Riggs www.2ndQuadrant.com
  • Simon Riggs at Feb 11, 2010 at 2:32 pm

    On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote:

    That last problem is easy to fix, but I'm not at all sure what to do
    about the scan interlock problem. Thoughts?
    AFAICS the problem doesn't exist in normal running.
    _bt_page_recyclable() tests against RecentXmin, which includes the xmins
    of read only transactions. So it doesn't matter if a read-only
    transaction still exists that is earlier than the value of
    opaque->btpo.xact when it is set. If it still there later then the page
    cannot be reused.

    A basic interlock approach can be put in place for Hot Standby. We just
    WAL log the reuse of a btree page in _bt_getbuf() just before we
    _bt_pageinit(), using transaction id that took that action. We can then
    conflict on that xid.

    - -

    For the TODO, I'm thinking whether there's a way to allow the page to be
    reused earlier and have it all just work. That would allow us to recycle
    index blocks faster and avoid index bloat from occurring in the presence
    of long lived transactions. Otherwise fixing this for the normal case
    will accentuate index bloat.

    It seems possible that a page can be reused and end up at exactly the
    same place in the index key space, so that the left link of the new page
    matches the right link of the page the scan just left. Most likely it
    would be in a different place entirely and so ignoring the issue will
    cause scans to potentially stop earlier than they should and we give an
    incomplete answer to a query. So we can't just re-check links to
    validate the page.

    The only thing we actually need to record about the old page is the
    right link, so perhaps we can store the right link value in a central
    place, together with visibility information. Make that info WAL-logged
    so it is available on standby also. That would allow us to find out
    whether we should read the page or use the right link info to move
    right.

    We then store a recycled-by transaction id on the new page we are
    recycling. When we scan onto a new page we check to see whether the page
    has been recycled by a transaction that we consider still in progress.
    If so, we consult the page-visibility info to see what the right link of
    the page was as far as our scan is concerned, then use that to continue
    our scan.

    --
    Simon Riggs www.2ndQuadrant.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedFeb 8, '10 at 2:34a
activeFeb 11, '10 at 2:32p
posts5
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase