I'm looking at how to make queries safe during recovery, in the presence
of concurrent changes to blocks. In particular, concurrent removal of
rows that might be read by queries.

My thinking is
* we ignore LockRelationForExtension(). Normal queries never request it.
All new blocks were created with that lock held and we are replaying
changes serially, so we do not need to re-create that lock. We already
do this, so no change.
* re-create the Cleanup lock on blocks, when the original operation was
performed while a Cleanup lock was held.

So the plan is to introduce a new XLogLockBufferForCleanup() function
and then use it in all places where a cleanup lock was held during the
write of the WAL record.

This means we will need to hold cleanup lock:

* while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)

* while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
is not always generated by VACUUM. So split this WAL record into two
types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
I'd rather do the full locking now and relax it later).

* Whenever we apply a backup block that performs the same function as
any of the above WAL records. So we would hold Cleanup lock when
applying WAL records of types
all RM_HEAP2_ID types
XLOG_BTREE_VACUUM

I'm most of the way through implementing the above and will post patch
as a separate item to make it easier to inspect.

Other aspects:

* For GIN indexes, we appear to not hold a Cleanup lock during
vacuuming, except on root page. That stops new scans from starting, but
it doesn't prevent progress of concurrent scans. Doesn't look correct to
me... so not sure what strength lock to acquire in each case. Probably
need to differentiate between WAL record types so we can tell which to
acquire CleanupLock for and which not.

* GIST doesn't use CleaupLocks at all. So I'm very unclear here.

Teodor has mentioned that it should be OK for GIST/GIN. Can I double
check that based upon my inspection of the code?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Search Discussions

  • Gregory Stark at Oct 20, 2008 at 1:24 pm

    Simon Riggs writes:

    * re-create the Cleanup lock on blocks, when the original operation was
    performed while a Cleanup lock was held.

    So the plan is to introduce a new XLogLockBufferForCleanup() function
    and then use it in all places where a cleanup lock was held during the
    write of the WAL record.
    I'm not sure I'm following what you're talking about. Are you saying the slave
    will have to stop when it reaches a cleanup log entry and wait until it can
    obtain the exclusive lock with no other pins?

    That sounds like the right thing though it could cause a long pause in WAL
    recovery. Perhaps we should plan on being able to kill other processes holding
    pins on the buffer just as we discussed killing transactions holding up
    advancing the globalxmin.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
    Ask me about EnterpriseDB's 24x7 Postgres support!
  • Simon Riggs at Oct 20, 2008 at 1:40 pm

    On Mon, 2008-10-20 at 14:23 +0100, Gregory Stark wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    * re-create the Cleanup lock on blocks, when the original operation was
    performed while a Cleanup lock was held.

    So the plan is to introduce a new XLogLockBufferForCleanup() function
    and then use it in all places where a cleanup lock was held during the
    write of the WAL record.
    I'm not sure I'm following what you're talking about. Are you saying the slave
    will have to stop when it reaches a cleanup log entry and wait until it can
    obtain the exclusive lock with no other pins? Yes.
    That sounds like the right thing though it could cause a long pause in WAL
    recovery. It could.
    Perhaps we should plan on being able to kill other processes holding
    pins on the buffer just as we discussed killing transactions holding up
    advancing the globalxmin.
    I'll copy my notes straight out of the Wiki:

    We can't cancel a query that holds a pin on a particular block since we
    don't keep track of who holds a pin. We just know a backend has a pin
    and that the startup process must wait.

    This may become a problem, or it may not. In most cases, backends hold
    5-15 pins concurrently and pins are held for relatively short times.
    Startup process will provide debug information for when pin times are
    extended, and for monitoring total delay from pin waits.

    Some strategies, if this becomes a problem:

    * minimise pin hold times wherever this occurs
    * deferring application of WAL for pinned blocks (requiring us to
    use conditional locking)
    * making VACUUM FREEZE hold stronger locks on standby to prevent
    query access (but doesn't help HOT)
    * ensuring that we perform read ahead I/O for WAL records that
    have not yet reached head of apply queue

    Most people on-list were comfortable with the idea of recovery waiting
    for queries to finish, up to a limit, re: discussion around
    max_standby_delay. So waiting for cleanup locks is just another source
    of potential wait time.

    I think it will take a while for people to come up with some user
    experience of whether this will be a pain or not.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Teodor Sigaev at Oct 20, 2008 at 2:29 pm

    * For GIN indexes, we appear to not hold a Cleanup lock during
    vacuuming, except on root page. That stops new scans from starting, but
    it doesn't prevent progress of concurrent scans. Doesn't look correct to
    me... so not sure what strength lock to acquire in each case. Probably
    Why do you think so?
    need to differentiate between WAL record types so we can tell which to
    acquire CleanupLock for and which not.

    * GIST doesn't use CleaupLocks at all. So I'm very unclear here.
    Scan process does not hold any pointer on page now, insertion process holds but
    it tracks changes of page by looking at LSN of page.



    --
    Teodor Sigaev E-mail: teodor@sigaev.ru
    WWW: http://www.sigaev.ru/
  • Simon Riggs at Oct 20, 2008 at 2:54 pm

    On Mon, 2008-10-20 at 18:28 +0400, Teodor Sigaev wrote:
    * For GIN indexes, we appear to not hold a Cleanup lock during
    vacuuming, except on root page. That stops new scans from starting, but
    it doesn't prevent progress of concurrent scans. Doesn't look correct to
    me... so not sure what strength lock to acquire in each case. Probably
    Why do you think so?
    I have questions.

    I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
    root page for Cleanup and no others. Why do we need to hold Cleanup lock
    on the root? If the index is concurrent safe for existing scans, why is
    it not safe for new scans also? And the converse: if it is not safe for
    new scans, why is it safe for existing scans?
    need to differentiate between WAL record types so we can tell which to
    acquire CleanupLock for and which not.

    * GIST doesn't use CleaupLocks at all. So I'm very unclear here.
    Scan process does not hold any pointer on page now, insertion process holds but
    it tracks changes of page by looking at LSN of page.
    Sounds good.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Teodor Sigaev at Oct 20, 2008 at 4:16 pm

    I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
    root page for Cleanup and no others. Why do we need to hold Cleanup lock
    on the root? If the index is concurrent safe for existing scans, why is
    it not safe for new scans also? And the converse: if it is not safe for
    new scans, why is it safe for existing scans?
    Because we wish to prevent concurrent inserts and page deletion just to simplify
    code. LockForCleanup guarantees that insertion process is not work here (it
    keeps root buffer pinned all time of insertion). New scan processes can't start
    as a side effect.

    Note, in most cases it keeps enough concurrence because all that is about work
    on one tree in GIN index. Usually, there is a lot of such trees in index - for
    each lexeme if we speak about tsearch index. So, there is a place for
    improvements but I don't believe that will give a big advantage for performance
    in typical usage of GIN.

    --
    Teodor Sigaev E-mail: teodor@sigaev.ru
    WWW: http://www.sigaev.ru/
  • Simon Riggs at Oct 20, 2008 at 5:29 pm

    On Mon, 2008-10-20 at 20:12 +0400, Teodor Sigaev wrote:
    I don't understand why in ginVacuumPostingTreeLeaves() we lock only the
    root page for Cleanup and no others. Why do we need to hold Cleanup lock
    on the root? If the index is concurrent safe for existing scans, why is
    it not safe for new scans also? And the converse: if it is not safe for
    new scans, why is it safe for existing scans?
    Because we wish to prevent concurrent inserts and page deletion just to simplify
    code. LockForCleanup guarantees that insertion process is not work here (it
    keeps root buffer pinned all time of insertion). New scan processes can't start
    as a side effect.
    But does holding cleanup lock on root prevent an in-progress Insert from
    changing non-root pages? I assume so, just not sure how.
    Note, in most cases it keeps enough concurrence because all that is about work
    on one tree in GIN index. Usually, there is a lot of such trees in index - for
    each lexeme if we speak about tsearch index. So, there is a place for
    improvements but I don't believe that will give a big advantage for performance
    in typical usage of GIN.
    I'm just worried about safety during Hot Standby, not trying to improve
    anything.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Teodor Sigaev at Oct 20, 2008 at 5:48 pm

    But does holding cleanup lock on root prevent an in-progress Insert from
    changing non-root pages? I assume so, just not sure how.
    Yes, because insertion process doesn't unpin root page until insert will done.
    --
    Teodor Sigaev E-mail: teodor@sigaev.ru
    WWW: http://www.sigaev.ru/
  • Simon Riggs at Oct 22, 2008 at 10:18 pm

    On Mon, 2008-10-20 at 13:54 +0100, Simon Riggs wrote:

    I'm looking at how to make queries safe during recovery, in the presence
    of concurrent changes to blocks. In particular, concurrent removal of
    rows that might be read by queries.

    My thinking is
    * we ignore LockRelationForExtension(). Normal queries never request it.
    All new blocks were created with that lock held and we are replaying
    changes serially, so we do not need to re-create that lock. We already
    do this, so no change.
    * re-create the Cleanup lock on blocks, when the original operation was
    performed while a Cleanup lock was held.

    So the plan is to introduce a new XLogLockBufferForCleanup() function
    and then use it in all places where a cleanup lock was held during the
    write of the WAL record.

    This means we will need to hold cleanup lock:

    * while RM_HEAP2_ID records are applied (Freeze, Clean, CleanMove)

    * while an XLOG_BTREE_DELETE was generated by VACUUM - this record type
    is not always generated by VACUUM. So split this WAL record into two
    types XLOG_BTREE_DELETE and XLOG_BTREE_VACUUM, so we can hold Cleanup
    lock while applyinh XLOG_BTREE_VACUUM. (This may not be required, but
    I'd rather do the full locking now and relax it later).

    * Whenever we apply a backup block that performs the same function as
    any of the above WAL records. So we would hold Cleanup lock when
    applying WAL records of types
    all RM_HEAP2_ID types
    XLOG_BTREE_VACUUM

    I'm most of the way through implementing the above and will post patch
    as a separate item to make it easier to inspect.
    Here's the code to implement this portion.

    I'm undecided as to whether this is too much code or too little. I'm
    somewhat uncertain as to the locking requirements for the physical scan
    during a vacuum. I've worked out various options if we need to change
    this.

    Also, the way I take a cleanup lock during RestoreBkpBlocks() seems very
    ugly, but too much additional code seems over the top also.

    I'm integrating into rest of patch now.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Heikki Linnakangas at Oct 23, 2008 at 6:10 am

    Simon Riggs wrote:
    I'm undecided as to whether this is too much code or too little. I'm
    somewhat uncertain as to the locking requirements for the physical scan
    during a vacuum. I've worked out various options if we need to change
    this.
    For the B-tree, an exclusive lock is enough to modify the contents of
    the page. A cleanup lock needs to be taken on every page to ensure that
    the vacuum doesn't finish and delete a heap tuple that's about to be
    accessed by an index scan. That could be handled in WAL replay by
    dropping the exclusive lock and re-acquiring the cleanup lock, but we
    can't do that for heap vacuums.

    However, we require that in b-tree vacuum, you take a cleanup lock on
    *every* leaf page of the index, not only those that you modify. That's a
    problem, because there's no trace of such pages in the WAL.

    PS. FSM doesn't need cleanup locks.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Oct 23, 2008 at 8:55 am

    On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

    However, we require that in b-tree vacuum, you take a cleanup lock on
    *every* leaf page of the index, not only those that you modify. That's a
    problem, because there's no trace of such pages in the WAL.
    OK, good. Thanks for the second opinion. I'm glad you said that, cos I
    felt sure anybody reading the patch would say "what the hell does this
    bit do?". Now I can add it.

    My solution is fairly simple:

    As we pass through the table we keep track of which blocks need
    visiting, then append that information onto the next WAL record. If the
    last block doesn't contain removed rows, then we send a no-op message
    saying which blocks to visit.

    I'd already invented the XLOG_BTREE_VACUUM record, so now we just need
    to augment it further with two fields: ordered array of blocks to visit,
    and a doit flag.

    Say we have a 10 block table, with rows to be removed on blocks 3,4,8.
    As we visit all 10 in sequence we would issue WAL records:

    XLOG_BTREE_VACUUM block 3 visitFirst {1, 2} doit = true
    XLOG_BTREE_VACUUM block 4 visitFirst {} doit = true
    XLOG_BTREE_VACUUM block 8 visitFirst {5,6,7} doit = true
    XLOG_BTREE_VACUUM block 10 visitFirst {9} doit = false

    So that allows us to issue the same number of WAL messages yet include
    all the required information to repeat the process correctly.

    (The blocks can be visited out of sequence in some cases, hence the
    ordered array of blocks to visit rather than just a first block value).

    It would also be possible to introduce a special tweak there which is
    that if the block is not in cache, don't read it in at all. If its not
    in cache we know that nobody has a pin on it, so don't need to read it
    in just to say "got the lock". That icing for later.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Heikki Linnakangas at Oct 23, 2008 at 4:22 pm

    Simon Riggs wrote:
    It would also be possible to introduce a special tweak there which is
    that if the block is not in cache, don't read it in at all. If its not
    in cache we know that nobody has a pin on it, so don't need to read it
    in just to say "got the lock". That icing for later.
    Heh, that's clever :-). We could actually use a method like that to
    solve the whole problem. After the (replay of the) b-tree vacuum is
    finished, scan through all shared buffers, and get a vacuum lock on
    every page of that index that's in cache. Groveling through all shared
    buffers would be slower for small indexes, though.

    I believe the "vacuum lock every leaf page" behavior is only needed for
    system catalogs. You have other issues with those still, namely that
    table locks are not yet taken appropriately, so I'm not sure if this is
    worth worrying about until that's done.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Oct 23, 2008 at 4:48 pm

    On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
    Simon Riggs wrote:
    It would also be possible to introduce a special tweak there which is
    that if the block is not in cache, don't read it in at all. If its not
    in cache we know that nobody has a pin on it, so don't need to read it
    in just to say "got the lock". That icing for later.
    Heh, that's clever :-). We could actually use a method like that to
    solve the whole problem. After the (replay of the) b-tree vacuum is
    finished, scan through all shared buffers, and get a vacuum lock on
    every page of that index that's in cache. Groveling through all shared
    buffers would be slower for small indexes, though.
    Well, re-examining all of the assumptions in the code seems to have been
    worthwhile so far. I think that makes 4 significant tweaks that have
    come out of the Search For Hot Standby.
    I believe the "vacuum lock every leaf page" behavior is only needed for
    system catalogs.
    So we will still need it then. Which is good 'cos I just wrote it.
    You have other issues with those still, namely that
    table locks are not yet taken appropriately, so I'm not sure if this is
    worth worrying about until that's done.
    Please explain. If you know of a correctness issue, please say.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Heikki Linnakangas at Oct 23, 2008 at 5:24 pm

    Simon Riggs wrote:
    On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
    You have other issues with those still, namely that
    table locks are not yet taken appropriately, so I'm not sure if this is
    worth worrying about until that's done.
    Please explain. If you know of a correctness issue, please say.
    Things like DROP TABLE need to take an AccessExclusiveLock, and you
    don't have that yet, no?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at Oct 23, 2008 at 5:30 pm

    On Thu, 2008-10-23 at 20:24 +0300, Heikki Linnakangas wrote:
    Simon Riggs wrote:
    On Thu, 2008-10-23 at 19:21 +0300, Heikki Linnakangas wrote:
    You have other issues with those still, namely that
    table locks are not yet taken appropriately, so I'm not sure if this is
    worth worrying about until that's done.
    Please explain. If you know of a correctness issue, please say.
    Things like DROP TABLE need to take an AccessExclusiveLock, and you
    don't have that yet, no?
    Oh right, I thought you meant something not already in the plan.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Simon Riggs at Nov 3, 2008 at 1:49 pm

    On Thu, 2008-10-23 at 09:57 +0100, Simon Riggs wrote:
    On Thu, 2008-10-23 at 09:09 +0300, Heikki Linnakangas wrote:

    However, we require that in b-tree vacuum, you take a cleanup lock on
    *every* leaf page of the index, not only those that you modify. That's a
    problem, because there's no trace of such pages in the WAL.
    OK, good. Thanks for the second opinion. I'm glad you said that, cos I
    felt sure anybody reading the patch would say "what the hell does this
    bit do?". Now I can add it.
    Heikki,

    When we discussed this before, I was glad that you'd mentioned that
    aspect since it allowed me to say "if two of us think that then it must
    be true".

    I didn't include that in the final patch because it felt wrong. I didn't
    have a rational explanation for that then, just a bad feeling. So, after
    lots of sleep, here's my rational explanation of why we do *not* need
    that during hot standby queries:

    VACUUM with a btree index proceeds like this:
    1. Scan table
    2. Remove rows from btree identified in (1)
    3. Remove rows from heap identified in (1)

    The purpose of the additional locking requirements during (2) for btrees
    is to ensure that we do not fail to find the rows identified in (1),
    because the rows can move after (1) and during (2) because of block
    splits.

    Requoting verbatim from the README: "The tricky part of this is to avoid
    missing any deletable tuples in the presence of concurrent page splits:
    a page split could easily move some tuples from a page not yet passed
    over by the sequential scan to a lower-numbered page already passed
    over." In recovery there are no concurrent page splits and the WAL
    records represent already successfully identified deletable tuples.

    On a standby server the rows will not move other than via WAL records.
    So there is no possibility that a WAL record will fail to find the row
    it was looking for. On the master we were looking for a tuple that
    pointed to a htid, whereas in WAL replay we look directly at the index
    tuple via its tid, not via the htid it points to. Therefore we do not
    need the additional locking.

    That seems logical to me, so I will leave that out.

    Any alternative views?

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Tom Lane at Nov 3, 2008 at 3:08 pm

    Simon Riggs writes:
    VACUUM with a btree index proceeds like this:
    1. Scan table
    2. Remove rows from btree identified in (1)
    3. Remove rows from heap identified in (1)
    The purpose of the additional locking requirements during (2) for btrees
    is to ensure that we do not fail to find the rows identified in (1),
    because the rows can move after (1) and during (2) because of block
    splits.
    No, you are missing the point. One purpose of the additional locking
    requirements is to ensure that there is not a concurrent process that
    has read a btree index entry just before you removed it but arrives at
    the heap page only after you removed the heap entry (and, perhaps,
    replaced it with some other row that doesn't match the index entry at
    all). This is clearly still a risk in a hot-standby environment.

    regards, tom lane
  • Simon Riggs at Nov 3, 2008 at 4:01 pm

    On Mon, 2008-11-03 at 10:07 -0500, Tom Lane wrote:
    Simon Riggs <simon@2ndQuadrant.com> writes:
    VACUUM with a btree index proceeds like this:
    1. Scan table
    2. Remove rows from btree identified in (1)
    3. Remove rows from heap identified in (1)
    The purpose of the additional locking requirements during (2) for btrees
    is to ensure that we do not fail to find the rows identified in (1),
    because the rows can move after (1) and during (2) because of block
    splits.
    No, you are missing the point. One purpose of the additional locking
    requirements is to ensure that there is not a concurrent process that
    has read a btree index entry just before you removed it but arrives at
    the heap page only after you removed the heap entry (and, perhaps,
    replaced it with some other row that doesn't match the index entry at
    all). This is clearly still a risk in a hot-standby environment.
    OK, I think I get it now. Thanks for putting me straight.

    So I will implement the locking-every-page approach discussed upthread.
    So I will just keep note of the blocks touched exactly in that order and
    store the info accordingly onto the WAL records.

    Are you happy with my optimisation that if a page needs to be read in,
    we just skip it and pretend we did read-pin-unpin on it? I would
    implement that as a new ReadBuffer mode (in Heikki's new API
    terminology).

    If you know/can see any other missing correctness requirements please
    let me know. I've not had trouble understanding any of the other
    correctness requirements, but I'll leave it to review to judge whether
    I've implemented them all correctly.

    --
    Simon Riggs www.2ndQuadrant.com
    PostgreSQL Training, Services and Support
  • Tom Lane at Nov 3, 2008 at 4:10 pm

    Simon Riggs writes:
    Are you happy with my optimisation that if a page needs to be read in,
    we just skip it and pretend we did read-pin-unpin on it?
    If it's not in buffers then it cannot be pinned, so I concur that that
    is safe.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedOct 20, '08 at 12:51p
activeNov 3, '08 at 4:10p
posts19
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase