FAQ
The attached patch refactors heap_hot_search_buffer() so that
index_getnext() can use it, and modifies index_getnext() to do so.

The idea is based on one of Heikki's index-only scan patches, from 2009:

http://archives.postgresql.org/pgsql-hackers/2009-07/msg00676.php

I believe, however, that this portion of that patch stands alone,
completely independently of index-only scans. At present, we have two
copies of the logic to traverse HOT chains floating around, which
means they can get out of sync, possibly resulting in bugs. This has
already happened once:

http://archives.postgresql.org/pgsql-hackers/2011-05/msg00733.php

As a nice bonus, the new logic is both simpler and, I believe, more
correct than the current logic. In IndexScanDescData, xs_hot_dead,
xs_next_hot, and xs_prev_xmax get replaced by a single boolean
xs_continue_hot. There is a small behavioral difference: in the
current code, when use a non-MVCC snapshot with index_getnext() and
walk a HOT chain, each call remembers the *next* TID that should be
examined. With this change, we instead remember the TID that we most
recently returned, and compute the next TID when index_getnext() is
called again. It seems to me that this is really what we should have
been doing all along. Imagine, for example, that we have a HOT chain
A -> B, where B has not yet committed. We return A and remember that
we next intend to look at B. Before index_getnext() is called, B
aborts and a new HOT update produces a HOT chain A -> C. The next
call to index_getnext() will nonetheless look at B and conclude that
it's reached the end of the HOT chain. This doesn't actually matter a
bit for current uses of index_getnext(), because - at least according
to Heikki's old notes - the only place we use a non-MVCC snapshot with
this function is during CLUSTER. And at that point, we have the whole
table locked down, so nothing is happening concurrently. I'm not sure
there's any possibility of us ever using this function in a way that
would break under the current implementation, but what I'm proposing
here does seem more robust to me.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Jeff Davis at Jun 19, 2011 at 5:49 pm

    On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote:
    The attached patch refactors heap_hot_search_buffer() so that
    index_getnext() can use it, and modifies index_getnext() to do so.
    Attached is a version of the patch that applies cleanly to master.

    Regards,
    Jeff Davis
  • Jeff Davis at Jun 19, 2011 at 6:01 pm

    On Sun, 2011-06-19 at 10:50 -0700, Jeff Davis wrote:
    On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote:
    The attached patch refactors heap_hot_search_buffer() so that
    index_getnext() can use it, and modifies index_getnext() to do so.
    Attached is a version of the patch that applies cleanly to master.
    In heap_hot_search_buffer:

    + /* If this is not the first call, previous call returned
    a (live!) tuple */
    if (all_dead)
    - *all_dead = true;
    + *all_dead = !first_call;

    I think that's a typo: it should be:

    + *all_dead = first_call;

    Right?

    Regards,
    Jeff Davis
  • Robert Haas at Jun 19, 2011 at 6:17 pm

    On Sun, Jun 19, 2011 at 2:01 PM, Jeff Davis wrote:
    On Sun, 2011-06-19 at 10:50 -0700, Jeff Davis wrote:
    On Mon, 2011-06-06 at 14:03 -0400, Robert Haas wrote:
    The attached patch refactors heap_hot_search_buffer() so that
    index_getnext() can use it, and modifies index_getnext() to do so.
    Attached is a version of the patch that applies cleanly to master.
    In heap_hot_search_buffer:

    +       /* If this is not the first call, previous call returned
    a (live!) tuple */
    if (all_dead)
    -               *all_dead = true;
    +               *all_dead = !first_call;

    I think that's a typo: it should be:

    +               *all_dead = first_call;
    Yikes. I think you are right. It's kind of scary that the regression
    tests passed with that mistake.

    New patch attached, with that one-line change.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jun 19, 2011 at 6:20 pm

    Robert Haas writes:
    Yikes. I think you are right. It's kind of scary that the regression
    tests passed with that mistake.
    Can we add a test that exposes that mistake?

    regards, tom lane
  • Robert Haas at Jun 19, 2011 at 6:42 pm

    On Sun, Jun 19, 2011 at 2:20 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Yikes.  I think you are right.  It's kind of scary that the regression
    tests passed with that mistake.
    Can we add a test that exposes that mistake?
    Not sure. We'd have to figure out how to reliably tickle it.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jun 19, 2011 at 8:51 pm

    On Sun, Jun 19, 2011 at 2:41 PM, Robert Haas wrote:
    On Sun, Jun 19, 2011 at 2:20 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Yikes.  I think you are right.  It's kind of scary that the regression
    tests passed with that mistake.
    Can we add a test that exposes that mistake?
    Not sure.  We'd have to figure out how to reliably tickle it.
    *thinks a bit*

    When using an MVCC snapshot, we always have first_call = true, so the
    effect of this mistake was just to disable the opportunistic killing
    of dead tuples, which doesn't affect correctness.

    When using a non-MVCC snapshot, we call heap_hot_search_buffer()
    repeatedly until it returns false. For so long as it returns true, it
    does not matter how all_dead is set, because index_getnext() will
    return the tuple without examining all_dead. So only the final call
    matters. If the final call happens to also be the first call, then
    all_dead might end up being false when it really ought to be true, but
    that will once again just miss killing a dead tuple. If the final
    call isn't the first call, then we've got a problem, because now
    all_dead will be true when it really ought to be false, and we'll nuke
    an index tuple that we shouldn't nuke.

    But if this is happening in the context of CLUSTER, then there might
    still be no user-visible failure, because we're going to rebuild the
    indexes anyway. There could be a problem if CLUSTER aborts part-way
    though.

    A system catalog might get scanned with SnapshotNow, but to exercise
    the bug you'd need to HOT update a system catalog and then have the
    updating transaction commit between the time it sees the first row and
    the time it sees the second one.

    So I don't quite see how to construct a test case, ATM.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jun 24, 2011 at 7:32 pm

    On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas wrote:
    New patch attached, with that one-line change.
    Jeff, are you planning to review this further? Do you think it's OK to commit?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jeff Davis at Jun 25, 2011 at 10:24 am

    On Fri, 2011-06-24 at 15:32 -0400, Robert Haas wrote:
    On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas wrote:
    New patch attached, with that one-line change.
    Jeff, are you planning to review this further? Do you think it's OK to commit?
    1. Patch does not apply to master cleanly, and it's in unified format
    (so I can't compare it against the old patch very easily). This review
    is for the first patch, disregarding the "skip = !first_call" issue that
    you already fixed. If you had other changes in the latest version,
    please repost the patch.

    2. Comment above heap_hot_search_buffer should be updated to document
    that heapTuple is an out-parameter and document the behavior of
    first_call

    3. The logic around "skip" is slightly confusing to me. Here's my
    description: if it's not an MVCC snapshot and it's not the first call,
    then you don't actually want to fetch the tuple with the given tid or a
    later one in the chain -- you want to fetch the _next_ tuple in the
    chain or a later one in the chain. Some wording of that description in a
    comment (either in the function's comment or near the use of "skip")
    would help a lot. Also, if skip is true, then the tid _must_ be visible
    according to the (non-MVCC) snapshot, correct? It might help if that was
    apparent from the code/comments.

    Other than that, it looks good.

    Regards,
    Jeff Davis
  • Robert Haas at Jun 27, 2011 at 2:32 pm

    On Sat, Jun 25, 2011 at 6:24 AM, Jeff Davis wrote:
    On Fri, 2011-06-24 at 15:32 -0400, Robert Haas wrote:
    On Sun, Jun 19, 2011 at 2:16 PM, Robert Haas wrote:
    New patch attached, with that one-line change.
    Jeff, are you planning to review this further?  Do you think it's OK to commit?
    1. Patch does not apply to master cleanly, and it's in unified format
    (so I can't compare it against the old patch very easily). This review
    is for the first patch, disregarding the "skip = !first_call" issue that
    you already fixed. If you had other changes in the latest version,
    please repost the patch.
    That is strange, because it applies for me. But I had no other changes.
    2. Comment above heap_hot_search_buffer should be updated to document
    that heapTuple is an out-parameter and document the behavior of
    first_call

    3. The logic around "skip" is slightly confusing to me. Here's my
    description: if it's not an MVCC snapshot and it's not the first call,
    then you don't actually want to fetch the tuple with the given tid or a
    later one in the chain -- you want to fetch the _next_ tuple in the
    chain or a later one in the chain. Some wording of that description in a
    comment (either in the function's comment or near the use of "skip")
    would help a lot. Also, if skip is true, then the tid _must_ be visible
    according to the (non-MVCC) snapshot, correct? It might help if that was
    apparent from the code/comments.

    Other than that, it looks good.
    OK, I've applied this with some additional comment changes.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJun 6, '11 at 6:03p
activeJun 27, '11 at 2:32p
posts10
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase