Currently, we have a problem with index-only scans in Hot Standby
mode: the xmin horizon on the standby might lag the master, and thus
an index-only scan might mistakenly conclude that no heap fetch is
needed when in fact it is. I suggested that we handle this by
suppressing generation of index-only scan plans in Hot Standby mode,
but Simon, Noah, and Dimitri were arguing that we should instead do
the following, which is now on the open items list:

* Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
so that IndexOnlyScans work on Hot Standby

Ideally, this would also allow us to remove the following hack from
heapgetpage(), which would improve sequential-scan performance for Hot
Standby.

/*
* If the all-visible flag indicates that all tuples on the page are
* visible to everyone, we can skip the per-tuple visibility tests. But
* not in hot standby mode. A tuple that's already visible to all
* transactions in the master might still be invisible to a read-only
* transaction in the standby.
*/
all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

As far as I can see, to make this work, we'd need to have
lazy_scan_heap() compute the latest xmin on any potentially
all-visible page and treat that as a cutoff XID, cancelling queries
with snapshots whose xmins equal or precede that value, just as we
currently do when freezing tuples (cf xlog_heap_freeze). There are
two things to worry about: correctness, and more query cancellations.

In the department of query cancellations, I believe Noah argued
previously that this wasn't really going to cause a problem. And,
indeed, if the master has a mix of inserts, updates, and deletes, then
it seems likely that any recovery conflicts generated this way would
be hitting about the same place in the XID space as the ones caused by
pruning away dead tuples. What will be different, if we go this
route, is that an insert-only master, which right now only generates
conflicts at freezing time, will instead generate conflicts much
sooner, as soon as the relation is vacuumed. I do not use Hot Standby
enough to know whether this is a problem, and I'm not trying to block
this approach, but I do want to make sure that everyone agrees that
it's acceptable before we go do it, because inevitably somebody is
going to get bit. In making our decision, I think we should keep in
mind that we are currently pretty laid-back about marking pages
all-visible, and that's probably something we're going to want to
change over time: for example, we recently discussed the fact that a
page with one dead tuple in it currently needs *two* vacuums to become
all-visible, because only the first vacuum pass is smart enough to
mark things all-visible, and the first heap pass only truncates the
dead tuple to a dead line pointer, so the page isn't all-visible a
that point. When we fix that, which I think we're almost certainly
going to want to do at some point, then that means these conflicts
will occur that much sooner. I think we will likely also want to
consider marking things all-visible on the fly at some point in the
future; for example, if we pull a buffer in for an index-only scan and
dirty it by setting a hint bit, we might want to take the plunge and
mark it all-visible before evicting it, to save effort the next time.
Again, not trying to dissuade anyone, just making sure we've thought
through it enough to avoid being unhappy later. It's worth noting
that none of this will normally matter if hot_standby_feedback=on, so
part of the analysis here may depend on how many people we think have
flipped that switch.

As to correctness, after further review of lazy_scan_heap(), I think
there are some residual bugs here that need to be fixed whatever we
decide to do about the Hot Standby case:

1. I noticed that when we do PageSetAllVisible() today we follow it up
with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a
hint, so I think these should be changed to MarkBufferDirty(), which
shouldn't make much difference given current code, but proposals to
handle hint changes differently than non-hint changes abound, so it's
probably not good to rely on those meaning the same thing forever.

2. More seriously, lazy_scan_heap() releases the buffer lock before
setting the all-visible bit on the heap. This looks just plain wrong
(and it's my fault, to be clear). Somebody else can come along after
we've released the buffer lock and mark the page not-all-visible.
That concurrent process will check the visibility map bit, find it
already clear, and go on its merry way. Then, we set the visibility
map bit and go on our merry way. Now we have a not-all-visible page
with the all-visible bit set in the visibility map. Oops. I think
this probably needs to be rewritten so that lazy_scan_heap() acquires
a pin on the visibility map page before locking the buffer for cleanup
and holds onto that pin until either we exit the range of heap pages
covered by that visibility map page or trigger index vac due to
maintenance_work_mem exhaustion. With that infrastructure in place,
we can set the visibility map bit at the same time we set the
page-level bit without risking holding the buffer lock across a buffer
I/O (we might have to hold the buffer lock across a WAL flush in the
worst case, but that hazard exists in a number of other places as well
and fixing it here is well out of scope).

Now, suppose we fix the above bugs and also add logic to generate
query-conflicts as described above. How far can the standby now trust
the visibility information? Setting the visibility map bit is a fully
WAL-logged operation, so anyone for whom seeing the bit as set would
potentially be a problem is certain to be killed before they get the
chance to become confused. And when we first open for read-only
connections after a new base backup, the initial snapshot had better
be able to see all XIDs which have committed prior to that point, and
only such things should be marked all-visible, so AFAICT the door is
nailed shut pretty tight here. I am a little less certain about the
page-level bit. On the master, it's possible for the PD_ALL_VISIBLE
bit to make it to disk before the WAL record that sets the visibility
map bit; the LSN interlock only protects the visibility map page
itself, as part of a complicated strategy to avoid emitting FPWs for
the entire heap when we vacuum an insert-only table. On the standby,
those same bits are going to get set when the xlog records covering
the setting of visibility map bit get replayed (which would be OK, in
the sense that the page-level bit would be trustworthy in this case),
or when a copy of the buffer makes it from master to standby by some
other means (which is the potentially problematic case). This could
happen either as part of a base backup, or via a FPW. I don't think
the base backup case is a problem for the same reasons that it's OK
for the visibility map bit case. If an FPW turns the bit on, then
somehow the page-level bit got set without the visibility map bit
getting set. I think the only for that to happen is:

1. vacuum on master sets the page-level bit and the visibility map bit
2. the heap page with the bit is written to disk BEFORE the WAL entry
generated in (1) gets flushed; this is allowable, since it's not an
error for the page-level bit to be set while the visibility-map bit is
unset, only the other way around
3. crash (still before the WAL entry is flushed)
4. some operation on the master emits an FPW for the page without
first rendering it not-all-visible

At present, I'm not sure there's any real problem here, since normally
an all-visible heap page is only going to get another WAL entry if
somebody does an insert, update, or delete on it, in which case the
visibility map bit is going to get cleared anyway. Freezing the page
might emit a new FPW, but that's going to generate conflicts anyway,
so no problem there. So I think there's no real problem here, but it
doesn't seem totally future-proof - any future type of WAL record that
might modify the page without rendering it not-all-visible would
create a very subtle hazard for Hot Standby. We could dodge the whole
issue by leaving the hack in heapgetpage() intact and just be happy
with making index-only scans work, or maybe somebody has a more clever
idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Simon Riggs at Apr 15, 2012 at 10:02 am

    On Fri, Apr 13, 2012 at 5:33 PM, Robert Haas wrote:
    Currently, we have a problem with index-only scans in Hot Standby
    mode: the xmin horizon on the standby might lag the master, and thus
    an index-only scan might mistakenly conclude that no heap fetch is
    needed when in fact it is.  I suggested that we handle this by
    suppressing generation of index-only scan plans in Hot Standby mode,
    but Simon, Noah, and Dimitri were arguing that we should instead do
    the following, which is now on the open items list:

    * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
    so that IndexOnlyScans work on Hot Standby
    ...
    <snip> very long email </snip>

    Luckily its much simpler than all of that suggests. It'll take a few
    hours for me to write a short reply but its Sunday today, so that will
    happen later.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Noah Misch at Apr 16, 2012 at 7:03 am

    On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
    In the department of query cancellations, I believe Noah argued
    previously that this wasn't really going to cause a problem. And,
    indeed, if the master has a mix of inserts, updates, and deletes, then
    it seems likely that any recovery conflicts generated this way would
    be hitting about the same place in the XID space as the ones caused by
    pruning away dead tuples. What will be different, if we go this
    route, is that an insert-only master, which right now only generates
    conflicts at freezing time, will instead generate conflicts much
    sooner, as soon as the relation is vacuumed. I do not use Hot Standby
    enough to know whether this is a problem, and I'm not trying to block
    this approach, but I do want to make sure that everyone agrees that
    it's acceptable before we go do it, because inevitably somebody is
    going to get bit.
    Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
    A standby with the GUC "off" would ignore all-visible indicators and also
    decline to generate conflicts when flipping them on.
    As to correctness, after further review of lazy_scan_heap(), I think
    there are some residual bugs here that need to be fixed whatever we
    decide to do about the Hot Standby case:

    1. I noticed that when we do PageSetAllVisible() today we follow it up
    with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a
    hint, so I think these should be changed to MarkBufferDirty(), which
    shouldn't make much difference given current code, but proposals to
    handle hint changes differently than non-hint changes abound, so it's
    probably not good to rely on those meaning the same thing forever.
    Do you refer to PD_ALL_VISIBLE as "not merely a hint" due to the requirement
    to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
    positive visibility map bit? That is to say, PD_ALL_VISIBLE is fully a hint
    in its role as a witness of tuple visibility, but it is not a hint in its role
    as a witness of the visibility map status? In any event, I agree that those
    call sites should use MarkBufferDirty().

    The large comment in SetBufferCommitInfoNeedsSave() seems incorrect. On
    systems with weaker memory ordering, the FlushBuffer() process's clearing of
    BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
    process until some time after the former retrieved buffer contents from shared
    memory. Memory barriers could make the comment true, but we should probably
    just update the comment to explain that a race condition may eat the update
    entirely. Incidentally, is there a good reason for MarkBufferDirty() to
    increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
    not to do so? Looks like an oversight.
    2. More seriously, lazy_scan_heap() releases the buffer lock before
    setting the all-visible bit on the heap. This looks just plain wrong
    (and it's my fault, to be clear). Somebody else can come along after
    we've released the buffer lock and mark the page not-all-visible.
    That concurrent process will check the visibility map bit, find it
    already clear, and go on its merry way. Then, we set the visibility
    map bit and go on our merry way. Now we have a not-all-visible page
    with the all-visible bit set in the visibility map. Oops.
    Hmm, indeed. This deserves its own open item.
    I think
    this probably needs to be rewritten so that lazy_scan_heap() acquires
    a pin on the visibility map page before locking the buffer for cleanup
    and holds onto that pin until either we exit the range of heap pages
    covered by that visibility map page or trigger index vac due to
    maintenance_work_mem exhaustion. With that infrastructure in place,
    we can set the visibility map bit at the same time we set the
    page-level bit without risking holding the buffer lock across a buffer
    I/O (we might have to hold the buffer lock across a WAL flush in the
    worst case, but that hazard exists in a number of other places as well
    and fixing it here is well out of scope).
    Looks reasonable at a glance.
    1. vacuum on master sets the page-level bit and the visibility map bit
    2. the heap page with the bit is written to disk BEFORE the WAL entry
    generated in (1) gets flushed; this is allowable, since it's not an
    error for the page-level bit to be set while the visibility-map bit is
    unset, only the other way around
    3. crash (still before the WAL entry is flushed)
    4. some operation on the master emits an FPW for the page without
    first rendering it not-all-visible

    At present, I'm not sure there's any real problem here, since normally
    an all-visible heap page is only going to get another WAL entry if
    somebody does an insert, update, or delete on it, in which case the
    visibility map bit is going to get cleared anyway. Freezing the page
    might emit a new FPW, but that's going to generate conflicts anyway,
    so no problem there. So I think there's no real problem here, but it
    doesn't seem totally future-proof - any future type of WAL record that
    might modify the page without rendering it not-all-visible would
    create a very subtle hazard for Hot Standby. We could dodge the whole
    issue by leaving the hack in heapgetpage() intact and just be happy
    with making index-only scans work, or maybe somebody has a more clever
    idea.
    Good point. We could finagle things so the standby notices end-of-recovery
    checkpoints and blocks the optimization for older snapshots. For example,
    maintain an integer count of end-of-recovery checkpoints seen and store that
    in each Snapshot instead of takenDuringRecovery; use 0 on a master. When the
    global value is greater than the snapshot's value, disable the optimization
    for that snapshot. I don't know whether the optimization is powerful enough
    to justify that level of trouble, but it's an option to consider.

    For a different approach, targeting the fragility, we could add assertions to
    detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
    full-page image.

    Thanks,
    nm
  • Simon Riggs at Apr 16, 2012 at 7:39 am

    On Mon, Apr 16, 2012 at 8:02 AM, Noah Misch wrote:
    On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
    In the department of query cancellations, I believe Noah argued
    previously that this wasn't really going to cause a problem.  And,
    indeed, if the master has a mix of inserts, updates, and deletes, then
    it seems likely that any recovery conflicts generated this way would
    be hitting about the same place in the XID space as the ones caused by
    pruning away dead tuples.  What will be different, if we go this
    route, is that an insert-only master, which right now only generates
    conflicts at freezing time, will instead generate conflicts much
    sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
    enough to know whether this is a problem, and I'm not trying to block
    this approach, but I do want to make sure that everyone agrees that
    it's acceptable before we go do it, because inevitably somebody is
    going to get bit.
    Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
    A standby with the GUC "off" would ignore all-visible indicators and also
    decline to generate conflicts when flipping them on.
    I'm against adding a new GUC, especially for that fairly thin reason.

    1. vacuum on master sets the page-level bit and the visibility map bit
    2. the heap page with the bit is written to disk BEFORE the WAL entry
    generated in (1) gets flushed; this is allowable, since it's not an
    error for the page-level bit to be set while the visibility-map bit is
    unset, only the other way around
    3. crash (still before the WAL entry is flushed)
    4. some operation on the master emits an FPW for the page without
    first rendering it not-all-visible

    At present, I'm not sure there's any real problem here, since normally
    an all-visible heap page is only going to get another WAL entry if
    somebody does an insert, update, or delete on it, in which case the
    visibility map bit is going to get cleared anyway.  Freezing the page
    might emit a new FPW, but that's going to generate conflicts anyway,
    so no problem there.  So I think there's no real problem here, but it
    doesn't seem totally future-proof - any future type of WAL record that
    might modify the page without rendering it not-all-visible would
    create a very subtle hazard for Hot Standby.  We could dodge the whole
    issue by leaving the hack in heapgetpage() intact and just be happy
    with making index-only scans work, or maybe somebody has a more clever
    idea.
    Good point.  We could finagle things so the standby notices end-of-recovery
    checkpoints and blocks the optimization for older snapshots.  For example,
    maintain an integer count of end-of-recovery checkpoints seen and store that
    in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
    global value is greater than the snapshot's value, disable the optimization
    for that snapshot.  I don't know whether the optimization is powerful enough
    to justify that level of trouble, but it's an option to consider.

    For a different approach, targeting the fragility, we could add assertions to
    detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
    full-page image.
    We don't need a future proof solution, especially not at this stage of
    the release cycle. Every time you add a WAL record, we need to
    consider the possible conflict impact.

    We just need a simple and clear solution. I'll work on that.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Heikki Linnakangas at Apr 16, 2012 at 8:26 am

    On 16.04.2012 10:38, Simon Riggs wrote:
    On Mon, Apr 16, 2012 at 8:02 AM, Noah Mischwrote:
    On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
    In the department of query cancellations, I believe Noah argued
    previously that this wasn't really going to cause a problem. And,
    indeed, if the master has a mix of inserts, updates, and deletes, then
    it seems likely that any recovery conflicts generated this way would
    be hitting about the same place in the XID space as the ones caused by
    pruning away dead tuples. What will be different, if we go this
    route, is that an insert-only master, which right now only generates
    conflicts at freezing time, will instead generate conflicts much
    sooner, as soon as the relation is vacuumed. I do not use Hot Standby
    enough to know whether this is a problem, and I'm not trying to block
    this approach, but I do want to make sure that everyone agrees that
    it's acceptable before we go do it, because inevitably somebody is
    going to get bit.
    Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
    A standby with the GUC "off" would ignore all-visible indicators and also
    decline to generate conflicts when flipping them on.
    I'm against adding a new GUC, especially for that fairly thin reason.
    Same here.

    Can we have a "soft" hot standby conflict that doesn't kill the query,
    but disables index-only-scans?

    In the long run, perhaps we need to store XIDs in the visibility map
    instead of just a bit. If you we only stored one xid per 100 pages or
    something like that, the storage overhead would not be much higher than
    what we have at the moment. But that's obviously not going to happen for
    9.2...

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Apr 16, 2012 at 2:19 pm

    Heikki Linnakangas writes:
    Can we have a "soft" hot standby conflict that doesn't kill the query,
    but disables index-only-scans?
    Well, there wouldn't be any way for the planner to know whether an
    index-only scan would be safe or not. I think this would have to look
    like a run-time fallback. Could it be structured as "return that the
    page's all-visible bit is not set, instead of failing?" Or am I
    confused about where the conflict is coming from?

    regards, tom lane
  • Simon Riggs at Apr 16, 2012 at 5:58 pm

    On Mon, Apr 16, 2012 at 3:19 PM, Tom Lane wrote:
    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    Can we have a "soft" hot standby conflict that doesn't kill the query,
    but disables index-only-scans?
    Well, there wouldn't be any way for the planner to know whether an
    index-only scan would be safe or not.  I think this would have to look
    like a run-time fallback.  Could it be structured as "return that the
    page's all-visible bit is not set, instead of failing?"  Or am I
    confused about where the conflict is coming from?
    The starting point is that HS now offers 4 different mechanisms for
    avoiding conflicts, probably the best of which is hot standby
    feedback. So we only need to improve things if those techniques don't
    work for people. So initially, my attitude is lets throw a conflict
    and wait for feedback during beta.

    If we do need to do something, then introduce concept of a visibility conflict.

    On replay:
    If feedback not set, set LSN of visibility conflict on PROCs that
    conflict, if not already set.

    On query:
    If feedback not set, check conflict LSN against page, if page is
    later, check visibility.

    In terms of optimisation, I wouldn't expect to have to adjust costs at
    all. The difference would only show for long running queries and
    typically they don't touch too much new data as a fraction of total.
    The cost model for index only is pretty crude anyhow.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Apr 16, 2012 at 8:14 pm

    On Mon, Apr 16, 2012 at 1:58 PM, Simon Riggs wrote:
    If we do need to do something, then introduce concept of a visibility conflict.

    On replay:
    If feedback not set, set LSN of visibility conflict on PROCs that
    conflict, if not already set.

    On query:
    If feedback not set, check conflict LSN against page, if page is
    later, check visibility.
    Hmm, should have read the whole thread before replying. This similar
    to what I just proposed in response to Heikki's message, but using LSN
    in lieu of (or maybe you mean in addition to) XID.

    I don't think we can ignore the need to throw conflicts just because
    hot_standby_feedback is set; there are going to be corner cases, for
    example, when it's just recently been turned on and the master has
    already done cleanup; or if the master and standby have recently
    gotten disconnected for even just a few seconds.

    But fundamentally we all seem to be converging on some variant of the
    "soft conflict" idea.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Apr 27, 2012 at 12:03 am

    On Mon, Apr 16, 2012 at 4:13 PM, Robert Haas wrote:
    But fundamentally we all seem to be converging on some variant of the
    "soft conflict" idea.
    So, as a first step, I've committed a patch that just throws a hard
    conflict. I think we probably want to optimize this further, and I'm
    going to work investigate that next. But it seemed productive to get
    this much out of the way first, so I did.

    In studying this, it strikes me that it would be rather nicer if we
    recovery conflicts could somehow arrange to roll back the active
    transaction by some means short of a FATAL error. I think there are
    some protocol issues with doing that, but I still wish we could figure
    out a way.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at May 2, 2012 at 12:41 pm

    On Thu, Apr 26, 2012 at 8:03 PM, Robert Haas wrote:
    So, as a first step, I've committed a patch that just throws a hard
    conflict.  I think we probably want to optimize this further, and I'm
    going to work investigate that next.  But it seemed productive to get
    this much out of the way first, so I did.
    I've been thinking about this some more. What's worrying me is that a
    visibility conflict, however we implement it, could be *worse* from
    the user's point of view than just killing the query. After all,
    there's a reasonable likelihood that a single visibility map page
    covers the whole relation (or all the blocks that the user is
    interested in), so any sort of conflict is basically going to turn the
    index-only scan into an index-scan plus some extra overhead. And if
    the planner had known that the user was going to get an index-only
    scan rather than just a plain index scan, it might well have picked
    some other plan in the first place.

    Another problem is that, if we add a test for visibility conflicts
    into visibilitymap_test(), I'm afraid we're going to drive up the cost
    of that function very significantly. Previous testing suggests that
    that efficiency or lack thereof of that function is already a
    performance problem for index-only scans, which kinda makes me not
    that excited about adding another branch in there somewhere (and even
    less excited about any proposed implementation that would add an
    lwlock acquire/release or similar).

    So on further reflection I'm thinking it may be best just to stick
    with a hard conflict for now and see what feedback we get from beta
    testers.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at May 4, 2012 at 8:57 am

    On 2 May 2012 13:41, Robert Haas wrote:

    So on further reflection I'm thinking it may be best just to stick
    with a hard conflict for now and see what feedback we get from beta
    testers.
    Which is what I was expecting y'all to conclude once you'd looked at
    the task in more detail.

    And I'm happy with the concept of beta being a period where we listen
    to feedback, not just bugs, and decide whether further refinements are
    required.

    What I think is possible is to alter the conflict as it hits the
    backend. If a backend has enable_indexonly = off then it wouldn't be
    affected by those conflicts anyhow. So if we simply record whether we
    are using an index only plan then we'll know whether to ignore it or
    abort. I think we can handle that by marking the snapshot at executor
    startup time. Needs a few other pushups but not that hard.

    The likelihood is that SQL that uses index only won't run for long
    enough to be cancelled anyway.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Apr 16, 2012 at 7:56 pm

    On Mon, Apr 16, 2012 at 4:26 AM, Heikki Linnakangas wrote:
    Can we have a "soft" hot standby conflict that doesn't kill the query, but
    disables index-only-scans?
    Yeah, something like that seems possible.

    For example, suppose the master includes, in each
    mark-heap-page-all-visible record, the newest XID on the page. On the
    standby, we keep track of the most recent such XID ever seen in shared
    memory. After noting that a page is all-visible, the standby
    cross-checks its snapshot against this XID and does the heap fetch
    anyway if it's too new. Conceivably this can be done with memory
    barriers but not without locks: we must update the XID in shared
    memory *before* marking the page all-visible, and we must check the
    visibility map or page-level bit *before* fetching the XID from shared
    memory - but the actual reads and writes of the XID are atomic.

    Now, if an index-only scan suffers a very high number of these "soft
    conflicts" and consequently does a lot more heap fetches than
    expected, performance might suck. But that should be rare, and could
    be mitigated by turning on hot_standby_feedback. Also, there might be
    some hit for repeatedly reading that XID from memory.

    Alternatively, we could try to forbid index-only scan plans from being
    created in the first place *only when* the underlying snapshot is too
    old. But then what happens if a conflict happens later, after the
    plan has been created but before it's fully executed? At that point
    it's too late to switch gears, so we'd still need something like the
    above. And the above might be adequate all by itself, since in
    practice index-only scans are mostly going to be useful when the data
    isn't changing all that fast, so most of the queries that would be
    cancelled by "hard" conflicts wouldn't have actually had a problem
    anyway.
    In the long run, perhaps we need to store XIDs in the visibility map instead
    of just a bit. If you we only stored one xid per 100 pages or something like
    that, the storage overhead would not be much higher than what we have at the
    moment. But that's obviously not going to happen for 9.2...
    Well, that would also have the undesirable side effect of greatly
    reducing the granularity of the map. As it is, updating a tiny
    fraction of the tuples in the table could result in the entire table
    ceasing to be not-all-visible if you happen to update exactly one
    tuple per page through the entire heap. Or more generally, updating
    X% of the rows in the table can cause Y% of the rows in the table to
    no longer be visible for index-only-scan purposes, where Y >> X. What
    you're proposing would make that much worse.

    I think we're actually going to find that the current system isn't
    tight enough; my suspicion is that users are going to complain that
    we're not aggressive enough about marking pages all-visible, because
    autovac won't kick in until updates+deletes>20%, but (1) index-only
    scans also care about inserts and (2) by the time we've got 20% dead
    tuples index-only scans may well be worthless.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Apr 16, 2012 at 8:02 pm

    On Mon, Apr 16, 2012 at 3:02 AM, Noah Misch wrote:
    Do you refer to PD_ALL_VISIBLE as "not merely a hint" due to the requirement
    to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
    positive visibility map bit?  That is to say, PD_ALL_VISIBLE is fully a hint
    in its role as a witness of tuple visibility, but it is not a hint in its role
    as a witness of the visibility map status? Exactly.
    The large comment in SetBufferCommitInfoNeedsSave() seems incorrect.  On
    systems with weaker memory ordering, the FlushBuffer() process's clearing of
    BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
    process until some time after the former retrieved buffer contents from shared
    memory. True.
    Memory barriers could make the comment true, but we should probably
    just update the comment to explain that a race condition may eat the update
    entirely. Agreed.
    Incidentally, is there a good reason for MarkBufferDirty() to
    increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
    not to do so?  Looks like an oversight.
    Also agreed.
    2. More seriously, lazy_scan_heap() releases the buffer lock before
    setting the all-visible bit on the heap.  This looks just plain wrong
    (and it's my fault, to be clear).  Somebody else can come along after
    we've released the buffer lock and mark the page not-all-visible.
    That concurrent process will check the visibility map bit, find it
    already clear, and go on its merry way.  Then, we set the visibility
    map bit and go on our merry way.  Now we have a not-all-visible page
    with the all-visible bit set in the visibility map.   Oops.
    Hmm, indeed.  This deserves its own open item.
    Also agreed.
    Good point.  We could finagle things so the standby notices end-of-recovery
    checkpoints and blocks the optimization for older snapshots.  For example,
    maintain an integer count of end-of-recovery checkpoints seen and store that
    in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
    global value is greater than the snapshot's value, disable the optimization
    for that snapshot.  I don't know whether the optimization is powerful enough
    to justify that level of trouble, but it's an option to consider.
    I suspect not. It seems we can make index-only scans work without
    doing something like this; it's only the sequential-scan optimization
    we might lose. But nobody's ever really complained about not having
    that, to my knowledge.
    For a different approach, targeting the fragility, we could add assertions to
    detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
    full-page image.
    The holes are narrow enough that I fear such cases would be detected
    only on high-velocity production systems, which is not exactly where
    you want to find out about such problems.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedApr 13, '12 at 4:33p
activeMay 4, '12 at 8:57a
posts13
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase