I'm planning to tackle the problem of killing index tuples for dead rows
during normal processing (ie, before VACUUM). We've noticed repeatedly
that visits to dead heap rows consume a lot of time during indexscans
of heavily-updated tables. This problem has been discussed before,
so the general outline of the solution is apparent, but several design
decisions remain to be made. Here are my present thoughts:
1. The basic idea is for index_getnext, when it retrieves a tuple that
turns out to be invisible to the current transaction, to test whether
the tuple is dead to *all* transactions; if so, tell the index AM to mark
that index tuple as killed. Subsequently the index tuple will be ignored
until it's finally vacuumed. (We cannot try to remove the index tuple
immediately, because of concurrency issues; but not returning it out of
the index AM during an indexscan should largely solve the performance
problem.) Under normal circumstances the time window between "dead to
my transaction" and "dead to all transactions" should not be very large,
so this approach should not cause very many extra tuple-visibility tests
to be performed.
2. The second visibility test is the same as VACUUM's: is the tuple
committed dead (or never good) and older than any running transaction's
xmin? To call HeapTupleSatisfiesVacuum we need an idea of the global
xmin, but we surely don't want index_getnext calling GetOldestXmin()
every time it does this. (Quite aside from the speed penalty, I'm worried
about possible deadlocks due to having to grab SInvalLock there.) Instead
I propose that we modify GetSnapshotData() to compute the current global
xmin as a byproduct of its existing computation (which it can do almost
for free) and stash that in a global variable someplace. index_getnext
can then use the global variable to call HeapTupleSatisfiesVacuum. This
will effectively mean that we do index-tuple killing on the basis of the
global xmin as it stood when we started the current transaction. In some
cases that might be a little out of date, but using an old xmin cannot
cause incorrect behavior; at worst an index entry will survive a little
longer than it really needs to.
3. What should the API to the index AMs look like? I propose adding
two fields to the IndexScanDesc data structure:
bool kill_prior_tuple; /* true if previously returned tuple is dead */
bool ignore_killed_tuples; /* true to not return killed entries */
kill_prior_tuple is always set false during RelationGetIndexScan and at
the start of index_getnext. It's set true when index_getnext detects
a dead tuple and loops around to call the index AM again. So the index
AM may interpret it as "kill the tuple you last returned, ie, the one
indicated by currentItemData". Performing this action as part of
amgetnext minimizes the extra overhead needed to kill a tuple --- we don't
need an extra cycle of re-locking the current index page and re-finding
ignore_killed_tuples will be set true in RelationGetIndexScan, but could
be set false by callers that want to see the killed index tuples.
(Offhand I think only VACUUM would want to do that.)
Within the index AMs, both kill_prior_tuple and ignore_killed_tuples would
be examined only by the topmost amgetnext routine. A "killed" entry
behaves normally with respect to all internal operations of the index AM;
we just don't return it to callers when ignore_killed_tuples is true.
This will minimize the risk of introducing bugs into the index AMs.
As long as we can loop around for the next index tuple before we've
released page locks inside the AM, we should get most of the possible
performance benefit with just a localized change.
4. How exactly should a killed index tuple be marked on-disk? While there
is one free bit available in IndexTupleData.t_info, I would prefer to use
that bit to expand the index tuple size field to 14 bits instead of 13.
(This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
rather than being hard-limited to 8K.) What I am thinking of doing is
using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be
unused for index tuples. (I'm not sure it's ever set for heap tuples
either, actually, but it definitely looks free for index tuples.)
Whichever bit we use, the index AM can simply set it and mark the buffer
dirty with SetBufferCommitInfoNeedsSave. We do not need to WAL-log the
action, just as we do not WAL-log marking heap tuple commit status bits,
because the action could be done over by someone else if it were lost.
Comments? Anyone see any flaws or better ideas?
regards, tom lane