FAQ
I'm planning to tackle the problem of killing index tuples for dead rows
during normal processing (ie, before VACUUM). We've noticed repeatedly
that visits to dead heap rows consume a lot of time during indexscans
of heavily-updated tables. This problem has been discussed before,
so the general outline of the solution is apparent, but several design
decisions remain to be made. Here are my present thoughts:

1. The basic idea is for index_getnext, when it retrieves a tuple that
turns out to be invisible to the current transaction, to test whether
the tuple is dead to *all* transactions; if so, tell the index AM to mark
that index tuple as killed. Subsequently the index tuple will be ignored
until it's finally vacuumed. (We cannot try to remove the index tuple
immediately, because of concurrency issues; but not returning it out of
the index AM during an indexscan should largely solve the performance
problem.) Under normal circumstances the time window between "dead to
my transaction" and "dead to all transactions" should not be very large,
so this approach should not cause very many extra tuple-visibility tests
to be performed.

2. The second visibility test is the same as VACUUM's: is the tuple
committed dead (or never good) and older than any running transaction's
xmin? To call HeapTupleSatisfiesVacuum we need an idea of the global
xmin, but we surely don't want index_getnext calling GetOldestXmin()
every time it does this. (Quite aside from the speed penalty, I'm worried
about possible deadlocks due to having to grab SInvalLock there.) Instead
I propose that we modify GetSnapshotData() to compute the current global
xmin as a byproduct of its existing computation (which it can do almost
for free) and stash that in a global variable someplace. index_getnext
can then use the global variable to call HeapTupleSatisfiesVacuum. This
will effectively mean that we do index-tuple killing on the basis of the
global xmin as it stood when we started the current transaction. In some
cases that might be a little out of date, but using an old xmin cannot
cause incorrect behavior; at worst an index entry will survive a little
longer than it really needs to.

3. What should the API to the index AMs look like? I propose adding
two fields to the IndexScanDesc data structure:

bool kill_prior_tuple; /* true if previously returned tuple is dead */
bool ignore_killed_tuples; /* true to not return killed entries */

kill_prior_tuple is always set false during RelationGetIndexScan and at
the start of index_getnext. It's set true when index_getnext detects
a dead tuple and loops around to call the index AM again. So the index
AM may interpret it as "kill the tuple you last returned, ie, the one
indicated by currentItemData". Performing this action as part of
amgetnext minimizes the extra overhead needed to kill a tuple --- we don't
need an extra cycle of re-locking the current index page and re-finding
our place.

ignore_killed_tuples will be set true in RelationGetIndexScan, but could
be set false by callers that want to see the killed index tuples.
(Offhand I think only VACUUM would want to do that.)

Within the index AMs, both kill_prior_tuple and ignore_killed_tuples would
be examined only by the topmost amgetnext routine. A "killed" entry
behaves normally with respect to all internal operations of the index AM;
we just don't return it to callers when ignore_killed_tuples is true.
This will minimize the risk of introducing bugs into the index AMs.
As long as we can loop around for the next index tuple before we've
released page locks inside the AM, we should get most of the possible
performance benefit with just a localized change.

4. How exactly should a killed index tuple be marked on-disk? While there
is one free bit available in IndexTupleData.t_info, I would prefer to use
that bit to expand the index tuple size field to 14 bits instead of 13.
(This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
rather than being hard-limited to 8K.) What I am thinking of doing is
using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be
unused for index tuples. (I'm not sure it's ever set for heap tuples
either, actually, but it definitely looks free for index tuples.)

Whichever bit we use, the index AM can simply set it and mark the buffer
dirty with SetBufferCommitInfoNeedsSave. We do not need to WAL-log the
action, just as we do not WAL-log marking heap tuple commit status bits,
because the action could be done over by someone else if it were lost.

Comments? Anyone see any flaws or better ideas?

regards, tom lane

Search Discussions

  • Andreas Zeugswetter at May 22, 2002 at 10:29 am

    4. How exactly should a killed index tuple be marked on-disk? While there
    is one free bit available in IndexTupleData.t_info, I would prefer to use
    that bit to expand the index tuple size field to 14 bits instead of 13.
    (This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
    rather than being hard-limited to 8K.)
    While I agree that it might be handy to save this bit for future use,
    I do not see any value in increasing the max key length from 8k,
    especially when the new limit is then 10k. The limit is already 32 *
    the max key size of some other db's, and even those 256 bytes are usually
    sufficient.

    Andreas
  • Hannu Krosing at May 22, 2002 at 12:18 pm

    On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
    4. How exactly should a killed index tuple be marked on-disk? While there
    is one free bit available in IndexTupleData.t_info, I would prefer to use
    that bit to expand the index tuple size field to 14 bits instead of 13.
    (This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
    rather than being hard-limited to 8K.)
    While I agree that it might be handy to save this bit for future use,
    I do not see any value in increasing the max key length from 8k,
    especially when the new limit is then 10k. The limit is already 32 *
    the max key size of some other db's, and even those 256 bytes are usually
    sufficient.
    I'm not sure if it applies here, but key length for GIST indexes may
    benefit from 2x increase (14bits = 16k). IIRC limited key length is one
    reason for intarray indexes being 'lossy'.

    And we can even make it bigger if we start measuring keys in words or
    dwords instead of bytes - 16k x dword = 64kb

    --------------
    Hannu
  • Tom Lane at May 22, 2002 at 1:49 pm

    Hannu Krosing writes:
    On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
    While I agree that it might be handy to save this bit for future use,
    I do not see any value in increasing the max key length from 8k,
    I'm not sure if it applies here, but key length for GIST indexes may
    benefit from 2x increase (14bits = 16k). IIRC limited key length is one
    reason for intarray indexes being 'lossy'.
    Since there seems to be some dissension about that, I'll leave the
    t_info bit unused for now, instead of absorbing it into the length
    field.

    Since 13 bits is sufficient for 8K, people would not see any benefit
    anyway unless they use a nonstandard BLCKSZ. So I'm not that concerned
    about raising it --- just wanted to throw out the idea and see if people
    liked it.

    In the long run it'd be possible to not store length in IndexTupleData
    at all, but rely on the length from the item header, same as we do for
    heap tuples. So if we ever need more bits in IndexTupleData, there's
    a way out.

    regards, tom lane
  • Jan Wieck at May 22, 2002 at 3:40 pm

    Tom Lane wrote:
    Hannu Krosing <hannu@tm.ee> writes:
    On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
    While I agree that it might be handy to save this bit for future use,
    I do not see any value in increasing the max key length from 8k,
    I'm not sure if it applies here, but key length for GIST indexes may
    benefit from 2x increase (14bits = 16k). IIRC limited key length is one
    reason for intarray indexes being 'lossy'.
    Since there seems to be some dissension about that, I'll leave the
    t_info bit unused for now, instead of absorbing it into the length
    field.

    Since 13 bits is sufficient for 8K, people would not see any benefit
    anyway unless they use a nonstandard BLCKSZ. So I'm not that concerned
    about raising it --- just wanted to throw out the idea and see if people
    liked it.
    Also, in btree haven't we had some problems with index page
    splits when using entries large enought so that not at least
    3 of them fit on a page?


    Jan

    --

    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #================================================== JanWieck@Yahoo.com #
  • Tom Lane at May 22, 2002 at 3:56 pm

    Jan Wieck writes:
    Also, in btree haven't we had some problems with index page
    splits when using entries large enought so that not at least
    3 of them fit on a page?
    Right, that's why I said that the limit would only go up to ~10K anyway;
    btree won't take keys > 1/3 page.

    regards, tom lane
  • Jan Wieck at May 22, 2002 at 5:16 pm

    Tom Lane wrote:
    Jan Wieck <janwieck@yahoo.com> writes:
    Also, in btree haven't we had some problems with index page
    splits when using entries large enought so that not at least
    3 of them fit on a page?
    Right, that's why I said that the limit would only go up to ~10K anyway;
    btree won't take keys > 1/3 page.
    What's the point then? I mean, someone who needs more than 8K
    will outgrow 10K in no time, and those cases are topics for
    comp.databases.abuse.brutal ...


    Jan

    --

    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #================================================== JanWieck@Yahoo.com #
  • Tom Lane at May 22, 2002 at 5:29 pm

    Jan Wieck writes:
    Tom Lane wrote:
    Right, that's why I said that the limit would only go up to ~10K anyway;
    btree won't take keys > 1/3 page.
    What's the point then?
    Well, btree's not the only index access method we have. I'm not sure
    whether gist or rtree allow larger tuples...

    regards, tom lane
  • Manfred Koizar at May 22, 2002 at 8:01 pm

    On Tue, 21 May 2002 12:48:39 -0400, Tom Lane wrote:
    4. How exactly should a killed index tuple be marked on-disk? While there
    is one free bit available in IndexTupleData.t_info, I would prefer to use
    that bit to expand the index tuple size field to 14 bits instead of 13.
    (This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
    rather than being hard-limited to 8K.) What I am thinking of doing is
    using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be
    unused for index tuples. (I'm not sure it's ever set for heap tuples
    either, actually, but it definitely looks free for index tuples.)
    AFAICS LP_DELETE is not used at all. The only place where something
    seems to happen to it is in PageRepairFragmentation() in bufpage.c:
    if ((*lp).lp_flags & LP_DELETE) /* marked for deletion */
    (*lp).lp_flags &= ~(LP_USED | LP_DELETE);
    but there is no place where this bit is set. There's also a macro
    definition in itemid.h:
    #define ItemIdDeleted(itemId) \
    (((itemId)->lp_flags & LP_DELETE) != 0)
    which is *always* used in this context
    if (!ItemIdIsUsed(lp) || ItemIdDeleted(lp))

    So it looks save to use this bit for marking dead tuples. Wouldn't it
    be even possible to simply reset LP_USED instead of setting
    LP_DELETED?

    If you do not use LP_DELETED I'd vote for cleaning up the source and
    removing it completely.

    Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples.

    Will this free space be used by subsequent inserts?

    Servus
    Manfred
  • Tom Lane at May 22, 2002 at 9:56 pm

    Manfred Koizar writes:
    So it looks save to use this bit for marking dead tuples. Wouldn't it
    be even possible to simply reset LP_USED instead of setting
    LP_DELETED?
    Mmmm ... I don't think so. That would cause the tuple to actually
    disappear from the perspective of the index AM internals, which seems
    like a bad idea. (For example, if another backend had an indexscan
    stopped on that same tuple, it would fail to re-find its place when it
    tried to continue the indexscan.)
    Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples.
    See above. This is *not* a substitute for vacuuming.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 21, '02 at 5:43p
activeMay 22, '02 at 9:56p
posts10
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase