Hi,

I am right now working on to get HOT and VACUUM FULL work
together. I hit upon a bug which I initially thought is
something that HOT has introduced. But I can reproduce
it with CVS HEAD as well.

Here is what I do: Create a table a simple table with
three columns and one index. Insert a single row in the
table.

CREATE TABLE test (a int, b int, c char(512));
CREATE UNIQUE INDEX testindx ON test (a);
INSERT INTO test VALUES (1, 2, 'test');

Now, I continuosly UPDATE the same row with a simple
sql script using pgbench with two clients.

$ cat test.sql

UPDATE test set b = b + 10 WHERE a = 1;
SELECT *, ctid FROM test;

$ ./pgbench -c 2 -t 50000 -f ./test.sql postgres

Now, I run VACUUM FULL on the table in a loop with 20
seconds sleep.

$ while (true); do
echo "VACUUM FULL START";
./install/bin/psql -c 'vacuum full verbose test;' postgres;
echo "VACUUM FULL COMPLETE";
sleep 20;
done;


After few seconds pgbench fails with the following output:

starting vacuum...end.
Client 1 aborted in state 0: ERROR: could not read block 650 of
relation 1663/11467/16401: read only 0 of 8192 bytes
Client 0 aborted in state 0: ERROR: could not read block 649 of
relation 1663/11467/16401: read only 0 of 8192 bytes
transaction type: Custom query
scaling factor: 1
number of clients: 2
number of transactions per client: 50000
number of transactions actually processed: 29445/100000
tps = 459.980394 (including connections establishing)
tps = 460.040423 (excluding connections establishing)


Is this something which has been reported earlier ? My first
guess would be some kind of race condition between the FSM
updates and the relation truncation.

Its too late for me to look into this now. If someone wants
to look into this, that would be great. I would otherwise
work on this tomorrow. Any clues/pointers are appreciated.


Thanks,
Pavan


EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Tom Lane at Mar 7, 2007 at 8:09 pm

    "Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
    I am right now working on to get HOT and VACUUM FULL work
    together. I hit upon a bug which I initially thought is
    something that HOT has introduced. But I can reproduce
    it with CVS HEAD as well.
    I think we broke this in 8.2: vac_update_relstats needs to ensure that
    it always sends a relcache invalidation event, but as of 8.2 it's set
    up to conditionally update the pg_class entry only if it wrote new
    values into the tuple. If vacuum full is truncating the rel back to
    the same length that it was on the previous cycle (as is always the
    case in this test), then no update, hence no relcache flush, hence
    clients still think their cached rd_targblock is good. I think the
    code in vac_update_relstats

    if (dirty)
    heap_inplace_update(rd, ctup);

    needs to look more like what index_update_stats does:

    if (dirty)
    {
    heap_inplace_update(pg_class, tuple);
    /* the above sends a cache inval message */
    }
    else
    {
    /* no need to change tuple, but force relcache inval anyway */
    CacheInvalidateRelcacheByTuple(tuple);
    }

    Please check if this makes it go away for you --- I'm a bit busy
    at the moment.

    regards, tom lane
  • Pavan Deolasee at Mar 8, 2007 at 7:36 am
    Tom Lane wrote:
    >
    Please check if this makes it go away for you --- I'm a bit busy
    at the moment.
    >

    Thanks a lot, Tom. It seems to work fine for me. I will do some
    more tests and report if I see any issue. Btw, the patch as per
    your suggestion is attached.

    Thanks,
    Pavan
  • Pavan Deolasee at Mar 8, 2007 at 12:43 pm
    Pavan Deolasee wrote:
    >
    Thanks a lot, Tom. It seems to work fine for me. I will do some
    more tests and report if I see any issue.
    >

    The problem mentioned before is hard to reproduce with the
    suggested change, but its not completely gone away. I have
    seen that again on CVS HEAD with the patch applied.

    I am facing another issue with VACUUM FULL. This
    problem gets reproduced with HOT very easily, but takes
    few attempts to reproduce with CVS HEAD, but it
    certainly exists.

    This time I am using pgbench. All tables but "history" are
    created with fillfactor=50

    Now, I start running pgbench with scale factor of 10 and 40
    clients and 10000 txns/client. After few minutes, I start
    running VACUUM FULL on tellers and branches, every 10 seconds.
    After a while, all pgbench clients fail with the following
    errors:


    Client 1 aborted in state 11: ERROR: duplicate key violates unique
    constraint "branches_pkey"
    Client 30 aborted in state 11: ERROR: duplicate key violates unique
    constraint "branches_pkey"
    Client 39 aborted in state 11: ERROR: duplicate key violates unique
    constraint "branches_pkey"
    Client 7 aborted in state 11: ERROR: duplicate key violates unique
    constraint "branches_pkey"

    Next run of VACUUM FULL gives the following error:

    WARNING: index "branches_pkey" contains 15 row versions, but table
    contains 12 row versions
    HINT: Rebuild the index with REINDEX.


    Has this been reported earlier ? IIRC Tom mentioned in one of the
    emails that Merlin has reported some problem related
    to "duplicate key violation". Tom, could this be related ?


    Thanks,
    Pavan
  • Tom Lane at Mar 9, 2007 at 9:40 pm

    "Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
    The problem mentioned before is hard to reproduce with the
    suggested change, but its not completely gone away. I have
    seen that again on CVS HEAD with the patch applied.
    I am facing another issue with VACUUM FULL. This
    problem gets reproduced with HOT very easily, but takes
    few attempts to reproduce with CVS HEAD, but it
    certainly exists.
    I've been banging away on this since yesterday, and I think I've
    achieved a full understanding of what's going on. There are three or
    four different-looking pathologies but they all seem to arise from
    the same problem: the update-chain-moving code assumes that
    RECENTLY_DEAD tuples will never have update successors that are entirely
    DEAD (according to HeapTupleSatisfiesVacuum). When faced with an
    update chain in which that does happen, it can move the chain multiple
    times, neglect to remove index entries for tuples that get truncated
    away, crash on an Assert, or other interesting stuff. Here's an example
    from some debug printouts I inserted into repair_frag:

    chain forward branches 17/174 to 17/183 (x 1993046 1993057) RECENTLY_DEAD
    chain forward branches 17/183 to 15/109 (x 1993057 1993055) RECENTLY_DEAD
    chain forward branches 15/109 to 15/111 (x 1993055 1993045) DEAD
    chain forward branches 15/111 to 15/114 (x 1993045 1993025) DEAD
    chain forward branches 15/114 to 15/116 (x 1993025 1993096) RECENTLY_DEAD
    chain forward branches 15/116 to 15/119 (x 1993096 1993107) RECENTLY_DEAD
    chain forward branches 15/119 to 15/121 (x 1993107 1993120) RECENTLY_DEAD
    chain forward branches 15/121 to 15/125 (x 1993120 1993121) RECENTLY_DEAD
    chain forward branches 15/125 to 15/128 (x 1993121 1993122) RECENTLY_DEAD
    chain forward branches 15/128 to 15/131 (x 1993122 1993092) RECENTLY_DEAD
    chain forward branches 15/131 to 15/133 (x 1993092 1993145) RECENTLY_DEAD
    chain forward branches 15/133 to 15/139 (x 1993145 1993182) RECENTLY_DEAD
    chain forward branches 15/139 to 15/141 (x 1993182 1993183) RECENTLY_DEAD
    chain forward branches 15/141 to 15/147 (x 1993183 1993155) RECENTLY_DEAD
    chain forward branches 15/147 to 15/150 (x 1993155 1993167) LIVE
    chain back stops at branches 15/114: xmin 1993025 < 1993050
    moved branches 15/150 to 0/69; next 0/69
    moved branches 15/147 to 0/70; next 0/69
    moved branches 15/141 to 0/71; next 0/70
    moved branches 15/139 to 0/72; next 0/71
    moved branches 15/133 to 0/73; next 0/72
    moved branches 15/131 to 0/74; next 0/73
    moved branches 15/128 to 0/75; next 0/74
    moved branches 15/125 to 0/76; next 0/75
    moved branches 15/121 to 0/77; next 0/76
    moved branches 15/119 to 0/78; next 0/77
    moved branches 15/116 to 0/79; next 0/78
    moved branches 15/114 to 0/80; next 0/79

    Since TIDs 17/174 and 17/183 didn't get moved, when the repair_frag
    search arrives at 17/183 it will copy this chain again, leading to
    duplicate copies of the LIVE tuple at the chain end, leading to trouble.

    It's not surprising that tuples could have xmax less than xmin, since
    transactions can commit in orders different than they start; when using
    READ COMMITTED updates it's not at all surprising that a transaction
    might update rows after a later-numbered transaction does. However, in
    looking at this code previously I'd assumed that the OldestXmin cutoff
    could never fall between two such transactions, and so the above
    scenario wouldn't happen. I'm not real sure why I thought that.
    For the cases that VACUUM FULL is interested in, both XIDs mentioned
    in a DEAD tuple must have committed before OldestXmin was computed, but
    there doesn't seem to be a compelling reason why OldestXmin might not
    have been determined by an unrelated third transaction with a number
    between those two.

    I believe it's the case that any update chain members appearing before
    a DEAD entry must in fact also be dead (ie, not listed as live in any
    active snapshot) but a test based on OldestXmin hasn't got enough
    resolution to prove that they are dead.

    Does anyone want to argue that this is an error in the calculation of
    OldestXmin (and if so, how would you propose to fix it)? If not, I'll
    set to work on fixing the chain-moving logic. I think the correct
    behavior is that we shouldn't consider DEAD tuples part of a movable
    chain, and so in the above example there are two separate chains to move
    not one. Alternatively we could try to recognize that the older part of
    the chain is really dead and removable, but that seems complicated and
    likely to introduce new bugs.

    I wonder whether this has any implications for HOT ...

    regards, tom lane
  • Simon Riggs at Mar 9, 2007 at 10:30 pm

    On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
    I wonder whether this has any implications for HOT ...
    My general feeling, expressed in a number of recent posts was that the
    VACUUM FULL code really isn't worth the trouble it causes. Especially
    when CLUSTER does a better job anyway?

    I've proposed a number of different proposals for changing VACUUM FULL,
    and Hannu posted some really cool ideas.

    Please can we spend time doing something useful, rather than trying to
    fix up a bag of worms that nobody ever runs? C'mon guys, this isn't a
    challenge, its a lost cause. I don't really mean to be radical, but I
    just think VACUUM FULL's time has come. A better utility could be
    written in the time it takes to fix and be certain of a fix.

    Yes, we need a utility that compacts a table, but isn't there a faster,
    safer way of doing that than the current VACUUM FULL algorithm and code?
    We can still *call* it VACUUM FULL. Modular replacement has been done
    numerous times over the years with great success, e.g. tuplesort, index
    build... lets do the same thing now and kiss goodbye to some code whose
    time has come.

    Put it another way: if anybody submitted a patch that does what VACUUM
    FULL does, coded in the way it is, it would never be applied, now.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Mar 9, 2007 at 11:00 pm

    "Simon Riggs" <simon@2ndquadrant.com> writes:
    On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
    I wonder whether this has any implications for HOT ...
    My general feeling, expressed in a number of recent posts was that the
    VACUUM FULL code really isn't worth the trouble it causes. Especially
    when CLUSTER does a better job anyway?
    Point A: we have to fix the back branches anyway.
    Point B: until we have an MVCC-safe CLUSTER, that is not a substitute.

    regards, tom lane
  • Simon Riggs at Mar 10, 2007 at 8:08 am

    On Fri, 2007-03-09 at 18:00 -0500, Tom Lane wrote:
    "Simon Riggs" <simon@2ndquadrant.com> writes:
    On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
    I wonder whether this has any implications for HOT ...
    My general feeling, expressed in a number of recent posts was that the
    VACUUM FULL code really isn't worth the trouble it causes. Especially
    when CLUSTER does a better job anyway?
    Point A: we have to fix the back branches anyway.
    OK, my thoughts were too forward-looking.
    Point B: until we have an MVCC-safe CLUSTER, that is not a substitute.
    Well, I wasn't actually suggesting we use CLUSTER instead, but there
    have been two other viable suggestions made that were MVCC safe and with
    much better characteristics (online, faster etc). A proposal for making
    CLUSTER MVCC safe was posted also.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Joshua D. Drake at Mar 9, 2007 at 11:02 pm

    Put it another way: if anybody submitted a patch that does what VACUUM
    FULL does, coded in the way it is, it would never be applied, now.
    Have an opinion do we? How about we just alias VACUUM FULL to cluster
    and add the reporting stuff from VERBOSE?

    Joshua D. Drake




    --

    === The PostgreSQL Company: Command Prompt, Inc. ===
    Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive PostgreSQL solutions since 1997
    http://www.commandprompt.com/

    Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
    PostgreSQL Replication: http://www.commandprompt.com/products/
  • Gregory Stark at Mar 10, 2007 at 12:43 am

    "Tom Lane" <tgl@sss.pgh.pa.us> writes:

    It's not surprising that tuples could have xmax less than xmin, since
    transactions can commit in orders different than they start; when using
    READ COMMITTED updates it's not at all surprising that a transaction
    might update rows after a later-numbered transaction does. However, in
    looking at this code previously I'd assumed that the OldestXmin cutoff
    could never fall between two such transactions, and so the above
    scenario wouldn't happen. I'm not real sure why I thought that.
    For the cases that VACUUM FULL is interested in, both XIDs mentioned
    in a DEAD tuple must have committed before OldestXmin was computed, but
    there doesn't seem to be a compelling reason why OldestXmin might not
    have been determined by an unrelated third transaction with a number
    between those two.
    No commentary but in case anyone else is having trouble following I had to
    make the following diagram (I think this is what you're describing?) before I
    fully understood what you were describing:

    TXN 1 TXN 2 TXN 3 TXN 4 VACUUM

    START
    . START
    . START .
    . UPDATE .
    . COMMIT .
    DELETE .
    COMMIT .
    . START
    COMMIT .
    . START

    So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
    between txn 1 and txn 2.

    And the tuple chain consists of two tuples. The original which has xmax
    younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
    has xmax older than OldestXmin and so is DEAD even though it has xmin younger
    than OldestXmin.


    Hm, I wonder if you could just notice that xmin is younger than OldestXmin.
    In a more complex example you could have lots of DEAD tuples in the chain and
    some RECENTLY_DEAD mixed in randomly. But I think all the DEAD tuples
    following a RECENTLY_DEAD would have to have xmin younger than OldestXmin.
    Or maybe I'm making the same mistake again. Gosh, this is confusing.


    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Mar 10, 2007 at 12:49 am

    Gregory Stark writes:
    So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
    between txn 1 and txn 2.
    And the tuple chain consists of two tuples. The original which has xmax
    younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
    has xmax older than OldestXmin and so is DEAD even though it has xmin younger
    than OldestXmin. Right.
    Hm, I wonder if you could just notice that xmin is younger than OldestXmin.
    You can see that at the newer tuple, but the problem is to propagate the
    knowledge back to the older tuple(s). Or were you suggesting that we
    treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a
    step backwards in space-reclamation ability. It'd be hard to implement
    in any case, because one of the problem cases is where VACUUM has
    already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
    that chains to it.

    regards, tom lane
  • Gregory Stark at Mar 10, 2007 at 1:05 am

    "Tom Lane" <tgl@sss.pgh.pa.us> writes:

    Hm, I wonder if you could just notice that xmin is younger than OldestXmin.
    You can see that at the newer tuple, but the problem is to propagate the
    knowledge back to the older tuple(s). Or were you suggesting that we
    treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a
    step backwards in space-reclamation ability. It'd be hard to implement
    in any case, because one of the problem cases is where VACUUM has
    already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
    that chains to it.
    I think I was suggesting treating the newer tuple as RECENTLY_DEAD. Ie, not
    vacuuming a tuple if either xmin or xmax is younger than OldestXmin. It's a
    step backwards in space-reclamation but really, how often can it happen?

    But I'm not entirely sure that's enough really. Any number of old transactions
    could come along and update the head of the tuple chain, setting both xmin and
    xmax to old values. I guess only the one tuple immediately following the
    RECENTLY_DEAD tuple would have the young xmin. That doesn't really help you
    identify the later DEAD tuples.

    Breaking the chain up into pieces seems weird. It seems like it's obviously
    bogus and only works because we're sure the tuples are dead anyways so it
    doesn't really matter what we do with them. If we're not sure they're dead it
    seems like the right thing to do is either to keep the whole chain or to
    relink the chain after removing the dead intervening tuples.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Mar 10, 2007 at 1:34 am

    Gregory Stark writes:
    Breaking the chain up into pieces seems weird. It seems like it's obviously
    bogus and only works because we're sure the tuples are dead anyways so it
    doesn't really matter what we do with them.
    Yup, exactly. If we wanted to be tense about this we'd try to get rid
    of the nominally RECENTLY_DEAD tuples that precede any DEAD tuple in the
    chain. However, I concur with Simon to the extent that I don't want to
    do any more work to fix this bug than necessary, and trying to recognize
    such tuples seems like a lot more work than necessary.

    Also, we know this case works because it already is working: in the
    situation where VACUUM happens to visit and remove the DEAD tuple(s)
    before reaching the RECENTLY_DEAD tuples that link forward to them,
    it treats the RECENTLY_DEAD tuples as a disconnected chain and moves
    them as-is. I saw tons of this in the traces I was making today, and
    it doesn't seem to create any bad effects. (My attention was drawn to
    it because I saw move_chain_tuple being used to move single-member
    chains, which looks impossible when you first look at the code --- the
    is-it-a-chain test seems to ensure that we can link either forward or
    backward. But not so if t_ctid points to an already-removed tuple.)

    regards, tom lane
  • Pavan Deolasee at Mar 10, 2007 at 11:42 am

    On 3/10/07, Tom Lane wrote:

    Also, we know this case works because it already is working: in the
    situation where VACUUM happens to visit and remove the DEAD tuple(s)
    before reaching the RECENTLY_DEAD tuples that link forward to them,
    it treats the RECENTLY_DEAD tuples as a disconnected chain and moves
    them as-is. I saw tons of this in the traces I was making today, and
    it doesn't seem to create any bad effects. (My attention was drawn to
    it because I saw move_chain_tuple being used to move single-member
    chains, which looks impossible when you first look at the code --- the
    is-it-a-chain test seems to ensure that we can link either forward or
    backward. But not so if t_ctid points to an already-removed tuple.)
    Oh. So thats the corner case which I missed. This would probably
    explain how we could miss marking an offset free and thus not remove
    the corresponding index entry.

    Thanks,
    Pavan
  • Pavan Deolasee at Mar 10, 2007 at 11:37 am

    On 3/10/07, Tom Lane wrote:
    I've been banging away on this since yesterday, and I think I've
    achieved a full understanding of what's going on. There are three or
    four different-looking pathologies but they all seem to arise from
    the same problem: the update-chain-moving code assumes that
    RECENTLY_DEAD tuples will never have update successors that are entirely
    DEAD (according to HeapTupleSatisfiesVacuum). When faced with an
    update chain in which that does happen, it can move the chain multiple
    times, neglect to remove index entries for tuples that get truncated
    away, crash on an Assert, or other interesting stuff. Here's an example
    from some debug printouts I inserted into repair_frag:

    It's not surprising that tuples could have xmax less than xmin, since
    transactions can commit in orders different than they start; when using
    READ COMMITTED updates it's not at all surprising that a transaction
    might update rows after a later-numbered transaction does. However, in
    looking at this code previously I'd assumed that the OldestXmin cutoff
    could never fall between two such transactions, and so the above
    scenario wouldn't happen. I'm not real sure why I thought that.
    For the cases that VACUUM FULL is interested in, both XIDs mentioned
    in a DEAD tuple must have committed before OldestXmin was computed, but
    there doesn't seem to be a compelling reason why OldestXmin might not
    have been determined by an unrelated third transaction with a number
    between those two.


    Sorry for replying late, I was away from the mails since yesterday. I
    arrived at the same conclusion yesterday while fixing this for HOT. It
    was much more common with HOT and the heavy concurrent
    UPDATE test that I was trying. It was very surprising for me
    as well. I always believed that we can *never* has a chain like:

    RD - RD - D - D - RD - L

    But thats not true. Thankfully ISTM that does not have any implication
    on HOT, but I shall think more and confirm.

    I had to introduce a hard-pruning mechanism to deal this in HOT
    where we remove any DEAD_CHAIN tuples from the chain
    before doing scan_heap.

    I thought this should not be a problem in CVS HEAD though because
    we vacuum_page() the block and mark DEAD tuples ~LP_USED.
    And later while moving backward in the chain we break whenever
    we find a tuple whose xmin < OldtestXmin assuming that to be
    the start of the chain. My understanding is that these two things
    together anyways break the chain into pieces and we move these
    pieces together. But may be I am missing something or there
    is some other corner case.

    In general, I believe that the most likely cause for earlier reported
    errors is that we are failing to clean up one or more index entries
    in VACUUM FULL, thus causing all sorts of errors. I had a hard
    time fixing this case for HOT.


    I believe it's the case that any update chain members appearing before
    a DEAD entry must in fact also be dead (ie, not listed as live in any
    active snapshot) but a test based on OldestXmin hasn't got enough
    resolution to prove that they are dead.

    I agree.


    Does anyone want to argue that this is an error in the calculation of
    OldestXmin (and if so, how would you propose to fix it)? If not, I'll
    set to work on fixing the chain-moving logic. I think the correct
    behavior is that we shouldn't consider DEAD tuples part of a movable
    chain, and so in the above example there are two separate chains to move
    not one. Alternatively we could try to recognize that the older part of
    the chain is really dead and removable, but that seems complicated and
    likely to introduce new bugs.

    Seems like a good idea. But as I said above, my guess is we already
    do that while moving backward in the chain. But I might be wrong.

    I wonder whether this has any implications for HOT ...
    Fortunately doesn't seem to be the case. VACUUM LAZY is fine
    because we anyways don't declare a tuple DEAD if its in the
    middle of a HOT-update chain. For VACUUM FULL, I am
    removing any intermediate DEAD tuples, fix the chain and then
    move the chain. We can actually do the same even for the
    current implementation. This would require changing xmin of
    the next tuple (when we remove a DEAD tuple in the chain) so
    that xmax/xmin chain is preserved, but we are only changing
    from one committed xmin to another committed xmin and hence
    should not have any correctness implication.

    Thanks,
    Pavan
  • Tom Lane at Mar 10, 2007 at 5:05 pm

    "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
    In general, I believe that the most likely cause for earlier reported
    errors is that we are failing to clean up one or more index entries
    in VACUUM FULL, thus causing all sorts of errors. I had a hard
    time fixing this case for HOT.
    Yeah, the case I saw was that the chain-moving code just assumes,
    without checking, that a successful chain move will have gotten rid of
    the tuple it started from. If in fact that does not happen (because
    chaining back stops before a RECENTLY_DEAD tuple) then the tuple is left
    where it is, and if the page is then truncated away, we are left with
    dangling index entries. That causes the WARNINGs about too many index
    entries, and if the TID is later repopulated then you have visible index
    corruption and/or bogus duplicate-key failures.

    Although this shouldn't happen anymore after fixing the chaining
    conditions, I'm inclined to introduce an additional test to verify that
    the starting tuple is actually MOVED_OFF after we finish the chain move.
    If not, give up on repair_frag the same as in some other corner cases.
    I wonder whether this has any implications for HOT ...
    Fortunately doesn't seem to be the case. VACUUM LAZY is fine
    because we anyways don't declare a tuple DEAD if its in the
    middle of a HOT-update chain. For VACUUM FULL, I am
    removing any intermediate DEAD tuples, fix the chain and then
    move the chain. We can actually do the same even for the
    current implementation. This would require changing xmin of
    the next tuple (when we remove a DEAD tuple in the chain) so
    that xmax/xmin chain is preserved, but we are only changing
    from one committed xmin to another committed xmin and hence
    should not have any correctness implication.
    [ raised eyebrow... ] You sure about that? If you replace an XID
    before OldestXmin with one after, or vice versa, ISTM you *could*
    be creating a problem. "Committed" is not good enough. So it looks
    to me like you can't remove a DEAD tuple whose predecessor is only
    RECENTLY_DEAD.

    regards, tom lane
  • Pavan Deolasee at Mar 10, 2007 at 6:06 pm

    On 3/10/07, Tom Lane wrote:
    Although this shouldn't happen anymore after fixing the chaining
    conditions, I'm inclined to introduce an additional test to verify that
    the starting tuple is actually MOVED_OFF after we finish the chain move.
    If not, give up on repair_frag the same as in some other corner cases.

    scan_heap() would usually have collected the DEAD tuple in offsets_free
    list. How do you plan to check if the tuple is in middle on a chain which
    has
    RECENTLY_DEAD tuple before the tuple under check ? Don't we need
    to collect the TID of the DEAD tuple in the vtlinks[] as well to establish
    the backward chains ?


    [ raised eyebrow... ] You sure about that? If you replace an XID
    before OldestXmin with one after, or vice versa, ISTM you *could*
    be creating a problem. "Committed" is not good enough. So it looks
    to me like you can't remove a DEAD tuple whose predecessor is only
    RECENTLY_DEAD.
    I also thought about it. For HOT, since we are only talking about the
    HOT-update chain within the page, ISTM it should be OK to change
    the xmin of the next tuple. I can see the following two cases:

    Case 1:

    Say, OldestXmin = 17

    10,20 20,18 18,16 16,14 14,22 22,0
    RD RD D D RD L

    After we remove the two DEAD tuples, the HOT-chain would
    look like:

    10,20 20,18 18,22 22,0
    RD RD RD L

    Case 2:

    Say, OldestXmin = 17

    10,20 20,18 18,16 16,14 14,0
    RD RD D D L

    After we remove the two DEAD tuples, the HOT-chain would
    look like:

    10,20 20,18 18,0
    RD RD L

    In both the cases, the modified HOT-chain looks sane to me.
    I understand that when we walk backward in the chain, we would
    have "break" at the second last tuple in Case 1 and the last tuple
    in the Case 2, but with the modified chain we shall walk backward
    to the first tuple in the chain.

    Do you see any issue here ? Anyways, if you plan to fix the problem
    in a different way for the current implementation, I can use the same
    for HOT.

    Thanks,
    Pavan
  • Pavan Deolasee at Mar 10, 2007 at 6:33 pm

    On 3/10/07, Pavan Deolasee wrote:

    scan_heap() would usually have collected the DEAD tuple in offsets_free
    list. How do you plan to check if the tuple is in middle on a chain which
    has
    RECENTLY_DEAD tuple before the tuple under check ? Don't we need
    to collect the TID of the DEAD tuple in the vtlinks[] as well to establish
    the backward chains ?
    Now that I read your first mail more carefully, I think you are suggesting
    that we move the tuple chains in pieces where each piece is terminated
    when we see a DEAD tuple. In that case, we don't need any of what
    I said above. Also, ISTM that HOT would work fine with this change
    and we may not need to the xmin-hack I described earlier. So it makes
    me comfortable. Well, at least until I take your modified code, merge
    HOT-changes and rerun the crazy UPDATE/VACUUM FULL intensive
    tests :)

    Thanks,
    Pavan
  • Gregory Stark at Mar 10, 2007 at 6:38 pm

    "Tom Lane" <tgl@sss.pgh.pa.us> writes:

    [ raised eyebrow... ] You sure about that? If you replace an XID
    before OldestXmin with one after, or vice versa, ISTM you *could*
    be creating a problem. "Committed" is not good enough. So it looks
    to me like you can't remove a DEAD tuple whose predecessor is only
    RECENTLY_DEAD.
    Except you're replacing it with the xmax of a tuple that while you can't prove
    it using OldestXmin you're convinced is in fact really DEAD. So everyone
    should be able to see the txn id you're substituting.

    It seems safer to remove all the tuples you think are DEAD rather than leave
    them and have hidden assumptions that you're right. It would be far easier to
    test that you're not removing tuples that aren't dead than that you aren't
    breaking the chain or substituting a live xmin in cases where it might matter.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Mar 13, 2007 at 9:22 pm

    "Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
    The problem mentioned before is hard to reproduce with the
    suggested change, but its not completely gone away. I have
    seen that again on CVS HEAD with the patch applied.
    I am facing another issue with VACUUM FULL. This
    problem gets reproduced with HOT very easily, but takes
    few attempts to reproduce with CVS HEAD, but it
    certainly exists.
    I've developed the attached patch against HEAD, and no longer see any
    funny behavior. Would appreciate it if you'd test some more, though.

    regards, tom lane
  • Pavan Deolasee at Mar 14, 2007 at 7:41 am
    Tom Lane wrote:
    >
    I've developed the attached patch against HEAD, and no longer see any
    funny behavior. Would appreciate it if you'd test some more, though.
    >

    The patch works for me. With the patch applied, I don't see the
    weird errors in the pgbench and other customized tests that I
    used to see earlier.

    I looked at the patch as well. ISTM that we are now moving chains
    in pieces where each piece is terminated by a DEAD tuple. That
    implies that the MOVED_OFF chain is actually broken. This
    should not be a problem as long as our assumption that all
    RECENTLY_DEAD tuples preceding a DEAD tuple must also be DEAD
    and its only the way OldtestXmin is calculated that we see
    them as RECENTLY_DEAD.

    If that assumption is true (and it must be true for us to move
    the chain in pieces), doesn't that mean we don't really need to
    move the RECENTLY_DEAD tuples preceding a DEAD tuple ? One way
    to do so would be to collect the target tuple (the tuple from
    where we started following the t_ctid chain) in the free_offsets
    if a DEAD or INSERT_IN_PROGRESS tuple is found while following
    the t_ctid chain. One-by-one we would collect all the
    RECENTLY_DEAD tuples preceding a DEAD tuple in the truncated
    pages.

    Not that I am suggesting we do this, just wanted to check if
    there is a flaw in my thinking. I agree that we should not be
    spending too much time on fixing this corner case and the
    patch that you have developed is good enough.


    Thanks,
    Pavan
  • Tom Lane at Mar 14, 2007 at 2:58 pm

    "Pavan Deolasee" <pavan.deolasee@enterprisedb.com> writes:
    If that assumption is true (and it must be true for us to move
    the chain in pieces), doesn't that mean we don't really need to
    move the RECENTLY_DEAD tuples preceding a DEAD tuple ?
    As I've already said several times: they are dead, but at least for
    VACUUM FULL's purposes it seems unreasonably difficult to determine that
    and remove them. The point at which we'd figure this out is after we've
    already performed dead-tuple removal (at least for some of the pages
    involved). It would make for a significant increase in logical
    complexity, and that code is too damn complicated already. Since we
    know this is a seldom-seen corner case, I'm not going to risk
    introducing new bugs to recycle a few tuples a bit sooner.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 7, '07 at 6:17p
activeMar 14, '07 at 2:58p
posts22
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase