FAQ
While testing the bug from the "Assertion failure at standby promotion",
I bumped into a different bug in fast promotion. When the first
checkpoint after fast promotion is performed, there is no guarantee that
the checkpointer process is running with the correct, new,
ThisTimeLineID. In CreateCheckPoint(), we have this:
/*
* An end-of-recovery checkpoint is created before anyone is allowed to
* write WAL. To allow us to write the checkpoint record, temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();
That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast
promotion. The consequence is that the checkpoint is created with the
old timeline, and subsequent recovery from it will fail.

I ran into this with the attached script. It sets up a master (M), a
standby (B), and a cascading standby (C). I'm not sure why, but when I
tried to simplify the script by removing the cascading standby, it
started to work. The bug occurs in standby B, so I'm not sure why the
presence of the cascading standby makes any difference. Maybe it just
affects the timing.

- Heikki

Search Discussions

  • Fujii Masao at May 9, 2013 at 12:59 am

    On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas wrote:
    While testing the bug from the "Assertion failure at standby promotion", I
    bumped into a different bug in fast promotion. When the first checkpoint
    after fast promotion is performed, there is no guarantee that the
    checkpointer process is running with the correct, new, ThisTimeLineID. In
    CreateCheckPoint(), we have this:
    /*
    * An end-of-recovery checkpoint is created before anyone is
    allowed to
    * write WAL. To allow us to write the checkpoint record,
    temporarily
    * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
    * initialized, which we need here and in AdvanceXLInsertBuffer.)
    */
    if (flags & CHECKPOINT_END_OF_RECOVERY)
    LocalSetXLogInsertAllowed();

    That ensures that ThisTimeLineID is updated when performing an
    end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
    The consequence is that the checkpoint is created with the old timeline, and
    subsequent recovery from it will fail.

    I ran into this with the attached script. It sets up a master (M), a standby
    (B), and a cascading standby (C). I'm not sure why, but when I tried to
    simplify the script by removing the cascading standby, it started to work.
    The bug occurs in standby B, so I'm not sure why the presence of the
    cascading standby makes any difference. Maybe it just affects the timing.
    Can this really happen? ISTM that the checkpointer should detect that
    the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
    before calling CreateCheckPoint().

    Regards,

    --
    Fujii Masao
  • Amit Kapila at May 9, 2013 at 5:42 am

    On Thursday, May 09, 2013 6:29 AM Fujii Masao wrote:
    On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas
    wrote:
    While testing the bug from the "Assertion failure at standby
    promotion", I
    bumped into a different bug in fast promotion. When the first
    checkpoint
    after fast promotion is performed, there is no guarantee that the
    checkpointer process is running with the correct, new,
    ThisTimeLineID. In
    CreateCheckPoint(), we have this:
    /*
    * An end-of-recovery checkpoint is created before anyone is
    allowed to
    * write WAL. To allow us to write the checkpoint record,
    temporarily
    * enable XLogInsertAllowed. (This also ensures
    ThisTimeLineID is
    * initialized, which we need here and in
    AdvanceXLInsertBuffer.)
    */
    if (flags & CHECKPOINT_END_OF_RECOVERY)
    LocalSetXLogInsertAllowed();

    That ensures that ThisTimeLineID is updated when performing an
    end-of-recovery checkpoint, but it doesn't get executed with fast
    promotion.
    The consequence is that the checkpoint is created with the old
    timeline, and
    subsequent recovery from it will fail.

    I ran into this with the attached script. It sets up a master (M), a standby
    (B), and a cascading standby (C). I'm not sure why, but when I tried to
    simplify the script by removing the cascading standby, it started to work.
    The bug occurs in standby B, so I'm not sure why the presence of the
    cascading standby makes any difference. Maybe it just affects the
    timing.

    Can this really happen? ISTM that the checkpointer should detect that
    the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
    before calling CreateCheckPoint().
    Without fast-promotion, it will request/perform End of Recovery checkpoint
    while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
    So I think before any new operation can start, it can make sure that
    Checkpoint with new timeline is performed.

    However with fast promotion, the request for checkpoint is done after
    recovery; so some operations can happen before checkpoint with new timeline.
    I think it can so happen that last checkpoint is with old timeline and there
    are operations with new timeline which might have caused the problem Heikki
    has seen.

    With Regards,
    Amit Kapila.
  • Kyotaro HORIGUCHI at May 9, 2013 at 8:06 am
    Hello,
    I think it can so happen that last checkpoint is with old timeline and there
    are operations with new timeline which might have caused the problem Heikki
    has seen.
    I suppose to have seen that.

    After adding an SQL command to modify the DB on standby-B after
    passive(propagated?) promotion of standby-C,

    < pg_ctl -D data-standbyB promote
    ---
    pg_ctl -D data-standbyB -w promote
    sleep 20
    psql -p 5433 postgres -c 'create table t (a int); checkpoint;'
    I saw following complaint from standby-C.
    B 2013-05-09 17:00:06.019 JST 30268 LOG: checkpoint starting: immediate force wait
    C 2013-05-09 17:00:06.211 JST 30277 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
    C 2013-05-09 17:00:06.211 JST 30295 FATAL: terminating walreceiver process due to administrator command
    B 2013-05-09 17:00:06.219 JST 30268 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.157 s, total=0.199 s; sync files=13, longest=0.041 s, average=0.012 s
    CHECKPOINT
    C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
    C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
    C 2013-05-09 16:41:38.980 JST 29624 LOG: record with zero length at 0/53F8B90
    .. repeats forever ..

    This seems to be caused from a kind of timeline inconsistency at
    first glance.. Explicit checkpoint before the modification on
    Sby-B does not help.
    psql -p 5433 postgres -c 'checkpoint;create table t (a int); checkpoint;'
    regards,

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center
  • Kyotaro HORIGUCHI at May 9, 2013 at 8:44 am
    With printing some additinal logs, the situation should be more
    clear..

    It seems that Sby-B failes to promote to TLI= 2; nevertheless the
    history file for TLI = 2 is somehow sent to sby-C. So sby-B
    remains on TLI=1 but sby-C solely switches onto TLI=2.

    # Come to think of this, I suspect that the additional logs is not so useful :(
    B 2013-05-09 17:29:53.380 JST 32258 ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000
    C 2013-05-09 17:29:53.380 JST 32257 FATAL: could not receive data from WAL stream: ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000

    B 2013-05-09 17:29:53.380 JST 32244 LOG: database system is ready to accept connections ..
    C 2013-05-09 17:30:08.395 JST 32256 LOG: Reading page on Timeline ID = 1
    C 2013-05-09 17:30:08.398 JST 32274 LOG: fetching timeline history file for timeline 2 from primary server
    C 2013-05-09 17:30:08.448 JST 32274 LOG: started streaming WAL from primary at 0/5000000 on timeline 1
    C 2013-05-09 17:30:08.452 JST 32274 LOG: replication terminated by primary server
    C 2013-05-09 17:30:08.452 JST 32274 DETAIL: End of WAL reached on timeline 1 at 0/53F8B60
    C 2013-05-09 17:30:08.452 JST 32256 LOG: new target timeline is 2
    C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
    C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
    C 2013-05-09 17:30:08.453 JST 32274 LOG: restarted WAL streaming at 0/5000000 on timeline 2
    B 2013-05-09 17:30:10.913 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8C30
    B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint complete: wrote 637 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=13.502 s, sync=0.105 s, total=13.733 s; sync files=2, longest=0.089 s, average=0.052 s
    B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint starting: immediate force wait
    B 2013-05-09 17:30:10.963 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8CD0
    B 2013-05-09 17:30:11.003 JST 32248 LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.049 s; sync files=0, longest=0.000 s, average=0.000 s
    B 2013-05-09 17:30:11.096 JST 32248 LOG: checkpoint starting: immediate force wait
    B 2013-05-09 17:30:11.909 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/540BEF8
    C 2013-05-09 17:30:11.929 JST 32256 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
    C 2013-05-09 17:30:11.929 JST 32274 FATAL: terminating walreceiver process due to administrator command
    B 2013-05-09 17:30:11.951 JST 32248 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.017 s, sync=0.785 s, total=0.855 s; sync files=13, longest=0.235 s, average=0.060 s
    CHECKPOINT
    C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
    C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
    C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
    C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
    C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
    C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
    C 2013-05-09 17:30:18.936 JST 32256 LOG: record with zero length at 0/53F8B90
    regards,

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center
  • Amit Kapila at May 9, 2013 at 1:09 pm

    On Thursday, May 09, 2013 2:14 PM Kyotaro HORIGUCHI wrote:
    With printing some additinal logs, the situation should be more
    clear..

    It seems that Sby-B failes to promote to TLI= 2; nevertheless the
    history file for TLI = 2 is somehow sent to sby-C. So sby-B
    remains on TLI=1 but sby-C solely switches onto TLI=2.

    # Come to think of this, I suspect that the additional logs is not so
    useful :(
    It seems to me, it is the same problem as discussed and fixed in below
    thread.
    http://www.postgresql.org/message-id/51894942.4080500@vmware.com

    Could you try with fixes given by heikki.

    With Regards,
    Amit Kapila.
  • Kyotaro HORIGUCHI at May 10, 2013 at 8:37 am
    Thank you for noticing me of that.
    It seems to me, it is the same problem as discussed and fixed in below
    thread.
    http://www.postgresql.org/message-id/51894942.4080500@vmware.com

    Could you try with fixes given by heikki.
    The first one settles the timeline transition problem for the
    present. Besides, I have no longer found any recovery failure
    except unrepeatable one shown below.
    C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at 0/5000000 on timeline 2
    B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete: (snip.)
    B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at 0/3000028
    B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
    C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data from WAL stream:
    I'm get confused, the patch seems to me ensureing the "first
    checkpoint after fast promotion is performed" to use the
    "correct, new, ThisTimeLineID".

    At Thu, 9 May 2013 11:10:23 +0530, Amit Kapila <amit.kapila@huawei.com> wrote in <005501ce4c77$b45f0b70$1d1d2250$@kapila@huawei.com>
    Without fast-promotion, it will request/perform End of Recovery checkpoint
    while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
    So I think before any new operation can start, it can make sure that
    Checkpoint with new timeline is performed.

    However with fast promotion, the request for checkpoint is done after
    recovery; so some operations can happen before checkpoint with new timeline.
    I think it can so happen that last checkpoint is with old timeline and there
    are operations with new timeline which might have caused the problem Heikki
    has seen.
    I don't see any problem (in my view :) that something writes WAL
    runs before the first checkpoint starts - as described in the
    comment just above the checkpoint request..

    regards,

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center
  • Amit Kapila at May 10, 2013 at 11:01 am

    On Friday, May 10, 2013 2:07 PM Kyotaro HORIGUCHI wrote:
    Thank you for noticing me of that.
    It seems to me, it is the same problem as discussed and fixed in below
    thread.
    http://www.postgresql.org/message-id/51894942.4080500@vmware.com

    Could you try with fixes given by heikki.
    The first one settles the timeline transition problem for the
    present. Besides, I have no longer found any recovery failure
    except unrepeatable one shown below.
    C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at
    0/5000000 on timeline 2
    B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete: (snip.)
    B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at 0/3000028
    B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
    C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
    from WAL stream:
    Is there any chance, that there is any network glitch caused this one time
    error.
    I'm get confused, the patch seems to me ensureing the "first
    checkpoint after fast promotion is performed" to use the
    "correct, new, ThisTimeLineID".
    What is your confusion?

    With Regards,
    Amit Kapila.
  • Kyotaro HORIGUCHI at May 13, 2013 at 12:24 am

    2013/05/10 20:01 "Amit Kapila" <amit.kapila@huawei.com>:
    C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
    from WAL stream:
    Is there any chance, that there is any network glitch caused this one time
    error.
    Unix domam sockets are hardly likely to have such troubles. This
    test ran within single host.
    I'm get confused, the patch seems to me ensureing the "first
    checkpoint after fast promotion is performed" to use the
    "correct, new, ThisTimeLineID".
    What is your confusion?
    Heikki said in the fist message in this thread that he suspected
    the cause of the failure he had seen to be wrong TLI on whitch
    checkpointer runs. Nevertheless, the patch you suggested for me
    looks fixing it. Moreover (one of?) the failure from the same
    cause looks fixed with the patch.

    Is the point of this discussion that the patch may leave out some
    glich about timing of timeline-related changing and Heikki saw an
    egress of that?

    regards,

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center
  • Amit Kapila at May 13, 2013 at 3:07 am

    On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
    2013/05/10 20:01 "Amit Kapila" <amit.kapila@huawei.com>:
    C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
    from WAL stream:
    Is there any chance, that there is any network glitch caused this one time
    error.
    Unix domam sockets are hardly likely to have such troubles. This
    test ran within single host.
    I'm get confused, the patch seems to me ensureing the "first
    checkpoint after fast promotion is performed" to use the
    "correct, new, ThisTimeLineID".
    What is your confusion?
    Heikki said in the fist message in this thread that he suspected
    the cause of the failure he had seen to be wrong TLI on whitch
    checkpointer runs. Nevertheless, the patch you suggested for me
    looks fixing it. Moreover (one of?) the failure from the same
    cause looks fixed with the patch.
    There were 2 problems:
    1. There was some issue in walsender logic due to which after promotion in
    some cases it hits assertion or error
    2. During fast promotion, checkpoint gets created with wrong TLI

    He has provided 2 different patches
    fix-standby-promotion-assert-fail-2.patch and
    fast-promotion-quick-fix.patch.
    Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
    (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
    75c99e52984f7ee81b47d137b5b4751)
    Is the point of this discussion that the patch may leave out some
    glich about timing of timeline-related changing and Heikki saw an
    egress of that?
    AFAIU, the committed patch has some gap in overall scenario which is the
    fast promotion issue.

    With Regards,
    Amit Kapila.
  • Heikki Linnakangas at May 13, 2013 at 7:43 am

    On 13.05.2013 06:07, Amit Kapila wrote:
    On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
    Heikki said in the fist message in this thread that he suspected
    the cause of the failure he had seen to be wrong TLI on whitch
    checkpointer runs. Nevertheless, the patch you suggested for me
    looks fixing it. Moreover (one of?) the failure from the same
    cause looks fixed with the patch.
    There were 2 problems:
    1. There was some issue in walsender logic due to which after promotion in
    some cases it hits assertion or error
    2. During fast promotion, checkpoint gets created with wrong TLI

    He has provided 2 different patches
    fix-standby-promotion-assert-fail-2.patch and
    fast-promotion-quick-fix.patch.
    Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
    (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
    75c99e52984f7ee81b47d137b5b4751)
    That's correct.
    Is the point of this discussion that the patch may leave out some
    glich about timing of timeline-related changing and Heikki saw an
    egress of that?
    AFAIU, the committed patch has some gap in overall scenario which is the
    fast promotion issue.
    Right, the fast promotion issue is still there.

    Just to get us all on the same page again: Does anyone see a problem
    with a fresh git checkout, with the fast-promotion-quick-fix.patch
    applied?
    (http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
    you do, please speak up. As far as I know, the already-committed patch,
    together with fast-promotion-quick-fix.patch, should fix all known
    issues (*).

    I haven't committed a fix for the issue I reported in this thread,
    because I'm not 100% on what the right fix for it would be.
    fast-promotion-quick-fix.patch seems to do the trick, but at least the
    comments need to be updated, and I'm not sure if there some related
    corner cases that it doesn't handle. Simon?

    (*) Well, almost. This one is still pending:
    http://www.postgresql.org/message-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com

    - Heikki
  • Amit Kapila at May 13, 2013 at 9:42 am

    On Monday, May 13, 2013 1:13 PM Heikki Linnakangas wrote:
    On 13.05.2013 06:07, Amit Kapila wrote:
    On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
    Heikki said in the fist message in this thread that he suspected
    the cause of the failure he had seen to be wrong TLI on whitch
    checkpointer runs. Nevertheless, the patch you suggested for me
    looks fixing it. Moreover (one of?) the failure from the same
    cause looks fixed with the patch.
    There were 2 problems:
    1. There was some issue in walsender logic due to which after
    promotion in
    some cases it hits assertion or error
    2. During fast promotion, checkpoint gets created with wrong TLI

    He has provided 2 different patches
    fix-standby-promotion-assert-fail-2.patch and
    fast-promotion-quick-fix.patch.
    Among 2, he has already committed fix-standby-promotion-assert-fail- 2.patch
    (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa
    66f49
    75c99e52984f7ee81b47d137b5b4751)
    That's correct.
    Is the point of this discussion that the patch may leave out some
    glich about timing of timeline-related changing and Heikki saw an
    egress of that?
    AFAIU, the committed patch has some gap in overall scenario which is the
    fast promotion issue.
    Right, the fast promotion issue is still there.

    Just to get us all on the same page again: Does anyone see a problem
    with a fresh git checkout, with the fast-promotion-quick-fix.patch
    applied?
    (http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
    you do, please speak up. As far as I know, the already-committed patch,
    together with fast-promotion-quick-fix.patch, should fix all known
    issues (*).

    I haven't committed a fix for the issue I reported in this thread,
    because I'm not 100% on what the right fix for it would be.
    fast-promotion-quick-fix.patch seems to do the trick, but at least the
    comments need to be updated, and I'm not sure if there some related
    corner cases that it doesn't handle. Simon?
    The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
    of recovery checkpoint, it doesn't matter w.r.t performance but actually the
    purpose will
    be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
    or am I missing something.

    One more thing, I think after fast promotion, either it should set timeline
    or give error in CreateCheckPoint() function before it reaches the check
    mentioned by you in your initial mail.
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
                     elog(ERROR, "can't create a checkpoint during recovery");
    Shouldn't it set timeline in above check (RecoveryInProgress()) or when
    RecoveryInProgress() is called before CreateCheckPoint()?

    With Regards,
    Amit Kapila.
  • Kyotaro HORIGUCHI at May 16, 2013 at 6:03 am
    Hello,
    Is the point of this discussion that the patch may leave out some
    glich about timing of timeline-related changing and Heikki saw an
    egress of that?
    AFAIU, the committed patch has some gap in overall scenario which is the
    fast promotion issue.
    Right, the fast promotion issue is still there.

    Just to get us all on the same page again: Does anyone see a problem
    with a fresh git checkout, with the fast-promotion-quick-fix.patch
    applied?
    (http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
    you do, please speak up. As far as I know, the already-committed patch,
    together with fast-promotion-quick-fix.patch, should fix all known
    issues (*).
    Shared XLogCtl->ThisTimeLineID is written and read without
    fencing by spinlock unlike some other XLogCtl members. Can this
    break coherency of its memory between different processors? It
    is quite reasonable that I cannot find the trouble if it is the
    cause. I didn't see the issue even without
    fast-promotion-quick-fix.patch.
    The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
    of recovery checkpoint, it doesn't matter w.r.t performance but actually the
    purpose will
    be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
    or am I missing something.

    One more thing, I think after fast promotion, either it should set timeline
    or give error in CreateCheckPoint() function before it reaches the check
    mentioned by you in your initial mail.
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
    elog(ERROR, "can't create a checkpoint during recovery");
    Shouldn't it set timeline in above check (RecoveryInProgress()) or when
    RecoveryInProgress() is called before CreateCheckPoint()?
    Thinking of checkpointer, it does RecoveryInProgress() far
    earlier to there, in waiting loop in CheckPointerMain where to
    decide which to do between checkpoint and restartpoint. So
    InitXLogAccess() has been already done when checkpoint is choosed
    there for the first time. And before that, ThisTimeLineID in the
    startup process gets incremented and is copied onto
    XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
    becomes false in StartupXLog(). I don't think it is possible for
    checkpointer to run on older timeline id on codition that all
    processes looks exactly the same memory image.

    regards,

    --
    Kyotaro Horiguchi
    NTT Open Source Software Center
  • Simon Riggs at May 16, 2013 at 7:13 am

    On 16 May 2013 07:02, Kyotaro HORIGUCHI wrote:

    fast promotion issue.
    Excuse me for not joining the thread earlier. I'm not available today,
    but will join in later in my evening.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Amit Kapila at May 16, 2013 at 9:17 am

    On Thursday, May 16, 2013 11:33 AM Kyotaro HORIGUCHI wrote:
    Hello,
    Is the point of this discussion that the patch may leave out
    some
    glich about timing of timeline-related changing and Heikki saw
    an
    egress of that?
    AFAIU, the committed patch has some gap in overall scenario which
    is
    the
    fast promotion issue.
    Right, the fast promotion issue is still there.

    Just to get us all on the same page again: Does anyone see a
    problem
    with a fresh git checkout, with the fast-promotion-quick-fix.patch
    applied?
    (http://www.postgresql.org/message-id/51894942.4080500@vmware.com).
    If
    you do, please speak up. As far as I know, the already-committed
    patch,
    together with fast-promotion-quick-fix.patch, should fix all known
    issues (*).
    Shared XLogCtl->ThisTimeLineID is written and read without
    fencing by spinlock unlike some other XLogCtl members. Can this
    break coherency of its memory between different processors? It
    is quite reasonable that I cannot find the trouble if it is the
    cause. I didn't see the issue even without
    fast-promotion-quick-fix.patch.
    The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
    of recovery checkpoint, it doesn't matter w.r.t performance but
    actually the
    purpose will
    be almost same for calling LocalSetXLogInsertAllowed() and
    InitXLOGAccess(),
    or am I missing something.

    One more thing, I think after fast promotion, either it should set timeline
    or give error in CreateCheckPoint() function before it reaches the check
    mentioned by you in your initial mail.
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
    elog(ERROR, "can't create a checkpoint during
    recovery");
    Shouldn't it set timeline in above check (RecoveryInProgress()) or when
    RecoveryInProgress() is called before CreateCheckPoint()?
    Thinking of checkpointer, it does RecoveryInProgress() far
    earlier to there, in waiting loop in CheckPointerMain where to
    decide which to do between checkpoint and restartpoint. So
    InitXLogAccess() has been already done when checkpoint is choosed
    there for the first time. And before that, ThisTimeLineID in the
    startup process gets incremented and is copied onto
    XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
    becomes false in StartupXLog(). I don't think it is possible for
    checkpointer to run on older timeline id on codition that all
    processes looks exactly the same memory image.
    I also think the same, but now one difference with fast promotion is the
    request for checkpoint is done after recovery; so some operations can happen
    before checkpoint with new timeline.

    With Regards,
    Amit Kapila.
  • Simon Riggs at May 19, 2013 at 2:25 pm

    On 7 May 2013 10:57, Heikki Linnakangas wrote:
    While testing the bug from the "Assertion failure at standby promotion", I
    bumped into a different bug in fast promotion. When the first checkpoint
    after fast promotion is performed, there is no guarantee that the
    checkpointer process is running with the correct, new, ThisTimeLineID. In
    CreateCheckPoint(), we have this:
    /*
    * An end-of-recovery checkpoint is created before anyone is
    allowed to
    * write WAL. To allow us to write the checkpoint record,
    temporarily
    * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
    * initialized, which we need here and in AdvanceXLInsertBuffer.)
    */
    if (flags & CHECKPOINT_END_OF_RECOVERY)
    LocalSetXLogInsertAllowed();

    That ensures that ThisTimeLineID is updated when performing an
    end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
    The consequence is that the checkpoint is created with the old timeline, and
    subsequent recovery from it will fail.
    LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(),
    but in fast promotion this is called by Startup process, not
    Checkpointer process. So there is no longer an explicit call to
    InitXLOGAccess() in the checkpointer process via a checkpoint with
    flag CHECKPOINT_END_OF_RECOVERY.

    However, there is a call to RecoveryInProgress() at the top of the
    main loop of the checkpointer, which does explicitly state that it
    "initializes TimeLineID if it's not set yet." The checkpointer makes
    the decision about whether to run a restartpoint or a checkpoint
    directly from the answer to that, modified only in the case of an
    CHECKPOINT_END_OF_RECOVERY.

    So the appropiate call is made and I don't agree with the statement
    that this "doesn't get executed with fast promotion", or the title of
    the thread.

    So while I believe that the checkpointer might have an incorrect TLI
    and that you've seen a bug, what isn't clear is that the checkpointer
    is the only process that would see an incorrect TLI, or why such
    processes see an incorrect TLI. It seems equally likely at this point
    that the TLI may be set incorrectly somehow and that is why it is
    being read incorrectly.

    I see that the comment in InitXLOGAccess() is incorrect
    "ThisTimeLineID doesn't change so we need no lock to copy it", which
    seems worth correcting since there's no need to save time in a once
    only function.

    Continuing to investigate further.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Heikki Linnakangas at May 20, 2013 at 5:47 pm

    On 19.05.2013 17:25, Simon Riggs wrote:
    However, there is a call to RecoveryInProgress() at the top of the
    main loop of the checkpointer, which does explicitly state that it
    "initializes TimeLineID if it's not set yet." The checkpointer makes
    the decision about whether to run a restartpoint or a checkpoint
    directly from the answer to that, modified only in the case of an
    CHECKPOINT_END_OF_RECOVERY.

    So the appropiate call is made and I don't agree with the statement
    that this "doesn't get executed with fast promotion", or the title of
    the thread.
    Hmm, I see.
    So while I believe that the checkpointer might have an incorrect TLI
    and that you've seen a bug, what isn't clear is that the checkpointer
    is the only process that would see an incorrect TLI, or why such
    processes see an incorrect TLI. It seems equally likely at this point
    that the TLI may be set incorrectly somehow and that is why it is
    being read incorrectly.
    Yeah. I spent some more time debugging this, and found the culprit. In
    CreateRestartPoint, we do this:
    /*
    * Update ThisTimeLineID to the timeline we're currently replaying,
    * so that we install any recycled segments on that timeline.
    *
    * There is no guarantee that the WAL segments will be useful on the
    * current timeline; if recovery proceeds to a new timeline right
    * after this, the pre-allocated WAL segments on this timeline will
    * not be used, and will go wasted until recycled on the next
    * restartpoint. We'll live with that.
    */
    (void) GetXLogReplayRecPtr(&ThisTimeLineID);
    Here's what happens:

    0. System is in recovery
    1. checkpointer process starts creating a restartpoint.
    2. Startup process ends recovery, creates end-of-recovery record, sets
    the shared ThisTimeLineID value to 2, and exits.
    3. checkpointer process calls RecoveryInProgess() while performing the
    restartpoint (there are RecoveryInProgress() calls in many places, e.g
    in XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and
    initializes ThisTimeLineID to 2.
    4. At the end of restartpoint, the checkpointer process does the above
    GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID
    back to 1.
    5. Checkpointer calls RecoveryInProgress, which returns true
    immediately, as LocalRecoveryInProgress is already set.
    5. Checkpointer performs the first checkpoint, with ThisTimeLineID set
    incorrectly to 1.

    Not sure what the best fix would be. Perhaps change the code in
    CreateRestartPoint() to do something like this instead:

    GetXLogReplayRecPtr(&replayTLI);
    if (RecoveryInProgress())
        ThisTimeLineID = replayTLI;

    I see that the comment in InitXLOGAccess() is incorrect
    "ThisTimeLineID doesn't change so we need no lock to copy it", which
    seems worth correcting since there's no need to save time in a once
    only function.
    Hmm, the code seems right to me, XLogCtl->ThisTimeLineID indeed doesn't
    change, once it's set. And InitXLOGAccess() isn't called until it is
    set. The comment could explain that better, though.

    - Heikki
  • Simon Riggs at May 20, 2013 at 7:18 pm

    On 20 May 2013 18:47, Heikki Linnakangas wrote:
    On 19.05.2013 17:25, Simon Riggs wrote:

    So while I believe that the checkpointer might have an incorrect TLI
    and that you've seen a bug, what isn't clear is that the checkpointer
    is the only process that would see an incorrect TLI, or why such
    processes see an incorrect TLI. It seems equally likely at this point
    that the TLI may be set incorrectly somehow and that is why it is
    being read incorrectly.

    Yeah. I spent some more time debugging this, and found the culprit. In
    CreateRestartPoint, we do this:
    /*
    * Update ThisTimeLineID to the timeline we're currently
    replaying,
    * so that we install any recycled segments on that
    timeline.
    *
    * There is no guarantee that the WAL segments will be
    useful on the
    * current timeline; if recovery proceeds to a new
    timeline right
    * after this, the pre-allocated WAL segments on this
    timeline will
    * not be used, and will go wasted until recycled on the
    next
    * restartpoint. We'll live with that.
    */
    (void) GetXLogReplayRecPtr(&ThisTimeLineID);

    Here's what happens:

    0. System is in recovery
    1. checkpointer process starts creating a restartpoint.
    2. Startup process ends recovery, creates end-of-recovery record, sets the
    shared ThisTimeLineID value to 2, and exits.
    3. checkpointer process calls RecoveryInProgess() while performing the
    restartpoint (there are RecoveryInProgress() calls in many places, e.g in
    XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes
    ThisTimeLineID to 2.
    4. At the end of restartpoint, the checkpointer process does the above
    GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to
    1.
    5. Checkpointer calls RecoveryInProgress, which returns true immediately, as
    LocalRecoveryInProgress is already set.
    6. Checkpointer performs the first checkpoint, with ThisTimeLineID set
    incorrectly to 1.
    That description looks correct to me from the code.

    Not sure what the best fix would be. Perhaps change the code in
    CreateRestartPoint() to do something like this instead:

    GetXLogReplayRecPtr(&replayTLI);
    if (RecoveryInProgress())
    ThisTimeLineID = replayTLI;
    Installing a few extra WAL files doesn't seem to be a good reason to
    mess with such an important variable as ThisTimeLineID, especially
    since its such a rare event and hardly worth optimising for.

    I would prefer it if we didn't set ThisTimeLineID at all in that way,
    or anywhere else. If we do, it needs to be done safely, otherwise any
    caller could make the same mistake.

    Suggested patch attached.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Heikki Linnakangas at May 20, 2013 at 7:40 pm

    On 20.05.2013 22:18, Simon Riggs wrote:
    On 20 May 2013 18:47, Heikki Linnakangaswrote:
    Not sure what the best fix would be. Perhaps change the code in
    CreateRestartPoint() to do something like this instead:

    GetXLogReplayRecPtr(&replayTLI);
    if (RecoveryInProgress())
    ThisTimeLineID = replayTLI;
    Installing a few extra WAL files doesn't seem to be a good reason to
    mess with such an important variable as ThisTimeLineID, especially
    since its such a rare event and hardly worth optimising for.

    I would prefer it if we didn't set ThisTimeLineID at all in that way,
    or anywhere else. If we do, it needs to be done safely, otherwise any
    caller could make the same mistake.
    Suggested patch attached.
    @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
    * not be used, and will go wasted until recycled on the next
    * restartpoint. We'll live with that.
    */
    - (void) GetXLogReplayRecPtr(&ThisTimeLineID);
    -
    - RemoveOldXlogFiles(_logSegNo, endptr);
    -
    - /*
    - * Make more log segments if needed. (Do this after recycling old log
    - * segments, since that may supply some of the needed files.)
    - */
    PreallocXlogFiles(endptr);
    }
    That's essentially reverting commit 343ee00b7, resurrecting the bug that
    that fixed.
    @@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
    SpinLockAcquire(&xlogctl->info_lck);
    recptr = xlogctl->lastReplayedEndRecPtr;
    tli = xlogctl->lastReplayedTLI;
    + if (replayTLI && xlogctl->SharedRecoveryInProgress)
    + *replayTLI = tli;
    SpinLockRelease(&xlogctl->info_lck);

    - if (replayTLI)
    - *replayTLI = tli;
    return recptr;
    }

    That breaks the only remaining caller that passes a replayTLI pointer,
    GetStandbyFlushRecPtr(); *replayTLI points to a local variable in that
    function, which is left uninitialized if !xlogctl->SharedRecoveryInProgress.

    - Heikki
  • Simon Riggs at May 20, 2013 at 9:01 pm

    On 20 May 2013 20:40, Heikki Linnakangas wrote:
    On 20.05.2013 22:18, Simon Riggs wrote:

    On 20 May 2013 18:47, Heikki Linnakangaswrote:
    Not sure what the best fix would be. Perhaps change the code in

    CreateRestartPoint() to do something like this instead:

    GetXLogReplayRecPtr(&replayTLI);
    if (RecoveryInProgress())
    ThisTimeLineID = replayTLI;

    Installing a few extra WAL files doesn't seem to be a good reason to
    mess with such an important variable as ThisTimeLineID, especially
    since its such a rare event and hardly worth optimising for.

    I would prefer it if we didn't set ThisTimeLineID at all in that way,
    or anywhere else. If we do, it needs to be done safely, otherwise any
    caller could make the same mistake.
    Suggested patch attached.
    @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)

    * not be used, and will go wasted until recycled on the
    next
    * restartpoint. We'll live with that.
    */
    - (void) GetXLogReplayRecPtr(&ThisTimeLineID);
    -
    - RemoveOldXlogFiles(_logSegNo, endptr);
    -
    - /*
    - * Make more log segments if needed. (Do this after
    recycling old log
    - * segments, since that may supply some of the needed
    files.)
    - */
    PreallocXlogFiles(endptr);
    }

    That's essentially reverting commit 343ee00b7, resurrecting the bug that
    that fixed.
    Hmm, OK.

    Then we should using the correct value, not the one that patch set. It
    certainly worked at that time, but not in a principled way.

    When we set the new timeline we should be updating all values that
    might be used elsewhere. If we do that, then no matter when or how we
    run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

    Patch attached.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Heikki Linnakangas at May 21, 2013 at 6:46 am

    On 21.05.2013 00:00, Simon Riggs wrote:
    When we set the new timeline we should be updating all values that
    might be used elsewhere. If we do that, then no matter when or how we
    run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

    --- a/src/backend/access/transam/xlog.c
    +++ b/src/backend/access/transam/xlog.c
    @@ -5838,8 +5838,16 @@ StartupXLOG(void)
    }

    /* Save the selected TimeLineID in shared memory, too */
    - XLogCtl->ThisTimeLineID = ThisTimeLineID;
    - XLogCtl->PrevTimeLineID = PrevTimeLineID;
    + {
    + /* use volatile pointer to prevent code rearrangement */
    + volatile XLogCtlData *xlogctl = XLogCtl;
    +
    + SpinLockAcquire(&xlogctl->info_lck);
    + XLogCtl->ThisTimeLineID = ThisTimeLineID;
    + XLogCtl->lastReplayedTLI = ThisTimeLineID;
    + XLogCtl->PrevTimeLineID = PrevTimeLineID;
    + SpinLockRelease(&xlogctl->info_lck);
    + }
    Hmm. lastReplayedTLI is supposed to be the timeline of the last record
    that was replayed, ie. the timeline corresponding lastReplayedEndRecPtr.
    I think it would work, but it feels like it could be an easy source of
    bugs in the future.

    It might be a good idea to change walsender to not store that in
    ThisTimeLineID, though. It used to make sense for ThisTimeLineID to
    track the current recovery timeline in 9.2 and below, but now that
    walsender can be sending any older timeline, using ThisTimeLineID to
    store the latest one seems confusing.

    - Heikki
  • Simon Riggs at May 21, 2013 at 8:26 am

    On 21 May 2013 07:46, Heikki Linnakangas wrote:
    On 21.05.2013 00:00, Simon Riggs wrote:

    When we set the new timeline we should be updating all values that
    might be used elsewhere. If we do that, then no matter when or how we
    run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

    --- a/src/backend/access/transam/xlog.c
    +++ b/src/backend/access/transam/xlog.c
    @@ -5838,8 +5838,16 @@ StartupXLOG(void)
    }

    /* Save the selected TimeLineID in shared memory, too */
    - XLogCtl->ThisTimeLineID = ThisTimeLineID;
    - XLogCtl->PrevTimeLineID = PrevTimeLineID;
    + {
    + /* use volatile pointer to prevent code rearrangement */
    + volatile XLogCtlData *xlogctl = XLogCtl;
    +
    + SpinLockAcquire(&xlogctl->info_lck);
    + XLogCtl->ThisTimeLineID = ThisTimeLineID;
    + XLogCtl->lastReplayedTLI = ThisTimeLineID;
    + XLogCtl->PrevTimeLineID = PrevTimeLineID;
    + SpinLockRelease(&xlogctl->info_lck);
    + }

    Hmm. lastReplayedTLI is supposed to be the timeline of the last record that
    was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. I think
    it would work, but it feels like it could be an easy source of bugs in the
    future.
    I'm OK with that principle, as long as we don't touch ThisTimeLineID,
    which has been the source of multiple bugs.

    So we should set the TLI explicitly before installing, like attached patch.

    Otherwise we'd need multiple permanent TLIs which would be overkill.

    I feel there are problems because we set the newly selected TLI from
    startup process into shared memory, then some time later we set
    SharedRecoveryInProgress = false. That timing window isn't good, but I
    don't see a different way.
    It might be a good idea to change walsender to not store that in
    ThisTimeLineID, though. It used to make sense for ThisTimeLineID to track
    the current recovery timeline in 9.2 and below, but now that walsender can
    be sending any older timeline, using ThisTimeLineID to store the latest one
    seems confusing.
    Agreed, but looks like too much code to touch that lightly.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Simon Riggs at May 21, 2013 at 8:33 pm

    On 21 May 2013 09:26, Simon Riggs wrote:

    I'm OK with that principle...
    Well, after fighting some more with that, I've gone with the, er,
    principle of slightly less ugliness.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 7, '13 at 9:57a
activeMay 21, '13 at 8:33p
posts23
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase