Joachim Wieland has diligently and sensibly pointed out a potential for
user error with the current PITR implementation. This is not a bug *per
se*, but is a design flaw that more than one person could fall into. It
is a minor issue and not that likely, since the manual describes what to
do...but I'm proposing a possible solution nonetheless, since this error
is likely to occur whilst learning the PITR functionality and might
potentially dissuade users from using the approach.

Following restoration of a base backup, archive recovery is entered by
placing a recovery.conf file in the data directory and then restarting
the server using pg_ctl start. If the recovery.conf file is misnamed,
e.g. Recovery.conf, or if it is misplaced, or simply absent, then the
server start will not enter archive recovery, yet will start normally.
The precise difference might well not be apparent. Subsequent server
activity could potentially overwrite archived log files in a poorly
managed archive. Both of those situations are likely to occur
simultaneously with inexperienced users.

The first fix to this is clearly to document the possibility and warn
the user of this possibility - i.e. describe what NOT to do when
invoking archive recovery. This will be submitted shortly.

A further fix is to implement a fail safe mode for invoking recovery.
Rather than making the recovery a normal server start, which then
searches for recovery.conf, it would be better to indicate that the
server start is expected to be a recovery and so the absence of a
recovery.conf should be regarded as an error.

If a further pg_ctl mode, recover, were implemented, this would allow a
fail safe mode for recovery.

e.g. pg_ctl -D datadir recover

pg_ctl could then check for the existence of a recovery.conf file and
return an error if none were found. This mode would not then need to be
passed through to the postmaster as an option, which could accidentally
be re-invoked later should a crash recovery occur (the main reason to
avoid adding recovery.conf options to postgresql.conf in the first
place).

This mode would do nothing more than:
- check for recovery.conf, if not found, return error
- invoke a start normally, as if mode=start had been requested

The doc for invoking recovery could then be changed to include this new
mode, and the potential for error would be removed.

This is a change requested for 8.0, to ensure that PITR doesn't fall
into disrepute by novice users shooting themselves in the foot. It is a
minor change, effecting only PITR, and only to the extent of a
documentation change and a file existence check in pg_ctl. No server
code would be changed.

Alternative suggestions are welcome. Thoughts?

--
Best Regards, Simon Riggs

Search Discussions

  • Tom Lane at Nov 6, 2004 at 11:29 pm

    Simon Riggs writes:
    If a further pg_ctl mode, recover, were implemented, this would allow a
    fail safe mode for recovery.
    e.g. pg_ctl -D datadir recover
    pg_ctl could then check for the existence of a recovery.conf file and
    return an error if none were found.
    I can't get very excited about this approach, because it only protects
    those people who (a) use pg_ctl to start the postmaster (not everyone)
    and (b) carefully follow the recovery directions (which the people you
    are worried about are very bad at, by hypothesis).

    A possibly more reliable interlock would involve having the postmaster
    probe during normal startup to see if there is already an archived WAL
    segment for what it thinks is the current segment. However there are
    several issues here: one is that if you're doing partial-log-file
    shipping, that isn't necessarily an error condition; another is that
    we don't know how to do such a probe unless more information is added
    to postgresql.conf. We could imagine adding another shell command
    string (something like "test -f ..." perhaps) but if the user gets it
    wrong he may still be left with no protection.

    regards, tom lane
  • Tom Lane at Nov 7, 2004 at 12:17 am

    I wrote:
    A possibly more reliable interlock would involve having the postmaster
    probe during normal startup to see if there is already an archived WAL
    segment for what it thinks is the current segment.
    Another and simpler way is to recommend that people use archive_command
    strings that won't overwrite an existing archive file.

    For instance instead of showing the example
    archive_command = 'cp %p /mnt/server/archivedir/%f'
    we could show
    archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
    or on some machines
    archive_command = 'cp -i %p /mnt/server/archivedir/%f </dev/null'

    One tricky point is to make sure that the command will return nonzero
    status if the target exists. In some quick testing I found that
    the "cp -i" method doesn't have portable behavior on this point.
    I think that the "test" method is portable across Unixen, but no idea
    what to do on Windows.

    regards, tom lane
  • Mark Kirkwood at Nov 7, 2004 at 12:51 am
    While this is nice, it will not help you if the restoration directory is
    different from your archive directory. That is : restore_command in
    recovery.conf fetches from somewhere other than where archive_command in
    postgresql.conf copied.

    I am not sure how likely this situation is, but setting up log shipping,
    or maybe recovering from disk failure *might* mean you need to bring the
    saved archive files "back from somewhere else".

    regards

    Mark

    Tom Lane wrote:
    Another and simpler way is to recommend that people use archive_command
    strings that won't overwrite an existing archive file.

    For instance instead of showing the example
    archive_command = 'cp %p /mnt/server/archivedir/%f'
    we could show
    archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

  • Simon Riggs at Nov 7, 2004 at 8:56 am

    On Sun, 2004-11-07 at 00:54, Mark Kirkwood wrote:
    While this is nice, it will not help you if the restoration directory is
    different from your archive directory. That is : restore_command in
    recovery.conf fetches from somewhere other than where archive_command in
    postgresql.conf copied.
    ...hmmm, this doesn't look like an issue to me.

    If the archive is protected when it's written, then it will be safe to
    use later for restoration. The path to the archive may differ, but the
    archive itself *needs* to be the same archive.

    --
    Best Regards, Simon Riggs
  • Simon Riggs at Nov 7, 2004 at 8:38 am

    On Sun, 2004-11-07 at 00:16, Tom Lane wrote:
    I wrote:
    A possibly more reliable interlock would involve having the postmaster
    probe during normal startup to see if there is already an archived WAL
    segment for what it thinks is the current segment.
    Another and simpler way is to recommend that people use archive_command
    strings that won't overwrite an existing archive file.
    Yes, exactly what I meant by saying I'd document the warning.

    --
    Best Regards, Simon Riggs
  • Mark Kirkwood at Nov 7, 2004 at 1:00 am
    I was thinking that even mildly experienced folks could benefit from a
    helpful sanity check. Typically the need to recover a system never comes
    at a good time, and features that help prevent silly mistakes are a
    great stress saver.

    As an aside, while testing recovery during pre beta, I think I probably
    "forgot" to put in a recovery.conf about 1 time in 10. Now I was using a
    small database cluster tar'ed up in /tmp, so no big deal, but if it had
    been a 100G beast that had to come off tape ....

    regards

    Mark

    Tom Lane wrote:
    I can't get very excited about this approach, because it only protects
    those people who (a) use pg_ctl to start the postmaster (not everyone)
    and (b) carefully follow the recovery directions (which the people you
    are worried about are very bad at, by hypothesis).

  • Simon Riggs at Nov 7, 2004 at 9:04 am

    On Sat, 2004-11-06 at 23:29, Tom Lane wrote:
    Simon Riggs <simon@2ndquadrant.com> writes:
    If a further pg_ctl mode, recover, were implemented, this would allow a
    fail safe mode for recovery.
    e.g. pg_ctl -D datadir recover
    pg_ctl could then check for the existence of a recovery.conf file and
    return an error if none were found.
    ...
    A possibly more reliable interlock would involve having the postmaster
    probe during normal startup to see if there is already an archived WAL
    segment for what it thinks is the current segment. However there are
    several issues here: one is that if you're doing partial-log-file
    shipping, that isn't necessarily an error condition; another is that
    we don't know how to do such a probe unless more information is added
    to postgresql.conf. We could imagine adding another shell command
    string (something like "test -f ..." perhaps) but if the user gets it
    wrong he may still be left with no protection.
    Yes, checking the archive is the safe way, but we don't know how to do
    that unless restore_command has been successfully read in (currently
    from recovery.conf). Putting it in postgresql.conf is the wrong place,
    because that will likely be wrongly set when we restore, and I'm against
    editing that file as part of a recovery...once edited, you could lose
    all context and thus completely screw the recovery.

    All the suggested change is about is trying to find a safe way to fail
    if restore_command has not been set because recovery.conf is missing for
    whatever reason.

    I can't get very excited about this approach, because it only protects
    those people who (a) use pg_ctl to start the postmaster (not everyone)
    and (b) carefully follow the recovery directions (which the people you
    are worried about are very bad at, by hypothesis).
    Well, I was trying to find a least-risk approach. Touching pg_ctl code
    at this stage of beta-ness seemed more reliable than touching postmaster
    code. pg_ctl doesn't catch everyone.

    Placing the test in postmaster is the natural place for it, yes.

    This additional keyword should never be placed in any startup script, so
    the commonly used interface would not change, just the one used at
    recovery time.

    Documentation changes are one thing - and as you point out, if they miss
    one thing in the manual, they'll miss three. I'm not expecting many
    people to actually use an archive created by using copy or cp.

    I'm with Mark on this: you need all the help you can get at 2am when you
    just got called out of bed with a phone call from your boss *requesting*
    you to recover your incredibly important production system.

    I'll cut the doc changes first, then produce a slim patch on postmaster.

    Best Regards, Simon Riggs
  • Tom Lane at Nov 7, 2004 at 7:14 pm

    Simon Riggs writes:
    On Sat, 2004-11-06 at 23:29, Tom Lane wrote:
    A possibly more reliable interlock would involve having the postmaster
    probe during normal startup to see if there is already an archived WAL
    segment for what it thinks is the current segment.
    Yes, checking the archive is the safe way, but we don't know how to do
    that unless restore_command has been successfully read in (currently
    from recovery.conf). Putting it in postgresql.conf is the wrong place,
    Agreed; we left it out of postgresql.conf for good reasons. I was
    thinking in terms of adding a third command string, perhaps like

    test_archive_file = 'test -f /mnt/server/archive/%f'

    But it's probably best just to tell people to write their
    archive_commands in a non-overwrite style.

    regards, tom lane
  • Mark Kirkwood at Nov 6, 2004 at 11:43 pm
    I like it - nice and simple, but targets a large (and likely) foot gun
    situation.

    regards

    Mark

    Simon Riggs wrote:
    If a further pg_ctl mode, recover, were implemented, this would allow a
    fail safe mode for recovery.

    e.g. pg_ctl -D datadir recover

    pg_ctl could then check for the existence of a recovery.conf file and
    return an error if none were found. This mode would not then need to be
    passed through to the postmaster as an option, which could accidentally
    be re-invoked later should a crash recovery occur (the main reason to
    avoid adding recovery.conf options to postgresql.conf in the first
    place).

    This mode would do nothing more than:
    - check for recovery.conf, if not found, return error
    - invoke a start normally, as if mode=start had been requested

    The doc for invoking recovery could then be changed to include this new
    mode, and the potential for error would be removed.

    This is a change requested for 8.0, to ensure that PITR doesn't fall
    into disrepute by novice users shooting themselves in the foot. It is a
    minor change, effecting only PITR, and only to the extent of a
    documentation change and a file existence check in pg_ctl. No server
    code would be changed.

    Alternative suggestions are welcome. Thoughts?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 6, '04 at 10:08p
activeNov 7, '04 at 7:14p
posts10
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase