PITR Functional Design v2 for 7.5
pgsql-hackers@postgresql.org

Currently, PostgreSQL provides Crash Recovery but not yet full Point In
Time
Recovery (PITR). The following document provides a design which enhances
the existing robustness features to include full PITR. Since one of the
primary objectives for PITR is robustness, this design is provided in
advance of patches to allow those features and behaviours to be
subjected to the rigours of [HACKERS] before final coding is attempted.
We're really not that far now from making this work, hence the attention
on up-front planning.

Thanks for your comments, Best Regards, Simon Riggs, 2nd Quadrant


Review of current Crash Recovery

Crash recovery is catered for by the use of WAL logging, or xlogs. xlogs
are
written to disk immediately before a transaction is acknowledged as
committed. xlogs contain REDO information sufficient to rollforward any
changes from a known starting position. The known starting position is
also recorded by keeping track of which transactions have completed in a
file structure known as the clog. Clogs are also written to disk as
transactions commit.

The changed data pages are not written immediately back to disk. They do
not
need to be because of the entries in the xlog and clog taken together
are
sufficient to recover from a crash. Every so often a full checkpoint
process is created that will perform a full synchronisation of changed
(or
"dirty") data pages back to disk. When a checkpoint is complete it will
write the last transaction id to the xlog as a marker, and it will trim
the
clog files to the last transaction id. The frequency of checkpoints is
controllable. Changed data pages are written back to disk as a
background
process called the bg_writer (or "lazy" writer), reducing the effect of
checkpoints on busy workloads.

In crash recovery, the database files are presumed to be intact, but not
necessarily up to date. When postmaster comes up again, it checks clog
to
discover what the last checkpointed transaction id was. Using this, it
then
scans through the available xlog files to the marker written by the
checkpoint at that time. Following REDO entries are then reapplied to
the
data pages as far as possible until the system is brough to the best
available point.

If the appropriate xlogs are not available, no recovery is possible.

Following initdb, there will be at least 1 xlog. As new data is written
to
xlog, new files will be allocated as required. As a result of
checkpointing,
there will be a time when xlogs are no longer required for crash
recovery.
At each checkpoint, if there is an xlog that is no longer required the
last
one will be recycled or removed. Xlogs will be recycled back to the
"front
of the queue", so that we do not need to delete and create files
constantly.
A certain maximum number of files will be kept as preallocated logs;
this limit is controllable. When the limit is reached, xlogs will be
removed rather than being recycled. As a result, the number of xlogs may
vary considerably over time, but mostly they will cycle around
maintaining roughly steady state number of xlogs, therefore with
predictably constant space utilisation.

If an xlog cannot be written because the space available is full then
the
transaction that depended upon the xlog write will not be able to
commit,
nor will any subsequent transactions until the space situation
alleviates.
Currently, this imposes a limit in the size of any transaction based
upon
the available diskspace in the pg_xlog directory.

Xlogs are relatively high volume, clogs are relatively low. An out of
space
condition on clog is typically unlikely.

Failure analysis:
- If a transaction fails, no changes will be committed to xlog and the
clog
entry will show the transaction aborted.
- If a transaction succeeds, its changes are committed to xlog and the
clog
entry shows the transactions succeeded.
- If xlog directory fills or is otherwise unwritable, a PANIC is raised
- If clog directory fills or is otherwise unwritable, a PANIC is raised

Point in Time Recovery (PITR)

PITR features are designed to extend the existing Crash Recovery
features so
that a recovery can take place in situations where a crash recovery
would
not have been possible. These situations are:
- database objects have been dropped
- xlogs do not go back far enough in time to allow rollforward recovery
- the database files are not intact and need to be completely replaced
before rollforward

To do this, a full physical backup of the system is required for
recovery.
When tablespaces are available, it should be possible to restore and
recover
individual tablespaces. In addition, xlogs will need to be moved out of
the normal xlog filesystem to an archive destination.

PITR Proposed Solution

The proposed solution is to allow the existing crash recovery detection
and rollforward logic to be utilised directly to perform PITR, which
should allow the minimum number of changes and additional code.

To allow this to occur, the full backup *must* occur while the database
is open or "hot". This backup must include all data and clogs (and any
tablespaces or logical links utilised). A continuous sequence of xlogs
must also be available, stretching from the last checkpoint prior to the
start of the backup through to whatever time is specified for the
"recovery point" or until the end of the xlogs.

The full PITR solution consists of a number of components:
1. xlog archival
2. recovery-to-point-in-time (RPIT)

1. Xlog Archival
There are a wide range of Backup and Recovery (BAR) products on the
market, both open source and commercially licensed programs that provide
facilities to perform full physical backups and individual file
archives. The best way to foster wide adoption of PostgreSQL is to allow
it to work in conjunction with any of these products. To this end, a
PostgreSQL archival API is specified that will allow both PostgreSQL and
an external archiving program to work together in a coordinated manner
to achieve the backup of the xlogs.

Archival API will need to be implemented directly into the postgreSQL
server, though will also require a reference implementation of the API
to
allow it to be copied and more widely used. The reference API is also
required to allow the workings of the API to be sufficiently well tested
to
allow its release into the mainstream PostgreSQL code. These together
require the following two sub-components:
1.1 XLogArchive API
1.2 pg_arch: simple xlog archiving tool

1.1 XLogArchive API
1.1.1 XLogArchive Initiation

The API assumes that all xlogs produced by PostgreSQL will need to be
archived. This is a requirement, since any break in the sequence of
xlogs will render the total archive useless for restoring forward from
the last backup.

When PostgreSQL server starts, it will check the value of the parameter
wal_archive_policy and enable/disable archiving accordingly. This
parameter can only be changed at server start. (This is required because
the initial step of archiving each xlog is performed by the backend; if
this were changeable after boot, then it might be possible for an
individual backend to override the wal_archive_policy and choose not to
archive - which would then effect the whole system and all users, not
just the user making that choice). It is considered less desirable to
utilize a compiler directive, since the archival policy is a
operational/business decision for a particular database not a developer
activity on the dbms executable.

It is not defined whether the external archiver starts before
PostgreSQL, or soon afterwards. Clearly, it is intended that the two
should work together at the direction of the administrator. This slight
lack of clarity is intended to allow for the situation where start-up is
invoked within automated boot sequence, where sub-system start-up order
may not be guaranteed by the OS. It also allows for variation between
the start-up times for PostgreSQL and the archiver; the archiver might
be ready in an instant or require manual intervention such as a new tape
load.

There is no requirement for the archiver to halt when PostgreSQL shuts
down, though may choose to do so or not, e.g. it may be desirable to
have one archiver operate for multiple postmasters simultaneously. The
archiver knows many things about PostgreSQL, including its data
directory, so is easily able to read PID file and monitor postmaster if
it chooses to do so. No additions to the API are required in that area.

As a result there is no "connection" concept between PostgreSQL and the
archiver, as there are in other client-server APIs (libpq, tcp/ip, JDBC
etc). So no connection and no disconnection. Similarly, there is no
environment set up/tear down at this time.

1.1.1 XLogArchive API specification

(PostgreSQL ->) XLogArchiveNotify(xlog)
(<- Archiver) XLogArchiveXlogs()
(<- Archiver) XLogArchiveComplete(xlog)
(PostgreSQL ->) XLogArchiveBusy(xlog) returns ARCHIVE_OK or BUSY

When writing to xlog switches to the next file, the older file will be
closed. At this point, the postgresql backend which caused the xlog file
switch will then call

XLogArchiveNotify(xlog) returns TRUE or FALSE.

TRUE indicates successful notification, though not necessarily receipt
of
that notification by the archiver.
FALSE indicates unsuccessful notification, which is a PANIC condition,
since
this situation should not occur and yet the administrator that requested
that the archival process should occur.

Since the call is made by a
particular user's backend, it is important that this call can be made in
minimum time and is not dependent upon the external archiver, i.e. the
call
is asynchronous. No two backends will call this at exactly the same
time,
though it is possible that one call will not have completed before
another
call executes. Should multiple calls be in progress at the same time
they
will be notifying that separate xlogs are ready for archiving, so there
is
no reason to require logical locks. The notify call should be written in
such a way that allows multiple calls to be active simultaneously, i.e.
no
critical sections or single-threading.

The archiver initially starts in a wait loop, waking up regularly to
issue

XLogArchiveXlogs() returns a single XLOG filename, or NULL

If an xlog file is waiting to be archived, then the archiver will
discover
the name of the xlog by using this API call. If more than one file is
available to be archived, then it will be ignored. If the archiver is
multi-threaded, it need not wait until it has executed
XLogArchiveComplete
before it executes XLogArchiveXlogs again.

The archiver can now use the name of the xlog retrieved to visit pg_xlog
diretory and copy that xlog away to a place that it considers safe. When
this occurs to its satisfaction, the archiver will call

XLogArchiveComplete(xlog) returns SUCCESS, ALREADY_NOTIFIED and
SEVERE_FAILURE

SUCCESS indicates successful notification, though not necessarily
receipt of
that notification by the archiver.
ALREADY_NOTIFIED indicates an error that XLogArchiveComplete had already
been called for that xlog. This indicates to the archiver either that
multiple archivers are active, or that this archiver has already called
ArchiveComplete for that xlog which it should not be doing twice.
SEVERE_ERROR indicates unsuccessful notification. The archiver is
requested
to retry this operation a number of times to ensure that this condition
is
certain, then raise a priority human alert to correct the situation.
Allowance must be made to retry this call again following intervention.

This is an asynchronous call, so there is no expectation that postgresql
will immediately receive this notification. There is no assumption that
archive copying must be single-threaded, or that the archiver must copy
files in the order that they become available. It is presumed that the
archiver has been granted read-only access by the administrator; no
xlogs
should be available for copy other than as a result of direct security
authorisation. No xlogs may be altered or deleted by the archiver in any
way. There is no assumption that archival time is bounded in time,
though it
is strongly desirable that the archiver make best efforts to copy away
the
files and then call ArchiveComplete as quickly and as consistently as
possible. Recognition is made that copying to tape or across networks
may
have considerable time variances, caused by physical tape media changes
and
bandwidth prioritisation etc.. If there is any known planned or there
occurs
some regular delays in this process, then the archiver is strongly
encouraged to implement a two-stage process: copy files to a more
consistently performing location, such as another directory on the same
system before external archival occurs.

At the normal point when xlogs are considered for deletion, i.e. after a
checkpoint, the postgresql checkpoint process will call
XLogArchiveBusy(xlog) returns ARCHIVE_OK, BUSY or SEVERE ERROR
ARCHIVE_OK indicates successful archival completion and that the xlog
can
now be removed by postgresql.
BUSY indicates that the archiver has not yet notified archivercomplete
and
the xlog should not yet be removed. It is possible that calling this
function against a particular xlog may return FALSE many times in a row.
Once TRUE is returned for any xlog, there is no meaning attached to
calling
it again for the same xlog: the TRUE/FALSE status in that situation is
undefined.
SEVERE_ERROR indicates that no information about Busy or not could be
obtained.
This is an asynchronous call, so postgresql will not wait for that xlog
to
complete archiving. This call is currently not likely to be called
simultaneously because it is called by the checkpoint process.
XLogArchiveFree should not itself remove xlogs from the pg_xlog
directory.
The existing mechanisms for xlog removal/recycling will be used so that
the
PITR does not interfere with the existing crash recovery facilities.

The archival API is designed to work with only one archiver per
postmaster.
If there were more than one archiver acting independently of one
another,
whichever calls XLogArchiveComplete first for a particular xlog would
then
allow postgresql to remove/recycle the file. If multiple archive copies
of
xlog files are required, a single archiver must coordinate the
production of
those multiple copies.

1.1.2 XLogArchival API Failure analysis
Failure conditions are mostly noted above.
Some behavioural notes may also be helpful:
If it regularly takes longer to archive than it does to switch xlogs,
then there will be a build up of xlog data. Timing analysis:
- denote the time taken between postgres Notifying that an xlog can now
be
archived and the attempt to recycle that same xlog as Tc.
(XLogArchiveNotify
to successful XLogArchiveFree)
- denote the time taken between archiver receiving notification and
completing the archival as Ta. (XLogArchiveNotify to
XLogArchiveComplete)
- denote number of xlogs as Nx
- denote capacity of xlog filesystem, in terms of number of xlogs as Nc

If Ta > Tc then N will increase.
However, we expect that as Nx increases, Tc will also increase, given a
constant xlog write rate (very roughly same as constant transaction
rate).
There should be a point at which Tc increases such that Ta = Tc, at
which
time Nx should reach a constant limit or steady state, Nc.
If Nc < Nf then everything will be fine, if however Nc > Nf, then we
will get an out of space condition. (Put another way, there may not be a
steady state before we hot the out of space condition).

The out of space condition could therefore occur in two ways:
1. there is a single delay during which xlog filesystem fills
2. there could be a systematic delay which builds slowly until the xlog
filesystem fills
(1) is only preventable by the archival program, or by processes by
which
that program is operated/administered
(2) is possibly preventable by either:
i) keeping track of any delay and reporting it
ii) releasing a WARNING when XLogArchiveFree returns BUSY when called
more
than a certain number of times on nay particular xlog, or returns BUSY
on
the first call for multiple consecutive xlogs.
Solving (2) is somewhat complicated in that the postgresql checkpoint
process is spawned once per checkpoint, so cannot maintain context/state
information between checkpoints. Another mechanism may be possible, such
as
a shared memory area or disk file that can be read by subsequent
checkpoint
processes.

PITR will act identically to Crash recovery when it hits an out-of-space
condition on the xlog directory, since it is exactly the same code. The
behaviour is to operate in "Fail Safe" mode.

It is possible that an administrator may wish to choose to keep
PostgreSQL up and to begin dropping log files rather than eventually
crash. If that choice was made AND a full physical backup was not yet
available, then there is a window of risk during which if a catastrophic
failure occurred then some committed transactions would not be
recoverable. It is not considered appropriate for anybody other than the
administrator to make this choice and so an option is planned to allow
"Fail Operational" behaviour (i.e. dropping logs) to be added.

Not sure at this time whether this scheme will work successfully if the
full backup spans multiple checkpoints. It is expected that this would
work if individual tablespaces were synchronised to different
checkpoints however.

1.1.3 XLogArchive API Implementation:
The API definition has been separated from the implementation. This
should
allow either a better implementation to be more easily applied in the
future, and/or specific customisation to take place for particular ports
or
architectures.

The initial proposal is a simple scheme that uses file existence & file
extension to pass information between PostgreSQL and the archiver. This
would take place in a peer directory of pg_xlog and pg_clog which has
been named the pg_rlog directory. (r as in the strong first syllable
"ar" in English pronunciation of "archive")

The use of a separate directory allows control over the security and
behaviour of the archiver: the archiver never has cause to create/delete
any files in critical PostgreSQL directories even if security isn't
enforced correctly. Only PostgreSQL will ever delete old xlog data by
recycling/removal.

XLogArchiveNotify(xlog) returns TRUE or FALSE.
Will write a file called <XLOG>.full to the pg_rlog directory, where
<XLOG> is a filename in the pattern currently used by PostgreSQL xlogs.
The file will contain <XLOG>, Date/Time info.
If correctly written, returns TRUE, else FALSE.

Archiver will scan pg_rlog directory. If it sees an rlog that shows as
.full, it will then rename the rlog entry to <XLOG>.busy and then it
will copy (away to the archive location) the xlog in the pg_xlog
directory that matches the name of the rlog entry.

XLogArchiveComplete(xlog) returns SUCCESS, ALREADY_NOTIFIED and
SEVERE_FAILURE
When this is complete, it will rename the rlog entry so that its
filename
now ends with <XLOG>.done. If all OK, this returns SUCCESS. If the rlog
entry has already been renamed to <XLOG>.done, then the archiver
receives
ALREADY_NOTIFIED.

XLogArchiveBusy(xlog) returns ARCHIVE_OK, BUSY or SEVERE_FAILURE
If <XLOG>.done exists, then returns ARCHIVE_OK, which then allows <XLOG>
to be recycled/removed. If <XLOG>.busy still exists then returns BUSY.
If <XLOG>.full is not available returns SEVERE_FAILURE.


Src/backend/utils/guc.c will be modified to add any config parameters

Src/backend/access/transam/Xlog.c will be modified to implement the
PostgreSQL side calls: XLogArchiveNotify(xlog) and
XLogArchiveBusy(xlog).

libpgarch.c
C implementation of archiver-side API calls: XLogArchiveXlogs()
and
XLogArchiveComplete()


1.2 pg_arch: simple xlog archiving tool

Src/tools/ will add:
pg_arch.c
a single-threaded program that uses libpgarch.c to use API, but
offers
a simple copy facility from pg_xlog to another directory. The program
will
continue to wait and watch for archived files: it is not a file-filter
type
of program. It may be run as a foreground process (for testing etc),
though
is also designed to be run as a background process, typically executed
at
the same time as postmaster startup (through a mechanism such as service
autostart mechanisms following system boot).
pg_arch has two parameters:
-D data-file root for particular instance of PostgreSQL
-A archive directory

2. Recovery to Point-in-Time (RPIT)

Recovery to will offer these options:

2.1 Recovery to end of logs (last time)
2.2 Recovery of all available on-line logs
2.3 Point in time recovery to the checkpoint AT or the last checkpoint
before the time specified.

The administrator is expected to be responsible for placing archived
xlogs back into the pg_xlog directory. This may be a facility provided
by the external archiver, a manual or other automated process. If any
mistakes are made at this point then the administrator can then reselect
appropriate xlogs and try again. There is no enforced limit to the
number of recovery attempts possible.

2.1 Recovery to end of logs
Default option requires no additional/changed PostgreSQL code. Archive
API will be tested using this option.

2.2 Recovery of all available on-line logs
This will be made available as a command-line switch on postmaster. This
will allow roll-forward on xlogs until all available logs are recovered,
then the postmaster will shut down.
This can be used in two ways:
- when the xlog archive exceeds available disk space: following
execution in this mode, the administrator would recover PostgreSQL in
batches. When the last batch is reached, the command switch would no
longer be used.

2.3 RPIT
Add a feature to accept a recovery parameter and to halt recovery when
that time is reached.

3. Possible future extensions
The implementation in 1.1.3 may well be improved upon, or it may also be
implemented differently altogether according to the architecture of the
archiving program.

Suggestions have been made to introduce a generalised notification
interface. If such was available, it would be straightforward to
re-write
the archival API to utilise this. It's outside of the aims of this
development to consider that.

It is foreseen that the API would be able to be used to form the basis
of an
XBSA or NDMP client application that could then work easily with the
leading
enterprise storage management products.

----end

Search Discussions

  • Richard Huxton at Mar 9, 2004 at 10:47 am

    On Monday 08 March 2004 23:28, Simon Riggs wrote:
    PITR Functional Design v2 for 7.5
    Blimey - that's a long post :-) Thanks for explaining things simply enough
    that a non-hacker like me can understand. Well - almost understand ;-)
    Review of current Crash Recovery
    Is there any value in putting this section on techdocs or similar? We do get a
    small but steady trickle of people asking for details on internals, and I
    think this covers things in a different way to the WAL section of the
    manuals.
    PITR Proposed Solution
    To allow this to occur, the full backup *must* occur while the database
    is open or "hot". This backup must include all data and clogs (and any
    tablespaces or logical links utilised). A continuous sequence of xlogs
    must also be available, stretching from the last checkpoint prior to the
    start of the backup through to whatever time is specified for the
    "recovery point" or until the end of the xlogs.
    So this is a standard cp/tar etc while the cluster is actually in use?

    XLogArchiveXlogs() returns a single XLOG filename, or NULL

    If an xlog file is waiting to be archived, then the archiver will
    discover
    the name of the xlog by using this API call. If more than one file is
    available to be archived, then it will be ignored. If the archiver is
    multi-threaded, it need not wait until it has executed
    XLogArchiveComplete
    before it executes XLogArchiveXlogs again.
    So this means:
    1. The archiver is responsible for noticing that it is already archiving the
    filename returned (if it repeats the call too quickly).
    2. The archiver can only ever archive one XLOG file at a time.

    The initial proposal is a simple scheme that uses file existence & file
    extension to pass information between PostgreSQL and the archiver. This
    would take place in a peer directory of pg_xlog and pg_clog which has
    been named the pg_rlog directory. (r as in the strong first syllable
    "ar" in English pronunciation of "archive")
    Any reason why not "pg_pitr" or "pg_pitr_log"?
    1.2 pg_arch: simple xlog archiving tool

    Src/tools/ will add:
    pg_arch.c
    a single-threaded program that uses libpgarch.c to use API, but
    offers
    a simple copy facility from pg_xlog to another directory. The program
    will
    continue to wait and watch for archived files: it is not a file-filter
    type
    of program. It may be run as a foreground process (for testing etc),
    though
    is also designed to be run as a background process, typically executed
    at
    the same time as postmaster startup (through a mechanism such as service
    autostart mechanisms following system boot).
    pg_arch has two parameters:
    -D data-file root for particular instance of PostgreSQL
    -A archive directory
    Does the specification of these parameters (and any others) need to be part of
    the API? I'm thinking about the ability to "drop in" different archivers with
    each using the same pre-defined settings.
    2. Recovery to Point-in-Time (RPIT)

    Recovery to will offer these options:

    2.1 Recovery to end of logs (last time)
    2.2 Recovery of all available on-line logs
    2.3 Point in time recovery to the checkpoint AT or the last checkpoint
    before the time specified.

    The administrator is expected to be responsible for placing archived
    xlogs back into the pg_xlog directory. This may be a facility provided
    by the external archiver, a manual or other automated process. If any
    mistakes are made at this point then the administrator can then reselect
    appropriate xlogs and try again. There is no enforced limit to the
    number of recovery attempts possible.
    Just to clarify:
    1. I can identify which XLOG files I need based on their timestamp?
    2. Can I force a checkpoint using standard PG client APIs? So I can do "close
    weekly payroll, force checkpoint".
    3. We're restoring an entire cluster here, not just one database? How
    difficult would it be to strip out information for a single db - I'm thinking
    about the case where you may have limited backup storage and want to save an
    orders db but not a catalogue db. Or perhaps a hosting company with
    "platinum" customers getting PITR.

    --
    Richard Huxton
    Archonet Ltd
  • Simon Riggs at Mar 9, 2004 at 10:38 pm

    Richard Huxton
    On Monday 08 March 2004 23:28, Simon Riggs wrote:
    PITR Functional Design v2 for 7.5
    Review of current Crash Recovery
    Is there any value in putting this section on techdocs or similar? We do
    get a
    small but steady trickle of people asking for details on internals, and I
    think this covers things in a different way to the WAL section of the
    manuals.
    Certainly, though I would like to do all of that after it actually
    works!
    PITR Proposed Solution
    To allow this to occur, the full backup *must* occur while the
    database
    is open or "hot". This backup must include all data and clogs (and
    any
    tablespaces or logical links utilised). A continuous sequence of
    xlogs
    must also be available, stretching from the last checkpoint prior to
    the
    start of the backup through to whatever time is specified for the
    "recovery point" or until the end of the xlogs.
    So this is a standard cp/tar etc while the cluster is actually in use?
    Yes. I will add a line in to clarify that.
    XLogArchiveXlogs() returns a single XLOG filename, or NULL

    If an xlog file is waiting to be archived, then the archiver will
    discover
    the name of the xlog by using this API call. If more than one file
    is
    available to be archived, then it will be ignored. If the archiver
    is
    multi-threaded, it need not wait until it has executed
    XLogArchiveComplete
    before it executes XLogArchiveXlogs again.
    So this means:
    1. The archiver is responsible for noticing that it is already archiving
    the
    filename returned (if it repeats the call too quickly).
    2. The archiver can only ever archive one XLOG file at a time.
    1. No: I notice I missed a line saying
    " XLogArchiveXlogs()" in section 1.1.3 (corrected).
    Clarification: The archiver will not need to keep track of whether it is
    already archiving the same file (though sounds reasonable programming to
    do so anyway). The API call will never return the same log file twice to
    this call (by definition). That is implemented in my proposal by
    renaming the rlog entry to .busy, so it wont show up on subsequent
    calls.
    2.
    a) There is no restriction on threading in the archiver; it can if it
    wishes archive many files simultaneously. Since PostgreSQL produces them
    one at a time, this implies a build up of xlogs, which is specifically
    not encouraged. An archiver would be encouraged to multi-thread to avoid
    peaks of demand where the archive process was occurring slower than
    xlogs were being written.
    b) The reference implementation won't be multi-threaded in its first
    incarnation (if I write it!!!....be my guest, you have the API
    definition).

    You have also made me realise another failure condition which I have
    also added, todo with a failure of the copy process after this API call.
    The initial proposal is a simple scheme that uses file existence &
    file
    extension to pass information between PostgreSQL and the archiver.
    This
    would take place in a peer directory of pg_xlog and pg_clog which
    has
    been named the pg_rlog directory. (r as in the strong first syllable
    "ar" in English pronunciation of "archive")
    Any reason why not "pg_pitr" or "pg_pitr_log"?
    None. I like pg_pitr...
    Let's wait for other feedback to come in...
    1.2 pg_arch: simple xlog archiving tool
    Does the specification of these parameters (and any others) need to be
    part of
    the API? I'm thinking about the ability to "drop in" different archivers
    with
    each using the same pre-defined settings.
    Those parameters ARE NOT part of the API. The parameters mentioned are
    command line switches on the simple external archiving program pg_arch.

    pg_arch is intended to be a simple archiver-side testing tool. It makes
    sense to make it available also. Basically, you can do whatever you like
    on the archiver side of the API...contrib beckons....
    2. Recovery to Point-in-Time (RPIT)
    Just to clarify:
    1. I can identify which XLOG files I need based on their timestamp?
    2. Can I force a checkpoint using standard PG client APIs? So I can do
    "close
    weekly payroll, force checkpoint".
    3. We're restoring an entire cluster here, not just one database? How
    difficult would it be to strip out information for a single db - I'm
    thinking
    about the case where you may have limited backup storage and want to save
    an
    orders db but not a catalogue db. Or perhaps a hosting company with
    "platinum" customers getting PITR.
    1. Yes, the external timestamp gives that I think. Checking detail...
    2. CHECKPOINT is a PostgreSQL SQL command which can be executed from any
    client. Yes, your scenario fits.
    3. I tried to avoid that issue, but it rears its head. You seem to be
    specifying what you want though, so I'll have a think.

    More response required on 1 & 3...later!

    Best Regards, Simon Riggs
  • Robert Treat at Mar 12, 2004 at 4:52 pm

    On Tuesday 09 March 2004 17:38, Simon Riggs wrote:
    Richard Huxton
    On Monday 08 March 2004 23:28, Simon Riggs wrote:
    PITR Functional Design v2 for 7.5
    Review of current Crash Recovery
    Is there any value in putting this section on techdocs or similar? We do
    get a
    small but steady trickle of people asking for details on internals, and I
    think this covers things in a different way to the WAL section of the
    manuals.
    Certainly, though I would like to do all of that after it actually
    works!
    Just getting caught up on this thread and had similar thoughts as to Richards.
    If there are no objections, I'd like to put the first part of this email up
    on techdocs as an explination of our current crash recovery system.

    Robert Treat
    --
    Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
  • Josh Berkus at Mar 9, 2004 at 5:47 pm
    Simon,

    First off, let me compliment you on such a thourough proposal. I'm feeling
    very enthusiastic about 7.5 PITR based on the amount of thought you've given
    the project.

    Based on throuroughness, I wanted to make a few comments and suggestions.
    All of the below are in the category of "extras it would be nice to have but
    are not essential to implementing PITR." Possibly they are all features to
    consider for 7.6/8.0/whatever; the next version. These comments are based
    on my personal experience as a professional contract DBA for PostgreSQL, MS
    SQL Server and SQL Anywhere.
    PITR features are designed to extend the existing Crash Recovery
    features so
    that a recovery can take place in situations where a crash recovery
    would
    not have been possible. These situations are:
    In my personal experience, the *primary* use of PITR is recovery from User
    Error. For example, with one SQL Server 7.0 installation for a law firm,
    I've made use of PITR 4 times over the last 4 years: once was because and HDD
    failed, the other three were all becuase of IT dept. staff running
    unconstrained UPDATE queries against the back end. For recovery with
    minimal loss of data, there are existing solutions, such as replication
    servers, in addition to PITR; for recovery from User Error, only PITR will
    suffice.
    There are a wide range of Backup and Recovery (BAR) products on the
    market, both open source and commercially licensed programs that provide
    facilities to perform full physical backups and individual file
    archives. The best way to foster wide adoption of PostgreSQL is to allow
    it to work in conjunction with any of these products. To this end, a
    Very perceptive of you. Good idea!
    wal_archive_policy and enable/disable archiving accordingly. This
    parameter can only be changed at server start. (This is required because
    the initial step of archiving each xlog is performed by the backend; if
    this were changeable after boot, then it might be possible for an
    individual backend to override the wal_archive_policy and choose not to
    archive - which would then effect the whole system and all users, not
    just the user making that choice). It is considered less desirable to
    Let me voice a real-world exception to this policy. Imagine that you are
    running an OLAP or decision-support database that analyzes data coming from
    an external source. Once a day you load 250MB of data via COPY and then
    does transformations on that data. While doing the load, you do *not* want
    the archiver running, as it would quickly fill up the WAL partition and
    backlog the archive tape.
    Under the proposed PITR spec, the only way to handle this would be to:
    1) Full back up
    2) Shut down PG
    3) Restart PG without archiving
    4) Load the data
    5) Shut down PG again
    6) Restart PG with archiving
    7) Full back-up again.
    DBAs would like it much more if starting/stopping the archiver was possible
    via a superuser (not regular user) GUC. This would allow a much faster
    cycle:
    1) Full back up
    2) Stop archiving
    3) Load the data
    4) Restart archiving
    5) Full back-up

    Related to the above, what I don't see in your paper or the proposed API is a
    way to coordinate full back-ups and WAL archiving. Obviously, the PITR
    Archive is only useful in reference to an existing full backup, so it is
    important to be able to associate a set of PITR archives with a particular
    full backup, or with some kind of "backup checkpoint". I'm sure that you
    have a solution for this, I just didn't see it explained in your proposal, or
    didn't understand it.

    FWIW, I find the MSSQL PITR system awkward in the extreme and harrowing in its
    unreliability. So it's not a good model to copy ....
    There is no requirement for the archiver to halt when PostgreSQL shuts
    down, though may choose to do so or not, e.g. it may be desirable to
    have one archiver operate for multiple postmasters simultaneously. The
    I see that you've chosen the "One archiver, many databases/clusters"
    architecture. I can also see how this strategy will be easier than the
    "many archivers" strategy. Be prepared that, based on the needs of DBAs,
    you will get the following requests:
    A) Will it be possible to have the archiver process run on a seperate machine
    from PostgreSQL and access it over the network, via NFS or some other means?
    B) Will it be possible to define multiple output streams, so that database X
    and be archived to device Y and database N to device M?
    The out of space condition could therefore occur in two ways:
    1. there is a single delay during which xlog filesystem fills
    2. there could be a systematic delay which builds slowly until the xlog
    filesystem fills
    Given how PITR, and Tablespaces, both substantially increase the risk of
    running out of space on the xlog partition(s), it would be very nice to be
    able to arrange a WARNING whenever any PostgreSQL disk resource drops below a
    pre-defined percentage of availability. This could be done through a
    simple asynchronous process; heck, I think even I could write it in Perl.
    Maybe I'll try.
    The idea would be that all of the dirs defined in Tablespaces and PGData would
    be checked every X seconds for available space, and when it drops below Y% as
    reported by the filesystem, a WARNING is issued (X and Y% would be
    configurable at start time). Given the fluctuation of the xlog size and the
    size of the data files, this is not 100% reliabile but as a DBA it would be
    useful to be warned that xlog is more than, say, 80% full at least some of
    the time. It would let me know that I need to look at re-partitioning
    sooner rather than later.
    This asynchronous daemon would be optional, like the statistics daemon,
    allowing DBAs with large disks and small DBs not to run it.
    It is possible that an administrator may wish to choose to keep
    PostgreSQL up and to begin dropping log files rather than eventually
    crash. If that choice was made AND a full physical backup was not yet
    available, then there is a window of risk during which if a catastrophic
    Frankly, the only reason I can see for keeping the DB up after out-of-space is
    to allow a full backup to be made. In fact, I would favor a solution that
    immediately ran a full backup and then shut down the db whenever out-of-space
    happened. If continued operation with dropping logs proves hard to
    implement, I'd say skip it.
    Also, if you allow halting the archiver process on running databases, it would
    allow the DBA to gracefully extricate themselves from circumstances where
    rlog or xlog is out of space but the rest of the DB system is not.

    --
    -Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Andreas Pflug at Mar 9, 2004 at 9:18 pm

    Josh Berkus wrote:
    5) Full back-up

    Related to the above, what I don't see in your paper or the proposed API is a
    way to coordinate full back-ups and WAL archiving. Obviously, the PITR
    Archive is only useful in reference to an existing full backup, so it is
    important to be able to associate a set of PITR archives with a particular
    full backup, or with some kind of "backup checkpoint". I'm sure that you
    have a solution for this, I just didn't see it explained in your proposal, or
    didn't understand it.
    As far as I understand , full backup in the sense of pgsql means all
    data files including c_log where all transactions before the checkpoint
    are completely written to the data files. AFAICS there is a small detail
    missing so far.

    When I'm doing a file level hot backup, I can't be sure about the backup
    order. To be sure the cluster is in a consistent state regarding
    checkpoints, pg_clog must be the first directory backed up. If this
    isn't made sure, the situation could arise that the backed up clog
    version contains a checkpoint which marks a transaction completed that
    has been written to a file which was backed up earlier than the data
    write took place.

    This could be insured by doing the backup in two steps; first backing up
    pg_clog, and then the rest, restore being performed in the opposite
    order. But this seems to be not too fail safe, what if the admin doesn't
    know this/forgot about it? So IMHO a mechanism insuring this would be
    better. I could think of a solution where a second pg_clog directory is
    used, and a pgsql api for that which is called right before performing
    the file backup. Josh calls this second pg_clog the "backup checkpoint".

    At the moment, a restart is done from clog + WAL, where clog might be
    too new in a hot backup situation as mentioned above. There should be a
    second pgsql restart mode, where checkpoints are not taken from that
    current clog, but the "backup checkpoint clog" which was created
    explicitely at backup time. This is somewhat similar to MSSQL's backup
    behaviour, where the transaction log (=WAL) is growing until a full
    backup has been performed successfully.

    Regards,
    Andreas
  • Simon Riggs at Mar 9, 2004 at 11:17 pm

    Andreas Pflug
    Josh Berkus wrote:
    Related to the above, what I don't see in your paper or the proposed
    API
    is a
    way to coordinate full back-ups and WAL archiving. Obviously, the
    PITR
    Archive is only useful in reference to an existing full backup, so it
    is
    important to be able to associate a set of PITR archives with a
    particular
    full backup, or with some kind of "backup checkpoint". I'm sure
    that
    you
    have a solution for this, I just didn't see it explained in your
    proposal, or
    didn't understand it.
    AFAICS there is a small detail
    missing so far.

    When I'm doing a file level hot backup, I can't be sure about the backup
    order. To be sure the cluster is in a consistent state regarding
    checkpoints, pg_clog must be the first directory backed up. If this
    isn't made sure, the situation could arise that the backed up clog
    version contains a checkpoint which marks a transaction completed that
    has been written to a file which was backed up earlier than the data
    write took place.

    This could be insured by doing the backup in two steps; first backing up
    pg_clog, and then the rest, restore being performed in the opposite
    order.
    Good spot. I'll add this to the design.

    Will think more on the "backup checkpoint". Don't let me off the hook...

    Best Regards, Simon
  • Tom Lane at Mar 9, 2004 at 11:20 pm

    Andreas Pflug writes:
    When I'm doing a file level hot backup, I can't be sure about the backup
    order. To be sure the cluster is in a consistent state regarding
    checkpoints, pg_clog must be the first directory backed up.
    You are going off in the wrong direction entirely.

    Any hot-backup design that thinks safety can be ensured by "back up file A
    before file B" considerations is wrong. That's because Postgres doesn't
    necessarily dump dirty blocks into the data area (or clog area) at any
    particular time. Therefore, a filesystem-level backup taken while the
    postmaster is running is basically certain to be inconsistent. You can
    *not* avoid this by being careful about the order you dump files in.
    Heck, you can't even be certain that a file you dump is internally
    consistent.

    The only way we can support file-level hot backup is in conjunction with
    PITR-style WAL log archiving. It is okay for the data area dump to be
    inconsistent, so long as your recovery process includes replay of WAL
    starting at some checkpoint before the filesystem dump started, and
    extending to some point after the filesystem dump finished. Replaying
    WAL will correct the inconsistencies.

    regards, tom lane
  • Andreas Pflug at Mar 10, 2004 at 11:16 am

    Tom Lane wrote:
    Andreas Pflug <pgadmin@pse-consulting.de> writes:

    When I'm doing a file level hot backup, I can't be sure about the backup
    order. To be sure the cluster is in a consistent state regarding
    checkpoints, pg_clog must be the first directory backed up.
    You are going off in the wrong direction entirely.

    Any hot-backup design that thinks safety can be ensured by "back up file A
    before file B" considerations is wrong. That's because Postgres doesn't
    necessarily dump dirty blocks into the data area (or clog area) at any
    particular time. Therefore, a filesystem-level backup taken while the
    postmaster is running is basically certain to be inconsistent. You can
    *not* avoid this by being careful about the order you dump files in.
    Heck, you can't even be certain that a file you dump is internally
    consistent.
    Maybe my wording was misleading, seems Simon understood me as int was meant.
    With "consistent state regarding checkpoints" I meant that all
    transactions that are marked as committed with the checkpoint are really
    present in the data files. Of course, there might be even more
    transactions which haven't been checkpointed so far, they'll need WAL
    replay.
    To clarify:
    I'd expect a cluster to be workable, if I
    - disable VACUUM until backup completed
    - issue CHECKPOINT
    - backup clog (CHECKPOINT and backup clog are the "backup checkpoint")
    - backup all datafiles (which include at least all completed transaction
    data at checkpoint time)
    and then
    - restore datafiles and clog
    - bring up pgsql.
    Certainly, all transactions after the backup checkpoint are lost. There
    might be fragments of newer transactions in data files, but they were
    never committed according to clog and thus rolled back.
    WAL replay would add more completed transactions, making the cluster
    more up-to-date, but omitting this would be sufficient in many desaster
    recovery scenarios.
    Did I miss something? If so, not only an API to get WAL data ordered out
    of pgsql is needed, but for the whole cluster.

    Regards,
    Andreas
  • Tom Lane at Mar 10, 2004 at 3:04 pm

    Andreas Pflug writes:
    To clarify:
    I'd expect a cluster to be workable, if I
    - disable VACUUM until backup completed
    - issue CHECKPOINT
    - backup clog (CHECKPOINT and backup clog are the "backup checkpoint")
    - backup all datafiles (which include at least all completed transaction
    data at checkpoint time)
    and then
    - restore datafiles and clog
    - bring up pgsql.
    Why is that a useful approach? You might as well shut down the
    postmaster and do a cold filesystem backup, because you are depending on
    the data files (including clog) not to change after the checkpoint. You
    cannot make such an assumption in a running database.

    Preventing VACUUM does not help btw.

    regards, tom lane
  • Andreas Pflug at Mar 10, 2004 at 4:37 pm

    Tom Lane wrote:
    Andreas Pflug <pgadmin@pse-consulting.de> writes:

    To clarify:
    I'd expect a cluster to be workable, if I
    - disable VACUUM until backup completed
    - issue CHECKPOINT
    - backup clog (CHECKPOINT and backup clog are the "backup checkpoint")
    - backup all datafiles (which include at least all completed transaction
    data at checkpoint time)
    and then
    - restore datafiles and clog
    - bring up pgsql.
    Why is that a useful approach? You might as well shut down the
    postmaster and do a cold filesystem backup,
    We're talking about *hot* backup, aren't we?

    Regards,
    Andreas
  • Tom Lane at Mar 10, 2004 at 4:56 pm

    Andreas Pflug writes:
    Tom Lane wrote:
    Why is that a useful approach? You might as well shut down the
    postmaster and do a cold filesystem backup,
    We're talking about *hot* backup, aren't we?
    Exactly. The approach you're sketching can't work for hot backup,
    because it effectively assumes that the database won't be changing.

    regards, tom lane
  • Andreas Pflug at Mar 10, 2004 at 5:33 pm

    Tom Lane wrote:
    Andreas Pflug <pgadmin@pse-consulting.de> writes:

    Tom Lane wrote:

    Why is that a useful approach? You might as well shut down the
    postmaster and do a cold filesystem backup,

    We're talking about *hot* backup, aren't we?
    Exactly. The approach you're sketching can't work for hot backup,
    because it effectively assumes that the database won't be changing.
    Well in the first place my posting was to express my suspicion that WAL
    replay relies on clog/pg_control being accurate, i.e. transactions
    marked as flushed must be on disk. AFAICS this is the consequence of WAL
    replay implementation. In case of hot backup, this means that data files
    must not be older than clog. Do you agree? So PITR needs a mechanism to
    insure this at backup time.

    Next question would be: If the point in time I'd like to recover is that
    very backup checkpoint time, do I need xlog at all?

    Regards,
    Andreas
  • Simon Riggs at Mar 9, 2004 at 10:38 pm

    Josh Berkus
    First off, let me compliment you on such a thorough proposal. I'm
    feeling very enthusiastic about 7.5 PITR
    Thank you, though please realise that I am in many ways summarising a
    wide range of suggestions and earlier work into a coherent whole.

    Me too! I'm trying to pace myself through release and into subsequent
    maintenance of the new features.

    Best Regards, Simon
  • Simon Riggs at Mar 9, 2004 at 10:38 pm

    Josh Berkus
    wal_archive_policy and enable/disable archiving accordingly. This
    parameter can only be changed at server start. (This is required
    because
    the initial step of archiving each xlog is performed by the backend;
    if
    this were changeable after boot, then it might be possible for an
    individual backend to override the wal_archive_policy and choose not
    to
    archive - which would then effect the whole system and all users,
    not
    just the user making that choice). It is considered less desirable
    to
    Let me voice a real-world exception to this policy. Imagine that you are
    running an OLAP or decision-support database that analyzes data coming
    from
    an external source. Once a day you load 250MB of data via COPY and then
    does transformations on that data. While doing the load, you do *not*
    want
    the archiver running, as it would quickly fill up the WAL partition and
    backlog the archive tape.
    Under the proposed PITR spec, the only way to handle this would be to:
    1) Full back up
    2) Shut down PG
    3) Restart PG without archiving
    4) Load the data
    5) Shut down PG again
    6) Restart PG with archiving
    7) Full back-up again.
    DBAs would like it much more if starting/stopping the archiver was
    possible
    via a superuser (not regular user) GUC. This would allow a much faster
    cycle:
    1) Full back up
    2) Stop archiving
    3) Load the data
    4) Restart archiving
    5) Full back-up
    The scenario you mention is what I'd like to do, but don't just yet see
    how.

    I'd welcome input on this point, since I don't fully understand GUCs:

    Thinking about this:
    1. Since the backends run XLogArchiveNotify(), they must all do so
    identically. One slip invalidates all the work of all the others.
    GUC Options are:
    INTERNAL - not appropriate
    POSTMASTER - what I originally envisaged, but not what you want
    SIGHUP - seems to allow different parameter settings in each backend
    BACKEND - not appropriate
    SUSET - maybe what you're looking for???????
    USERLIMIT - no
    USERSET - absolutely no

    2. Maybe have Postmaster run something every so often that looks for
    full xlogs and then executes XLogArchiveNotify() for them?

    Thoughts anyone?

    Best Regards, Simon Riggs
  • Josh Berkus at Mar 9, 2004 at 10:43 pm
    Simon,
    SIGHUP - seems to allow different parameter settings in each backend
    Nope. SIGHUP means that you need to send a HUP to the postmaster, such as
    you would with changes to pg_hba.conf.
    SUSET - maybe what you're looking for???????
    Yes. This means that it can be changed, at runtime, but by the Superuser
    only. This is used for several settings which are possible to change at
    runtime but take effect system-wide.

    --
    -Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Simon Riggs at Mar 9, 2004 at 11:17 pm

    From: Josh Berkus
    SIGHUP - seems to allow different parameter settings in each backend
    Nope. SIGHUP means that you need to send a HUP to the postmaster, such
    as
    you would with changes to pg_hba.conf.
    SUSET - maybe what you're looking for???????
    Yes. This means that it can be changed, at runtime, but by the Superuser
    only. This is used for several settings which are possible to change at
    runtime but take effect system-wide.
    SUSET it is then. I'll update the design doc

    Thanks, Simon
  • Simon Riggs at Mar 9, 2004 at 11:10 pm

    what I don't see in your paper or the proposed API
    is a
    way to coordinate full back-ups and WAL archiving. Obviously, the PITR
    Archive is only useful in reference to an existing full backup, so it is
    important to be able to associate a set of PITR archives with a
    particular
    full backup, or with some kind of "backup checkpoint". I'm sure that you
    have a solution for this, I just didn't see it explained in your proposal,
    or
    didn't understand it.
    You are perceptive, and generous in imagining I have a good solution. I
    will document this, assuming no better ideas emerge:

    My understanding is this:
    When crash recovery occurs, recovery starts from this last checkpoint,
    not from the earliest log. Existing function caters for locating start
    of xlogs for recovery purposes.
    Relying upon that, it should be possible to have a backup coordinate
    with a stream of xlogs at recovery time, but not EXACTLY at backup time.
    Best practice would indicate that you should always maintain 2-3 full
    backups, so deleting xlogs immediately after a backup is not a very good
    idea at all.

    In general, the usage is going to be:
    - start archiver
    - start PostgreSQL
    - interval during which xlogs are backed up
    - start backup

    Best Regards, Simon Riggs
  • Hannu Krosing at Mar 10, 2004 at 9:01 pm

    Josh Berkus kirjutas T, 09.03.2004 kell 19:46:

    In my personal experience, the *primary* use of PITR is recovery from User
    Error. For example, with one SQL Server 7.0 installation for a law firm,
    I've made use of PITR 4 times over the last 4 years: once was because and HDD
    failed, the other three were all becuase of IT dept. staff running
    unconstrained UPDATE queries against the back end. For recovery with
    minimal loss of data, there are existing solutions, such as replication
    servers, in addition to PITR; for recovery from User Error, only PITR will
    suffice.
    Actually PostgreSQL used to have very good support for this until some
    time in Postgres95 development by supporting additional temporal
    qualifiers for queries. I.e. one could ask for 'SELECT SALARY FROM
    EMPLOYEES AS IT WAS AT YESTERDAY NOON' (the syntax was not exacly this
    :)

    It was a very simple and logical result of PostgreSQL's MVCC storage and
    was supported by VACUUM allowing dropping only deleted tuples older than
    some specified time.

    Oracle has recently added something similar (using their WAL's) to ver.
    9.0 or 10.x of their DBMS exactly for recovery from user errors.

    The support for this was dropped from postgreSQL citing performance
    reasons at that time, but I still hope that it can restored some time.

    --------------
    Hannu
  • Simon Riggs at Mar 12, 2004 at 10:46 am

    Hannu Krosing
    Josh Berkus kirjutas T, 09.03.2004 kell 19:46:
    In my personal experience, the *primary* use of PITR is recovery
    from
    User
    Error. For example, with one SQL Server 7.0 installation for a law firm,
    I've made use of PITR 4 times over the last 4 years: once was
    because
    and HDD
    failed, the other three were all becuase of IT dept. staff running
    unconstrained UPDATE queries against the back end. For recovery
    with
    minimal loss of data, there are existing solutions, such as
    replication
    servers, in addition to PITR; for recovery from User Error, only
    PITR
    will
    suffice.
    Actually PostgreSQL used to have very good support for this until some
    time in Postgres95 development by supporting additional temporal
    qualifiers for queries. I.e. one could ask for 'SELECT SALARY FROM
    EMPLOYEES AS IT WAS AT YESTERDAY NOON' (the syntax was not exacly this
    :)
    Shame we can't run cc -as was 4 years ago.
    Or even better cc -as will be in 3 weeks; that would save me loads :)
    !!
    It was a very simple and logical result of PostgreSQL's MVCC storage and
    was supported by VACUUM allowing dropping only deleted tuples older than
    some specified time.

    Oracle has recently added something similar (using their WAL's) to ver.
    9.0 or 10.x of their DBMS exactly for recovery from user errors.

    The support for this was dropped from postgreSQL citing performance
    reasons at that time, but I still hope that it can restored some time.
    Flashback query is a new feature in Oracle 9i. I believe it is regarded
    with some horror by the DBA community...

    I get your idea though, though I think it is a different thing.

    PITR is about the avoidance of risk, not really about fixing any
    particular classes of problem. If you have PITR you can recover from
    "all" problems, frequent or not, depending upon how carefully and for
    how long you protect your backups.

    Reading old MVCC copies won't take that away, though is a feature that
    would be useful within a particular time window. It's also hard to tell
    whether VACUUM has been run, and if so on which tables, since that will
    also change the answer you get from those MVCC-usage type queries.

    Oracle recognise this also. Flashback hasn't replaced backup/restore.
    Neither has it prevented them from enhancing log miner.

    Temporal support is a different issue anyway. It is up to you to come up
    with a database design that supports being able to ask that question, if
    that is a business requirement.

    Best regards, Simon Riggs
  • Andreas Zeugswetter at Mar 10, 2004 at 5:37 pm

    The only way we can support file-level hot backup is in conjunction with
    PITR-style WAL log archiving. It is okay for the data area dump to be
    inconsistent, so long as your recovery process includes replay of WAL
    starting at some checkpoint before the filesystem dump started, and
    extending to some point after the filesystem dump finished. Replaying
    WAL will correct the inconsistencies.
    And the "last checkpoint" info resides in pg_control, and not in pg_clog, no ?
    So basically a PITR restore would need to adjust the pg_control file
    after filesystem restore and before starting recovery. Maybe it can take that
    info from the oldest available WAL ? The OP would only need to ensure,
    that only such logs that need to be rolled forward are visible (in the
    correct directory) to the recovery.

    Andreas
  • Andreas Zeugswetter at Mar 10, 2004 at 6:58 pm

    To clarify:
    I'd expect a cluster to be workable, if I
    - disable VACUUM until backup completed
    - issue CHECKPOINT
    - backup clog (CHECKPOINT and backup clog are the "backup checkpoint")
    - backup all datafiles (which include at least all completed transaction
    data at checkpoint time)
    and then
    - restore datafiles and clog
    - bring up pgsql.
    Why is that a useful approach? You might as well shut down the
    postmaster and do a cold filesystem backup, because you are depending on
    the data files (including clog) not to change after the checkpoint. You
    cannot make such an assumption in a running database.
    I think there is a misunderstanding here.

    What I think is possible is the following (continuous backup of WAL assumed):
    - disable VACUUM
    - issue CHECKPOINT "C1"
    - backup all files
    - reenable VACUUM

    - restore files
    - adapt pg_control (checkpoint "C1")
    - recover WAL until at least end of backup

    The db is inconsistent until you recovered all WAL (PITR) that accumulated during
    file backup.

    I am not sure about clog, isn't clog logged in xlog ?

    Andreas
  • Tom Lane at Mar 11, 2004 at 6:57 am

    "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
    What I think is possible is the following (continuous backup of WAL assumed):
    - disable VACUUM
    - issue CHECKPOINT "C1"
    - backup all files
    - reenable VACUUM
    - restore files
    - adapt pg_control (checkpoint "C1")
    - recover WAL until at least end of backup
    I do not understand this fixation on "disable VACUUM". What are you
    thinking that will buy you? AFAICS it would make no difference.
    I am not sure about clog, isn't clog logged in xlog ?
    Right. For the purposes of PITR, clog behaves the same as regular
    data files. You gotta back it up, but a filesystem copy will be
    inconsistent until fixed by WAL replay.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 8, '04 at 11:28p
activeMar 12, '04 at 4:52p
posts23
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase