I've now completed the coding of Phase 1 of PITR.

This allows a backup to be recovered and then rolled forward (all the
way) on transaction logs. This proves the code and the design works, but
also validates a lot of the earlier assumptions that were the subject of
much earlier debate.

As noted in the previous designs, PostgreSQL talks to an external
archiver using the XLogArchive API.
I've now completed:
- changes to PostgreSQL
- written a simple archiving utility, pg_arch

Using both of these together, I have successfully:
- started pg_arch
- started postgres
- taken a backup using tar
- ran pgbench for an extended period, so that the transaction logs taken
at the start have long since been recycled
- killed postmaster
- wait for completion
- rm -R $PGDATA
- restore using tar
- restore xlogs from archive directory
- start postmaster and watch it recover to end of logs

This has been tested through a number of times on non-trivial tests and
I've sat and watch the beast at work to make sure nothing wierd was
happening on timing.

At this stage:
Missing Functions -
- recovery does NOT yet stop at a specified point-in-time (that was
always planned for Phase 2)
- few more log messages required to report progress
- debug mode required to allow most to be turned off

Wrinkles
- code is system testable, but not as cute as it could be
- input from committers is now sought to complete the work
- you are strongly advised not to treat any of the patches as usable in
any real world situation YET - that bit comes next

Bugs
- two bugs currently occur during some tests:
1. the notification mechanism as originally designed causes ALL backends
to report that a log file has closed. That works most of the time,
though does give rise to occaisional timing errors - nothing too
serious, but this inexactness could lead to later errors.
2. After restore, the notification system doesn't recover fully - this
is a straightforward one

I'm building a full patchset for this code and will upload this soon. As
you might expect over the time its taken me to develop this, some bitrot
has set in, so I'm rebuilding it against the latest dev version now, and
will complete fixes for the two bugs mentioned above.

I'm sure some will say "no words, show me the code"... I thought you all
would appreciate some advance warning of this, to plan time to
investigate and comment upon the coding.

Best Regards, Simon Riggs, 2ndQuadrant
http://www.2ndquadrant.com

Search Discussions

  • Bruce Momjian at Apr 26, 2004 at 4:48 pm
    I want to come hug you --- where do you live? !!!

    :-)

    ---------------------------------------------------------------------------

    Simon Riggs wrote:
    I've now completed the coding of Phase 1 of PITR.

    This allows a backup to be recovered and then rolled forward (all the
    way) on transaction logs. This proves the code and the design works, but
    also validates a lot of the earlier assumptions that were the subject of
    much earlier debate.

    As noted in the previous designs, PostgreSQL talks to an external
    archiver using the XLogArchive API.
    I've now completed:
    - changes to PostgreSQL
    - written a simple archiving utility, pg_arch

    Using both of these together, I have successfully:
    - started pg_arch
    - started postgres
    - taken a backup using tar
    - ran pgbench for an extended period, so that the transaction logs taken
    at the start have long since been recycled
    - killed postmaster
    - wait for completion
    - rm -R $PGDATA
    - restore using tar
    - restore xlogs from archive directory
    - start postmaster and watch it recover to end of logs

    This has been tested through a number of times on non-trivial tests and
    I've sat and watch the beast at work to make sure nothing wierd was
    happening on timing.

    At this stage:
    Missing Functions -
    - recovery does NOT yet stop at a specified point-in-time (that was
    always planned for Phase 2)
    - few more log messages required to report progress
    - debug mode required to allow most to be turned off

    Wrinkles
    - code is system testable, but not as cute as it could be
    - input from committers is now sought to complete the work
    - you are strongly advised not to treat any of the patches as usable in
    any real world situation YET - that bit comes next

    Bugs
    - two bugs currently occur during some tests:
    1. the notification mechanism as originally designed causes ALL backends
    to report that a log file has closed. That works most of the time,
    though does give rise to occaisional timing errors - nothing too
    serious, but this inexactness could lead to later errors.
    2. After restore, the notification system doesn't recover fully - this
    is a straightforward one

    I'm building a full patchset for this code and will upload this soon. As
    you might expect over the time its taken me to develop this, some bitrot
    has set in, so I'm rebuilding it against the latest dev version now, and
    will complete fixes for the two bugs mentioned above.

    I'm sure some will say "no words, show me the code"... I thought you all
    would appreciate some advance warning of this, to plan time to
    investigate and comment upon the coding.

    Best Regards, Simon Riggs, 2ndQuadrant
    http://www.2ndquadrant.com



    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
    joining column's datatypes do not match
    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Apr 26, 2004 at 5:01 pm
    Well, I guess I was fairly happy too :-)

    I'd be more comfortable if I'd found more bugs though, but I'm sure the
    kind folk on this list will see that wish of mine comes true!

    The code is in a "needs more polishing" state - which is just the right
    time for some last discussions before everything sets too solid.

    Regards, Simon
    On Mon, 2004-04-26 at 17:48, Bruce Momjian wrote:
    I want to come hug you --- where do you live? !!!

    :-)

    ---------------------------------------------------------------------------

    Simon Riggs wrote:
    I've now completed the coding of Phase 1 of PITR.

    This allows a backup to be recovered and then rolled forward (all the
    way) on transaction logs. This proves the code and the design works, but
    also validates a lot of the earlier assumptions that were the subject of
    much earlier debate.

    As noted in the previous designs, PostgreSQL talks to an external
    archiver using the XLogArchive API.
    I've now completed:
    - changes to PostgreSQL
    - written a simple archiving utility, pg_arch

    Using both of these together, I have successfully:
    - started pg_arch
    - started postgres
    - taken a backup using tar
    - ran pgbench for an extended period, so that the transaction logs taken
    at the start have long since been recycled
    - killed postmaster
    - wait for completion
    - rm -R $PGDATA
    - restore using tar
    - restore xlogs from archive directory
    - start postmaster and watch it recover to end of logs

    This has been tested through a number of times on non-trivial tests and
    I've sat and watch the beast at work to make sure nothing wierd was
    happening on timing.

    At this stage:
    Missing Functions -
    - recovery does NOT yet stop at a specified point-in-time (that was
    always planned for Phase 2)
    - few more log messages required to report progress
    - debug mode required to allow most to be turned off

    Wrinkles
    - code is system testable, but not as cute as it could be
    - input from committers is now sought to complete the work
    - you are strongly advised not to treat any of the patches as usable in
    any real world situation YET - that bit comes next

    Bugs
    - two bugs currently occur during some tests:
    1. the notification mechanism as originally designed causes ALL backends
    to report that a log file has closed. That works most of the time,
    though does give rise to occaisional timing errors - nothing too
    serious, but this inexactness could lead to later errors.
    2. After restore, the notification system doesn't recover fully - this
    is a straightforward one

    I'm building a full patchset for this code and will upload this soon. As
    you might expect over the time its taken me to develop this, some bitrot
    has set in, so I'm rebuilding it against the latest dev version now, and
    will complete fixes for the two bugs mentioned above.

    I'm sure some will say "no words, show me the code"... I thought you all
    would appreciate some advance warning of this, to plan time to
    investigate and comment upon the coding.

    Best Regards, Simon Riggs, 2ndQuadrant
    http://www.2ndquadrant.com



    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
    joining column's datatypes do not match
  • Bruce Momjian at Apr 26, 2004 at 5:08 pm

    Simon Riggs wrote:

    Well, I guess I was fairly happy too :-) YES!
    I'd be more comfortable if I'd found more bugs though, but I'm sure the
    kind folk on this list will see that wish of mine comes true!

    The code is in a "needs more polishing" state - which is just the right
    time for some last discussions before everything sets too solid.
    Once we see the patch, we will be able to eyeball all the code paths and
    interface to existing code and will be able to spot a lot of stuff, I am
    sure.

    It might take a few passes over it but you will get all the support and
    ideas we have.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Apr 26, 2004 at 11:46 pm

    On Mon, 2004-04-26 at 18:08, Bruce Momjian wrote:
    Simon Riggs wrote:
    Well, I guess I was fairly happy too :-) YES!
    I'd be more comfortable if I'd found more bugs though, but I'm sure the
    kind folk on this list will see that wish of mine comes true!

    The code is in a "needs more polishing" state - which is just the right
    time for some last discussions before everything sets too solid.
    Once we see the patch, we will be able to eyeball all the code paths and
    interface to existing code and will be able to spot a lot of stuff, I am
    sure.

    It might take a few passes over it but you will get all the support and
    ideas we have.
    Thanks very much.

    Code will be there in full tomorrow now (oh it is tomorrow...)

    Fixed the bugs that I spoke of earlier though. They all make sense when
    you try to tell someone else about them...

    Best Regards, Simon
  • Glen Parker at Apr 26, 2004 at 9:03 pm
    I want to come hug you --- where do you live? !!!
    You're not the only one. But we don't want to smother the poor guy, at
    least not before he completes his work :-)
  • Simon Riggs at Apr 26, 2004 at 9:11 pm

    On Mon, 2004-04-26 at 16:37, Simon Riggs wrote:
    I've now completed the coding of Phase 1 of PITR.

    This allows a backup to be recovered and then rolled forward (all the
    way) on transaction logs. This proves the code and the design works, but
    also validates a lot of the earlier assumptions that were the subject of
    much earlier debate.

    As noted in the previous designs, PostgreSQL talks to an external
    archiver using the XLogArchive API.
    I've now completed:
    - changes to PostgreSQL
    - written a simple archiving utility, pg_arch
    This will be on HACKERS not PATCHES for a while...


    OVERVIEW :

    Various code changes. Not all included here...but I want to prove this
    is real, rather than have you waiting for my patch release skills to
    improve.

    PostgreSQL changes include:
    ============================
    - guc.c
    New GUC called wal_archive to control archival logging/not.

    - xlog.h
    GUC added here

    - xlog.c
    The most critical parts of the code live here. The way things currently
    work can be thought of as a circular set of logs, with the current log
    position sweeping around the circle like a clock. In order to archive an
    xlog, you must start just AFTER the file has been closed and BEFORE the
    pointer sweeps round again.
    The code here tries to spot the right moment to notify the archive that
    its time to archive. That point is critical, too early and the archive
    may yet be incomplete, too late and a window of failure creeps into the
    system.
    Finding that point is more complicated than it seems because every
    backend has the same file open and decides to close it at different
    times - nearly the same time if you're running pgbench, but could vary
    considerably otherwise. That timing difference is the source of Bug#1.
    My solution is to use the piece of code that first updates pg_control,
    since there is a similar need to only-do-it-once. My understanding is
    that the other backends eventually discover they are supposed to be
    looking at a different file now and reset themselves - so that the xlog
    gets fsynced only once.
    It's taken me a week to consider the alternatives...this point is
    critical, so please suggest if you know/think differently.
    When the pointer sweeps round again, if we are still archiving, we
    simply increase the number of logs in the cycle to defer when we can
    recycle the xlog. The code doesn't yet handle a failure condition we
    discussed previously: running out of disk space and how we handle that
    (there was detailed debate, noted for future implementation).

    New utility aimed at being located in src/bin/pg_arch
    =======================================================
    - pg_arch.c
    The idea of pg_arch is that it is a functioning archival tool and at the
    same time is the reference implementation of the XLogArchive API. The
    API is all wrapped up in the same file currently, to make it easier to
    implement, but I envisage separating these out into two parts after it
    passes initial inspection - shouldn't take too much work given that was
    its design goal. This will then allow the API to be used for wider
    applications that want to backup PostgreSQL.

    - src/bin/Makefile has been updated to include pg_arch, so that this
    then gets made as part of the full system rather than an add-on. I'm
    sure somebody has feelings on this...my thinking was that it ought to be
    available without too much effort.

    What's NOT included (YET!)
    ==========================
    -changes to initdb
    -changes to postgresql.conf
    -changes to wal_debug
    -related changes
    -user documentation

    - changes to initdb
    XLogArchive API implementation relies on the existence of
    $PGDATA/pg_rlog

    That would be relatively simple to add to initdb, but its also a no
    brainer to add without it, so I thought I'd leave it for discussion in
    case anybody has good reasons to put elsewhere/rename it etc.

    More importantly, this effects the security model used by XLogArchive.
    The way I had originally envisaged this, the directory permissions would
    be opened up for group level read/write thus:
    pg_xlog rwxr-x---
    pg_rlog rwxrwx---
    though this of course relies on $PGDATA being opened up also. That then
    would allow the archiving tool to be in its own account also, yet with a
    shared group. (Thinking that a standard Legato install (for instance) is
    unlikely to recommend sharing a UNIX userid with PostgreSQL). I was
    unaware that PostgreSQL checks the permissions of PGDATA before it
    starts and does not allow you to proceed if group permissions exist.

    We have two options:-related changes
    -user documentation

    i) alter all things that rely on security being userlevel-only
    - initdb
    - startup
    - most other security features?
    ii) encourage (i.e. force) people using XLogArchive API to run as the
    PostgreSQL owning-user (postgres).

    I've avoided this issue in the general implementation, thinking that
    there'll be some strong feelings either way, or an alternative that I
    haven't thought of yet (please...)

    -changes to postgresql.conf
    The parameter setting
    wal_archive=true
    needs to be added to make XLogArchive work or not.
    I've not added this to the install template (yet), in case we had some
    further suggestions for what this might be called.
    -related changes
    -user documentation

    -changes to wal_debug
    The XLOG_DEBUG flag is set as a value between 1 and 16, though the code
    only ever treats this as a boolean. For my development, I partially
    implemented an earlier suggestion of mine: set the flag to 1 in the
    config file, then set the more verbose portions of debug output to
    trigger when its set to 16. That effected a couple of places in xlog.c.
    That may not be needed, so thats not included either.

    -user documentation
    Not yet...but it will be.
    Bugs
    - two bugs currently occur during some tests:
    1. the notification mechanism as originally designed causes ALL backends
    to report that a log file has closed. That works most of the time,
    though does give rise to occasional timing errors - nothing too
    serious, but this inexactness could lead to later errors.
    2. After restore, the notification system doesn't recover fully - this
    is a straightforward one
  • Peter Eisentraut at Apr 27, 2004 at 5:10 pm

    Simon Riggs wrote:
    New utility aimed at being located in src/bin/pg_arch
    Why isn't the archiver process integrated into the server?
  • Bruce Momjian at Apr 27, 2004 at 5:59 pm

    Peter Eisentraut wrote:
    Simon Riggs wrote:
    New utility aimed at being located in src/bin/pg_arch
    Why isn't the archiver process integrated into the server?
    I think it is because the archiver process has to be started/stopped
    independently of the server.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Peter Eisentraut at Apr 28, 2004 at 3:14 pm

    Am Tuesday 27 April 2004 19:59 schrieb Bruce Momjian:
    Peter Eisentraut wrote:
    Simon Riggs wrote:
    New utility aimed at being located in src/bin/pg_arch
    Why isn't the archiver process integrated into the server?
    I think it is because the archiver process has to be started/stopped
    independently of the server.
    When the server is not running there is nothing to archive, so I don't follow
    this argument.
  • Simon Riggs at Apr 28, 2004 at 4:47 pm

    On Wed, 2004-04-28 at 16:14, Peter Eisentraut wrote:
    Am Tuesday 27 April 2004 19:59 schrieb Bruce Momjian:
    Peter Eisentraut wrote:
    Simon Riggs wrote:
    New utility aimed at being located in src/bin/pg_arch
    Why isn't the archiver process integrated into the server?
    I think it is because the archiver process has to be started/stopped
    independently of the server.
    When the server is not running there is nothing to archive, so I don't follow
    this argument.
    The running server creates xlogs, which are still available for archive
    even when the server is not running...

    Overall, your point is taken, with many additional comments in my other
    posts in reply to you.

    I accept that this may be desirable in the future, for some simple
    implementations. The pg_autovacuum evolution path is a good model - if
    it works and the code is stable, bring it under the postmaster at a
    later time.

    Best Regards, Simon Riggs
  • Bruce Momjian at Apr 29, 2004 at 4:19 am

    Simon Riggs wrote:
    When the server is not running there is nothing to archive, so I don't follow
    this argument.
    The running server creates xlogs, which are still available for archive
    even when the server is not running...

    Overall, your point is taken, with many additional comments in my other
    posts in reply to you.

    I accept that this may be desirable in the future, for some simple
    implementations. The pg_autovacuum evolution path is a good model - if
    it works and the code is stable, bring it under the postmaster at a
    later time.
    [ This email isn't focused because I haven't resolved all my ideas yet.]

    OK, I looked over the code. Basically it appears pg_arch is a
    client-side program that copies files from pg_xlog to a specified
    directory, and marks completion in a new pg_rlog directory.

    The driving part of the program seems to be:

    while ( (n = read( xlogfd, buf, BLCKSZ)) > 0)
    if ( write( archfd, buf, n) != n)
    return false;

    The program basically sleeps and when it awakes checks to see if new WAL
    files have been created.

    There is some additional GUC variable to prevent WAL from being recycled
    until it has been archived, but the posted patch only had pg_arch.c, its
    Makefile, and a patch to update bin/Makefile.

    Simon (the submitter) specified he was providing an API to archive, but
    it is really just a set of C routines to call that do copies. It is not
    a wire protocol or anything like that.

    The program has a mode where it archives all available wal files and
    exits, but by default it has to remain running to continue archiving.

    I am wondering if this is the way to approach the situation. I
    apologize for not considering this earlier. Archives of PITR postings
    of interest are at:

    http://momjian.postgresql.org/cgi-bin/pgtodo?pitr

    It seems the backend is the one who knows right away when a new WAL file
    has been created and needs to be archived.

    Also, are folks happy with archiving only full WAL files? This will not
    restore all transactions up to the point of failure, but might lose
    perhaps 2-5 minutes of transactions before the failure.

    Also, a client application is a separate process that must remain
    running. With Informix, there is a separate utility to do PITR logging.
    It is a pain to have to make sure a separate process is always running.

    Here is an idea. What if we add two GUC settings:

    pitr = true/false;
    pitr_path = 'filename or |program';

    In this way, you would basically specify your path to dump all WAL logs
    into (just keep appending 16MB chunks) or call a program that you pipe
    all the WAL logs into.

    You can't change pitr_path while pitr is on. Each backend opens the
    filename in append mode before writing. One problem is that this slows
    down the backend because it has to do the write, and it might be slow.

    We also need the ability to write to a tape drive, and you can't
    open/close those like a file. Different backends will be doing the WAL
    file additions, there isn't a central process to keep a tape drive file
    descriptor open.

    Seems pg_arch should at least use libpq to connect to a database and do
    a LISTEN and have the backend NOTIFY when they create a new WAL file or
    something. Polling for new WAL files seems non-optimal, but maybe a
    database connection is overkill.

    Then, you start the backend, specify the path, turn on pitr, do the tar,
    and you are on your way.

    Also, pg_arch should only be run the the install user. No need to allow
    other users to run this.

    Another idea is to have a client program like pg_ctl that controls PITR
    logging (start, stop, location), but does its job and exits, rather than
    remains running.

    I apologies for not bringing up these issues earlier. I didn't realize
    the direction it was going. I wasn't focused on it. Sorry.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Alvaro Herrera at Apr 29, 2004 at 1:52 pm

    On Thu, Apr 29, 2004 at 12:18:38AM -0400, Bruce Momjian wrote:

    OK, I looked over the code. Basically it appears pg_arch is a
    client-side program that copies files from pg_xlog to a specified
    directory, and marks completion in a new pg_rlog directory.

    The driving part of the program seems to be:

    while ( (n = read( xlogfd, buf, BLCKSZ)) > 0)
    if ( write( archfd, buf, n) != n)
    return false;

    The program basically sleeps and when it awakes checks to see if new WAL
    files have been created.
    Is the API able to indicate a written but not-yet-filled WAL segment?
    So an archiver could copy the filled part, and refill it later. This
    may be needed because a segment could take a while to be filled.

    --
    Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
    "Hoy es el primer día del resto de mi vida"
  • Bruce Momjian at Apr 29, 2004 at 2:07 pm

    Alvaro Herrera wrote:
    On Thu, Apr 29, 2004 at 12:18:38AM -0400, Bruce Momjian wrote:

    OK, I looked over the code. Basically it appears pg_arch is a
    client-side program that copies files from pg_xlog to a specified
    directory, and marks completion in a new pg_rlog directory.

    The driving part of the program seems to be:

    while ( (n = read( xlogfd, buf, BLCKSZ)) > 0)
    if ( write( archfd, buf, n) != n)
    return false;

    The program basically sleeps and when it awakes checks to see if new WAL
    files have been created.
    Is the API able to indicate a written but not-yet-filled WAL segment?
    So an archiver could copy the filled part, and refill it later. This
    may be needed because a segment could take a while to be filled.
    I couldn't figure that out, but I don't think it does. It would have to
    lock the WAL writes so it could get a good copy, I think, and I didn't
    see that.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Alvaro Herrera at Apr 29, 2004 at 2:11 pm

    On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote:
    Alvaro Herrera wrote:
    Is the API able to indicate a written but not-yet-filled WAL segment?
    So an archiver could copy the filled part, and refill it later. This
    may be needed because a segment could take a while to be filled.
    I couldn't figure that out, but I don't think it does. It would have to
    lock the WAL writes so it could get a good copy, I think, and I didn't
    see that.
    I'm not sure but I don't think so. You don't have to lock the WAL for
    writing, because it will always write later in the file than you are
    allowed to read. (If you read more than you were told to, it's your
    fault as an archiver.)

    --
    Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
    "Et put se mouve" (Galileo Galilei)
  • Bruce Momjian at Apr 29, 2004 at 2:23 pm

    Alvaro Herrera wrote:
    On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote:
    Alvaro Herrera wrote:
    Is the API able to indicate a written but not-yet-filled WAL segment?
    So an archiver could copy the filled part, and refill it later. This
    may be needed because a segment could take a while to be filled.
    I couldn't figure that out, but I don't think it does. It would have to
    lock the WAL writes so it could get a good copy, I think, and I didn't
    see that.
    I'm not sure but I don't think so. You don't have to lock the WAL for
    writing, because it will always write later in the file than you are
    allowed to read. (If you read more than you were told to, it's your
    fault as an archiver.)
    My point was that without locking the WAL, we might get part of a WAL
    write in our file, but I now realize that during a crash the same thing
    might happen, so it would be OK to just copy it even if it is being
    written to.

    Simon posted the rest of his patch that shows changes to the backend,
    and a comment reads:

    + * The name of the notification file is the message that will be picked up
    + * by the archiver, e.g. we write RLogDir/00000001000000C6.full
    + * and the archiver then knows to archive XLOgDir/00000001000000C6,
    + * while it is doing so it will rename RLogDir/00000001000000C6.full
    + * to RLogDir/00000001000000C6.busy, then when complete, rename it again
    + * to RLogDir/00000001000000C6.done

    so it is only archiving full logs.

    Also, I think this archiver should be able to log to a local drive,
    network drive (trivial), tape drive, ftp, or use an external script to
    transfer the logs somewhere. (ftp would probably be an external script
    with 'expect').

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Apr 29, 2004 at 6:35 pm

    On Thu, 2004-04-29 at 15:22, Bruce Momjian wrote:
    Alvaro Herrera wrote:
    On Thu, Apr 29, 2004 at 10:07:01AM -0400, Bruce Momjian wrote:
    Alvaro Herrera wrote:
    Is the API able to indicate a written but not-yet-filled WAL segment?
    So an archiver could copy the filled part, and refill it later. This
    may be needed because a segment could take a while to be filled.
    I couldn't figure that out, but I don't think it does. It would have to
    lock the WAL writes so it could get a good copy, I think, and I didn't
    see that.
    I'm not sure but I don't think so. You don't have to lock the WAL for
    writing, because it will always write later in the file than you are
    allowed to read. (If you read more than you were told to, it's your
    fault as an archiver.)
    My point was that without locking the WAL, we might get part of a WAL
    write in our file, but I now realize that during a crash the same thing
    might happen, so it would be OK to just copy it even if it is being
    written to.

    Simon posted the rest of his patch that shows changes to the backend,
    and a comment reads:

    + * The name of the notification file is the message that will be picked up
    + * by the archiver, e.g. we write RLogDir/00000001000000C6.full
    + * and the archiver then knows to archive XLOgDir/00000001000000C6,
    + * while it is doing so it will rename RLogDir/00000001000000C6.full
    + * to RLogDir/00000001000000C6.busy, then when complete, rename it again
    + * to RLogDir/00000001000000C6.done

    so it is only archiving full logs.

    Also, I think this archiver should be able to log to a local drive,
    network drive (trivial), tape drive, ftp, or use an external script to
    transfer the logs somewhere. (ftp would probably be an external script
    with 'expect').
    Bruce is correct, the API waits for the archive to be full before
    archiving.

    I had thought about the case for partial archiving: basically, if you
    want to archive in smaller chunks, make your log files smaller...this is
    now a compile time option. Possibly there is an argument to make the
    xlog file size configurable, as a way of doing what you suggest.

    Taking multiple copies of the same file, yet trying to work out which
    one to apply sounds complex and error prone to me. It also increases the
    cost of the archival process and thus drains other resources.

    The archiver should be able to do a whole range of things. Basically,
    that point was discussed and the agreed approach was to provide an API
    that would allow anybody and everybody to write whatever they wanted.
    The design included pg_arch since it was clear that there would be a
    requirement in the basic product to have those facilities - and in any
    case any practically focused API has a reference port as a way of
    showing how to use it and exposing any bugs in the server side
    implementation.

    The point is...everybody is now empowered to write tape drive code,
    whatever you fancy.... go do.

    Best regards, Simon Riggs
  • Bruce Momjian at Apr 29, 2004 at 7:25 pm

    Simon Riggs wrote:
    Also, I think this archiver should be able to log to a local drive,
    network drive (trivial), tape drive, ftp, or use an external script to
    transfer the logs somewhere. (ftp would probably be an external script
    with 'expect').
    Bruce is correct, the API waits for the archive to be full before
    archiving.

    I had thought about the case for partial archiving: basically, if you
    want to archive in smaller chunks, make your log files smaller...this is
    now a compile time option. Possibly there is an argument to make the
    xlog file size configurable, as a way of doing what you suggest.

    Taking multiple copies of the same file, yet trying to work out which
    one to apply sounds complex and error prone to me. It also increases the
    cost of the archival process and thus drains other resources.

    The archiver should be able to do a whole range of things. Basically,
    that point was discussed and the agreed approach was to provide an API
    that would allow anybody and everybody to write whatever they wanted.
    The design included pg_arch since it was clear that there would be a
    requirement in the basic product to have those facilities - and in any
    case any practically focused API has a reference port as a way of
    showing how to use it and exposing any bugs in the server side
    implementation.

    The point is...everybody is now empowered to write tape drive code,
    whatever you fancy.... go do.
    Agreed we want to allow the superuser control over writing of the
    archive logs. The question is how do they get access to that. Is it by
    running a client program continuously or calling an interface script
    from the backend?

    My point was that having the backend call the program has improved
    reliablity and control over when to write, and easier administration.

    How are people going to run pg_arch? Via nohup? In virtual screens? If
    I am at the console and I want to start it, do I use "&"? If I want to
    stop it, do I do a 'ps' and issue a 'kill'? This doesn't seem like a
    good user interface to me.

    To me the problem isn't pg_arch itself but the idea that a client
    program is going to be independently finding(polling) and copying of the
    archive logs.

    I am thinking the client program is called with two arguments, the xlog
    file name, and the arch location defined in GUC. Then the client
    program does the write. The problem there though is who gets the write
    error since the backend will not wait around for completion?

    Another case is server start/stop. You want to start/stop the archive
    logger to match the database server, particularly if you reboot the
    server. I know Informix used a client program for logging, and it was a
    pain to administer.

    I would be happy with an exteral program if it was started/stoped by the
    postmaster (or via GUC change) and received a signal when a WAL file was
    written. But if we do that, it isn't really an external program anymore
    but another child process like our stats collector.

    I am willing to work on this if folks think this is a better approach.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Apr 29, 2004 at 8:56 pm

    On Thu, 2004-04-29 at 20:24, Bruce Momjian wrote:
    Simon Riggs wrote:
    The archiver should be able to do a whole range of things. Basically,
    that point was discussed and the agreed approach was to provide an API
    that would allow anybody and everybody to write whatever they wanted.
    The design included pg_arch since it was clear that there would be a
    requirement in the basic product to have those facilities - and in any
    case any practically focused API has a reference port as a way of
    showing how to use it and exposing any bugs in the server side
    implementation.

    The point is...everybody is now empowered to write tape drive code,
    whatever you fancy.... go do.
    Agreed we want to allow the superuser control over writing of the
    archive logs. The question is how do they get access to that. Is it by
    running a client program continuously or calling an interface script
    from the backend?

    My point was that having the backend call the program has improved
    reliablity and control over when to write, and easier administration.
    Agreed. We've both suggested ways that can occur, though I suggest this
    is much less of a priority, for now. Not "no", just not "now".
    How are people going to run pg_arch? Via nohup? In virtual screens? If
    I am at the console and I want to start it, do I use "&"? If I want to
    stop it, do I do a 'ps' and issue a 'kill'? This doesn't seem like a
    good user interface to me.

    To me the problem isn't pg_arch itself but the idea that a client
    program is going to be independently finding(polling) and copying of the
    archive logs.

    I am thinking the client program is called with two arguments, the xlog
    file name, and the arch location defined in GUC. Then the client
    program does the write. The problem there though is who gets the write
    error since the backend will not wait around for completion?

    Another case is server start/stop. You want to start/stop the archive
    logger to match the database server, particularly if you reboot the
    server. I know Informix used a client program for logging, and it was a
    pain to administer.
    pg_arch is just icing on top of the API. The API is the real deal here.
    I'm not bothered if pg_arch is not accepted, as long as we can adopt the
    API. As noted previously, my original mind was to split the API away
    from the pg_arch application to make it clearer what was what. Once that
    has been done, I encourage others to improve pg_arch - but also to use
    the API to interface with other BAR prodiucts.

    If you're using PostgreSQL for serious business then you will be using a
    serious BAR product as well. There are many FOSS alternatives...

    The API's purpose is to allow larger, pre-existing BAR products to know
    when and how to retrieve data from PostgreSQL. Those products don't and
    won't run underneath postmaster, so although I agree with Peter's
    original train of thought, I also agree with Tom's suggestion that we
    need an API more than we need an archiver process.

    I would be happy with an exteral program if it was started/stoped by the
    postmaster (or via GUC change) and received a signal when a WAL file was
    written.
    That is exactly what has been written.

    The PostgreSQL side of the API is written directly into the backend, in
    xlog.c and is therefore activated by postmaster controlled code. That
    then sends "a signal" to the process that will do the archiving - the
    Archiver side of the XLogArchive API has it as an in-process library.
    (The "signal" is, in fact, a zero-length file written to disk because
    there are many reasons why an external archiver may not be ready to
    archive or even up and running to receive a signal).

    The only difference is that there is some confusion as to the role and
    importance of pg_arch.

    Best Regards, Simon Riggs
  • Bruce Momjian at Apr 30, 2004 at 3:02 am

    Simon Riggs wrote:
    Agreed we want to allow the superuser control over writing of the
    archive logs. The question is how do they get access to that. Is it by
    running a client program continuously or calling an interface script
    from the backend?

    My point was that having the backend call the program has improved
    reliablity and control over when to write, and easier administration.
    Agreed. We've both suggested ways that can occur, though I suggest this
    is much less of a priority, for now. Not "no", just not "now".
    Another case is server start/stop. You want to start/stop the archive
    logger to match the database server, particularly if you reboot the
    server. I know Informix used a client program for logging, and it was a
    pain to administer.
    pg_arch is just icing on top of the API. The API is the real deal here.
    I'm not bothered if pg_arch is not accepted, as long as we can adopt the
    API. As noted previously, my original mind was to split the API away
    from the pg_arch application to make it clearer what was what. Once that
    has been done, I encourage others to improve pg_arch - but also to use
    the API to interface with other BAR prodiucts.

    If you're using PostgreSQL for serious business then you will be using a
    serious BAR product as well. There are many FOSS alternatives...

    The API's purpose is to allow larger, pre-existing BAR products to know
    when and how to retrieve data from PostgreSQL. Those products don't and
    won't run underneath postmaster, so although I agree with Peter's
    original train of thought, I also agree with Tom's suggestion that we
    need an API more than we need an archiver process.

    I would be happy with an exteral program if it was started/stoped by the
    postmaster (or via GUC change) and received a signal when a WAL file was
    written.
    That is exactly what has been written.

    The PostgreSQL side of the API is written directly into the backend, in
    xlog.c and is therefore activated by postmaster controlled code. That
    then sends "a signal" to the process that will do the archiving - the
    Archiver side of the XLogArchive API has it as an in-process library.
    (The "signal" is, in fact, a zero-length file written to disk because
    there are many reasons why an external archiver may not be ready to
    archive or even up and running to receive a signal).

    The only difference is that there is some confusion as to the role and
    importance of pg_arch.
    OK, I have finalized my thinking on this.

    We both agree that a pg_arch client-side program certainly works for
    PITR logging. The big question in my mind is whether a client-side
    program is what we want to use long-term, and whether we want to release
    a 7.5 that uses it and then change it in 7.6 to something more
    integrated into the backend.

    Let me add this is a little different from pg_autovacuum. With that,
    you could put it in cron and be done with it. With pg_arch, there is a
    routine that has to be used to do PITR, and if we change the process in
    7.6, I am afraid there will be confusion.

    Let me also add that I am not terribly worried about having the feature
    to restore to an arbitrary point in time for 7.5. I would much rather
    have a good PITR solution that works cleanly in 7.5 and add it to 7.6,
    than to have retore to an arbitrary point but have a strained
    implementation that we have to revisit for 7.6.

    Here are my ideas. (I talked to Tom about this and am including his
    ideas too.) Basically, the archiver that scans the xlog directory to
    identify files to be archived should be a subprocess of the postmaster.
    You already have that code and it can be moved into the backend.

    Here is my implementation idea. First, your pg_arch code runs in the
    backend and is started just like the statistics process. It has to be
    started whether PITR is being used or not, but will be inactive if PITR
    isn't enabled. This must be done because we can't have a backend start
    this process later in case they turn on PITR after server start.

    The process id of the archive process is stored in shared memory. When
    PITR is turned on, each backend that complete a WAL file sends a signal
    to the archiver process. The archiver wakes up on the signal and scans
    the directory, finds files that need archiving, and either does a 'cp'
    or runs a user-defined program (like scp) to transfer the file to the
    archive location.

    In GUC we add:

    pitr = true/false
    pitr_location = 'directory, user@host:/dir, etc'
    pitr_transfer = 'cp, scp, etc'

    The archiver program updates its config values when someone changes
    these values via postgresql.conf (and uses pg_ctl reload). These can
    only be modified from postgresql.conf. Changing them via SET has to be
    disabled because they are cluster-level settings, not per session, like
    port number or checkpoint_segments.

    Basically, I think that we need to push user-level control of this
    process down beyond the directory scanning code (that is pretty
    standard), and allow them to call an arbitrary program to transfer the
    logs. My idea is that the pitr_transfer program will get $1=WAL file
    name and $2=pitr_location and the program can use those arguments to do
    the transfer. We can even put a pitr_transfer.sample program in share
    and document $1 and $2.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Apr 30, 2004 at 7:04 am

    On Fri, 2004-04-30 at 04:02, Bruce Momjian wrote:

    Let me also add that I am not terribly worried about having the feature
    to restore to an arbitrary point in time for 7.5. I would much rather
    have a good PITR solution that works cleanly in 7.5 and add it to 7.6,
    than to have retore to an arbitrary point but have a strained
    implementation that we have to revisit for 7.6.
    Interesting thought, I see now your priorities.

    Will read and digest over next few days.

    Thanks for your help and attention,

    Best regards, Simon Riggs
  • Simon Riggs at May 4, 2004 at 9:50 pm

    On Fri, 2004-04-30 at 04:02, Bruce Momjian wrote:
    Simon Riggs wrote:
    Agreed we want to allow the superuser control over writing of the
    archive logs. The question is how do they get access to that. Is it by
    running a client program continuously or calling an interface script
    from the backend?

    My point was that having the backend call the program has improved
    reliablity and control over when to write, and easier administration.
    Agreed. We've both suggested ways that can occur, though I suggest this
    is much less of a priority, for now. Not "no", just not "now".
    Another case is server start/stop. You want to start/stop the archive
    logger to match the database server, particularly if you reboot the
    server. I know Informix used a client program for logging, and it was a
    pain to administer.
    pg_arch is just icing on top of the API. The API is the real deal here.
    I'm not bothered if pg_arch is not accepted, as long as we can adopt the
    API. As noted previously, my original mind was to split the API away
    from the pg_arch application to make it clearer what was what. Once that
    has been done, I encourage others to improve pg_arch - but also to use
    the API to interface with other BAR prodiucts.

    If you're using PostgreSQL for serious business then you will be using a
    serious BAR product as well. There are many FOSS alternatives...

    The API's purpose is to allow larger, pre-existing BAR products to know
    when and how to retrieve data from PostgreSQL. Those products don't and
    won't run underneath postmaster, so although I agree with Peter's
    original train of thought, I also agree with Tom's suggestion that we
    need an API more than we need an archiver process.

    I would be happy with an exteral program if it was started/stoped by the
    postmaster (or via GUC change) and received a signal when a WAL file was
    written.
    That is exactly what has been written.

    The PostgreSQL side of the API is written directly into the backend, in
    xlog.c and is therefore activated by postmaster controlled code. That
    then sends "a signal" to the process that will do the archiving - the
    Archiver side of the XLogArchive API has it as an in-process library.
    (The "signal" is, in fact, a zero-length file written to disk because
    there are many reasons why an external archiver may not be ready to
    archive or even up and running to receive a signal).

    The only difference is that there is some confusion as to the role and
    importance of pg_arch.
    OK, I have finalized my thinking on this.

    We both agree that a pg_arch client-side program certainly works for
    PITR logging. The big question in my mind is whether a client-side
    program is what we want to use long-term, and whether we want to release
    a 7.5 that uses it and then change it in 7.6 to something more
    integrated into the backend.

    Let me add this is a little different from pg_autovacuum. With that,
    you could put it in cron and be done with it. With pg_arch, there is a
    routine that has to be used to do PITR, and if we change the process in
    7.6, I am afraid there will be confusion.

    Let me also add that I am not terribly worried about having the feature
    to restore to an arbitrary point in time for 7.5. I would much rather
    have a good PITR solution that works cleanly in 7.5 and add it to 7.6,
    than to have retore to an arbitrary point but have a strained
    implementation that we have to revisit for 7.6.

    Here are my ideas. (I talked to Tom about this and am including his
    ideas too.) Basically, the archiver that scans the xlog directory to
    identify files to be archived should be a subprocess of the postmaster.
    You already have that code and it can be moved into the backend.

    Here is my implementation idea. First, your pg_arch code runs in the
    backend and is started just like the statistics process. It has to be
    started whether PITR is being used or not, but will be inactive if PITR
    isn't enabled. This must be done because we can't have a backend start
    this process later in case they turn on PITR after server start.

    The process id of the archive process is stored in shared memory. When
    PITR is turned on, each backend that complete a WAL file sends a signal
    to the archiver process. The archiver wakes up on the signal and scans
    the directory, finds files that need archiving, and either does a 'cp'
    or runs a user-defined program (like scp) to transfer the file to the
    archive location.

    In GUC we add:

    pitr = true/false
    pitr_location = 'directory, user@host:/dir, etc'
    pitr_transfer = 'cp, scp, etc'

    The archiver program updates its config values when someone changes
    these values via postgresql.conf (and uses pg_ctl reload). These can
    only be modified from postgresql.conf. Changing them via SET has to be
    disabled because they are cluster-level settings, not per session, like
    port number or checkpoint_segments.

    Basically, I think that we need to push user-level control of this
    process down beyond the directory scanning code (that is pretty
    standard), and allow them to call an arbitrary program to transfer the
    logs. My idea is that the pitr_transfer program will get $1=WAL file
    name and $2=pitr_location and the program can use those arguments to do
    the transfer. We can even put a pitr_transfer.sample program in share
    and document $1 and $2.
    ...Bruce and I have just discussed this in some detail and reached a
    good understanding of the design proposals as a whole. It looks like all
    of this can happen in the next few weeks, with a worst case time
    estimate of mid-June. TGFT!

    I'll write this up and post this shortly, with a rough roadmap for
    further development of recovery-related features.

    Best Regards,

    Simon Riggs
    2nd Quadrant
  • Simon Riggs at Apr 29, 2004 at 8:57 pm

    On Thu, 2004-04-29 at 20:24, Bruce Momjian wrote:
    I am willing to work on this...
    There is much work still to be done to make PITR work..accepting all of
    the many comments made.

    If anybody wants this by 1 June, I think we'd better look sharp. My aim
    has been to knock one of the URGENT items on the TODO list into touch,
    however that was to be achieved.

    The following work remains...from all that has been said...
    - halt restore at particular condition (point in time, txnid etc)
    - archive policy to control whether to halt database should archiving
    fail and space run out (as Oracle, Db2 do), or not (as discussed)
    - cope with restoring a stream of logs larger than the disk space on the
    restoration target system
    - integrate restore with tablespace code, to allow tablespace backups
    - build XLogSpy mechanism to allow DBA to better know when to recover to
    - extend logging mechanism to allow recovery time prediction
    - publicise the API with BAR open source teams, to get feedback and to
    encourage them to use the API to allow PostgreSQL support for their BAR
    - use the API to build interfaces to the 100+ BAR products on the market
    - performance tuning of xlogs, to ensure minimum xlog volume written
    - performance tuning of recovery, to ensure wasted effort avoided
    - allow archiver utility to be managed by postmaster
    - write some good documentation
    - comprehensive crash testing
    - really comprehensive crash testing
    - very comprehensive crash testing

    It seems worth working on things in some kind of priority order.

    I claim these, by the way, but many others look important and
    interesting to me:
    - halt restore at particular condition (point in time, txnid etc)
    - cope with restoring a stream of logs larger than the disk space on the
    restoration target system
    - write some good documentation

    Best Regards, Simon Riggs
  • Alvaro Herrera at Apr 29, 2004 at 8:24 pm

    On Thu, Apr 29, 2004 at 07:34:47PM +0100, Simon Riggs wrote:

    Bruce is correct, the API waits for the archive to be full before
    archiving.

    I had thought about the case for partial archiving: basically, if you
    want to archive in smaller chunks, make your log files smaller...this is
    now a compile time option. Possibly there is an argument to make the
    xlog file size configurable, as a way of doing what you suggest.

    Taking multiple copies of the same file, yet trying to work out which
    one to apply sounds complex and error prone to me. It also increases the
    cost of the archival process and thus drains other resources.
    My idea was basically that the archiver could be told "I've finished
    writing XLog segment 1 until byte 9000", so the archiver would

    dd if=xlog-1 seek=0 skip=0 bs=1c count=9000c of=archive-1

    And later, it would get a notification "segment 1 until byte 18000" he does

    dd if=xlog-1 seek=0 skip=0 bs=1c count=18000c of=archive-1

    Or, if it's smart enough,

    dd if=xlog-1 seek=9000c skip=9000c bs=1c count=9000c of=archive-1

    Basically it is updating the logs as soon as it receives the
    notifications. Writing 16 MB of xlogs could take some time.

    When a full xlog segment has been written, a different kind of
    notification can be issued. A dumb archiver could just ignore the
    incremental ones and copy the files only upon receiving this other kind.


    I think that if log files are too small, maybe it will be a waste of
    resources (which ones?). Anyway, it's just an idea.

    --
    Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
  • Simon Riggs at Apr 27, 2004 at 8:21 pm

    On Tue, 2004-04-27 at 18:10, Peter Eisentraut wrote:
    Simon Riggs wrote:
    New utility aimed at being located in src/bin/pg_arch
    Why isn't the archiver process integrated into the server?
    Number of reasons....

    Overall, I initially favoured the archiver as another special backend,
    like checkpoint. That is exactly the same architecture as Oracle uses,
    so is a good starting place for thought.

    We discussed the design in detail on the list and the suggestion was
    made to implement PITR using an API to send notification to an archiver.
    In Oracle7, it was considered OK to just dump the files in some
    directory and call them archived. Later, most DBMSs have gone to some
    trouble to integrate with generic or at least market leading backup and
    recovery (BAR) software products. Informix and DB2 provide open
    interfaces to BARs; Oracle does not, but then it figures it already
    (had) market share, so we'll just do it our way.

    The XLogArchive design allows ANY external archiver to work with
    PostgreSQL. The pg_arch program supplied is really to show how that
    might be implemented. This leaves the door open for any BAR product to
    interface through to PostgreSQL, whether this be your favourite open
    source BAR or the leading proprietary vendors.

    Wide adoption is an important design feature and the design presented
    offers this.

    The other reason is to do with how and when archival takes place. An
    asynchronous communication mechanism is required between PostgreSQL and
    the archiver, to allow for such situations as tape mounts or simple
    failure of the archiver. The method chosen for implementing this
    asynchronous comms mechanism lends itself to being an external API -
    there were other designs but these were limited to internal use only.

    You ask a reasonable question however. If pg_autovacuum exists, why
    should pg_autoarch not work also? My own thinking about external
    connectivity may have overshadowed my thinking there.

    It would not require too much additional work to add another GUC which
    gives the name of the external archiver to confirm execution of, or
    start/restart if it fails. At this point, such a feature is a nice to
    have in comparison with the goal of being able to recover to a PIT, so I
    will defer this issue to Phase 3....

    Best regards, Simon Riggs
  • Peter Eisentraut at Apr 28, 2004 at 3:03 pm

    Am Tuesday 27 April 2004 22:21 schrieb Simon Riggs:
    Why isn't the archiver process integrated into the server?
    You ask a reasonable question however. If pg_autovacuum exists, why
    should pg_autoarch not work also?
    pg_autovacuum is going away to be integrated as a backend process.
  • Peter Eisentraut at Apr 28, 2004 at 3:16 pm

    Am Monday 26 April 2004 23:11 schrieb Simon Riggs:
    ii) encourage (i.e. force) people using XLogArchive API to run as the
    PostgreSQL owning-user (postgres).
    I think this is perfectly reasonable.
  • Andreas Zeugswetter at Apr 30, 2004 at 10:35 am

    Basically it is updating the logs as soon as it receives the
    notifications. Writing 16 MB of xlogs could take some time.
    In my experience with archiving logs, 16 Mb is on the contrary way too
    small for a single log. The overhead of starting e.g. a tape session
    is so high that you cannot keep up (a few seconds). Once the tape is
    streaming it is usually quite fast. So imho it is not really practical to
    have logs so small that they can fill in less that 20 seconds.

    Andreas
  • Simon Riggs at May 10, 2004 at 11:00 pm
    Further design plans for PITR...as posted previously, Bruce and I had a
    long discussion recently to iron out the major thinking and a good deal
    of the detail also.
    In overview, major change is introducing an ARCHIVE process running
    under control of the Postmaster, similar to Stats collector.

    Due to personal commitments in latter May, early June, these changes
    will not be complete until mid/late June. Best I can do...
    Including time required for the fair amount of documentation required
    for code to be usefully tested during beta. The good news is there is
    little speculation in this design now, it is just hanging the code in
    the right place - about half the code is waiting to be remerged into
    this latest design.

    I'll submit the code in pieces as well, so we can view progress, whether
    or not those are incrementally committed.

    Committers & all others interested: pls check this out and make any
    comments or questions now...time for rework is now slipping fast.

    Best Regards, Simon Riggs, 2nd Quadrant


    ...detail chatter follows
    On Thu, 2004-05-06 at 05:38, Bruce Momjian wrote:
    Simon Riggs wrote:
    Bruce, was this OK with you...shall I post?

    Some items occurred to me during write up...are you OK with those? Do
    you want to alter anything before I post?
    Looks good with a few adjustments:
    Some additions and backtracks...
    These choices should be offered as a single GUC, with mutually
    exclusive values of
    - CIRCULAR (named same as DB2 to illustrate that some xlogging does take
    place, just not archive logging)
    - ARCHIVE
    - EXTERNAL
    It would be nice to allow the external program to work if you specfied
    the program as '', but external isn't the same as running no program
    because the external program will also do the flag file removal once it
    is archived. I am a little worried about adding an external capability
    when we don't have anyone ready to actually show someone wanting such an
    external program. Not sure how to handle that -- add it in 7.5 and
    see, or go with a boolean and see if we can get an external thing
    working for 7.6.
    OK, EXTERNAL will not be included in the 7.5 drop; I'm not certain it is
    necessary now because of other changes in the design (below).
    We always spawn an ARCHIVER process under postmaster, no matter what the
    setting of the main GUC. That way, it can be started up if required.
    Archive process id is stored in shared memory (or on disk as
    postmaster?)
    I think shared memory, but I am not positive. I think shared memory
    because the postmaster could potentially have to stop/restart it. I
    will have to look at how the stats process is done.
    Looks to me like this would have to be a disk file, e.g. archiver.pid
    but I'll isolate that piece of code in case someone has a bright idea.
    The archiver program updates its config values when someone changes
    these values via postgresql.conf (and uses pg_ctl reload). These can
    only be modified from postgresql.conf.
    This would be PGC_SIGHUP. However, we need to make sure the archiver
    sees those changes like the backends see such changes now.
    Agreed.
    Basically, I think that we need to push user-level control of this
    process down beyond the directory scanning code (that is pretty
    standard), and allow them to call an arbitrary program to transfer the
    logs. My idea is that the pitr_transfer program will get $1=WAL file
    name and $2=pitr_location and the program can use those arguments to do
    the transfer. We can even put a pitr_transfer.sample program in share
    and document $1 and $2.
    Agreed.
    - initdb needs to be altered to add the pg_rlog directory
    Should we put the rlog directory as subdirectory of xlog? Seems so.
    Agreed.
    - code also required to note when xlog file switches occur during
    extended recovery across a number of xlog files
    ...was accepted
    - didn't discuss when we test for archive_dest and what happens then. We
    know Informix, DB2 and Oracle all freeze if archive_dest is not
    available. That's not an option at the moment...for the future. Right
    now we can choose to either PANIC, ERROR or WARNING and so need a
    GUC-specified policy to control that behaviour. (Suggest naming options
    SHUTDOWN(=PANIC) or WARNING)
    Yep, we can allow the admin to specify what happens if we can't archive.
    Summary of additional GUCs required (names not discussed...still open!)
    SUSET...NOT SET!
    - wal_archive_mode = CIRCULAR (default) | ARCHIVE | EXTERNAL
    - archive_dest = 'directory, user@host:/dir, etc' (no default)
    - archive_program = 'cp, scp, etc' (no default, or scp?)
    - wal_archive_error_policy = WARNING (default) | SHUTDOWN
    I would remove the wal_ part because though it is implemented via WAL,
    the actually process is archiving. WAL is just an implementation
    detail.
    So, in summary, we have 5 GUCs, all PGC_SIGHUP
    - archive_mode = CIRCULAR (default) (==off)| ARCHIVE (==on)
    - archive_dest = 'directory, user@host:/dir, etc' (no default)
    - archive_program = 'cp, scp, etc' (no default)
    - archive_error_policy = WARNING (default) | SHUTDOWN
    - archive_debug
    - The GUC for recovery target maybe should be a postmaster command
    line
    switch? That way we wouldn't need to edit postgresql.conf before
    recovery and we also wouldn't need to give it a name...
    I like centralizing it all in GUC. Command-line parameters are pretty
    hard to specify for one-time usage like this. However, if you set it
    via GUC, and you don't modify the value and restart the postmaster, is
    it going to honor that old xid. That would be a strange problem. I
    guess we could fail to start if we don't find the specified xid in the
    wal files.
    Postmaster startup only, applies only if enters recovery
    - recovery_target = 12345262 (default is NOT SET)
    recovery_xid?

    Does it stop before that xid or after that xid?
    Recovery target supplied at recovery-time start of postmaster cannot
    easily be supplied as a GUC or Postmaster startup switch. Suggestion is
    to test for a file called:
    pgrecovery.conf
    which has something in it like this
    ROLLFORWARD UNTIL TRANSACTIONID 0x2343D4 INCLUSIVE;
    That looks over-cooked, but I'll make it simple (believe me!)
    After recovery completes, the file is renamed to:
    pgrecovery.done
    This then avoids complications with interactions of crash recovery and
    rollforward recovery. If we crash during recovery, it will restart
    cleanly and continue. Once recovery completes, if we then crash, we
    don't go back into rollforward recovery (unless we want to), which would
    not be the case if we put a GUC in the postgresql.conf file directly
    because we would need to re-edit it and send out a SIGHUP via pg_ctl
    reload - which is guaranteed not to happen under stress at 4am.
    No changes to postgresql.conf are required.
    [No capability, for now, to rollforward when logspace > available disk,
    but that can be a later addition]

    ARCHIVER architecture very similar to Stats Collector. Startup just
    before Stats collector, postmaster will restart. I'll put all the code
    in one place like we have with stats collector.

    At startup, ARCHIVER will test archive capability: We write a test file
    to xlog directory called [pgarch_startup_$pid_$date] to xlog, then
    execute the command once using that name as a parameter, which should
    then copy file to archive location using the archive_program command. At
    startup, failure of the archive_program will be a PANIC condition,
    whereas once started, PostgreSQL will act according to
    archive_error_policy.

    If ARCHIVER fails, it will be restarted by Postmaster. Archive_program
    runs in its own process, so shouldn't be able to touch PostgreSQL. It
    will run in (postgres) security context, so no permissions changes.
    archive_error_policy will only come into effect once the situation
    occurs that archive directory runs out of space - after archiver_program
    has failed and the WARNING to restart it has been ignored by admins.

    Since EXTERNAL is not being supported, originally posted program called
    pg_arch lives no more...c'est la vie

    Final issues:
    - need to know which signal to use from backend->ARCHIVER when an xlog
    fills. Somebody let me know - not bothered which...?

    ===
  • Simon Riggs at May 11, 2004 at 8:51 pm
    A few questions may help to speed up my work

    I need to send a signal from a backend to the archiver process.

    1. What signal should I use?

    2. How do I give the processid of the archiver to the backend? The
    archiver may restart at any time, so its pid could change after a
    backend is forked.

    I have answers, but I strive for the best answer.

    Thanks very much, Best regards, Simon Riggs
  • Bruce Momjian at May 11, 2004 at 8:55 pm

    Simon Riggs wrote:
    A few questions may help to speed up my work

    I need to send a signal from a backend to the archiver process.

    1. What signal should I use?
    You can use any unused signal. I would suggest looking at what the
    stats processes uses, and use something else like SIGUSR1.
    2. How do I give the processid of the archiver to the backend? The
    archiver may restart at any time, so its pid could change after a
    backend is forked.
    I was thinking of having it be in shared memory. I am going to work on
    that part, but I need to finish the relocatable install stuff for Win32
    first.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Tom Lane at May 11, 2004 at 9:16 pm

    Simon Riggs writes:
    I need to send a signal from a backend to the archiver process.
    1. What signal should I use?
    SIGUSR1 or SIGUSR2 would be the safest choices.
    2. How do I give the processid of the archiver to the backend? The
    archiver may restart at any time, so its pid could change after a
    backend is forked.
    My answer would be "don't". Send a signal to the postmaster and
    let it signal the current archiver child. Use the existing
    SendPostmasterSignal() code for the first part of this.

    regards, tom lane
  • Simon Riggs at May 11, 2004 at 9:59 pm

    On Tue, 2004-05-11 at 22:15, Tom Lane wrote:
    Simon Riggs <simon@2ndquadrant.com> writes:
    I need to send a signal from a backend to the archiver process.
    1. What signal should I use?
    SIGUSR1 or SIGUSR2 would be the safest choices.
    2. How do I give the processid of the archiver to the backend? The
    archiver may restart at any time, so its pid could change after a
    backend is forked.
    My answer would be "don't". Send a signal to the postmaster and
    let it signal the current archiver child. Use the existing
    SendPostmasterSignal() code for the first part of this.
    Brilliant - very clean. Many thanks. Best Regards, Simon Riggs

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedApr 26, '04 at 3:38p
activeMay 11, '04 at 9:59p
posts33
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase