As has been previously mentioned a couple of times, it should be
perfectly possible to use streaming replication to get around the
limitations of archive_command/archive_timeout to do log archiving for
PITR (being that you either keep archive_timeout high and risk data
loss or you set it very low and generate a huge log archive without
need).

I've put together a tool to do this. The basic idea is to just stream
down replication and write it to regular WAL files, which can then be
used for recovery. You'll still need to use archive_command together
with it to ensure that the backups are complete. Streaming replication
doesn't guarantee that - in fact, regular replication will fallback to
using whatever archive_command created when wal_keep_segments isn't
enough.

I've put up an early version of the tool at
http://github.com/mhagander/pg_streamrecv

Comments and contributions are most welcome. And frankly, a good
review is very much required before I'd trust it ;) Hopefully, I
didn't overlook something critical :D

Search Discussions

  • Fujii Masao at Sep 28, 2010 at 4:25 am

    On Mon, Sep 27, 2010 at 9:07 PM, Magnus Hagander wrote:
    As has been previously mentioned a couple of times, it should be
    perfectly possible to use streaming replication to get around the
    limitations of archive_command/archive_timeout to do log archiving for
    PITR (being that you either keep archive_timeout high and risk data
    loss or you set it very low and generate a huge log archive without
    need).

    I've put together a tool to do this. The basic idea is to just stream
    down replication and write it to regular WAL files, which can then be
    used for recovery. You'll still need to use archive_command together
    with it to ensure that the backups are complete. Streaming replication
    doesn't guarantee that - in fact, regular replication will fallback to
    using whatever archive_command created when wal_keep_segments isn't
    enough.

    I've put up an early version of the tool at
    http://github.com/mhagander/pg_streamrecv
    Great! This also might be useful for users who want something like
    Oracle redo log mirroring.
    Comments and contributions are most welcome. And frankly, a good
    review is very much required before I'd trust it ;) Hopefully, I
    didn't overlook something critical :D
    When I ran that, the size of the WAL file in inprogress directory
    became more than 16MB. Obviously something isn't right.

    When I requested immediate shutdown to the master, segmentation
    fault occurred in pg_streamrecv. I guess that the return value 0
    of PQgetCopyData would not be handled correctly.

    After I repeated Ctrl+C and start of pg_streamrecv some times,
    I encountered the following error and pg_streamrecv was never up.
    Is this intentional?

    In progress directory contains more than one file!

    $ ls foo/inprogress/
    00000001000000000000000D 00000001000000000000000D.save

    When there is inprogress or archived WAL file, pg_streamrecv should
    not execute pg_current_xlog_location because that result is not used?

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Magnus Hagander at Sep 28, 2010 at 8:23 am

    On Tue, Sep 28, 2010 at 06:25, Fujii Masao wrote:
    On Mon, Sep 27, 2010 at 9:07 PM, Magnus Hagander wrote:
    As has been previously mentioned a couple of times, it should be
    perfectly possible to use streaming replication to get around the
    limitations of archive_command/archive_timeout to do log archiving for
    PITR (being that you either keep archive_timeout high and risk data
    loss or you set it very low and generate a huge log archive without
    need).

    I've put together a tool to do this. The basic idea is to just stream
    down replication and write it to regular WAL files, which can then be
    used for recovery. You'll still need to use archive_command together
    with it to ensure that the backups are complete. Streaming replication
    doesn't guarantee that - in fact, regular replication will fallback to
    using whatever archive_command created when wal_keep_segments isn't
    enough.

    I've put up an early version of the tool at
    http://github.com/mhagander/pg_streamrecv
    Great! This also might be useful for users who want something like
    Oracle redo log mirroring.
    Thanks.
    Comments and contributions are most welcome. And frankly, a good
    review is very much required before I'd trust it ;) Hopefully, I
    didn't overlook something critical :D
    When I ran that, the size of the WAL file in inprogress directory
    became more than 16MB. Obviously something isn't right.
    Wow, that's weird. I'm unable to reproduce that here - can you try to
    figure out why that happened?

    When I requested immediate shutdown to the master, segmentation
    fault occurred in pg_streamrecv. I guess that the return value 0
    of PQgetCopyData would not be handled correctly.
    Almost right - it actually returns -2 - which isn't handled. I've
    added a fix for that - and while att it, covering anything that's so
    small it doesn't contain the streaming replication header.

    After I repeated Ctrl+C and start of pg_streamrecv some times,
    I encountered the following error and pg_streamrecv was never up.
    Is this intentional?

    In progress directory contains more than one file!

    $ ls foo/inprogress/
    00000001000000000000000D  00000001000000000000000D.save
    Yes, that is actually intentional.

    When it finds the ..0D file there the first time, it gets renamed to
    ".save", and it retries the transmission from the beginning of that
    segment. as soon as the retransmission has passed the point that 0D
    was at, the .save file is removed. If you Ctrl-C the process again
    *before* it has reached that point, it will leave both files around -
    it's up to you to clean them up. This is to make sure we don't
    overwrite a file that contains more log data than is currently
    available on the master.

    When there is inprogress or archived WAL file, pg_streamrecv should
    not execute pg_current_xlog_location because that result is not used?
    Yeah, that's just a bit of lazy programming that I should fix :-)
  • Fujii Masao at Sep 29, 2010 at 3:41 am

    On Tue, Sep 28, 2010 at 5:23 PM, Magnus Hagander wrote:
    When I ran that, the size of the WAL file in inprogress directory
    became more than 16MB. Obviously something isn't right.
    Wow, that's weird. I'm unable to reproduce that here - can you try to
    figure out why that happened?
    Sorry, I overlooked the single-digit figure in the result of "ls -l".
    To be exact, the size of the WAL file in inprogress directory can be
    less than 16MB. Here is the result of "ls -l inprogress".

    $ ls -l inprogress/
    total 1724
    -rw-rw-r-- 1 postgres postgres 1757352 Sep 29 12:03 000000010000000000000003

    This also would be problem since the WAL file smaller than 16MB cannot
    be used for recovery. I think that pg_streamrecv should create 16MB
    file with zero at first, and write the received WAL records in that, as
    walreceiver does.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Magnus Hagander at Sep 29, 2010 at 8:47 am

    On Wed, Sep 29, 2010 at 05:40, Fujii Masao wrote:
    On Tue, Sep 28, 2010 at 5:23 PM, Magnus Hagander wrote:
    When I ran that, the size of the WAL file in inprogress directory
    became more than 16MB. Obviously something isn't right.
    Wow, that's weird. I'm unable to reproduce that here - can you try to
    figure out why that happened?
    Sorry, I overlooked the single-digit figure in the result of "ls -l".
    Aha, that explains it.
    To be exact, the size of the WAL file in inprogress directory can be
    less than 16MB. Here is the result of "ls -l inprogress".

    $ ls -l inprogress/
    total 1724
    -rw-rw-r-- 1 postgres postgres 1757352 Sep 29 12:03 000000010000000000000003

    This also would be problem since the WAL file smaller than 16MB cannot
    be used for recovery. I think that pg_streamrecv should create 16MB
    file with zero at first, and write the received WAL records in that, as
    walreceiver does.
    It's actually intentional. If we create a file at first, there is no
    way to figure out exactly how far through a partial segment we are
    without parsing the details of the log. This is useful both for the
    admin (who can look at the directory and watch the file grow) and the
    tool itself (to know when the .save file can be rotated away, when
    recovering from a partial segment receive).

    My idea was to just have the admin pad the file when it's time to do
    the restore. I could perhaps even add an option to the tool to do it -
    the idea being it's a manual step still.

    Do you have another suggestion for how to provide those two things?
  • Fujii Masao at Oct 1, 2010 at 9:13 am

    On Wed, Sep 29, 2010 at 5:47 PM, Magnus Hagander wrote:
    It's actually intentional. If we create a file at first, there is no
    way to figure out exactly how far through a partial segment we are
    without parsing the details of the log. This is useful both for the
    admin (who can look at the directory and watch the file grow) and the
    tool itself (to know when the .save file can be rotated away, when
    recovering from a partial segment receive).

    My idea was to just have the admin pad the file when it's time to do
    the restore. I could perhaps even add an option to the tool to do it -
    the idea being it's a manual step still.

    Do you have another suggestion for how to provide those two things?
    My idea is to implement something like xlogdump in contrib and use it
    for those two things. Though it's harder to implement that than to do
    padding tool.

    BTW, implementing something like xlogdump is already in TODO list:

    ---
    Create dump tool for write-ahead logs for use in determining transaction
    id for point-in-time recovery. This is useful for checking PITR recovery.
    http://wiki.postgresql.org/wiki/TODO#Point-In-Time_Recovery_.28PITR.29
    ---

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Magnus Hagander at Oct 1, 2010 at 11:01 am

    On Fri, Oct 1, 2010 at 11:13, Fujii Masao wrote:
    On Wed, Sep 29, 2010 at 5:47 PM, Magnus Hagander wrote:
    It's actually intentional. If we create a file at first, there is no
    way to figure out exactly how far through a partial segment we are
    without parsing the details of the log. This is useful both for the
    admin (who can look at the directory and watch the file grow) and the
    tool itself (to know when the .save file can be rotated away, when
    recovering from a partial segment receive).

    My idea was to just have the admin pad the file when it's time to do
    the restore. I could perhaps even add an option to the tool to do it -
    the idea being it's a manual step still.

    Do you have another suggestion for how to provide those two things?
    My idea is to implement something like xlogdump in contrib and use it
    for those two things. Though it's harder to implement that than to do
    padding tool.
    Yes, much harder. I prefer keeping the tool simple when possible ;)
    Doesn't mean we couldn't do both, i guess.
    BTW, implementing something like xlogdump is already in TODO list:
    Yes, obviously such a tool would be very useful independent of the
    streaming thing.
  • Kevin Grittner at Sep 29, 2010 at 9:45 pm

    Magnus Hagander wrote:

    Comments and contributions are most welcome.
    This is probably too esoteric to be worked on yet, but for this to
    be useful for us we would need to pass the resulting files through
    pg_clearxlogtail and gzip in an automated fashion. And we would
    need to do regular log file archiving in parallel with it.

    As background, our databases around the state archive to a directory
    which is then pushed via rsync to a "dumb" backup location in the
    same room as the database server (we're lucky to have rsync on the
    target of this copy; any other executable is out of the question),
    and the same directory is pulled via rsync to a central location.
    We would be interested in using streaming replication to a tool such
    as you describe for the copy to the central location, but since we
    would still be forcing a wal-file switch once per hour we would need
    the current capability to shrink an "empty" file from 16MB to 16kB
    using the above-mentioned tools.

    Also, a the ability to limit bandwidth would be a nice feature for
    us, preferably in a way which could be changed on the fly.

    If you could keep the development "friendly" to such features, I may
    get around to adding them to support our needs....

    -Kevin
  • Magnus Hagander at Sep 30, 2010 at 9:21 am

    On Wed, Sep 29, 2010 at 23:45, Kevin Grittner wrote:
    Magnus Hagander wrote:
    Comments and contributions are most welcome.
    This is probably too esoteric to be worked on yet, but for this to
    be useful for us we would need to pass the resulting files through
    pg_clearxlogtail and gzip in an automated fashion.  And we would
    need to do regular log file archiving in parallel with it.

    As background, our databases around the state archive to a directory
    which is then pushed via rsync to a "dumb" backup location in the
    same room as the database server (we're lucky to have rsync on the
    target of this copy; any other executable is out of the question),
    and the same directory is pulled via rsync to a central location.
    We would be interested in using streaming replication to a tool such
    as you describe for the copy to the central location, but since we
    would still be forcing a wal-file switch once per hour we would need
    the current capability to shrink an "empty" file from 16MB to 16kB
    using the above-mentioned tools.
    You could just have one stream going local and one stream going to the
    other location in parallell, though?

    Or use the stream to the local directory and rsync that off? While I
    haven't tested it, rsyncing the partial WAL files *should* be fine, I
    think...

    Also, a the ability to limit bandwidth would be a nice feature for
    us, preferably in a way which could be changed on the fly.

    If you could keep the development "friendly" to such features, I may
    get around to adding them to support our needs....
    Would it be enough to have kind of an "archive_command" switch that
    says "whenever you've finished a complete wal segment, run this
    command on it"? Then that command could clear the tail, compress, and
    send off?

    And in that case, should it run inline or in the background with the
    streaming? I would assume just fork it off and leave it to it's own
    business would be best?
  • Kevin Grittner at Sep 30, 2010 at 1:45 pm

    Magnus Hagander wrote:

    If you could keep the development "friendly" to such features, I
    may get around to adding them to support our needs....
    Would it be enough to have kind of an "archive_command" switch
    that says "whenever you've finished a complete wal segment, run
    this command on it"?
    That would allow some nice options. I've been thinking what would
    be the ideal use of this with our backup scheme, and the best I've
    thought up would be that each WAL segment file would be a single
    output stream, with the option of calling a executable (which could
    be a script) with the target file name and then piping the stream to
    it. At 16MB or a forced xlog switch, it would close the stream and
    call the executable again with a new file name. You could have a
    default executable for the default behavior, or just build in a
    default if no executable is specified.

    The reason I like this is that I could pipe the stream through
    pg_clearxlogtail and gzip pretty much "as is" to the locations on
    the database server currently used for rsync to the two targets, and
    the rsync commands would send the incremental changes once per
    minute to both targets. I haven't thought of another solution which
    provides incremental transmission of the WAL to the local backup
    location, which would be a nice thing to have, since this is most
    crucial when the WAN is down and not only is WAL data not coming
    back to our central location, but our application framework based
    replication stream isn't making back, either.

    -Kevin
  • Magnus Hagander at Sep 30, 2010 at 2:24 pm

    On Thu, Sep 30, 2010 at 15:45, Kevin Grittner wrote:
    Magnus Hagander wrote:
    If you could keep the development "friendly" to such features, I
    may get around to adding them to support our needs....
    Would it be enough to have kind of an "archive_command" switch
    that says "whenever you've finished a complete wal segment, run
    this command on it"?
    That would allow some nice options.  I've been thinking what would
    be the ideal use of this with our backup scheme, and the best I've
    thought up would be that each WAL segment file would be a single
    output stream, with the option of calling a executable (which could
    be a script) with the target file name and then piping the stream to
    it.  At 16MB or a forced xlog switch, it would close the stream and
    call the executable again with a new file name.  You could have a
    default executable for the default behavior, or just build in a
    default if no executable is specified.
    The problem with that one (which I'm sure is solvable somehow) is how
    to deal with restarts. Both restarts in the middle of a segment
    (happens all the time if you don't have an archive_timeout set), and
    really also restarts between segments. How would the tool know where
    to begin streaming again? Right now, it looks at the files - but doing
    it by your suggestion there are no files to look at. We'd need a
    second script/command to call to figure out where to restart from in
    that case, no?

    The reason I like this is that I could pipe the stream through
    pg_clearxlogtail and gzip pretty much "as is" to the locations on
    the database server currently used for rsync to the two targets, and
    the rsync commands would send the incremental changes once per
    minute to both targets.  I haven't thought of another solution which
    provides incremental transmission of the WAL to the local backup
    location, which would be a nice thing to have, since this is most
    crucial when the WAN is down and not only is WAL data not coming
    back to our central location, but our application framework based
    replication stream isn't making back, either.
    It should be safe to just rsync the archive directory as it's being
    written by pg_streamrecv. Doesn't that give you the property you're
    looking for - local machine gets data streamed in live, remote machine
    gets it rsynced every minute?
  • Aidan Van Dyk at Sep 30, 2010 at 2:39 pm

    On Thu, Sep 30, 2010 at 10:24 AM, Magnus Hagander wrote:

    That would allow some nice options.  I've been thinking what would
    be the ideal use of this with our backup scheme, and the best I've
    thought up would be that each WAL segment file would be a single
    output stream, with the option of calling a executable (which could
    be a script) with the target file name and then piping the stream to
    it.  At 16MB or a forced xlog switch, it would close the stream and
    call the executable again with a new file name.  You could have a
    default executable for the default behavior, or just build in a
    default if no executable is specified.
    The problem with that one (which I'm sure is solvable somehow) is how
    to deal with restarts. Both restarts in the middle of a segment
    (happens all the time if you don't have an archive_timeout set), and
    really also restarts between segments. How would the tool know where
    to begin streaming again? Right now, it looks at the files - but doing
    it by your suggestion there are no files to look at. We'd need a
    second script/command to call to figure out where to restart from in
    that case, no?
    And then think of the future, when sync rep is in... I'm hoping to be
    able to use something like this to do synchrous replication to my
    archive (instead of to a live server).
    It should be safe to just rsync the archive directory as it's being
    written by pg_streamrecv. Doesn't that give you the property you're
    looking for - local machine gets data streamed in live, remote machine
    gets it rsynced every minute?
    When the "being written to" segmnt copmletes moves to the final
    location, he'll get an extra whole "copy" of the file. But of the
    "move" can be an exec of his scritpt, the compressed/gzipped final
    result shouldn't be that bad. Certainly no worse then what he's
    currently getting with archive command ;-) And he's got the
    uncompressed incimental updates as they are happening.

    a.
  • Magnus Hagander at Sep 30, 2010 at 3:01 pm

    On Thu, Sep 30, 2010 at 16:39, Aidan Van Dyk wrote:
    On Thu, Sep 30, 2010 at 10:24 AM, Magnus Hagander wrote:

    That would allow some nice options.  I've been thinking what would
    be the ideal use of this with our backup scheme, and the best I've
    thought up would be that each WAL segment file would be a single
    output stream, with the option of calling a executable (which could
    be a script) with the target file name and then piping the stream to
    it.  At 16MB or a forced xlog switch, it would close the stream and
    call the executable again with a new file name.  You could have a
    default executable for the default behavior, or just build in a
    default if no executable is specified.
    The problem with that one (which I'm sure is solvable somehow) is how
    to deal with restarts. Both restarts in the middle of a segment
    (happens all the time if you don't have an archive_timeout set), and
    really also restarts between segments. How would the tool know where
    to begin streaming again? Right now, it looks at the files - but doing
    it by your suggestion there are no files to look at. We'd need a
    second script/command to call to figure out where to restart from in
    that case, no?
    And then think of the future, when sync rep is in... I'm hoping to be
    able to use something like this to do synchrous replication to my
    archive (instead of to a live server).
    Right, that could be a future enhancement. Doesn't mean we shouldn't
    still do our best with the async mode of course :P

    It should be safe to just rsync the archive directory as it's being
    written by pg_streamrecv. Doesn't that give you the property you're
    looking for - local machine gets data streamed in live, remote machine
    gets it rsynced every minute?
    When the "being written to" segmnt copmletes moves to the final
    location, he'll get an extra whole "copy" of the file.  But of the
    Ah, good point.
    "move" can be an exec of his scritpt, the compressed/gzipped final
    result shouldn't be that bad.  Certainly no worse then what he's
    currently getting with archive command ;-)  And he's got the
    uncompressed incimental updates as they are happening.
    Yeah, it would be trivial to replace the rename() call with a call to
    a script that gets to do whatever is suitable to the file. Actually,
    it'd probably be better to rename() it *and* call the script, so that
    we can continue properly if the script fails.
  • Kevin Grittner at Sep 30, 2010 at 3:25 pm

    Aidan Van Dyk wrote:

    When the "being written to" segmnt copmletes moves to the final
    location, he'll get an extra whole "copy" of the file. But of the
    "move" can be an exec of his scritpt, the compressed/gzipped final
    result shouldn't be that bad. Certainly no worse then what he's
    currently getting with archive command ;-) And he's got the
    uncompressed incimental updates as they are happening.
    Hmmm... As long as streaming replication doesn't send the "tail" of
    an incomplete WAL segment file, the only thing we'd be missing on
    the send to the central location is the compression. That's
    typically reducing the size of the transmission by 50% to 75% (e.g.,
    the gzipped "full" files are usually in the range of 4MB to 8MB).
    At our WAN speeds, that is significant. I don't suppose that
    streaming replication uses (or offers as an option) a compressed
    stream?

    -Kevin
  • Magnus Hagander at Oct 1, 2010 at 8:08 am

    On Thu, Sep 30, 2010 at 17:25, Kevin Grittner wrote:
    Aidan Van Dyk wrote:
    When the "being written to" segmnt copmletes moves to the final
    location, he'll get an extra whole "copy" of the file.  But of the
    "move" can be an exec of his scritpt, the compressed/gzipped final
    result shouldn't be that bad.  Certainly no worse then what he's
    currently getting with archive command ;-)  And he's got the
    uncompressed incimental updates as they are happening.
    Hmmm...  As long as streaming replication doesn't send the "tail" of
    an incomplete WAL segment file, the only thing we'd be missing on
    the send to the central location is the compression.  That's
    typically reducing the size of the transmission by 50% to 75% (e.g.,
    the gzipped "full" files are usually in the range of 4MB to 8MB).
    At our WAN speeds, that is significant.  I don't suppose that
    streaming replication uses (or offers as an option) a compressed
    stream?
    No, it sends a regular COPY stream with the raw transaction log data.
    So the files generated will eventually exactly 16Mb, just like they
    are in pg_xlog.
  • Kevin Grittner at Sep 30, 2010 at 3:13 pm

    Magnus Hagander wrote:

    We'd need a second script/command to call to figure out where to
    restart from in that case, no?
    I see your point; I guess we would need that.
    It should be safe to just rsync the archive directory as it's
    being written by pg_streamrecv. Doesn't that give you the property
    you're looking for - local machine gets data streamed in live,
    remote machine gets it rsynced every minute?
    Well the local target is a can't run pg_streamrecv -- it's a backup
    machine where we pretty much have rsync and nothing else. We could
    run pg_streamrecv on the database server itself and rsync to the
    local machine every minute.

    I just checked with the DBA who monitors space issues for such
    things, and it would be OK to rsync the uncompressed file to the
    local backup as it is written (we have enough space for it without
    compression) as long as we compress it before sending it to the
    central location. For that, your idea to fire a script on
    completion of the file would work -- we could maintain both raw and
    compressed files on the database server for rsync to the two
    locations.

    You can probably see the appeal of filtering it as it is written,
    though, if that is feasible. :-)

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 27, '10 at 12:09p
activeOct 1, '10 at 11:01a
posts16
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase