Attached is an updated streaming base backup patch, based off the work
that Heikki
started. It includes support for tablespaces, permissions, progress
reporting and
some actual documentation of the protocol changes (user interface
documentation is
going to be depending on exactly what the frontend client will look like, so I'm
waiting with that one a while).

The basic implementation is: Add a new command to the replication mode called
BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
compatible format) of the data directory and all tablespaces, and then end
the base backup in a single operation.

Other than the basic implementation, there is a small refactoring done of
pg_start_backup() and pg_stop_backup() splitting them into a "backend function"
that is easier to call internally and a "user facing function" that remains
identical to the previous one, and I've also added a pg_abort_backup()
internal-only function to get out of crashes while in backup mode in a safer
way (so it can be called from error handlers). Also, the walsender needs a
resource owner in order to call pg_start_backup().

I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).


Some remaining thoughts and must-dos:

* Compression: Do we want to be able to compress the backups server-side? Or
defer that to whenever we get compression in libpq? (you can still tunnel it
through for example SSH to get compression if you want to) My thinking is
defer it.
* Compression: We could still implement compression of the tar files in
pg_streamrecv (probably easier, possibly more useful?)
* Windows support (need to implement readlink)
* Tar code is copied from pg_dump and modified. Should we try to factor it out
into port/? There are changes in the middle of it so it can't be done with
the current calling points, it would need a refactor. I think it's not worth
it, given how simple it is.

Improvements I want to add, but that aren't required for basic operation:

* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?
* include all the necessary WAL files in the backup. This way we could generate
a tar file that would work on it's own - right now, you still need to set up
log archiving (or use streaming repl) to get the remaining logfiles from the
master. This is fine for replication setups, but not for backups.
This would also require us to block recycling of WAL files during the backup,
of course.
* Suggestion from Heikki: don't put backup_label in $PGDATA during the backup.
Rather, include it just in the tar file. That way if you crash during the
backup, the master doesn't start recovery from the backup_label, leading
to failure to start up in the worst case
* Suggestion from Heikki: perhaps at some point we're going to need a full
bison grammar for walsender commands.
* Relocation of tablespaces (can at least partially be done client-side)

Search Discussions

  • Stefan Kaltenbrunner at Jan 5, 2011 at 3:33 pm
    On 01/05/2011 02:54 PM, Magnus Hagander wrote:
    [..]
    Some remaining thoughts and must-dos:

    * Compression: Do we want to be able to compress the backups server-side? Or
    defer that to whenever we get compression in libpq? (you can still tunnel it
    through for example SSH to get compression if you want to) My thinking is
    defer it.
    * Compression: We could still implement compression of the tar files in
    pg_streamrecv (probably easier, possibly more useful?)
    hmm compression would be nice but I don't think it is required for this
    initial implementation.

    * Windows support (need to implement readlink)
    * Tar code is copied from pg_dump and modified. Should we try to factor it out
    into port/? There are changes in the middle of it so it can't be done with
    the current calling points, it would need a refactor. I think it's not worth
    it, given how simple it is.

    Improvements I want to add, but that aren't required for basic operation:

    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    well my main concern is that a basebackup done that way might blew up
    the buffercache of the OS causing temporary performance issues.
    This might be more serious with an in-core solution than with what
    people use now because a number of backup software and tools (like some
    of the commercial backup solutions) employ various tricks to avoid that.
    One interesting tidbit i found was:

    http://insights.oetiker.ch/linux/fadvise/

    which is very Linux specific but interesting nevertheless...




    Stefan
  • Dimitri Fontaine at Jan 5, 2011 at 9:58 pm

    Magnus Hagander writes:
    Attached is an updated streaming base backup patch, based off the work
    Thanks! :)
    * Compression: Do we want to be able to compress the backups server-side? Or
    defer that to whenever we get compression in libpq? (you can still tunnel it
    through for example SSH to get compression if you want to) My thinking is
    defer it.
    Compression in libpq would be a nice way to solve it, later.
    * Compression: We could still implement compression of the tar files in
    pg_streamrecv (probably easier, possibly more useful?)
    What about pg_streamrecv | gzip > …, which has the big advantage of
    being friendly to *any* compression command line tool, whatever patents
    and licenses?
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    * include all the necessary WAL files in the backup. This way we could generate
    a tar file that would work on it's own - right now, you still need to set up
    log archiving (or use streaming repl) to get the remaining logfiles from the
    master. This is fine for replication setups, but not for backups.
    This would also require us to block recycling of WAL files during the backup,
    of course.
    Well, I would guess that if you're streaming the WAL files in parallel
    while the base backup is taken, then you're able to have it all without
    archiving setup, and the server could still recycling them.

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Magnus Hagander at Jan 5, 2011 at 10:04 pm

    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    Attached is an updated streaming base backup patch, based off the work
    Thanks! :)
    * Compression: Do we want to be able to compress the backups server-side? Or
    defer that to whenever we get compression in libpq? (you can still tunnel it
    through for example SSH to get compression if you want to) My thinking is
    defer it.
    Compression in libpq would be a nice way to solve it, later.
    Yeah, I'm pretty much set on postponing that one.

    * Compression: We could still implement compression of the tar files in
    pg_streamrecv (probably easier, possibly more useful?)
    What about pg_streamrecv | gzip > …, which has the big advantage of
    being friendly to *any* compression command line tool, whatever patents
    and licenses?
    That's part of what I meant with "easier and more useful".

    Right now though, pg_streamrecv will output one tar file for each
    tablespace, so you can't get it on stdout. But that can be changed of
    course. The easiest step 1 is to just use gzopen() from zlib on the
    files and use the same code as now :-)

    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.

    * include all the necessary WAL files in the backup. This way we could generate
    a tar file that would work on it's own - right now, you still need to set up
    log archiving (or use streaming repl) to get the remaining logfiles from the
    master. This is fine for replication setups, but not for backups.
    This would also require us to block recycling of WAL files during the backup,
    of course.
    Well, I would guess that if you're streaming the WAL files in parallel
    while the base backup is taken, then you're able to have it all without
    archiving setup, and the server could still recycling them.
    Yes, this was mostly for the use-case of "getting a single tarfile
    that you can actually use to restore from without needing the log
    archive at all".
  • Dimitri Fontaine at Jan 5, 2011 at 10:27 pm

    Magnus Hagander writes:
    Compression in libpq would be a nice way to solve it, later.
    Yeah, I'm pretty much set on postponing that one.
    +1, in case it was not clear for whoever's counting the votes :)
    What about pg_streamrecv | gzip > …, which has the big advantage of
    That's part of what I meant with "easier and more useful". Well…
    Right now though, pg_streamrecv will output one tar file for each
    tablespace, so you can't get it on stdout. But that can be changed of
    course. The easiest step 1 is to just use gzopen() from zlib on the
    files and use the same code as now :-)
    Oh if integrating it is easier :)
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    Yeah.
    Well, I would guess that if you're streaming the WAL files in parallel
    while the base backup is taken, then you're able to have it all without
    archiving setup, and the server could still recycling them.
    Yes, this was mostly for the use-case of "getting a single tarfile
    that you can actually use to restore from without needing the log
    archive at all".
    It also allows for a simpler kick-start procedure for preparing a
    standby, and allows to stop worrying too much about wal_keep_segments
    and archive servers.

    When do the standby launch its walreceiver? It would be extra-nice for
    the base backup tool to optionally continue streaming WALs until the
    standby starts doing it itself, so that wal_keep_segments is really
    deprecated. No idea how feasible that is, though.

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Heikki Linnakangas at Jan 6, 2011 at 9:09 am

    On 06.01.2011 00:27, Dimitri Fontaine wrote:
    Magnus Hagander<magnus@hagander.net> writes:
    What about pg_streamrecv | gzip> …, which has the big advantage of
    That's part of what I meant with "easier and more useful".
    Well…
    One thing to keep in mind is that if you do compression in libpq for the
    transfer, and gzip the tar file in the client, that's quite inefficient.
    You compress the data once in the server, decompress in the client, then
    compress it again in the client. If you're going to write the backup to
    a compressed file, and you want to transfer it compressed to save
    bandwidth, you want to gzip it in the server to begin with.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Magnus Hagander at Jan 6, 2011 at 4:30 pm

    On Wed, Jan 5, 2011 at 23:27, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    Well, I would guess that if you're streaming the WAL files in parallel
    while the base backup is taken, then you're able to have it all without
    archiving setup, and the server could still recycling them.
    Yes, this was mostly for the use-case of "getting a single tarfile
    that you can actually use to restore from without needing the log
    archive at all".
    It also allows for a simpler kick-start procedure for preparing a
    standby, and allows to stop worrying too much about wal_keep_segments
    and archive servers.

    When do the standby launch its walreceiver? It would be extra-nice for
    the base backup tool to optionally continue streaming WALs until the
    standby starts doing it itself, so that wal_keep_segments is really
    deprecated.  No idea how feasible that is, though.
    I think that's we're inventing a whole lot of complexity that may not
    be necessary at all. Let's do it the simple way and see how far we can
    get by with that one - we can always improve this for 9.2
  • Tatsuo Ishii at Jan 16, 2011 at 1:32 am

    When do the standby launch its walreceiver? It would be extra-nice for
    the base backup tool to optionally continue streaming WALs until the
    standby starts doing it itself, so that wal_keep_segments is really
    deprecated. No idea how feasible that is, though.
    Good point. I have been always wondering why we can't use exiting WAL
    transporting infrastructure for sending/receiving WAL archive
    segments in streaming replication.
    If my memory serves, Fujii has already proposed such an idea but was
    rejected for some reason I don't understand.
    --
    Tatsuo Ishii
    SRA OSS, Inc. Japan
    English: http://www.sraoss.co.jp/index_en.php
    Japanese: http://www.sraoss.co.jp
  • Robert Haas at Jan 17, 2011 at 2:15 am

    On Sat, Jan 15, 2011 at 8:33 PM, Tatsuo Ishii wrote:
    When do the standby launch its walreceiver? It would be extra-nice for
    the base backup tool to optionally continue streaming WALs until the
    standby starts doing it itself, so that wal_keep_segments is really
    deprecated.  No idea how feasible that is, though.
    Good point. I have been always wondering why we can't use exiting WAL
    transporting infrastructure for sending/receiving WAL archive
    segments in streaming replication.
    If my memory serves, Fujii has already proposed such an idea but was
    rejected for some reason I don't understand.
    I must be confused, because you can use backup_command/restore_command
    to transport WAL segments, in conjunction with streaming replication.

    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request. It
    was deemed better to have the slave obtain them from the archive
    directly.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tatsuo Ishii at Jan 17, 2011 at 2:31 am

    Good point. I have been always wondering why we can't use exiting WAL
    transporting infrastructure for sending/receiving WAL archive
    segments in streaming replication.
    If my memory serves, Fujii has already proposed such an idea but was
    rejected for some reason I don't understand.
    I must be confused, because you can use backup_command/restore_command
    to transport WAL segments, in conjunction with streaming replication.
    Yes, but using restore_command is not terribly convenient. On
    Linux/UNIX systems you have to enable ssh access, which is extremely
    hard on Windows.

    IMO Streaming replication is not yet easy enough to set up for
    ordinary users. It is already proposed that making base backup easier
    and I think it's good. Why don't we go step beyond a little bit more?
    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request. It
    was deemed better to have the slave obtain them from the archive
    directly.
    Did Fuji-san agreed on the conclusion?
    --
    Tatsuo Ishii
    SRA OSS, Inc. Japan
    English: http://www.sraoss.co.jp/index_en.php
    Japanese: http://www.sraoss.co.jp
  • Fujii Masao at Jan 17, 2011 at 3:00 am

    On Mon, Jan 17, 2011 at 11:32 AM, Tatsuo Ishii wrote:
    Good point. I have been always wondering why we can't use exiting WAL
    transporting infrastructure for sending/receiving WAL archive
    segments in streaming replication.
    If my memory serves, Fujii has already proposed such an idea but was
    rejected for some reason I don't understand.
    I must be confused, because you can use backup_command/restore_command
    to transport WAL segments, in conjunction with streaming replication.
    Yes, but using restore_command is not terribly convenient. On
    Linux/UNIX systems you have to enable ssh access, which is extremely
    hard on Windows. Agreed.
    IMO Streaming replication is not yet easy enough to set up for
    ordinary users. It is already proposed that making base backup easier
    and I think it's good. Why don't we go step beyond a little bit more?
    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request.  It
    was deemed better to have the slave obtain them from the archive
    directly.
    Did Fuji-san agreed on the conclusion?
    No. If the conclusion is true, we would not need a streaming backup feature.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Magnus Hagander at Jan 17, 2011 at 6:52 am

    On Mon, Jan 17, 2011 at 03:32, Tatsuo Ishii wrote:
    Good point. I have been always wondering why we can't use exiting WAL
    transporting infrastructure for sending/receiving WAL archive
    segments in streaming replication.
    If my memory serves, Fujii has already proposed such an idea but was
    rejected for some reason I don't understand.
    I must be confused, because you can use backup_command/restore_command
    to transport WAL segments, in conjunction with streaming replication.
    Yes, but using restore_command is not terribly convenient. On
    Linux/UNIX systems you have to enable ssh access, which is extremely
    hard on Windows.
    Agreed.

    IMO Streaming replication is not yet easy enough to set up for
    ordinary users. It is already proposed that making base backup easier
    and I think it's good. Why don't we go step beyond a little bit more?
    With pg_basebackup, you can set up streaming replication in what's
    basically a single command (run the base backup, copy i na
    recovery.conf file). In my first version I even had a switch that
    would create the recovery.conf file for you - should we bring that
    back?

    It does require you to set a "reasonable" wal_keep_segments, though,
    but that's really all you need to do on the master side.

    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request.  It
    was deemed better to have the slave obtain them from the archive
    directly.
    Did Fuji-san agreed on the conclusion?
    I can see the point of the mastering being able to do this, but it
    seems like a pretty narrow usecase, really. I think we invented
    wal_keep_segments partially to solve this problem in a neater way?
  • Dimitri Fontaine at Jan 17, 2011 at 5:15 pm

    Magnus Hagander writes:
    With pg_basebackup, you can set up streaming replication in what's
    basically a single command (run the base backup, copy i na
    recovery.conf file). In my first version I even had a switch that
    would create the recovery.conf file for you - should we bring that
    back?
    +1. Well, make it optional maybe?
    It does require you to set a "reasonable" wal_keep_segments, though,
    but that's really all you need to do on the master side.
    Until we get integrated WAL streaming while the base backup is ongoing.
    We don't know when that is (9.1 or future), but that's what we're aiming
    to now, right?
    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request.  It
    was deemed better to have the slave obtain them from the archive
    directly.
    Did Fuji-san agreed on the conclusion?
    I can see the point of the mastering being able to do this, but it
    seems like a pretty narrow usecase, really. I think we invented
    wal_keep_segments partially to solve this problem in a neater way?
    Well I still think that the easier setup we can offer here is to ship
    with integrated libpq based archive and restore commands. Those could
    be bin/pg_walsender and bin/pg_walreceiver. They would have some
    switches to make them suitable for running in subprocesses of either the
    base backup utility or the default libpq based archive daemon.

    Again, all of that is not forcibly material for 9.1, despite having all
    the pieces already coded and tested, mainly in Magnus hands. But could
    we get agreement about going this route?

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Magnus Hagander at Jan 17, 2011 at 5:24 pm

    On Mon, Jan 17, 2011 at 11:18, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    With pg_basebackup, you can set up streaming replication in what's
    basically a single command (run the base backup, copy i na
    recovery.conf file). In my first version I even had a switch that
    would create the recovery.conf file for you - should we bring that
    back?
    +1.  Well, make it optional maybe?
    It has always been optional. Basically it just creates a recovery.conf file with
    primary_conninfo=<whatever pg_streamrecv was using>
    standby_mode=on

    It does require you to set a "reasonable" wal_keep_segments, though,
    but that's really all you need to do on the master side.
    Until we get integrated WAL streaming while the base backup is ongoing.
    We don't know when that is (9.1 or future), but that's what we're aiming
    to now, right?
    Yeah, it does sound like a plan. But to still allow both - streaming
    it in parallell will eat two connections, and I'm sure some people
    might consider that a higher cost.

    What Fujii-san unsuccessfully proposed was to have the master restore
    segments from the archive and stream them to clients, on request.  It
    was deemed better to have the slave obtain them from the archive
    directly.
    Did Fuji-san agreed on the conclusion?
    I can see the point of the mastering being able to do this, but it
    seems like a pretty narrow usecase, really. I think we invented
    wal_keep_segments partially to solve this problem in a neater way?
    Well I still think that the easier setup we can offer here is to ship
    with integrated libpq based archive and restore commands.  Those could
    be bin/pg_walsender and bin/pg_walreceiver.  They would have some
    switches to make them suitable for running in subprocesses of either the
    base backup utility or the default libpq based archive daemon.
    Not sure why they'd run as an archive command and not like now as a
    replication client - but let's keep that out of this thread and in a
    new one :)
  • Dimitri Fontaine at Jan 17, 2011 at 7:31 pm

    Magnus Hagander writes:
    Until we get integrated WAL streaming while the base backup is ongoing.
    We don't know when that is (9.1 or future), but that's what we're aiming
    to now, right?
    Yeah, it does sound like a plan. But to still allow both - streaming
    it in parallell will eat two connections, and I'm sure some people
    might consider that a higher cost.
    Sure. Ah, tradeoffs :)
    Well I still think that the easier setup we can offer here is to ship
    with integrated libpq based archive and restore commands.  Those could
    be bin/pg_walsender and bin/pg_walreceiver.  They would have some
    switches to make them suitable for running in subprocesses of either the
    base backup utility or the default libpq based archive daemon.
    Not sure why they'd run as an archive command and not like now as a
    replication client - but let's keep that out of this thread and in a
    new one :)
    On the archive side you're right that it's not necessary, but it would
    be to cater for the restore side. Sure enough, thinking about it some
    more, what we would like here is for the standby to be able to talk to
    the archive server (pg_streamsendrecv) rather than the primary, in order
    to offload it. Ok scratch all that and get cascading support instead :)

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Cédric Villemain at Jan 7, 2011 at 12:48 am

    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.

    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.

    posix_fadvise support is already in postgresql core...we can start by
    just doing a snapshot of the files before starting, or at some point
    in the basebackup, it will need only 256kB per GB of data...
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Magnus Hagander at Jan 7, 2011 at 2:47 pm

    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.
    Uh, it just reads the files out of the filesystem. Just like you'd to
    today, except it's now integrated and streams the data across a
    regular libpq connection.
  • Cédric Villemain at Jan 9, 2011 at 10:34 pm

    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.
    Uh, it just reads the files out of the filesystem. Just like you'd to
    today, except it's now integrated and streams the data across a
    regular libpq connection.

    --
    Magnus Hagander
    Me: http://www.hagander.net/
    Work: http://www.redpill-linpro.com/


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Magnus Hagander at Jan 10, 2011 at 2:09 pm

    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.

    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
  • Cédric Villemain at Jan 10, 2011 at 7:13 pm

    2011/1/10 Magnus Hagander <magnus@hagander.net>:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    we are dropping stuff underneath other backends anyway but I
    understand your point.
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks. I agree.
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Stefan Kaltenbrunner at Jan 10, 2011 at 8:48 pm

    On 01/10/2011 08:13 PM, Cédric Villemain wrote:
    2011/1/10 Magnus Hagander<magnus@hagander.net>:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander<magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander<magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontainewrote:
    Magnus Hagander<magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    we are dropping stuff underneath other backends anyway but I
    understand your point.
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    I agree.
    well I want to point out that the link I provided upthread actually
    provides a (linux centric) way to do get the property of interest for this:

    * if the datablocks are in the OS buffercache just leave them alone, if
    the are NOT tell the OS that "this current user" is not interested in
    having it there

    I would like to see something like that implemented in the backend
    sometime and maybe even as a guc of some sort, that way we actually
    could use that for say a pg_dump run as well, I have seen the
    responsetimes of big boxes tank not because of the CPU and lock-load
    pg_dump imposes but because of the way that it can cause the
    OS-buffercache to get spoiled with not-really-important data.



    anyway I agree that the (positive and/or negative) effect of something
    like that needs to be measured but this effect is not too easy to see in
    very simple setups...


    Stefan
  • Cédric Villemain at Jan 10, 2011 at 11:26 pm

    2011/1/10 Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>:
    On 01/10/2011 08:13 PM, Cédric Villemain wrote:

    2011/1/10 Magnus Hagander<magnus@hagander.net>:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander<magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander<magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri
    Fontainewrote:
    Magnus Hagander<magnus@hagander.net>  writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as
    long as that
    doesn't kick them out of the cache *completely*, for other
    backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that
    are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    we are dropping stuff underneath other backends  anyway but I
    understand your point.
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    I agree.
    well I want to point out that the link I provided upthread actually provides
    a (linux centric) way to do get the property of interest for this:
    yes, it is exactly what we are talking about here.
    mincore and posix_fadvise.

    freeBSD should allow that later, at least it is in the todo list
    Windows may allow that too with different API.
    * if the datablocks are in the OS buffercache just leave them alone, if the
    are NOT tell the OS that "this current user" is not interested in having it
    there
    my experience is that posix_fadvise on a specific block behave more
    brutaly than flaging a whole file. In the later case it may not do
    what you want if it estimates it is not welcome (because of other IO
    request)

    What Magnus point out is that other backends execute queries and
    request blocks (and load them in shared buffers of postgresql) and it
    is *hard* to be sure we don't remove blocks just loaded by another
    backend ( the worst case beeing flushing prefeteched blocks not yet in
    shared buffers, cf effective_io_concurrency )
    I would like to see something like that implemented in the backend sometime
    and maybe even as a guc of some sort, that way we actually could use that
    for say a pg_dump run as well, I have seen the responsetimes of big boxes
    tank not because of the CPU and lock-load pg_dump imposes but because of the
    way that it can cause the OS-buffercache to get spoiled with
    not-really-important data.
    Glad to here that, pgfincore is also a POC about those topics.
    The best solution is to mmap in postgres, but it is not posible, so we
    have to do snapshot of objects and restore them afterwards (again *it
    is* what tobias do with is rsync). Side note : because of readahead,
    inspect block by block while you read the file provide bad results (or
    you need to fadvise POSIX_FADV_RANDOM to remove readahead behavior,
    which is not good at all).
    anyway I agree that the (positive and/or negative) effect of something like
    that needs to be measured but this effect is not too easy to see in very
    simple setups...
    yes. and with pgbase_backup, copying 1GB over the network is longer
    than 2 seconds, we will probably need to have a specific strategy.


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Cédric Villemain at Jan 11, 2011 at 12:29 am

    2011/1/10 Magnus Hagander <magnus@hagander.net>:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.

    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
    attached is a small patch fixing "-d basedir" when its called with an
    absolute path.
    maybe we can use pg_mkdir_p() instead of mkdir ?


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Magnus Hagander at Jan 11, 2011 at 8:44 am

    On Tue, Jan 11, 2011 at 01:28, Cédric Villemain wrote:
    2011/1/10 Magnus Hagander <magnus@hagander.net>:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
    attached is a small patch fixing "-d basedir" when its called with an
    absolute path.
    maybe we can use pg_mkdir_p() instead of mkdir ?
    Heh, that was actually a hack to be able to run pg_basebackup on the
    same machine as the database with the tablespaces. It will be removed
    before commit :-) (It was also in the wrong place to work, I realize I
    managed to break it in a refactor) I've put in a big ugly comment to
    make sure it gets removed :-)

    And yes, using pg_mkdir_p() is good. I used to do that, I think I
    removed it by mistake when it was supposed to be removed elsewhere.
    I've put it back.
  • Garick Hamlin at Jan 11, 2011 at 3:55 pm

    On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    Another option that occurs to me is an option to use direct IO (or another
    means as needed) to bypass the cache. So rather than kicking it out of
    the cache, we attempt just not to pollute the cache by bypassing it for cold
    pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
    the cache afterward.

    Garick
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.

    --
    Magnus Hagander
    Me: http://www.hagander.net/
    Work: http://www.redpill-linpro.com/

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Cédric Villemain at Jan 11, 2011 at 4:39 pm

    2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
    On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    Another option that occurs to me is an option to use direct IO (or another
    means as needed) to bypass the cache.  So rather than kicking it out of
    the cache, we attempt just not to pollute the cache by bypassing it for cold
    pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
    the cache afterward.
    AFAIR, even Linus is rejecting the idea to use it seriously, except if
    I shuffle in my memory.
    Garick
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.

    --
    Magnus Hagander
    Me: http://www.hagander.net/
    Work: http://www.redpill-linpro.com/

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Garick Hamlin at Jan 11, 2011 at 5:10 pm

    On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
    2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
    On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    Another option that occurs to me is an option to use direct IO (or another
    means as needed) to bypass the cache.  So rather than kicking it out of
    the cache, we attempt just not to pollute the cache by bypassing it for cold
    pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
    the cache afterward.
    AFAIR, even Linus is rejecting the idea to use it seriously, except if
    I shuffle in my memory.
    Direct IO is generally a pain.

    POSIX_FADV_NOREUSE is an alternative (I think). Realistically I wasn't sure which
    way(s) actually worked. My gut was that direct io would likely work right on Linux
    and Solaris, at least. If POSIX_FADV_NOREUSE works than maybe that is the answer
    instead, but I haven't tested either.

    Garick

    Garick
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.

    --
    Magnus Hagander
    Me: http://www.hagander.net/
    Work: http://www.redpill-linpro.com/

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Florian Pflug at Jan 11, 2011 at 5:27 pm

    On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
    My gut was that direct io would likely work right on Linux
    and Solaris, at least.
    Didn't we discover recently that O_DIRECT fails for ext4 on linux
    if ordered=data, or something like that?

    best regards,
    Florian Pflug
  • Tom Lane at Jan 11, 2011 at 5:45 pm

    Florian Pflug writes:
    On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
    My gut was that direct io would likely work right on Linux
    and Solaris, at least.
    Didn't we discover recently that O_DIRECT fails for ext4 on linux
    if ordered=data, or something like that?
    Quite. Blithe assertions that something like this "should work" aren't
    worth the electrons they're written on.

    regards, tom lane
  • Garick Hamlin at Jan 11, 2011 at 6:26 pm

    On Tue, Jan 11, 2011 at 12:45:02PM -0500, Tom Lane wrote:
    Florian Pflug <fgp@phlo.org> writes:
    On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
    My gut was that direct io would likely work right on Linux
    and Solaris, at least.
    Didn't we discover recently that O_DIRECT fails for ext4 on linux
    if ordered=data, or something like that?
    Quite. Blithe assertions that something like this "should work" aren't
    worth the electrons they're written on.
    Indeed. I wasn't making such a claim in case that wasn't clear. I believe,
    in fact, there is no single way that will work everywhere. This isn't
    needed for correctness of course, it is merely a tweak for performance as
    long as the 'not working case' on platform + filesystem X case degrades to
    something close to what would have happened if we didn't try. I expected
    POSIX_FADV_NOREUSE not to work on Linux, but haven't looked at it recently
    and not all systems are Linux so I mentioned it. This was why I thought
    direct io might be more realistic.

    I did not have a chance to test before I wrote this email so I attempted to
    make my uncertainty clear. I _know_ it will not work in some environments,
    but I thought it was worth looking at if it worked on more than one sane
    common setup, but I can understand if you feel differently about that.

    Garick
    regards, tom lane
  • Cédric Villemain at Jan 11, 2011 at 5:28 pm

    2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
    On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
    2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
    On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
    On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
    wrote:
    2011/1/7 Magnus Hagander <magnus@hagander.net>:
    On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
    wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.
    Then we definitely don't want to use it - because some other backend
    might well want the file. Better leave it up to the standard logic in
    the kernel.
    Looking at the patch, it is (very) easy to add the support for that in
    basebackup.c
    That supposed allowing mincore(), so mmap(), and so probably switch
    the fopen() to an open() (or add an open() just for mmap
    requirement...)

    Let's go ?
    Per above, I still don't think we *should* do this. We don't want to
    kick things out of the cache underneath other backends, and since we
    can't control that. Either way, it shouldn't happen in the beginning,
    and if it does, should be backed with proper benchmarks.
    Another option that occurs to me is an option to use direct IO (or another
    means as needed) to bypass the cache.  So rather than kicking it out of
    the cache, we attempt just not to pollute the cache by bypassing it for cold
    pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
    the cache afterward.
    AFAIR, even Linus is rejecting the idea to use it seriously, except if
    I shuffle in my memory.
    Direct IO is generally a pain.

    POSIX_FADV_NOREUSE is an alternative (I think).  Realistically I wasn't sure which
    way(s) actually worked.  My gut was that direct io would likely work right on Linux
    and Solaris, at least.  If POSIX_FADV_NOREUSE works than maybe that is the answer
    instead, but I haven't tested either.
    yes it should be the best option, unfortunely it is a ghost flag, it
    doesn't do anythig.
    At some point there were a libprefetch library and a linux fincore()
    syscall in the air. Unfortunely actors of those items stop
    communication with open source afais. (I didn't get answers myself,
    neither linux ML get ones.)

    Garick

    Garick
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.

    --
    Magnus Hagander
    Me: http://www.hagander.net/
    Work: http://www.redpill-linpro.com/

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Fujii Masao at Jan 12, 2011 at 9:39 am

    On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander wrote:
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
    Great work!

    I have some comments:

    While walsender is sending a base backup, WalSndWakeup should
    not send the signal to that walsender?

    In sendFile or elsewhere, we should periodically check whether
    postmaster is alive and whether the flag was set by the signal?

    At the end of the backup by walsender, it forces a switch to a new
    WAL file and waits until the last WAL file has been archived. So we
    should change postmaster so that it doesn't cause the archiver to
    end before walsender ends when shutdown is requested?

    Also, when shutdown is requested, the walsender which is
    streaming WAL should not end before another walsender which
    is sending a backup ends, to stream the backup-end WAL?

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Magnus Hagander at Jan 13, 2011 at 7:13 pm

    On Wed, Jan 12, 2011 at 10:39, Fujii Masao wrote:
    On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander wrote:
    I've committed the backend side of this, without that. Still working
    on the client, and on cleaning up Heikki's patch for grammar/parser
    support.
    Great work!

    I have some comments:

    While walsender is sending a base backup, WalSndWakeup should
    not send the signal to that walsender?
    True, it's not necessary. How bad does it actually hurt things though?
    Given that the walsender running the backup isn't actually waiting on
    the latch, it doesn't actually send a signal, does it?

    In sendFile or elsewhere, we should periodically check whether
    postmaster is alive and whether the flag was set by the signal?
    That, however, we probably should.

    At the end of the backup by walsender, it forces a switch to a new
    WAL file and waits until the last WAL file has been archived. So we
    should change postmaster so that it doesn't cause the archiver to
    end before walsender ends when shutdown is requested?
    Um. I have to admit I'm not entirely following what you mean enough to
    confirm it, but it *sounds* correct :-)

    What scenario exactly is the problematic one?

    Also, when shutdown is requested, the walsender which is
    streaming WAL should not end before another walsender which
    is sending a backup ends, to stream the backup-end WAL?
    Not sure I see the reason for that. If we're shutting down in the
    middle of the base backup, we don't have any support for continuing
    that one after we're back up - you have to start over.
  • Fujii Masao at Jan 14, 2011 at 6:46 am

    On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander wrote:
    While walsender is sending a base backup, WalSndWakeup should
    not send the signal to that walsender?
    True, it's not necessary. How bad does it actually hurt things though?
    Given that the walsender running the backup isn't actually waiting on
    the latch, it doesn't actually send a signal, does it?
    Yeah, you are right. Once WalSndWakeup sends the signal to walsender,
    latch->is_set is set. So, then WalSndWakeup does nothing against that
    walsender until latch->is_set is reset. Since ResetLatch is not called while
    walsender is sending a base backup, that would be harmless.
    At the end of the backup by walsender, it forces a switch to a new
    WAL file and waits until the last WAL file has been archived. So we
    should change postmaster so that it doesn't cause the archiver to
    end before walsender ends when shutdown is requested?
    Um. I have to admit I'm not entirely following what you mean enough to
    confirm it, but it *sounds* correct :-)

    What scenario exactly is the problematic one?
    1. Smart shutdown is requested while walsender is sending a backup.
    2. Shutdown causes archiver to end.
    (Though shutdown sends SIGUSR2 to walsender to exit, walsender
    running backup doesn't respond for now)
    3. At the end of backup, walsender calls do_pg_stop_backup, which
    forces a switch to a new WAL file and waits until the last WAL file has
    been archived.
    *BUT*, since archiver has already been dead, walsender waits for
    that forever.
    Also, when shutdown is requested, the walsender which is
    streaming WAL should not end before another walsender which
    is sending a backup ends, to stream the backup-end WAL?
    Not sure I see the reason for that. If we're shutting down in the
    middle of the base backup, we don't have any support for continuing
    that one after we're back up - you have to start over.
    For now, shutdown is designed to cause walsender to end after
    sending all the WAL records. So I thought that.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Heikki Linnakangas at Jan 14, 2011 at 10:19 am

    On 14.01.2011 08:45, Fujii Masao wrote:
    On Fri, Jan 14, 2011 at 4:13 AM, Magnus Haganderwrote:
    At the end of the backup by walsender, it forces a switch to a new
    WAL file and waits until the last WAL file has been archived. So we
    should change postmaster so that it doesn't cause the archiver to
    end before walsender ends when shutdown is requested?
    Um. I have to admit I'm not entirely following what you mean enough to
    confirm it, but it *sounds* correct :-)

    What scenario exactly is the problematic one?
    1. Smart shutdown is requested while walsender is sending a backup.
    2. Shutdown causes archiver to end.
    (Though shutdown sends SIGUSR2 to walsender to exit, walsender
    running backup doesn't respond for now)
    3. At the end of backup, walsender calls do_pg_stop_backup, which
    forces a switch to a new WAL file and waits until the last WAL file has
    been archived.
    *BUT*, since archiver has already been dead, walsender waits for
    that forever.
    Not only does it wait forever, but it writes the end-of-backup WAL
    record after bgwriter has already exited and written the shutdown
    checkpoint record.

    I think postmaster should treat a walsender as a regular backend, until
    it has started streaming.

    We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
    changing the state to PM_CHILD_WALSENDER later, when streaming is
    started. Looking at the postmaster.c, that should be safe, postmaster
    will treat a backend as a regular backend anyway until it has connected
    to shared memory. It is *not* safe to switch a walsender back to a
    regular process, but we have no need to do that.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Magnus Hagander at Jan 14, 2011 at 11:38 am

    On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas wrote:
    On 14.01.2011 08:45, Fujii Masao wrote:

    On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander<magnus@hagander.net>
    wrote:
    At the end of the backup by walsender, it forces a switch to a new
    WAL file and waits until the last WAL file has been archived. So we
    should change postmaster so that it doesn't cause the archiver to
    end before walsender ends when shutdown is requested?
    Um. I have to admit I'm not entirely following what you mean enough to
    confirm it, but it *sounds* correct :-)

    What scenario exactly is the problematic one?
    1. Smart shutdown is requested while walsender is sending a backup.
    2. Shutdown causes archiver to end.
    (Though shutdown sends SIGUSR2 to walsender to exit, walsender
    running backup doesn't respond for now)
    3. At the end of backup, walsender calls do_pg_stop_backup, which
    forces a switch to a new WAL file and waits until the last WAL file
    has
    been archived.
    *BUT*, since archiver has already been dead, walsender waits for
    that forever.
    Not only does it wait forever, but it writes the end-of-backup WAL record
    after bgwriter has already exited and written the shutdown checkpoint
    record.

    I think postmaster should treat a walsender as a regular backend, until it
    has started streaming.

    We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
    changing the state to PM_CHILD_WALSENDER later, when streaming is started.
    Looking at the postmaster.c, that should be safe, postmaster will treat a
    backend as a regular backend anyway until it has connected to shared memory.
    It is *not* safe to switch a walsender back to a regular process, but we
    have no need to do that.
    Seems reasonable to me.

    I've applied a patch that exits base backups when the postmaster is
    shutting down - I'm happily waiting for Heikki to submit one that
    changes the shutdown logic in the postmaster :-)
  • Heikki Linnakangas at Jan 15, 2011 at 2:42 pm

    On 14.01.2011 13:38, Magnus Hagander wrote:
    On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
    wrote:
    On 14.01.2011 08:45, Fujii Masao wrote:
    1. Smart shutdown is requested while walsender is sending a backup.
    2. Shutdown causes archiver to end.
    (Though shutdown sends SIGUSR2 to walsender to exit, walsender
    running backup doesn't respond for now)
    3. At the end of backup, walsender calls do_pg_stop_backup, which
    forces a switch to a new WAL file and waits until the last WAL file
    has
    been archived.
    *BUT*, since archiver has already been dead, walsender waits for
    that forever.
    Not only does it wait forever, but it writes the end-of-backup WAL record
    after bgwriter has already exited and written the shutdown checkpoint
    record.

    I think postmaster should treat a walsender as a regular backend, until it
    has started streaming.

    We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
    changing the state to PM_CHILD_WALSENDER later, when streaming is started.
    Looking at the postmaster.c, that should be safe, postmaster will treat a
    backend as a regular backend anyway until it has connected to shared memory.
    It is *not* safe to switch a walsender back to a regular process, but we
    have no need to do that.
    Seems reasonable to me.

    I've applied a patch that exits base backups when the postmaster is
    shutting down - I'm happily waiting for Heikki to submit one that
    changes the shutdown logic in the postmaster :-)
    Ok, committed a fix for that.

    BTW, I just spotted a small race condition between creating a new table
    space and base backup. We take a snapshot of all the tablespaces in
    pg_tblspc before calling pg_start_backup(). If someone creates a new
    tablespace and puts some data in it in the window between base backup
    acquiring the list tablespaces and starting the backup, the new
    tablespace won't be included in the backup.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Jan 15, 2011 at 3:30 pm

    Heikki Linnakangas writes:
    BTW, I just spotted a small race condition between creating a new table
    space and base backup. We take a snapshot of all the tablespaces in
    pg_tblspc before calling pg_start_backup(). If someone creates a new
    tablespace and puts some data in it in the window between base backup
    acquiring the list tablespaces and starting the backup, the new
    tablespace won't be included in the backup.
    So what? The needed actions will be covered by WAL replay.

    regards, tom lane
  • Heikki Linnakangas at Jan 15, 2011 at 3:35 pm

    On 15.01.2011 17:30, Tom Lane wrote:
    Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
    BTW, I just spotted a small race condition between creating a new table
    space and base backup. We take a snapshot of all the tablespaces in
    pg_tblspc before calling pg_start_backup(). If someone creates a new
    tablespace and puts some data in it in the window between base backup
    acquiring the list tablespaces and starting the backup, the new
    tablespace won't be included in the backup.
    So what? The needed actions will be covered by WAL replay.
    No, they won't, if pg_base_backup() is called *after* getting the list
    of tablespaces.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Jan 15, 2011 at 3:55 pm

    Heikki Linnakangas writes:
    On 15.01.2011 17:30, Tom Lane wrote:
    So what? The needed actions will be covered by WAL replay.
    No, they won't, if pg_base_backup() is called *after* getting the list
    of tablespaces.
    Ah. Then the fix is to change the order in which those things are done.

    regards, tom lane
  • Magnus Hagander at Jan 15, 2011 at 6:21 pm

    On Sat, Jan 15, 2011 at 16:54, Tom Lane wrote:
    Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
    On 15.01.2011 17:30, Tom Lane wrote:
    So what?  The needed actions will be covered by WAL replay.
    No, they won't, if pg_base_backup() is called *after* getting the list
    of tablespaces.
    Ah.  Then the fix is to change the order in which those things are done.
    Grumble. It used to be that way. For some reason I can't recall, I broke it.

    Something like this to fix? or is this going to put those "warnings by
    stupid versions of gcc" back?
  • Tom Lane at Jan 15, 2011 at 6:27 pm

    Magnus Hagander writes:
    Something like this to fix? or is this going to put those "warnings by
    stupid versions of gcc" back?
    Possibly. If so, I'll fix it --- I have an old gcc to test against
    here.

    regards, tom lane
  • Magnus Hagander at Jan 15, 2011 at 6:30 pm

    On Sat, Jan 15, 2011 at 19:27, Tom Lane wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    Something like this to fix? or is this going to put those "warnings by
    stupid versions of gcc" back?
    Possibly.  If so, I'll fix it --- I have an old gcc to test against
    here.
    Ok, thanks, I'll commit tihs one then.
  • Garick Hamlin at Jan 7, 2011 at 3:31 pm

    On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.

    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.

    posix_fadvise support is already in postgresql core...we can start by
    just doing a snapshot of the files before starting, or at some point
    in the basebackup, it will need only 256kB per GB of data...
    It is actually possible to be more scalable than the simple solution you
    outline here (although that solution works pretty well).

    I've written a program that syncronizes the OS cache state using
    mmap()/mincore() between two computers. It haven't actually tested its
    impact on performance yet, but I was surprised by how fast it actually runs
    and how compact cache maps can be.

    If one encodes the data so one remembers the number of zeros between 1s
    one, storage scale by the amount of memory in each size rather than the
    dataset size. I actually played with doing that, then doing huffman
    encoding of that. I get around 1.2-1.3 bits / page of _physical memory_
    on my tests.

    I don't have my notes handy, but here are some numbers from memory...

    The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
    of physical memory in the machine. The latter limit get better, however,
    since there are < 1024 symbols possible for the encoder (since in this
    case symbols are spans of zeros that need to fit in a file that is 1 GB in
    size). So is actually real worst case is much closer to 1 bit per page of
    the dataset or ~10 bits per page of physical memory. The real performance
    I see with huffman is more like 1.3 bits per page of physical memory. All the
    encoding decoding is actually very fast. zlib would actually compress even
    better than huffman, but huffman encoder/decoder is actually pretty good and
    very straightforward code.

    I would like to integrate something like this into PG or perhaps even into
    something like rsync, but its was written as proof of concept and I haven't
    had time work on it recently.

    Garick
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Garick Hamlin at Jan 7, 2011 at 3:48 pm

    On Fri, Jan 07, 2011 at 10:26:29AM -0500, Garick Hamlin wrote:
    On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.

    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.

    posix_fadvise support is already in postgresql core...we can start by
    just doing a snapshot of the files before starting, or at some point
    in the basebackup, it will need only 256kB per GB of data...
    It is actually possible to be more scalable than the simple solution you
    outline here (although that solution works pretty well).

    I've written a program that syncronizes the OS cache state using
    mmap()/mincore() between two computers. It haven't actually tested its
    impact on performance yet, but I was surprised by how fast it actually runs
    and how compact cache maps can be.

    If one encodes the data so one remembers the number of zeros between 1s
    one, storage scale by the amount of memory in each size rather than the
    Sorry for the typos, that should read:

    the storage scales by the number of pages resident in memory rather than the
    total dataset size.
    dataset size. I actually played with doing that, then doing huffman
    encoding of that. I get around 1.2-1.3 bits / page of _physical memory_
    on my tests.

    I don't have my notes handy, but here are some numbers from memory...

    The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
    of physical memory in the machine. The latter limit get better, however,
    since there are < 1024 symbols possible for the encoder (since in this
    case symbols are spans of zeros that need to fit in a file that is 1 GB in
    size). So is actually real worst case is much closer to 1 bit per page of
    the dataset or ~10 bits per page of physical memory. The real performance
    I see with huffman is more like 1.3 bits per page of physical memory. All the
    encoding decoding is actually very fast. zlib would actually compress even
    better than huffman, but huffman encoder/decoder is actually pretty good and
    very straightforward code.

    I would like to integrate something like this into PG or perhaps even into
    something like rsync, but its was written as proof of concept and I haven't
    had time work on it recently.

    Garick
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Cédric Villemain at Jan 9, 2011 at 2:21 pm

    2011/1/7 Garick Hamlin <ghamlin@isc.upenn.edu>:
    On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
    2011/1/5 Magnus Hagander <magnus@hagander.net>:
    On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine wrote:
    Magnus Hagander <magnus@hagander.net> writes:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    I think that's way more complex than we want to go here.
    DONTNEED will remove the block from OS buffer everytime.

    It should not be that hard to implement a snapshot(it needs mincore())
    and to restore previous state. I don't know how basebackup is
    performed exactly...so perhaps I am wrong.

    posix_fadvise support is already in postgresql core...we can start by
    just doing a snapshot of the files before starting, or at some point
    in the basebackup, it will need only 256kB per GB of data...
    It is actually possible to be more scalable than the simple solution you
    outline here (although that solution works pretty well).
    Yes I suggest something pretty simple to go with a first shoot.
    I've written a program that syncronizes the OS cache state using
    mmap()/mincore() between two computers.  It haven't actually tested its
    impact on performance yet, but I was surprised by how fast it actually runs
    and how compact cache maps can be.

    If one encodes the data so one remembers the number of zeros between 1s
    one, storage scale by the amount of memory in each size rather than the
    dataset size.  I actually played with doing that, then doing huffman
    encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_
    on my tests.

    I don't have my notes handy, but here are some numbers from memory...
    that is interesting, even if I didn't have issue with the size of the
    maps so far, I thought that a simple zlib compression should be
    enought.
    The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
    of physical memory in the machine.  The latter limit get better, however,
    since there are < 1024 symbols possible for the encoder (since in this
    case symbols are spans of zeros that need to fit in a file that is 1 GB in
    size).  So is actually real worst case is much closer to 1 bit per page of
    the dataset or ~10 bits per page of physical memory.  The real performance
    I see with huffman is more like 1.3 bits per page of physical memory.  All the
    encoding decoding is actually very fast.  zlib would actually compress even
    better than huffman, but huffman encoder/decoder is actually pretty good and
    very straightforward code.
    pgfincore currently hold those information in flat file. The on-going
    dev is more simple and provide the data as bits, so you can store it
    in a table, and restore it on your slave thanks to SR, and use it on
    the slave.
    I would like to integrate something like this into PG or perhaps even into
    something like rsync, but its was written as proof of concept and I haven't
    had time work on it recently.

    Garick
    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Marti Raudsepp at Jan 6, 2011 at 11:54 am

    On Wed, Jan 5, 2011 at 23:58, Dimitri Fontaine wrote:
    * Stefan mentiond it might be useful to put some
    posix_fadvise(POSIX_FADV_DONTNEED)
    in the process that streams all the files out. Seems useful, as long as that
    doesn't kick them out of the cache *completely*, for other backends as well.
    Do we know if that is the case?
    Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
    not already in SHM?
    It's not much of an improvement. For pages that we already have in
    shared memory, OS cache is mostly useless. OS cache matters for pages
    that *aren't* in shared memory.

    Regards,
    Marti
  • Heikki Linnakangas at Jan 6, 2011 at 10:57 pm

    On 05.01.2011 15:54, Magnus Hagander wrote:
    Attached is an updated streaming base backup patch, based off the work
    that Heikki started.
    ...
    I've implemented a frontend for this in pg_streamrecv, based on the assumption
    that we wanted to include this in bin/ for 9.1 - and that it seems like a
    reasonable place to put it. This can obviously be moved elsewhere if we want to.
    Hmm, is there any point in keeping the two functionalities in the same
    binary, taking the base backup and streaming WAL to an archive
    directory? Looks like the only common option between the two modes is
    passing the connection string, and the verbose flag. A separate
    pg_basebackup binary would probably make more sense.
    That code needs a lot more cleanup, but I wanted to make sure I got the backend
    patch out for review quickly. You can find the current WIP branch for
    pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
    in the branch "baserecv". I'll be posting that as a separate patch once it's
    been a bit more cleaned up (it does work now if you want to test it, though).
    Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
    because they're not included in the streamed tar. Wouldn't it be better
    to include them in the tar as empty directories at the server-side?
    Otherwise if you write the tar file to disk and untar it later, you have
    to manually create them.

    It would be nice to have an option in pg_streamrecv to specify the
    backup label to use.

    An option to stream the tar to stdout instead of a file would be very
    handy too, so that you could pipe it directly to gzip for example. I
    realize you get multiple tar files if tablespaces are used, but even if
    you just throw an error in that case, it would be handy.
    * Suggestion from Heikki: perhaps at some point we're going to need a full
    bison grammar for walsender commands.
    Maybe we should at least start using the lexer; we're not quite there to
    need a full-blown grammar yet, but even a lexer might help.


    BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
    from scratch with an empty target directory, it needs to connect to
    "postgres" database, to run pg_current_xlog_location(), and then
    reconnect in replication mode. That's a bit awkward, there might not be
    a "postgres" database, and even if there is, you might not have the
    permission to connect to it. It would be much better to have a variant
    of the START_REPLICATION command at the server-side that begins
    streaming from the current location. Maybe just by leaving out the
    start-location parameter.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Magnus Hagander at Jan 6, 2011 at 11:02 pm

    On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas wrote:
    On 05.01.2011 15:54, Magnus Hagander wrote:

    Attached is an updated streaming base backup patch, based off the work
    that Heikki started.
    ...
    I've implemented a frontend for this in pg_streamrecv, based on the
    assumption
    that we wanted to include this in bin/ for 9.1 - and that it seems like a
    reasonable place to put it. This can obviously be moved elsewhere if we
    want to.
    Hmm, is there any point in keeping the two functionalities in the same
    binary, taking the base backup and streaming WAL to an archive directory?
    Looks like the only common option between the two modes is passing the
    connection string, and the verbose flag. A separate pg_basebackup binary
    would probably make more sense.
    Yeah, once I broke things apart for better readability, I started
    leaning in that direction as well.

    However, if you consider the things that Dimiti mentioned about
    streaming at the same time as downloading, having them in the same one
    would make more sense. I don't think that's something for now,
    though..

    That code needs a lot more cleanup, but I wanted to make sure I got the
    backend
    patch out for review quickly. You can find the current WIP branch for
    pg_streamrecv on my github page at
    https://github.com/mhagander/pg_streamrecv,
    in the branch "baserecv". I'll be posting that as a separate patch once
    it's
    been a bit more cleaned up (it does work now if you want to test it,
    though).
    Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
    because they're not included in the streamed tar. Wouldn't it be better to
    include them in the tar as empty directories at the server-side? Otherwise
    if you write the tar file to disk and untar it later, you have to manually
    create them.
    Yeah, good point. Originally, the tar code (your tar code, btw :P)
    didn't create *any* directories, so I stuck it in there. I agree it
    should be moved to the backend patch now.

    It would be nice to have an option in pg_streamrecv to specify the backup
    label to use.
    Agreed.

    An option to stream the tar to stdout instead of a file would be very handy
    too, so that you could pipe it directly to gzip for example. I realize you
    get multiple tar files if tablespaces are used, but even if you just throw
    an error in that case, it would be handy.
    Makes sense.

    * Suggestion from Heikki: perhaps at some point we're going to need a full
    bison grammar for walsender commands.
    Maybe we should at least start using the lexer; we're not quite there to
    need a full-blown grammar yet, but even a lexer might help.
    Might. I don't speak flex very well, so I'm not really sure what that
    would mean.

    BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
    from scratch with an empty target directory, it needs to connect to
    "postgres" database, to run pg_current_xlog_location(), and then reconnect
    in replication mode. That's a bit awkward, there might not be a "postgres"
    database, and even if there is, you might not have the permission to connect
    to it. It would be much better to have a variant of the START_REPLICATION
    command at the server-side that begins streaming from the current location.
    Maybe just by leaving out the start-location parameter.
    Agreed. That part is unchanged from the one that runs against 9.0
    though, where that wasn't a possibility. But adding something like
    that to the walsender in 9.1 would be good.
  • Magnus Hagander at Jan 8, 2011 at 4:34 pm

    On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas wrote:

    Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
    because they're not included in the streamed tar. Wouldn't it be better to
    include them in the tar as empty directories at the server-side? Otherwise
    if you write the tar file to disk and untar it later, you have to manually
    create them.
    Attached is an updated patch that does this.

    It also collects all the header records as a single resultset at the
    beginning. This made for cleaner code, but more importantly makes it
    possible to get the total size of the backup even if there are
    multiple tablespaces.

    It also changes the tar members to use relative paths instead of
    absolute ones - since we send the root of the directory in the header
    anyway. That also takes away the "./" portion in all tar members.

    git branch on github updated as well, of course.
  • Simon Riggs at Jan 7, 2011 at 1:15 am

    On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:

    The basic implementation is: Add a new command to the replication mode called
    BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
    compatible format) of the data directory and all tablespaces, and then end
    the base backup in a single operation.
    I'm a little dubious of the performance of that approach for some users,
    though it does seem a popular idea.

    One very useful feature will be some way of confirming the number and
    size of files to transfer, so that the base backup client can find out
    the progress.

    It would also be good to avoid writing a backup_label file at all on the
    master, so there was no reason why multiple concurrent backups could not
    be taken. The current coding allows for the idea that the start and stop
    might be in different sessions, whereas here we know we are in one
    session.

    --
    Simon Riggs http://www.2ndQuadrant.com/books/
    PostgreSQL Development, 24x7 Support, Training and Services

Related Discussions

People

Translate

site design / logo © 2021 Grokbase