Greetings,
I've got three Linux systems (each with Fedora15-x86_64 running
PostgreSQL-9.0.4). I'm attempting to get a basic streaming
replication setup going with one master & two standby servers. At
this point, the replication portion appears to be working. I can run
an 'update' statement on the master, and view the result with a
'SELECT' on both standby servers.

I've mostly been working off the information presented here:
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial
http://wiki.postgresql.org/wiki/Streaming_Replication

plus the official Postgresql website documentation, with copious
googling for other sources when something didn't make sense.

First the problem. On *only* one of the two standby servers, I'm
seeing errors like the following whenever I issue any SQL commands on
the master which write (insert, update, etc) to the database:
LOG: invalid record length at 8/7A000020
FATAL: terminating walreceiver process due to administrator command
LOG: invalid record length at 8/7A0000B0
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 8/7B000020
FATAL: terminating walreceiver process due to administrator command
LOG: record with zero length at 8/7B0000B0
LOG: streaming replication successfully connected to primary
LOG: record with incorrect prev-link 8/79000058 at 8/7D0000B0
LOG: streaming replication successfully connected to primary

The thing that makes this even more confusing is that the data seems
to remain synchronized on both standby servers even with that error,
so I can't tell if I'm merely missing the implication of the errors,
of if they're somehow harmless? Maybe its able to resume streaming
replication only because i have wal_keep_segments=128 ? I googled a
bit on this, and found a few other references to these errors,
including this recent one which suggested that too much network
latency might be the problem:
http://permalink.gmane.org/gmane.comp.db.postgresql.general/153445

I should note that the standby that is exhibiting this problem is
running inside of a virtual machine, while the standby without the
problem is running on real HW. Whenever I get to the point where I
want to push the entire setup into production, it will all be running
on real HW.

Now a few unrelated questions:
0) I've successfully setup the WAL archiving on the master, and set
archive_timeout=61. However, what I'm seeing is that new files are
not getting generated every 61 seconds, but instead only when some
kind of SQL is invoked which writes to the database, or every 305
seconds (whichever comes first). I know this is the case because I
wrote a script which is being invoked via the archive_command option,
and that script is logging both the timestamp and filenames each time
it is invoked. Can I debug this?
1) Both of the wiki links above comment that the restore_command may
not be necessary if wal_keep_segments is large enough (mine is set to
128). I was going to setup the restore_command anyway, as I'm not yet
confident enough about streaming replication and failover with
postgresql to take chances, although the fact that i have two standby
servers makes this setup a bit more complex. However, can anyone
comment about whether its ever truly safe 100% of the time to run
without a restore_command ?

thanks!

Search Discussions

  • Pedro Sam at Aug 11, 2011 at 3:17 pm
    Do your machines have the same architecture? (64 bit vs 32 bit)

    ---------------------------------------------------------------------
    This transmission (including any attachments) may contain confidential information, privileged material (including material protected by the solicitor-client or other applicable privileges), or constitute non-public information. Any use of this information by anyone other than the intended recipient is prohibited. If you have received this transmission in error, please immediately reply to the sender and delete this information from your system. Use, dissemination, distribution, or reproduction of this transmission by unintended recipients is not authorized and may be unlawful.
  • Lonni J Friedman at Aug 11, 2011 at 3:20 pm

    On Thu, Aug 11, 2011 at 8:17 AM, Pedro Sam wrote:
    Do your machines have the same architecture?  (64 bit vs 32 bit)
    Yes, they're all Fedora15-x86_64.


    --
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    L. Friedman                                    netllama@gmail.com
    LlamaLand                       https://netllama.linux-sxs.org
  • Rockwell at Aug 14, 2011 at 8:15 pm
    Hi.

    I've experienced very similar behavior and log message. I do not know what
    is the problem in detail but my guess is that standby system has data which
    is incompatible to accept streaming replication of the master.

    The way I solved is the following on Ubuntu11.04. Suppose postgresql.conf
    and recover.conf (on standby) are ready.

    0. Stop postgresql on standby
    standby# /etc/init.d/postgresql stop

    1. invoke pg_start_backup() on master
    master# sudo -u postgres psql -c "SELECT pg_start_backup('2011-08-15_04:49)"

    2. remove data files on standby
    standby# cd /var/lib/postgresql/9.0/main
    standby# /bin/rm -rf base pg_*
    standby# mkdir pg_xlog; chown postgres.postgres pg_xlog; chmod 700 pg_xlog

    3. copy data files
    master# rsync -av --delete /var/lib/postgresql/9.0/main --exclude=pg_xlog
    --exclude=postmaster.pid --exclude=server.crt --exclude=server.key
    /path/to/standby/data/directory

    4. invoke pg_stop_backup() on master
    master# sudo -u postgres psql -c "SELECT pg_stop_backup()"

    5. start postgresql on standby
    standby# /etc/init.d/postgresql start

    Then, I found streaming replication just started to work.

    --
    View this message in context: http://postgresql.1045698.n5.nabble.com/streaming-replication-one-problem-several-questions-tp4687602p4698911.html
    Sent from the PostgreSQL - general mailing list archive at Nabble.com.
  • Fujii Masao at Aug 16, 2011 at 4:34 am

    On Thu, Aug 11, 2011 at 7:19 AM, Lonni J Friedman wrote:
    First the problem.  On *only* one of the two standby servers, I'm
    seeing errors like the following whenever I issue any SQL commands on
    the master which write (insert, update, etc) to the database:
    LOG:  invalid record length at 8/7A000020
    FATAL:  terminating walreceiver process due to administrator command
    LOG:  invalid record length at 8/7A0000B0
    LOG:  streaming replication successfully connected to primary
    LOG:  invalid record length at 8/7B000020
    FATAL:  terminating walreceiver process due to administrator command
    LOG:  record with zero length at 8/7B0000B0
    LOG:  streaming replication successfully connected to primary
    LOG:  record with incorrect prev-link 8/79000058 at 8/7D0000B0
    LOG:  streaming replication successfully connected to primary
    Did you use gcc4.6 or later to build PostgreSQL9.0? If yes, you would
    face the same problem reported before;
    http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php

    This problem was fixed, and the fix will be included in next minor update
    (i.e., 9.0.5).
    http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php

    Of course, you can avoid the problem by building PostgreSQL with
    pre-4.6 gcc.
    0) I've successfully setup the WAL archiving on the master, and set
    archive_timeout=61.  However, what I'm seeing is that new files are
    not getting generated every 61 seconds, but instead only when some
    kind of SQL is invoked which writes to the database, or every 305
    seconds (whichever comes first).
    First of all, you don't need to set archive_timeout. Without archive_timeout,
    streaming replication transfers WAL records from the master to the standby
    in almost real time.

    archive_timeout doesn't always generate new WAL file for each timeout.
    If there is no write workload, WAL file generation by archive_timeout is
    skipped.

    OTOH, checkout generates write workload, so archive_timeout after checkpoint
    always creates new WAL file. Since (I guess) you set checkpoint_timeout
    to 5min, you observed WAL file generation for each about 5min.
    1) Both of the wiki links above comment that the restore_command may
    not be necessary if wal_keep_segments is large enough (mine is set to
    128).  I was going to setup the restore_command anyway, as I'm not yet
    confident enough about streaming replication and failover with
    postgresql to take chances, although the fact that i have two standby
    servers makes this setup a bit more complex.  However, can anyone
    comment about whether its ever truly safe 100% of the time to run
    without a restore_command ?
    Specifically, what problem are you concerned about?

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Lonni J Friedman at Aug 17, 2011 at 7:26 pm

    On Mon, Aug 15, 2011 at 9:34 PM, Fujii Masao wrote:
    On Thu, Aug 11, 2011 at 7:19 AM, Lonni J Friedman wrote:
    First the problem.  On *only* one of the two standby servers, I'm
    seeing errors like the following whenever I issue any SQL commands on
    the master which write (insert, update, etc) to the database:
    LOG:  invalid record length at 8/7A000020
    FATAL:  terminating walreceiver process due to administrator command
    LOG:  invalid record length at 8/7A0000B0
    LOG:  streaming replication successfully connected to primary
    LOG:  invalid record length at 8/7B000020
    FATAL:  terminating walreceiver process due to administrator command
    LOG:  record with zero length at 8/7B0000B0
    LOG:  streaming replication successfully connected to primary
    LOG:  record with incorrect prev-link 8/79000058 at 8/7D0000B0
    LOG:  streaming replication successfully connected to primary
    Did you use gcc4.6 or later to build PostgreSQL9.0? If yes, you would
    face the same problem reported before;
    http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php

    This problem was fixed, and the fix will be included in next minor update
    (i.e., 9.0.5).
    http://archives.postgresql.org/pgsql-committers/2011-06/msg00101.php
    Yes, that was the issue. I thought that I had replied earlier to
    someone else speculating that this was the issue, but perhaps I had
    not.

    1) Both of the wiki links above comment that the restore_command may
    not be necessary if wal_keep_segments is large enough (mine is set to
    128).  I was going to setup the restore_command anyway, as I'm not yet
    confident enough about streaming replication and failover with
    postgresql to take chances, although the fact that i have two standby
    servers makes this setup a bit more complex.  However, can anyone
    comment about whether its ever truly safe 100% of the time to run
    without a restore_command ?
    Specifically, what problem are you concerned about?
    I wish I knew. All the documentation out there always focuses on
    setting up a restore command, as if there would be a huge disaster if
    it wasn't done. Is it safe to simply make wal_keep_segments really
    large, and skip the restore_command altogether?
  • Fujii Masao at Aug 18, 2011 at 1:26 am

    On Thu, Aug 18, 2011 at 4:26 AM, Lonni J Friedman wrote:
    I wish I knew.  All the documentation out there always focuses on
    setting up a restore command, as if there would be a huge disaster if
    it wasn't done.  Is it safe to simply make wal_keep_segments really
    large, and skip the restore_command altogether?
    There are pros and cons of replication setting NOT using restore_command.
    Please evaluate whether it's safe or not according to them.

    Pros;
    * You don't need to prepare the archive area shared between the master
    and standby. Don't need to purchase new server for that.

    * If you use restore_command and have the shared archive area,
    archive_command is a bit more likely to fail because it copies WAL files
    via network. Failure of archive_command might fill up the pg_xlog
    directory on the master, which might cause PANIC error. So you need
    to consider how to handle this failure case. OTOH, you don't need to
    do that if you don't use restore_command.

    Cons;
    * When setting up the standby, if the backup takes very long because
    the database is quite large, some WAL files required to the backup
    might be deleted from the master during the backup. If this happens,
    the standby starting from that backup will fail to start replication.
    To avoid such an unexpected deletion of WAL files from the master,
    you need to increase wal_keep_segments enough. But it might not
    be easy to determine the appropriate value of it.

    * You need to prepare large disk space for pg_xlog directory
    if wal_keep_segments is large. Because, in that case, a large number
    of WAL files can accumulate in pg_xlog.

    * When replication connection is terminated, no WAL data is streamed
    to the standby, so the standby cannot advance recovery at all. OTOH,
    if you set restore_command on the standby and have the shared
    archive area, the standby can read new WAL file from it by using
    restore_command and advance recovery.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedAug 10, '11 at 10:19p
activeAug 18, '11 at 1:26a
posts7
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase