I am not sure the following pg_ctl behaviour is really a bug, but I find it unexpected enough to
report.

I was testing synchronous replication in a test setup on a single machine. (After all, one could
have different instances on different arrays, right? If you think this is an unlikely use-case,
perhaps the following is not important.)

There are two installations of 9.1devel (git as of today):
primary: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1
standby: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2

The standby's data_directory is generated by pg_basebackup from vanilla_1.

The problem is the very first run of pg_ctl restart:

pg_ctl first correctly decides that the standby instance (=vanilla_2) isn't yet running:

pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does
not exist

This is OK and expected. But then it continues (in the logfile) with:

FATAL: lock file "postmaster.pid" already exists
HINT: Is another postmaster (PID 20519) running in data directory
"/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl
start) has happened.

It starts fine when started right away with 'start' instead of 'restart'.

Also, if it has been started once, it will react to 'pg_ctl restart' without the errors.

I'll attach a shell-script, that provokes the error, see the 'restart' on the line with the
comment: 'HERE'

It would seem (see below) that pg_ctl's final decision about the standby, (that is has started up)
is wrong; the standby does *not* eventually start.


Below the output of the attached shell script. (careful - it deletes stuff)
(It still contains some debug lines, but I didn't want to change it too much.)


$ clear; ./split_vanilla.sh

PGPASSFILE=/home/rijkers/.pg_rijkers
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000018'
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019'
removed `/var/data1/pg_stuff/archive_dir/000000010000000000000019.00000020.backup'
removed `/var/data1/pg_stuff/archive_dir/00000001000000000000001A'
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_basebackup
NOTICE: pg_stop_backup complete, all required WAL segments have been archived

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin
PGPORT=6564
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_ctl
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
UID PID PPID C STIME TTY STAT TIME CMD
rijkers 20519 1 20 17:19 pts/25 S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
rijkers 20521 20519 0 17:19 ? Ss 0:00 \_ postgres: writer process
rijkers 20522 20519 0 17:19 ? Ss 0:00 \_ postgres: wal writer process
rijkers 20523 20519 0 17:19 ? Ss 0:00 \_ postgres: autovacuum launcher process
rijkers 20524 20519 0 17:19 ? Ss 0:00 \_ postgres: archiver process
rijkers 20525 20519 0 17:19 ? Ss 0:00 \_ postgres: stats collector process

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin
PGPORT=6664
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin/pg_ctl
pg_ctl: PID file "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid" does
not exist
Is server running?
starting server anyway
waiting for server to start............................................................... done
server started

-- logfile 1:
LOG: database system is shut down
LOG: database system was shut down at 2011-03-18 17:19:54 CET
LOG: autovacuum launcher started
LOG: database system is ready to accept connections

-- logfile 2:
LOG: shutting down
LOG: database system is shut down
FATAL: lock file "postmaster.pid" already exists
HINT: Is another postmaster (PID 20519) running in data directory
"/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?




thanks,

Erik Rijkers

Search Discussions

  • Robert Haas at Mar 19, 2011 at 1:22 am

    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected.  But then it continues (in the logfile) with:

    FATAL:  lock file "postmaster.pid" already exists
    HINT:  Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Fujii Masao at Mar 23, 2011 at 5:48 am

    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected.  But then it continues (in the logfile) with:

    FATAL:  lock file "postmaster.pid" already exists
    HINT:  Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.

    The simple workaround is to exclude postmaster.opts from the backup
    as well as postmaster.pid. But when postmaster.opts doesn't exist,
    "pg_ctl restart" cannot start up the server. We might also need to change
    the code of "pg_ctl restart" so that it does just "pg_ctl start" when
    postmaster.opts doesn't exist.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Robert Haas at Mar 23, 2011 at 11:53 am

    On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao wrote:
    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected.  But then it continues (in the logfile) with:

    FATAL:  lock file "postmaster.pid" already exists
    HINT:  Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance.  It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.

    The simple workaround is to exclude postmaster.opts from the backup
    as well as postmaster.pid. But when postmaster.opts doesn't exist,
    "pg_ctl restart" cannot start up the server. We might also need to change
    the code of "pg_ctl restart" so that it does just "pg_ctl start" when
    postmaster.opts doesn't exist.
    Sounds reasonable.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Bruce Momjian at Sep 6, 2011 at 2:14 am

    Robert Haas wrote:
    On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao wrote:
    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected. ?But then it continues (in the logfile) with:

    FATAL: ?lock file "postmaster.pid" already exists
    HINT: ?Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.

    The simple workaround is to exclude postmaster.opts from the backup
    as well as postmaster.pid. But when postmaster.opts doesn't exist,
    "pg_ctl restart" cannot start up the server. We might also need to change
    the code of "pg_ctl restart" so that it does just "pg_ctl start" when
    postmaster.opts doesn't exist.
    Sounds reasonable.
    Has this been handled?

    --
    Bruce Momjian <[email protected]> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Bruce Momjian at Oct 11, 2011 at 9:35 pm

    Robert Haas wrote:
    On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao wrote:
    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected. ?But then it continues (in the logfile) with:

    FATAL: ?lock file "postmaster.pid" already exists
    HINT: ?Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.

    The simple workaround is to exclude postmaster.opts from the backup
    as well as postmaster.pid. But when postmaster.opts doesn't exist,
    "pg_ctl restart" cannot start up the server. We might also need to change
    the code of "pg_ctl restart" so that it does just "pg_ctl start" when
    postmaster.opts doesn't exist.
    Sounds reasonable.
    I looked over this issue and I don't thinking having pg_ctl restart fall
    back to 'start' is a good solution. I am concerned about cases where we
    start a different server without shutting down the old server, for some
    reason. When they say 'restart', I think we have to assume they want a
    restart.

    What I did do was to document that not backing up postmaster.pid and
    postmaster.opts might help prevent pg_ctl from getting confused.

    Patch applied and backpatched to 9.1.X.

    --
    Bruce Momjian <[email protected]> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Magnus Hagander at Oct 12, 2011 at 4:45 pm

    On Tue, Oct 11, 2011 at 23:35, Bruce Momjian wrote:
    Robert Haas wrote:
    On Wed, Mar 23, 2011 at 1:48 AM, Fujii Masao wrote:
    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected. ?But then it continues (in the logfile) with:

    FATAL: ?lock file "postmaster.pid" already exists
    HINT: ?Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.

    The simple workaround is to exclude postmaster.opts from the backup
    as well as postmaster.pid. But when postmaster.opts doesn't exist,
    "pg_ctl restart" cannot start up the server. We might also need to change
    the code of "pg_ctl restart" so that it does just "pg_ctl start" when
    postmaster.opts doesn't exist.
    Sounds reasonable.
    I looked over this issue and I don't thinking having pg_ctl restart fall
    back to 'start' is a good solution.  I am concerned about cases where we
    start a different server without shutting down the old server, for some
    reason.  When they say 'restart', I think we have to assume they want a
    restart.

    What I did do was to document that not backing up postmaster.pid and
    postmaster.opts might help prevent pg_ctl from getting confused.
    Should we exclude postmaster.opts from streaming base backups? We
    already exclude postmaster.pid...
  • Bruce Momjian at Oct 12, 2011 at 6:11 pm

    Magnus Hagander wrote:
    I looked over this issue and I don't thinking having pg_ctl restart fall
    back to 'start' is a good solution. ?I am concerned about cases where we
    start a different server without shutting down the old server, for some
    reason. ?When they say 'restart', I think we have to assume they want a
    restart.

    What I did do was to document that not backing up postmaster.pid and
    postmaster.opts might help prevent pg_ctl from getting confused.
    Should we exclude postmaster.opts from streaming base backups? We
    already exclude postmaster.pid...
    Uh, I think so, unless my analysis was wrong.

    --
    Bruce Momjian <[email protected]> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +
  • Magnus Hagander at Oct 18, 2011 at 2:02 pm

    On Wednesday, October 12, 2011, Bruce Momjian wrote:

    Magnus Hagander wrote:
    I looked over this issue and I don't thinking having pg_ctl restart
    fall
    back to 'start' is a good solution. ?I am concerned about cases where
    we
    start a different server without shutting down the old server, for some
    reason. ?When they say 'restart', I think we have to assume they want a
    restart.

    What I did do was to document that not backing up postmaster.pid and
    postmaster.opts might help prevent pg_ctl from getting confused.
    Should we exclude postmaster.opts from streaming base backups? We
    already exclude postmaster.pid...
    Uh, I think so, unless my analysis was wrong.
    Ok, fixed and applied.

    //Magnus
  • Fujii Masao at Oct 18, 2011 at 4:19 pm

    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Robert Haas at Oct 19, 2011 at 4:20 am

    On Tue, Oct 18, 2011 at 12:18 PM, Fujii Masao wrote:
    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.
    Committed.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Fujii Masao at Oct 18, 2011 at 4:25 pm

    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Fujii Masao at Oct 18, 2011 at 5:03 pm
    Oh, sorry for repeating the same posts. Gmail seems to have not worked
    fine... :(
    On Wed, Oct 19, 2011 at 1:24 AM, Fujii Masao wrote:
    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Fujii Masao at Oct 18, 2011 at 4:43 pm

    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Fujii Masao at Oct 18, 2011 at 4:43 pm

    On Tue, Oct 18, 2011 at 11:02 PM, Magnus Hagander wrote:
    Ok, fixed and applied.
    You seem to have forgot to change protocol.sgml.
    Patch attached.

    Regards,

    --
    Fujii Masao
    NIPPON TELEGRAPH AND TELEPHONE CORPORATION
    NTT Open Source Software Center
  • Bruce Momjian at Apr 26, 2011 at 7:53 pm

    Fujii Masao wrote:
    On Sat, Mar 19, 2011 at 10:20 AM, Robert Haas wrote:
    On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers wrote:
    This is OK and expected. ?But then it continues (in the logfile) with:

    FATAL: ?lock file "postmaster.pid" already exists
    HINT: ?Is another postmaster (PID 20519) running in data directory
    "/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data"?

    So, complaints about the *other* instance. ?It doesn't happen once a successful start (with pg_ctl
    start) has happened.
    I'm guessing that leftover postmaster.pid contents might be
    responsible for this?
    The cause is that "pg_ctl restart" uses the postmaster.opts which was
    created in the primary. Since its content was something like
    "pg_ctl -D vanilla_1/data", vanilla_1/data/postmaster.pid was checked
    wrongly.
    FYI, my The Magic of Hot Streaming Replication talk shows this exact
    issue on slide 16:

    http://momjian.us/main/presentations/features.html#hot_streaming

    Remove /data2/postmaster.pid so the standby server does not see the
    primary servers pid as its own:

    rm /u/pg/data2/postmaster.pid

    This is because my demo creates the standby on the same machine as the
    master so the pid is still valid and owned by 'postgres', which is what
    the user is reporting.

    --
    Bruce Momjian <[email protected]> http://momjian.us
    EnterpriseDB http://enterprisedb.com

    + It's impossible for everything to be true. +

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 18, '11 at 5:19p
activeOct 19, '11 at 4:20a
posts16
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2023 Grokbase