My database crashed a couple of days ago during an upgrade several seconds
after committing a large transaction to the database. Eventually we found
out that this was due to the disk being full as the transaction had created
several gigs of data. A day or so later the disk is full again and
PostgreSQL crashes due to the pg_xlog file taking up all of the disk space.
I have cleaned up the drive to have so extra space which allows PostgreSQL
to start again but the xlogs are still increasing. I have two errors in my
pg_log:

"WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures" and

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

Postgres version 9.0.3 conf:

    -
    wal_level = hot_standby
    archive_mode = true
    archive_command = 'test ! -f /opt/postgres/remote_pgsql/wal_archive/%f
    && cp %p /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
    archive a logfile segment
    archive_timeout = 1800
    max_wal_senders = 1
    max_standby_archive_delay = 900s
    max_standby_streaming_delay = 900s
    default_statistics_target = 50 # pgtune wizard 2010-11-18
    maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
    constraint_exclusion = on # pgtune wizard 2010-11-18
    checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
    effective_cache_size = 5632MB # pgtune wizard 2010-11-18
    work_mem = 48MB # pgtune wizard 2010-11-18
    wal_buffers = 8MB # pgtune wizard 2010-11-18
    checkpoint_segments = 16 # pgtune wizard 2010-11-18
    shared_buffers = 1920MB # pgtune wizard 2010-11-18
    max_connections = 80 # pgtune wizard 2010-11-18


I've tried stopping postgres and then deleting the
00000001000000A800000078.ready file and starting postgres but this appears
to be recreated instantly and the error is still in the log file.

I've read about the pg_reset_xlog() command but with having to pg_dump our
db with a large amount of blobs and restoring it again is highly
problematic as the pg_restore has struggled to restore.

Will setting zero_damaged_pages (true) work in 9.0.1 and would this resolve
the issue?

Would creating an empty file and replacing the offending xlog work, would
this need to be a specific size?

Any ideas?

Search Discussions

  • Adrian Klaver at Jan 22, 2013 at 2:48 pm

    On 01/22/2013 03:12 AM, Cliff de Carteret wrote:
    My database crashed a couple of days ago during an upgrade several
    seconds after committing a large transaction to the database. Eventually
    we found out that this was due to the disk being full as the transaction
    had created several gigs of data. A day or so later the disk is full
    again and PostgreSQL crashes due to the pg_xlog file taking up all of
    the disk space. I have cleaned up the drive to have so extra space which
    allows PostgreSQL to start again but the xlogs are still increasing. I
    have two errors in my pg_log:

    "WARNING: transaction log file "00000001000000A800000078" could not be
    archived: too many failures" and

    "LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
    pg_xlog/00000001000000A800000078
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"
    I am not seeing it below, but just to be complete is wal_keep_segments
    set to something greater than 0?


    Just to be clear which xlogs are filling up, the ones in the original
    data directory or in the archive directory?


    Postgres version 9.0.3 conf:

    *


    wal_level = hot_standby
    archive_mode = true
    archive_command = 'test ! -f
    /opt/postgres/remote_pgsql/wal_archive/%f && cp %p
    /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
    archive a logfile segment
    archive_timeout = 1800
    max_wal_senders = 1
    max_standby_archive_delay = 900s
    max_standby_streaming_delay = 900s
    default_statistics_target = 50 # pgtune wizard 2010-11-18
    maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
    constraint_exclusion = on # pgtune wizard 2010-11-18
    checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
    effective_cache_size = 5632MB # pgtune wizard 2010-11-18
    work_mem = 48MB # pgtune wizard 2010-11-18
    wal_buffers = 8MB # pgtune wizard 2010-11-18
    checkpoint_segments = 16 # pgtune wizard 2010-11-18
    shared_buffers = 1920MB # pgtune wizard 2010-11-18
    max_connections = 80 # pgtune wizard 2010-11-18
    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Cliff de Carteret at Jan 22, 2013 at 2:59 pm
    Hello, thanks for your response.

    The xlogs that are filling up are in the original data directory and the
    wal_keep_segments is commented out on my master and slave postgresql.conf

    Cliff

    On 22 January 2013 14:48, Adrian Klaver wrote:
    On 01/22/2013 03:12 AM, Cliff de Carteret wrote:

    My database crashed a couple of days ago during an upgrade several
    seconds after committing a large transaction to the database. Eventually
    we found out that this was due to the disk being full as the transaction
    had created several gigs of data. A day or so later the disk is full
    again and PostgreSQL crashes due to the pg_xlog file taking up all of
    the disk space. I have cleaned up the drive to have so extra space which
    allows PostgreSQL to start again but the xlogs are still increasing. I
    have two errors in my pg_log:

    "WARNING: transaction log file "00000001000000A800000078" could not be
    archived: too many failures" and

    "LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078 && cp
    pg_xlog/**00000001000000A800000078
    /opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078"
    I am not seeing it below, but just to be complete is wal_keep_segments set
    to something greater than 0?


    Just to be clear which xlogs are filling up, the ones in the original data
    directory or in the archive directory?



    Postgres version 9.0.3 conf:


    *


    wal_level = hot_standby
    archive_mode = true
    archive_command = 'test ! -f
    /opt/postgres/remote_pgsql/**wal_archive/%f && cp %p
    /opt/postgres/remote_pgsql/**wal_archive/%f' # command to use to
    archive a logfile segment
    archive_timeout = 1800
    max_wal_senders = 1
    max_standby_archive_delay = 900s
    max_standby_streaming_delay = 900s
    default_statistics_target = 50 # pgtune wizard 2010-11-18
    maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
    constraint_exclusion = on # pgtune wizard 2010-11-18
    checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
    effective_cache_size = 5632MB # pgtune wizard 2010-11-18
    work_mem = 48MB # pgtune wizard 2010-11-18
    wal_buffers = 8MB # pgtune wizard 2010-11-18
    checkpoint_segments = 16 # pgtune wizard 2010-11-18
    shared_buffers = 1920MB # pgtune wizard 2010-11-18
    max_connections = 80 # pgtune wizard 2010-11-18

    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Adrian Klaver at Jan 22, 2013 at 3:34 pm

    On 01/22/2013 06:59 AM, Cliff de Carteret wrote:
    Hello, thanks for your response.

    The xlogs that are filling up are in the original data directory and the
    wal_keep_segments is commented out on my master and slave postgresql.conf
    It is expected that there may a certain increase in WAL files.
       In particular: checkpoint_segments = 16

    http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS
    "
    checkpoint_segments (integer)
    Maximum number of log file segments between automatic WAL checkpoints
    (each segment is normally 16 megabytes). The default is three segments.
    Increasing this parameter can increase the amount of time needed for
    crash recovery. This parameter can only be set in the postgresql.conf
    file or on the server command line.
    "

    For more information that explains the above see:
    http://www.postgresql.org/docs/9.2/static/wal-configuration.html

    More questions:

    Does it look like any WAL files in the pg_xlog directory are being recycled?

    Re your archive error from your previous post, is your archive directory
    full?

          Assuming the archive directory is remote, is it reachable?
    Cliff

    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Cliff de Carteret at Jan 22, 2013 at 3:57 pm
    There are no files in the pg_xlog directory from before the database crash
    except one file which ends in .backup, there are 759 files. In the
    pg_xlog/archive_status folder there are 755 files.
    The local ../wal_archive folder has no files in it at all whereas the
    remote wal_archive folder has 147 files present.

    The remote archive folder is not full and has ~14G whereas the local
    archive folder was previously full but now it has been moved so it has 100G+

    The archive directory is a mount and as the user postgres I can copy a file
    into it successfully.

    The current setup has been working successfully for several years until the
    recent database crash

    On 22 January 2013 15:34, Adrian Klaver wrote:
    On 01/22/2013 06:59 AM, Cliff de Carteret wrote:

    Hello, thanks for your response.

    The xlogs that are filling up are in the original data directory and the
    wal_keep_segments is commented out on my master and slave postgresql.conf
    It is expected that there may a certain increase in WAL files.
    In particular: checkpoint_segments = 16

    http://www.postgresql.org/**docs/9.2/static/runtime-**config-wal.html#GUC-
    **CHECKPOINT-SEGMENTS<http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS>
    "
    checkpoint_segments (integer)
    Maximum number of log file segments between automatic WAL checkpoints
    (each segment is normally 16 megabytes). The default is three segments.
    Increasing this parameter can increase the amount of time needed for crash
    recovery. This parameter can only be set in the postgresql.conf file or on
    the server command line.
    "

    For more information that explains the above see:
    http://www.postgresql.org/**docs/9.2/static/wal-**configuration.html<http://www.postgresql.org/docs/9.2/static/wal-configuration.html>

    More questions:

    Does it look like any WAL files in the pg_xlog directory are being
    recycled?

    Re your archive error from your previous post, is your archive directory
    full?

    Assuming the archive directory is remote, is it reachable?

    Cliff

    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Adrian Klaver at Jan 22, 2013 at 4:15 pm

    On 01/22/2013 07:57 AM, Cliff de Carteret wrote:
    There are no files in the pg_xlog directory from before the database
    crash except one file which ends in .backup, there are 759 files. In the
    pg_xlog/archive_status folder there are 755 files.
    The local ../wal_archive folder has no files in it at all whereas the
    remote wal_archive folder has 147 files present.
    What is this local wal_archive directory?
      From a previous post:

    "LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
    pg_xlog/00000001000000A800000078
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078

    I see only /opt/postgres/remote_pgsql/wal_archive which I assumed was
    the remote. I am not sure where the local one fits in?
    The remote archive folder is not full and has ~14G whereas the local
    archive folder was previously full but now it has been moved so it has 100G+

    The archive directory is a mount and as the user postgres I can copy a
    file into it successfully.

    The current setup has been working successfully for several years until
    the recent database crash
    So what did you do to get the database running again?


    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Cliff de Carteret at Jan 22, 2013 at 4:28 pm
    The local wal_archive directory is a directory named "wal_archive" which is
    at the same file system level as the data directory. The time stamp states
    that it has not been used for several years so it looks to be redundant.

    To get the database running again I moved an old postgres installation to a
    backups folder which gave me ~2G of space to start up again until I was
    able to mount a new disk and move the pg_xlog to a separate partition using
    a symbolic link.

    On 22 January 2013 16:15, Adrian Klaver wrote:
    On 01/22/2013 07:57 AM, Cliff de Carteret wrote:

    There are no files in the pg_xlog directory from before the database
    crash except one file which ends in .backup, there are 759 files. In the
    pg_xlog/archive_status folder there are 755 files.
    The local ../wal_archive folder has no files in it at all whereas the
    remote wal_archive folder has 147 files present.
    What is this local wal_archive directory?
    From a previous post:


    "LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078 && cp
    pg_xlog/**00000001000000A800000078 /opt/postgres/remote_pgsql/**
    wal_archive/**00000001000000A800000078

    I see only /opt/postgres/remote_pgsql/**wal_archive which I assumed was
    the remote. I am not sure where the local one fits in?


    The remote archive folder is not full and has ~14G whereas the local
    archive folder was previously full but now it has been moved so it has
    100G+

    The archive directory is a mount and as the user postgres I can copy a
    file into it successfully.

    The current setup has been working successfully for several years until
    the recent database crash
    So what did you do to get the database running again?


    --
    Adrian Klaver
    adrian.klaver@gmail.com
  • Kevin Grittner at Jan 22, 2013 at 4:07 pm

    Cliff de Carteret wrote:

    The current setup has been working successfully for several years
    until the recent database crash
    What file does the server log say it is trying to archive? What
    error are you getting? Does that filename already exist on the
    archive (or some intermediate location used by the archive command
    or script)?

    -Kevin
  • Kevin Grittner at Jan 22, 2013 at 4:43 pm
    [Please keep the list copied, and put your reply in-line instead
    of at the top.]

    Cliff de Carteret wrote:
    On 22 January 2013 16:07, Kevin Grittner wrote:

    Cliff de Carteret wrote:
    The current setup has been working successfully for several years
    until the recent database crash
    What file does the server log say it is trying to archive? What
    error are you getting? Does that filename already exist on the
    archive (or some intermediate location used by the archive command
    or script)?
    The sever log is (repeated constantly):

    LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
    pg_xlog/00000001000000A800000078
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
    WARNING: transaction log file "00000001000000A800000078" could not be
    archived: too many failures

    The file 00000001000000A800000078 exists in the remote archive's
    wal_archive directory. I read a post saying to copy the file over to the
    archive and then delete the .ready file to get postgres to move onto the
    next file but this ended up logging out saying that a log file was missing.
    There are more recent files in this directory but they end at the point
    where I reverted all of the changes I made last night when time was running
    out and the database had to be put back to a known state.
    I would have deleted (or renamed) the copy in the archive
    directory. Archiving should have then resumed and cleaned up the
    pg_xlog directory.

    -Kevin
  • Cliff de Carteret at Jan 22, 2013 at 5:23 pm

    On 22 January 2013 16:43, Kevin Grittner wrote:

    [Please keep the list copied, and put your reply in-line instead
    of at the top.]

    Cliff de Carteret wrote:
    On 22 January 2013 16:07, Kevin Grittner wrote:

    Cliff de Carteret wrote:
    The current setup has been working successfully for several years
    until the recent database crash
    What file does the server log say it is trying to archive? What
    error are you getting? Does that filename already exist on the
    archive (or some intermediate location used by the archive command
    or script)?
    The sever log is (repeated constantly):

    LOG: archive command failed with exit code 1
    DETAIL: The failed archive command was: test ! -f
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
    pg_xlog/00000001000000A800000078
    /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
    WARNING: transaction log file "00000001000000A800000078" could not be
    archived: too many failures

    The file 00000001000000A800000078 exists in the remote archive's
    wal_archive directory. I read a post saying to copy the file over to the
    archive and then delete the .ready file to get postgres to move onto the
    next file but this ended up logging out saying that a log file was missing.
    There are more recent files in this directory but they end at the point
    where I reverted all of the changes I made last night when time was running
    out and the database had to be put back to a known state.
    I would have deleted (or renamed) the copy in the archive
    directory. Archiving should have then resumed and cleaned up the
    pg_xlog directory.
    I have now deleted the copy on the remote wal_archive folder and the
    archiving is now functioning and sending the logs from the local to the
    remote folder. The remote database does not startup and the following is in
    the log:

    LOG: database system was shut down in recovery at 2013-01-22 10:54:48 GMT
    LOG: entering standby mode
    LOG: restored log file "00000001000000AB00000051" from archive
    LOG: invalid resource manager ID in primary checkpoint record
    PANIC: could not locate a valid checkpoint record
    LOG: startup process (PID 22350) was terminated by signal 6: Aborted
    LOG: aborting startup due to startup process failure

    00000001000000AB00000051 is in my remote database's pg_xlog folder

    Thanks for your help already!


    -Kevin
  • Kevin Grittner at Jan 22, 2013 at 6:34 pm

    Cliff de Carteret wrote:

    I have now deleted the copy on the remote wal_archive folder and the
    archiving is now functioning and sending the logs from the local to the
    remote folder. The remote database does not startup and the following is in
    the log:

    LOG: database system was shut down in recovery at 2013-01-22 10:54:48 GMT
    LOG: entering standby mode
    LOG: restored log file "00000001000000AB00000051" from archive
    LOG: invalid resource manager ID in primary checkpoint record
    PANIC: could not locate a valid checkpoint record
    LOG: startup process (PID 22350) was terminated by signal 6: Aborted
    LOG: aborting startup due to startup process failure

    00000001000000AB00000051 is in my remote database's pg_xlog folder
    Any chance that there was a pg_start_backup() call on the master without
    a matching pg_stop_backup() call?

    -Kevin
  • Cliff de Carteret at Jan 23, 2013 at 9:14 am
    I had not run these commands on the master as I was only doing sql updates
    ~1mil of them

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedJan 22, '13 at 11:13a
activeJan 23, '13 at 9:14a
posts12
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase