Hi all,

i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
running with the data on a DRBD Device for High Availability. The used
database is also replicated to two similar machines with slony1.

Since the load average is between 1 (most of the time) and 10 (peeks) i
am worried about the load and executed vmstat and iostat which show that
1000-6000 Blocks are writen per second. Please check the attached output
for further details.
top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck. I'm aware that this hardware setup is
probably not sufficient but is would like to investigate how critical
the situation is.

thanks,
Peter

Search Discussions

  • Peter Bauer at Nov 25, 2007 at 10:28 am
    Hi all,

    i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
    a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
    running with the data on a DRBD Device for High Availability. The used
    database is also replicated to two similar machines with slony1.

    Since the load average is between 1 (most of the time) and 10 (peeks) i
    am worried about the load and executed vmstat and iostat which show that
    1000-6000 Blocks are writen per second. Please check the attached output
    for further details.
    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck. I'm aware that this hardware setup is
    probably not sufficient but is would like to investigate how critical
    the situation is.

    thanks,
    Peter
  • Pavel Stehule at Nov 25, 2007 at 10:47 am
    Hello

    maybe you have to tune bgwriter. Please, read this article
    http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

    Regards
    Pavel Stehule
    On 25/11/2007, Peter Bauer wrote:

    Hi all,

    i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
    a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
    running with the data on a DRBD Device for High Availability. The used
    database is also replicated to two similar machines with slony1.

    Since the load average is between 1 (most of the time) and 10 (peeks) i
    am worried about the load and executed vmstat and iostat which show that
    1000-6000 Blocks are writen per second. Please check the attached output
    for further details.
    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck. I'm aware that this hardware setup is
    probably not sufficient but is would like to investigate how critical
    the situation is.

    thanks,
    Peter



    procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 398256 78328 140612 1063556 0 0 0 1472 2029 5081 4 3 92 0
    0 2 398256 78268 140612 1063576 0 0 0 2304 1928 4216 0 2 98 0
    1 2 398256 78100 140612 1063576 0 0 0 1464 1716 3994 1 3 96 0
    0 0 398256 78704 140612 1063592 0 0 0 916 1435 3570 5 3 91 0
    0 0 398256 77876 140612 1063616 0 0 0 0 305 1169 3 1 96 0
    0 1 398256 79200 140612 1063628 0 0 0 2404 2787 6555 7 6 86 0
    0 2 398256 79256 140612 1063660 0 0 0 1564 1640 3577 4 2 94 0
    0 0 398256 79324 140612 1063692 0 0 0 1660 2322 5615 5 2 93 0
    0 0 398256 78668 140612 1063724 0 0 0 1048 1390 4197 16 13 72 0
    1 0 398256 79344 140612 1063748 0 0 0 1896 2416 5675 5 5 89 0
    0 0 398256 79336 140612 1063784 0 0 0 1176 1720 4436 5 3 92 0
    1 2 398256 79280 140612 1063812 0 0 0 1800 1697 3931 3 3 94 0
    0 1 398256 79048 140612 1063832 0 0 0 1384 1733 4137 4 2 93 0
    0 1 398256 78792 140612 1063868 0 0 0 1672 2163 5241 3 2 94 0
    0 0 398256 77608 140612 1063880 0 0 0 1088 1638 3484 4 2 93 0
    0 0 398256 79108 140612 1063884 0 0 0 1568 2103 5382 7 5 88 0
    0 0 398256 79100 140612 1063892 0 0 0 1556 1394 3135 2 1 97 0
    0 2 398256 79084 140612 1063900 0 0 0 1644 2072 4953 2 0 97 0
    0 0 398256 79060 140612 1063932 0 0 0 1240 1714 3888 5 2 92 0
    0 2 398256 79032 140612 1063940 0 0 0 1328 1694 4135 4 2 94 0
    0 1 398256 78452 140612 1063944 0 0 0 620 925 2824 6 7 87 0
    0 0 398256 79036 140612 1063956 0 0 0 1196 1293 2954 6 7 87 0
    0 2 398256 79136 140612 1063964 0 0 0 1736 1959 4494 4 2 94 0
    0 0 398256 79132 140612 1063964 0 0 0 4 260 1039 1 1 98 0
    0 0 398256 79052 140612 1063980 0 0 0 2444 3084 6955 6 5 89 0
    0 2 398256 79060 140612 1063988 0 0 0 948 1146 3616 3 1 96 0
    0 1 398256 78268 140612 1064056 0 0 0 1908 1809 4086 6 5 88 0
    0 1 398256 76728 140612 1064056 0 0 0 6256 6637 15472 5 5 90 0
    0 2 398256 77000 140612 1064064 0 0 0 4916 5840 12107 1 4 95 0
    0 2 398256 76956 140612 1064068 0 0 0 6468 7432 15211 1 3 96 0
    0 6 398256 77388 140612 1064072 0 0 0 8116 7826 18265 1 8 91 0
    0 2 398256 74312 140612 1064076 0 0 0 7032 6886 16136 2 7 91 0
    0 2 398256 74264 140612 1064076 0 0 0 5680 7143 13411 0 5 95 0
    0 2 398256 72980 140612 1064140 0 0 0 5396 6377 13251 6 6 88 0
    0 3 398256 76972 140612 1064148 0 0 0 5652 6793 14079 4 9 87 0
    0 2 398256 77836 140612 1064148 0 0 0 3968 5321 14187 10 8 82 0
    1 0 398256 77280 140612 1064148 0 0 0 1608 3188 8974 21 12 67 0
    1 0 398256 77832 140612 1064152 0 0 0 236 834 2625 7 5 87 0
    0 0 398256 77464 140612 1064152 0 0 0 244 505 1378 2 4 94 0
    1 0 398256 77828 140612 1064164 0 0 0 316 580 1954 7 2 91 0
    0 0 398256 77804 140612 1064180 0 0 0 740 673 2248 2 2 96 0
    0 0 398256 77000 140612 1064180 0 0 0 304 589 1739 1 3 96 0
    0 0 398256 77000 140612 1064184 0 0 0 0 216 886 0 1 99 0
    0 0 398256 75452 140612 1064184 0 0 0 432 755 2032 6 1 93 0
    0 0 398256 76964 140616 1064200 0 0 0 1980 2722 6452 7 2 91 0
    0 0 398256 77040 140616 1064212 0 0 0 1524 1536 3635 5 1 93 0
    0 2 398256 77028 140616 1064216 0 0 0 1136 1489 3342 1 2 97 0
    0 0 398256 76976 140616 1064236 0 0 0 1392 1831 4468 3 2 94 0
    0 0 398256 76944 140616 1064256 0 0 0 932 1374 3318 7 8 85 0
    0 0 398256 76856 140616 1064276 0 0 0 1392 1773 3985 5 4 91 0
    1 4 398256 76536 140616 1064448 0 0 0 1432 1537 3950 4 2 94 0
    0 0 398256 76640 140616 1064308 0 0 0 560 865 2577 6 3 91 0
    0 0 398256 76368 140616 1064328 0 0 0 1752 1973 4424 3 3 94 0
    0 0 398256 74732 140616 1064340 0 0 0 1244 1807 3758 4 6 90 0
    1 2 398256 76244 140616 1064352 0 0 0 876 1189 3579 4 2 94 0
    0 0 398256 76192 140616 1064360 0 0 0 1856 1623 3727 5 4 91 0
    0 0 398256 76236 140616 1064372 0 0 0 1292 1748 3664 5 4 91 0
    0 0 398256 76548 140616 1064392 0 0 0 0 338 1810 11 5 84 0
    0 0 398256 76540 140616 1064392 0 0 0 0 278 1055 2 4 93 0


    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 241.00 0.00 6344.00 0 6344
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 39.00 0.00 1240.00 0 1240
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 170.00 0.00 3944.00 0 3944
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 173.00 0.00 3880.00 0 3880
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 73.00 0.00 2656.00 0 2656
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 120.00 0.00 2656.00 0 2656
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 81.00 0.00 2184.00 0 2184
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 133.00 0.00 3144.00 0 3144
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 115.00 0.00 2936.00 0 2936
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 64.00 0.00 2080.00 0 2080
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 176.00 0.00 4384.00 0 4384
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 172.00 0.00 3296.00 0 3296
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 191.00 0.00 3880.00 0 3880
    dev8-1 0.00 0.00 0.00 0 0

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    dev8-0 144.00 0.00 2464.00 0 2464
    dev8-1 0.00 0.00 0.00 0 0



    #---------------------------------------------------------------------------
    # FILE LOCATIONS
    #---------------------------------------------------------------------------

    # The default values of these variables are driven from the -D command line
    # switch or PGDATA environment variable, represented here as ConfigDir.

    data_directory = '/clstsql/data' # btaf: Data Directory on the drbd device, APUS

    #hba_file = 'ConfigDir/pg_hba.conf' # host-based authentication file
    #ident_file = 'ConfigDir/pg_ident.conf' # IDENT configuration file

    # If external_pid_file is not explicitly set, no extra pid file is written.
    #external_pid_file = '(none)' # write an extra pid file


    #---------------------------------------------------------------------------
    # CONNECTIONS AND AUTHENTICATION
    #---------------------------------------------------------------------------

    # - Connection Settings -

    listen_addresses = '*' # btaf: '*' = Listen to all addresses
    port = 5432 # btaf: Use default port for PostgreSQL, APUS

    max_connections = 400 # btaf: APUS
    # note: increasing max_connections costs ~400 bytes of shared memory per
    # connection slot, plus lock space (see max_locks_per_transaction). You
    # might also need to raise shared_buffers to support more connections.
    #superuser_reserved_connections = 2
    unix_socket_directory = '/var/run/postgresql'
    #unix_socket_group = ''
    #unix_socket_permissions = 0777 # octal
    #bonjour_name = '' # defaults to the computer name

    # - Security & Authentication -

    #authentication_timeout = 60 # 1-600, in seconds

    ssl = false # btaf: Do not use ssl for connections, APUS

    #password_encryption = on
    #db_user_namespace = off

    # Kerberos
    #krb_server_keyfile = ''
    #krb_srvname = 'postgres'
    #krb_server_hostname = '' # empty string matches any keytab entry
    #krb_caseins_users = off

    # - TCP Keepalives -
    # see 'man 7 tcp' for details

    #tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
    # 0 selects the system default
    #tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
    # 0 selects the system default
    #tcp_keepalives_count = 0 # TCP_KEEPCNT;
    # 0 selects the system default


    #---------------------------------------------------------------------------
    # RESOURCE USAGE (except WAL)
    #---------------------------------------------------------------------------

    # - Memory -

    shared_buffers = 40000 # min 16 or max_connections*2, 8KB each, APUS
    temp_buffers = 8000 # min 100, 8KB each, APUS
    #max_prepared_transactions = 5 # can be 0 or more
    # note: increasing max_prepared_transactions costs ~600 bytes of shared memory
    # per transaction slot, plus lock space (see max_locks_per_transaction).
    work_mem = 20480 # min 64, size in KB, APUS
    maintenance_work_mem = 51200 # min 1024, size in KB, APUS
    #max_stack_depth = 2048 # min 100, size in KB

    # - Free Space Map -

    max_fsm_pages = 40000 # min max_fsm_relations*16, 6 bytes each, APUS
    #max_fsm_relations = 1000 # min 100, ~70 bytes each

    # - Kernel Resource Usage -

    #max_files_per_process = 1000 # min 25
    #preload_libraries = ''

    # - Cost-Based Vacuum Delay -

    #vacuum_cost_delay = 0 # 0-1000 milliseconds
    #vacuum_cost_page_hit = 1 # 0-10000 credits
    #vacuum_cost_page_miss = 10 # 0-10000 credits
    #vacuum_cost_page_dirty = 20 # 0-10000 credits
    #vacuum_cost_limit = 200 # 0-10000 credits

    # - Background writer -

    bgwriter_delay = 100 # 10-10000 milliseconds between rounds, APUS
    bgwriter_lru_percent = 2.0 # 0-100% of LRU buffers scanned/round, APUS
    bgwriter_lru_maxpages = 10 # 0-1000 buffers max written/round, APUS
    bgwriter_all_percent = 1 # 0-100% of all buffers scanned/round, APUS
    bgwriter_all_maxpages = 10 # 0-1000 buffers max written/round, APUS


    #---------------------------------------------------------------------------
    # WRITE AHEAD LOG
    #---------------------------------------------------------------------------

    # - Settings -

    #fsync = on # turns forced synchronization on or off
    #wal_sync_method = fsync # the default is the first option
    # supported by the operating system:
    # open_datasync
    # fdatasync
    # fsync
    # fsync_writethrough
    # open_sync
    #full_page_writes = on # recover from partial page writes
    #wal_buffers = 8 # min 4, 8KB each
    #commit_delay = 0 # range 0-100000, in microseconds
    #commit_siblings = 5 # range 1-1000

    # - Checkpoints -

    checkpoint_segments = 32 # in logfile segments, min 1, 16MB each, APUS
    #checkpoint_timeout = 300 # range 30-3600, in seconds
    checkpoint_warning = 100 # in seconds, 0 is off, APUS

    # - Archiving -

    #archive_command = '' # command to use to archive a logfile
    # segment


    #---------------------------------------------------------------------------
    # QUERY TUNING
    #---------------------------------------------------------------------------

    # - Planner Method Configuration -

    #enable_bitmapscan = on
    #enable_hashagg = on
    #enable_hashjoin = on
    #enable_indexscan = on
    #enable_mergejoin = on
    #enable_nestloop = on
    #enable_seqscan = on
    #enable_sort = on
    #enable_tidscan = on

    # - Planner Cost Constants -

    effective_cache_size = 89600 # typically 8KB each, APUS
    #random_page_cost = 4 # units are one sequential page fetch
    # cost
    #cpu_tuple_cost = 0.01 # (same)
    #cpu_index_tuple_cost = 0.001 # (same)
    #cpu_operator_cost = 0.0025 # (same)

    # - Genetic Query Optimizer -

    #geqo = on
    #geqo_threshold = 12
    #geqo_effort = 5 # range 1-10
    #geqo_pool_size = 0 # selects default based on effort
    #geqo_generations = 0 # selects default based on effort
    #geqo_selection_bias = 2.0 # range 1.5-2.0

    # - Other Planner Options -

    #default_statistics_target = 10 # range 1-1000
    #constraint_exclusion = off
    #from_collapse_limit = 8
    #join_collapse_limit = 8 # 1 disables collapsing of explicit
    # JOINs


    #---------------------------------------------------------------------------
    # ERROR REPORTING AND LOGGING
    #---------------------------------------------------------------------------

    # - Where to Log -

    #log_destination = 'stderr' # Valid values are combinations of
    # stderr, syslog and eventlog,
    # depending on platform.

    # This is used when logging to stderr:
    #redirect_stderr = off # Enable capturing of stderr into log
    # files

    # These are only used if redirect_stderr is on:
    #log_directory = 'pg_log' # Directory where log files are written
    # Can be absolute or relative to PGDATA
    #log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern.
    # Can include strftime() escapes
    #log_truncate_on_rotation = off # If on, any existing log file of the same
    # name as the new log file will be
    # truncated rather than appended to. But
    # such truncation only occurs on
    # time-driven rotation, not on restarts
    # or size-driven rotation. Default is
    # off, meaning append to existing files
    # in all cases.
    #log_rotation_age = 1440 # Automatic rotation of logfiles will
    # happen after so many minutes. 0 to
    # disable.
    #log_rotation_size = 10240 # Automatic rotation of logfiles will
    # happen after so many kilobytes of log
    # output. 0 to disable.

    # These are relevant when logging to syslog:
    #syslog_facility = 'LOCAL0'
    #syslog_ident = 'postgres'


    # - When to Log -

    #client_min_messages = notice # Values, in order of decreasing detail:
    # debug5
    # debug4
    # debug3
    # debug2
    # debug1
    # log
    # notice
    # warning
    # error

    #log_min_messages = notice # Values, in order of decreasing detail:
    # debug5
    # debug4
    # debug3
    # debug2
    # debug1
    # info
    # notice
    # warning
    # error
    # log
    # fatal
    # panic

    #log_error_verbosity = default # terse, default, or verbose messages

    #log_min_error_statement = panic # Values in order of increasing severity:
    # debug5
    # debug4
    # debug3
    # debug2
    # debug1
    # info
    # notice
    # warning
    # error
    # panic(off)

    log_min_duration_statement = 1000 # -1 is disabled, 0 logs all statements
    # and their durations, in milliseconds.

    #silent_mode = off # DO NOT USE without syslog or
    # redirect_stderr

    # - What to Log -

    #debug_print_parse = off
    #debug_print_rewritten = off
    #debug_print_plan = off
    #debug_pretty_print = off
    #log_connections = off
    #log_disconnections = off
    #log_duration = off
    log_line_prefix = '[%d %m]' # Special values:
    # %u = user name
    # %d = database name
    # %r = remote host and port
    # %h = remote host
    # %p = PID
    # %t = timestamp (no milliseconds)
    # %m = timestamp with milliseconds
    # %i = command tag
    # %c = session id
    # %l = session line number
    # %s = session start timestamp
    # %x = transaction id
    # %q = stop here in non-session
    # processes
    # %% = '%'
    # e.g. '<%u%%%d> '
    #log_statement = 'none' # none, mod, ddl, all
    #log_hostname = off


    #---------------------------------------------------------------------------
    # RUNTIME STATISTICS
    #---------------------------------------------------------------------------

    # - Statistics Monitoring -

    #log_parser_stats = off
    #log_planner_stats = off
    #log_executor_stats = off
    #log_statement_stats = off

    # - Query/Index Statistics Collector -

    #stats_start_collector = on
    stats_command_string = on
    #stats_block_level = off
    stats_row_level = on
    #stats_reset_on_server_start = off


    #---------------------------------------------------------------------------
    # AUTOVACUUM PARAMETERS
    #---------------------------------------------------------------------------

    autovacuum = on # enable autovacuum subprocess?
    autovacuum_naptime = 30 # time between autovacuum runs, in secs, APUS
    autovacuum_vacuum_threshold = 500 # min # of tuple updates before, APUS
    # vacuum
    autovacuum_analyze_threshold = 250 # min # of tuple updates before, APUS
    # analyze
    autovacuum_vacuum_scale_factor = 0.2 # fraction of rel size before, APUS
    # vacuum
    autovacuum_analyze_scale_factor = 0.1 # fraction of rel size before, APUS
    # analyze
    #autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for
    # autovac, -1 means use
    # vacuum_cost_delay
    #autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for
    # autovac, -1 means use
    # vacuum_cost_limit


    #---------------------------------------------------------------------------
    # CLIENT CONNECTION DEFAULTS
    #---------------------------------------------------------------------------

    # - Statement Behavior -

    #search_path = '$user,public' # schema names
    #default_tablespace = '' # a tablespace name, '' uses
    # the default
    #check_function_bodies = on
    #default_transaction_isolation = 'read committed'
    #default_transaction_read_only = off
    statement_timeout = 60000 # 0 is disabled, in milliseconds, APUS

    # - Locale and Formatting -

    #datestyle = 'iso, mdy'
    #timezone = unknown # actually, defaults to TZ
    # environment setting
    #australian_timezones = off
    #extra_float_digits = 0 # min -15, max 2
    #client_encoding = sql_ascii # actually, defaults to database
    # encoding

    # These settings are initialized by initdb -- they might be changed
    lc_messages = 'C' # locale for system error message
    # strings
    lc_monetary = 'C' # locale for monetary formatting
    lc_numeric = 'C' # locale for number formatting
    lc_time = 'C' # locale for time formatting

    # - Other Defaults -

    #explain_pretty_print = on
    #dynamic_library_path = '$libdir'


    #---------------------------------------------------------------------------
    # LOCK MANAGEMENT
    #---------------------------------------------------------------------------

    #deadlock_timeout = 1000 # in milliseconds
    #max_locks_per_transaction = 64 # min 10
    # note: each lock table slot uses ~220 bytes of shared memory, and there are
    # max_locks_per_transaction * (max_connections + max_prepared_transactions)
    # lock table slots.


    #---------------------------------------------------------------------------
    # VERSION/PLATFORM COMPATIBILITY
    #---------------------------------------------------------------------------

    # - Previous Postgres Versions -

    add_missing_from = on # btaf: Add missing FROM clauses to improve compatibility with 7.4.x versions

    #backslash_quote = safe_encoding # on, off, or safe_encoding
    #default_with_oids = off
    #escape_string_warning = off
    #regex_flavor = advanced # advanced, extended, or basic
    #sql_inheritance = on

    # - Other Platforms & Clients -

    #transform_null_equals = off


    #---------------------------------------------------------------------------
    # CUSTOMIZED OPTIONS
    #---------------------------------------------------------------------------

    #custom_variable_classes = '' # list of custom variable class names



    ---------------------------(end of broadcast)---------------------------
    TIP 9: In versions below 8.0, the planner will ignore your desire to
    choose an index scan if your joining column's datatypes do not
    match
  • Scott Marlowe at Nov 27, 2007 at 5:19 pm

    On Nov 24, 2007 10:57 AM, Peter Bauer wrote:
    i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
    a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
    running with the data on a DRBD Device for High Availability. The used
    database is also replicated to two similar machines with slony1.
    Why are you running a version of PostgreSQL with known data eating
    bugs? If you care for your data, you will keep up to date on releases.
    8.1.10 was released on 2007-09-17. 8.1.4 was released on 2006-05-23.
    That's 16 months of bug fixes you're missing. Go here:
    http://www.postgresql.org/docs/8.1/static/release.html and read up on
    the fixes you're missing. Then update. Or just update.

    OK, on the the issue at hand.
    Since the load average is between 1 (most of the time) and 10 (peeks) i
    am worried about the load and executed vmstat and iostat which show that
    1000-6000 Blocks are writen per second. Please check the attached output
    for further details.
    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck. I'm aware that this hardware setup is
    probably not sufficient but is would like to investigate how critical
    the situation is.
    Yes. Battery backed cache can only do so much, it's not magic pixie
    dust. Once it's full, the drive becomes the bottle neck. Real db
    servers have more than one disk drive. They usually have at least 4
    or so, and often dozens to hundreds. Also, not all battery backed
    caching RAID controllers are created equal.
    procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 398256 78328 140612 1063556 0 0 0 1472 2029 5081 4 3 92 0
    0 2 398256 78268 140612 1063576 0 0 0 2304 1928 4216 0 2 98 0
    1 2 398256 78100 140612 1063576 0 0 0 1464 1716 3994 1 3 96 0
    0 0 398256 78704 140612 1063592 0 0 0 916 1435 3570 5 3 91 0
    0 0 398256 77876 140612 1063616 0 0 0 0 305 1169 3 1 96 0
    See that dip in the last line above where the blocks in drop to 0,
    idle jumps to 96, and blocks out drops, and context switches drop?
    That's most likely where postgresql is checkpointing. Checkpointing
    is where it writes out all the dirty buffers to disk. If the bgwriter
    is not tuned aggresively enough, checkpoints happen and make the whole
    database slow down for a few seconds. If it's tuned too aggresively
    then the db spends too much CPU time tracking the dirty buffers and
    then writing them. If tuned just right, it will write out the dirty
    buffers just fast enough that a checkpoint is never needed.

    You tune the bgwriter to your machine and I/O subsystem. If you're
    planning on getting more hard drives, do that first. Then tune the
    bgwriter.

    btw, if this is "vmstat 1" running, it's showing a checkpoint every 20
    or so seconds I think
    0 2 398256 79136 140612 1063964 0 0 0 1736 1959 4494 4 2 94 0
    checkpoint here:
    0 0 398256 79132 140612 1063964 0 0 0 4 260 1039 1 1 98 0
    0 0 398256 79052 140612 1063980 0 0 0 2444 3084 6955 6 5 89 0
    0 2 398256 79060 140612 1063988 0 0 0 948 1146 3616 3 1 96 0
    0 1 398256 78268 140612 1064056 0 0 0 1908 1809 4086 6 5 88 0
    0 1 398256 76728 140612 1064056 0 0 0 6256 6637 15472 5 5 90 0
    0 2 398256 77000 140612 1064064 0 0 0 4916 5840 12107 1 4 95 0
    0 2 398256 76956 140612 1064068 0 0 0 6468 7432 15211 1 3 96 0
    0 6 398256 77388 140612 1064072 0 0 0 8116 7826 18265 1 8 91 0
    0 2 398256 74312 140612 1064076 0 0 0 7032 6886 16136 2 7 91 0
    0 2 398256 74264 140612 1064076 0 0 0 5680 7143 13411 0 5 95 0
    0 2 398256 72980 140612 1064140 0 0 0 5396 6377 13251 6 6 88 0
    0 3 398256 76972 140612 1064148 0 0 0 5652 6793 14079 4 9 87 0
    0 2 398256 77836 140612 1064148 0 0 0 3968 5321 14187 10 8 82 0
    1 0 398256 77280 140612 1064148 0 0 0 1608 3188 8974 21 12 67 0
    1 0 398256 77832 140612 1064152 0 0 0 236 834 2625 7 5 87 0
    0 0 398256 77464 140612 1064152 0 0 0 244 505 1378 2 4 94 0
    1 0 398256 77828 140612 1064164 0 0 0 316 580 1954 7 2 91 0
    0 0 398256 77804 140612 1064180 0 0 0 740 673 2248 2 2 96 0
    0 0 398256 77000 140612 1064180 0 0 0 304 589 1739 1 3 96 0
    20 rows later, checkpoint here:
    0 0 398256 77000 140612 1064184 0 0 0 0 216 886 0 1 99 0
    0 0 398256 75452 140612 1064184 0 0 0 432 755 2032 6 1 93 0
    max_fsm_pages = 40000 # min max_fsm_relations*16, 6 bytes each, APUS
    This seems a little low for a busy server.
    # - Background writer -

    bgwriter_delay = 100 # 10-10000 milliseconds between rounds, APUS
    bgwriter_lru_percent = 2.0 # 0-100% of LRU buffers scanned/round, APUS
    bgwriter_lru_maxpages = 10 # 0-1000 buffers max written/round, APUS
    bgwriter_all_percent = 1 # 0-100% of all buffers scanned/round, APUS
    bgwriter_all_maxpages = 10 # 0-1000 buffers max written/round, APUS
    So, bgwriter wakes up 10 times a second, and each time it processes 2%
    of the Least Recently Used pages for writing, and writes up to 10 of
    those pages. And it only checks 1% of the total pages and writes 10
    of those at the most. This is not aggresive enough, and given how
    much spare CPU you have left over, you can be a fair bit more
    aggresive. The main thing to increase is the maxes. Try changing
    them to the 100 to 300 range, and maybe increase your percentages to
    5% or so. What we're shooting for is to see those checkpoints go
    away.

    Then, when running your benchmark, after a few minutes, run a
    checkpoint by hand and see if you get one of those slow downs like we
    saw in vmstat above. If your bgwriter is tuned properly, you should
    get an almost instant response from the checkpoint and no noticeable
    slow down in the vmstat numbers for context switches per second.

    Once you reach the point where the bgwriter is just keeping ahead of
    check points, there's little to be gained in more aggressive tuning of
    the bgwriter and you'll just be chewing up memory and cpu bandwidth if
    you do get too aggressive with it.
  • Peter Bauer at Nov 28, 2007 at 7:19 am

    Am Dienstag 27 November 2007 schrieb Scott Marlowe:
    On Nov 24, 2007 10:57 AM, Peter Bauer wrote:
    i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
    a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
    running with the data on a DRBD Device for High Availability. The used
    database is also replicated to two similar machines with slony1.
    Why are you running a version of PostgreSQL with known data eating
    bugs? If you care for your data, you will keep up to date on releases.
    8.1.10 was released on 2007-09-17. 8.1.4 was released on 2006-05-23.
    That's 16 months of bug fixes you're missing. Go here:
    http://www.postgresql.org/docs/8.1/static/release.html and read up on
    the fixes you're missing. Then update. Or just update.

    OK, on the the issue at hand.
    Since the load average is between 1 (most of the time) and 10 (peeks) i
    am worried about the load and executed vmstat and iostat which show that
    1000-6000 Blocks are writen per second. Please check the attached output
    for further details.
    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck. I'm aware that this hardware setup is
    probably not sufficient but is would like to investigate how critical
    the situation is.
    Yes. Battery backed cache can only do so much, it's not magic pixie
    dust. Once it's full, the drive becomes the bottle neck. Real db
    servers have more than one disk drive. They usually have at least 4
    or so, and often dozens to hundreds. Also, not all battery backed
    caching RAID controllers are created equal.
    procs -----------memory---------- ---swap-- -----io---- --system--
    ----cpu---- r b swpd free buff cache si so bi bo in
    cs us sy id wa 0 0 398256 78328 140612 1063556 0 0 0 1472
    2029 5081 4 3 92 0 0 2 398256 78268 140612 1063576 0 0 0
    2304 1928 4216 0 2 98 0 1 2 398256 78100 140612 1063576 0 0
    0 1464 1716 3994 1 3 96 0 0 0 398256 78704 140612 1063592 0
    0 0 916 1435 3570 5 3 91 0 0 0 398256 77876 140612 1063616
    0 0 0 0 305 1169 3 1 96 0
    See that dip in the last line above where the blocks in drop to 0,
    idle jumps to 96, and blocks out drops, and context switches drop?
    That's most likely where postgresql is checkpointing. Checkpointing
    is where it writes out all the dirty buffers to disk. If the bgwriter
    is not tuned aggresively enough, checkpoints happen and make the whole
    database slow down for a few seconds. If it's tuned too aggresively
    then the db spends too much CPU time tracking the dirty buffers and
    then writing them. If tuned just right, it will write out the dirty
    buffers just fast enough that a checkpoint is never needed.

    You tune the bgwriter to your machine and I/O subsystem. If you're
    planning on getting more hard drives, do that first. Then tune the
    bgwriter.

    btw, if this is "vmstat 1" running, it's showing a checkpoint every 20
    or so seconds I think
    0 2 398256 79136 140612 1063964 0 0 0 1736 1959 4494 4 2
    94 0
    checkpoint here:
    0 0 398256 79132 140612 1063964 0 0 0 4 260 1039 1 1
    98 0 0 0 398256 79052 140612 1063980 0 0 0 2444 3084 6955
    6 5 89 0 0 2 398256 79060 140612 1063988 0 0 0 948 1146
    3616 3 1 96 0 0 1 398256 78268 140612 1064056 0 0 0 1908
    1809 4086 6 5 88 0 0 1 398256 76728 140612 1064056 0 0 0
    6256 6637 15472 5 5 90 0 0 2 398256 77000 140612 1064064 0 0
    0 4916 5840 12107 1 4 95 0 0 2 398256 76956 140612 1064068 0
    0 0 6468 7432 15211 1 3 96 0 0 6 398256 77388 140612 1064072
    0 0 0 8116 7826 18265 1 8 91 0 0 2 398256 74312 140612
    1064076 0 0 0 7032 6886 16136 2 7 91 0 0 2 398256 74264
    140612 1064076 0 0 0 5680 7143 13411 0 5 95 0 0 2 398256
    72980 140612 1064140 0 0 0 5396 6377 13251 6 6 88 0 0 3
    398256 76972 140612 1064148 0 0 0 5652 6793 14079 4 9 87 0
    0 2 398256 77836 140612 1064148 0 0 0 3968 5321 14187 10 8
    82 0 1 0 398256 77280 140612 1064148 0 0 0 1608 3188 8974
    21 12 67 0 1 0 398256 77832 140612 1064152 0 0 0 236 834
    2625 7 5 87 0 0 0 398256 77464 140612 1064152 0 0 0 244
    505 1378 2 4 94 0 1 0 398256 77828 140612 1064164 0 0 0
    316 580 1954 7 2 91 0 0 0 398256 77804 140612 1064180 0 0
    0 740 673 2248 2 2 96 0 0 0 398256 77000 140612 1064180 0
    0 0 304 589 1739 1 3 96 0
    20 rows later, checkpoint here:
    0 0 398256 77000 140612 1064184 0 0 0 0 216 886 0 1
    99 0 0 0 398256 75452 140612 1064184 0 0 0 432 755 2032
    6 1 93 0

    max_fsm_pages = 40000 # min max_fsm_relations*16, 6
    bytes each, APUS
    This seems a little low for a busy server.
    # - Background writer -

    bgwriter_delay = 100 # 10-10000 milliseconds between
    rounds, APUS bgwriter_lru_percent = 2.0 # 0-100% of LRU
    buffers scanned/round, APUS bgwriter_lru_maxpages = 10 #
    0-1000 buffers max written/round, APUS bgwriter_all_percent = 1
    # 0-100% of all buffers scanned/round, APUS bgwriter_all_maxpages =
    10 # 0-1000 buffers max written/round, APUS
    So, bgwriter wakes up 10 times a second, and each time it processes 2%
    of the Least Recently Used pages for writing, and writes up to 10 of
    those pages. And it only checks 1% of the total pages and writes 10
    of those at the most. This is not aggresive enough, and given how
    much spare CPU you have left over, you can be a fair bit more
    aggresive. The main thing to increase is the maxes. Try changing
    them to the 100 to 300 range, and maybe increase your percentages to
    5% or so. What we're shooting for is to see those checkpoints go
    away.

    Then, when running your benchmark, after a few minutes, run a
    checkpoint by hand and see if you get one of those slow downs like we
    saw in vmstat above. If your bgwriter is tuned properly, you should
    get an almost instant response from the checkpoint and no noticeable
    slow down in the vmstat numbers for context switches per second.

    Once you reach the point where the bgwriter is just keeping ahead of
    check points, there's little to be gained in more aggressive tuning of
    the bgwriter and you'll just be chewing up memory and cpu bandwidth if
    you do get too aggressive with it.
    Hi Scott,

    thank you for the great suggestions, i will keep the list informed.

    br,
    Peter

    --
    Peter Bauer
    APUS Software G.m.b.H.
    A-8074 Raaba, Bahnhofstrasse 1/1
    Email: peter.bauer@apus.co.at
    Tel: +43 316 401629 24
    Fax: +43 316 401629 9
  • Greg Smith at Nov 28, 2007 at 9:03 am

    On Sat, 24 Nov 2007, Peter Bauer wrote:

    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck.
    top also shows that you're never waiting for I/O which is usually evidence
    there isn't an I/O bottleneck. You passed along most of the right data,
    but some useful additional things to know are:

    -Actual brand/model of SCSI controller
    -Operating system
    -What time interval the vmstat and iostat information you gave were
    produced at.

    I agree with Scott that checkpoints should be considered as a possibility
    here. I'd suggest you set checkpoint_warning to a high value so you get a
    note in the logs every time one happens, then see if those happen at the
    same time as your high load average. More on that topic and how to adjust
    the background writer if that proves to be the cause of your slowdown is
    at http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
  • Peter Bauer at Nov 28, 2007 at 3:03 pm
    Hi Greg,

    Am Mittwoch 28 November 2007 schrieb Greg Smith:
    On Sat, 24 Nov 2007, Peter Bauer wrote:
    top shows that the CPUs are at least 80% idle most of the time so i
    think there is an I/O bottleneck.
    top also shows that you're never waiting for I/O which is usually evidence
    there isn't an I/O bottleneck. You passed along most of the right data,
    but some useful additional things to know are:

    -Actual brand/model of SCSI controller
    -Operating system
    -What time interval the vmstat and iostat information you gave were
    produced at.
    here are the hardware specs:
    2x POWEREDGE 2850 - XEON 3.0GHZ/2MB, 800FSB
    2048MB SINGLE RANK DDR2
    73 GB SCSI-Disk , 15.000 rpm, UL
    PERC4E/DI DC ULTRA320 SCSI RAID, 256MB

    Its Debian sarge with kernel 2.4.26.050719-686 #1 SMP.

    vmstat and iostat were running with 1 second intervals.
    I agree with Scott that checkpoints should be considered as a possibility
    here. I'd suggest you set checkpoint_warning to a high value so you get a
    note in the logs every time one happens, then see if those happen at the
    same time as your high load average. More on that topic and how to adjust
    the background writer if that proves to be the cause of your slowdown is
    at http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

    ---------------------------(end of broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org/
    thank you,
    Peter




    --
    Peter Bauer
    APUS Software G.m.b.H.
    A-8074 Raaba, Bahnhofstrasse 1/1
    Email: peter.bauer@apus.co.at
    Tel: +43 316 401629 24
    Fax: +43 316 401629 9
  • Greg Smith at Nov 29, 2007 at 6:28 am

    On Wed, 28 Nov 2007, Peter Bauer wrote:

    PERC4E/DI DC ULTRA320 SCSI RAID, 256MB
    Its Debian sarge with kernel 2.4.26.050719-686 #1 SMP.
    OK, so I'd expect you're using the megaraid2 driver. That and kernel
    2.4.26 are a few years behind current at this point, and there have been
    plenty of performance improvements in Linux and that driver since then.
    Since you're asking about this before it's a serious problem and don't
    need an immediate fix, you may want to consider whether upgrading to a
    more modern 2.6 kernel is in your future; that change alone may resolve
    some of the possibly too high load you're having. I doubt any 2.4 kernel
    is really getting the best from your fairly modern server hardsare.

    Also, the PERC4E and similar Megaraid cards are known to be generally
    sluggish on write throughput compared with some of the competing products
    out there. I don't know that I'd replace it though, as spending the same
    amount of money adding disks would probably be more useful.

    There's actually a way to combine these two ideas and get an upgrade
    transition plan. Buy a second disk, find a downtime window, install a
    newer Linux onto it and test. If that works well switch to it. If it
    doesn't you still have the original unmodified system around and you can
    at least split the disk load between the two drives.
    vmstat and iostat were running with 1 second intervals.
    Good, that means the responses you've already gotten made the right
    assumptions. There's some additional fun statistics you can collect with
    the Linux iostat, but sadly that needs a newer kernel as well.

    --
    * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedNov 24, '07 at 4:58p
activeNov 29, '07 at 6:28a
posts8
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase