Hi,

There alot here, so skip to the middle from my WAL settings if you like.

I'm currently investigating the performance on a large database which
consumes email designated as SPAM for the perusal of customers wishing
to check. This incorporates a number of subprocesses - several delivery
daemons, an expiry daemon and a UI which performs large selects. A
considerable amount of UPDATE, SELECT and DELETE are performed continually.

Starting with a stock pg config, I've well understood the importance
increased shared mem, effective cache size and low random_page_cost as
detailed in
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html. After some
system analysis with vmstat and sar we've been able to determin that the
main problem is IO bound and IMO this is due to lots of updates
requiring high drive contention - the array is a RAID0 mirror and the
dataset originally 79GB. Alot of SPAM is being sent from our mail
scanners and coupled with the UI is creating an increasingly lagging system.

Typically all our db servers have these sort of enhancements - 1GB ram,
SMP boxen with SCSI 160 disks :
effective_cache_size = 95694
random_page_cost = 0.5
sort_mem=65536
max_connections = 128
shared_buffers = 15732

My focus today has been on WAL - I've not looked at WAL before. By
increasing the settings thus :

wal_buffers = 64 # need to determin WAL usage
wal_files = 64 # range 0-64
wal_sync_method = fsync # the default varies across platforms:
wal_debug = 0 # range 0-16

# hopefully this should see less LogFlushes per LogInsert - use more WAL
though.
commit_delay = 10000 # range 0-100000
commit_siblings = 2 # range 1-1000
checkpoint_segments = 16 # in logfile segments (16MB each), min 1
checkpoint_timeout = 300 # in seconds, range 30-3600
fsync = true

great improvements have been seen. A vacuumdb -f -a -z went from
processing 1 table in 10 minutes to 10 tables in 1 minute. :) I actually
stopped it after 80 tables (48 hours runtime) because the projected end
time would have been next week. Once I restarted the postmaster with the
above WAL settings, vacuumdb -f -a -z completed all 650 tables by the
following day.

My thinking is therefore to reduce disk context switching as best as
possible within the current hardware limitiations. I'm aiming at keeping
the checkpoint subprocess happy that other backends are about to commit
- hence keep siblings low at 2 - and create a sufficient gap between
internal commital so many commits can be done in a single sync. From the
above config, I believe I've gone some way to acheive this and the
performance I'm now seeing suggests this.

But I think we can get more out of this as the above setting were picked
from thin air and my concern here is being able to determin WAL file
usage and if the system is caught out on the other extreme that we're
not commiting fast enough. Currently I've read that WAL files shouldn't
be more than 2*checkpoint_segments+1 however my pg_xlog directory
contains 74 files. This suggests I'm using more logfiles than I should.
Also I'm not sure what wal_buffers really should be set to.

Can I get any feedback on this ? How to look into pg's WAL usage would
be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 -
it's been impossible to upgrade up until now because it's been too slow.
I have moved the pg_xlog onto the root SCSI disk - it doesn't appear to
have made a huge difference but it could be on the same cable.

Additional information as a bit of background :
I can supply sar output if required. I'm currently running our expiry
daemon which scans all mail for each domain (ie each table) and this
seems to take a few hours to run on a 26GB archive. It's alot faster
than it ever was. Load gets to about 8 as backends are all busy doing
selects, updates and deletes. This process has recently already been run
so it shouldn't be doing too much deleting. Still seems IO bound, and I
don't think I'm going to solve that without a better disk arrangement,
but this is essentially what I'm doing now - exhausting other possibilities.

$ sar -B -s 16:00:00

16:35:55 pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg
16:36:00 3601.60 754.40 143492 87791 10230 48302
16:36:05 5766.40 552.80 143947 88039 10170 48431
16:36:10 3663.20 715.20 144578 88354 9075 48401
16:36:15 3634.40 412.00 144335 88405 9427 48433
16:36:20 5578.40 447.20 143626 88545 9817 48397
16:36:25 4154.40 469.60 143640 88654 10388 48536
16:36:30 3504.00 635.20 143538 88763 9992 48458
16:36:35 3540.80 456.00 142515 88949 10444 48381
16:36:40 3334.40 1067.20 143268 89244 9832 48468

$ vmstat 5
procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
0 7 1 29588 10592 15700 809060 1 0 97 75 0 103
13 9 79
3 8 0 29588 11680 15736 807620 0 0 3313 438 1838 3559 19
13 68
2 13 1 29588 12808 15404 800328 0 0 4470 445 1515 1752
7 7 86
0 9 1 29588 10992 15728 806476 0 0 2933 781 1246 2686 14
10 76
2 5 1 29588 11336 15956 807884 0 0 3354 662 1773 5211 27
17 57
4 5 0 29696 13072 16020 813872 0 24 4282 306 2632 7862 45
25 31
4 6 1 29696 10400 16116 815084 0 0 5086 314 2668 7893 47
26 27
9 2 1 29696 13060 16308 814232 27 0 3927 748 2586 7836 48
29 23
3 8 1 29696 10444 16232 812816 3 0 4015 433 2443 7180 47
28 25
8 4 0 29696 10904 16432 812488 0 0 4537 500 2616 8418 46
30 24
4 6 2 29696 11048 16320 810276 0 0 6076 569 1893 3919 20
14 66
0 5 0 29696 10480 16600 813788 0 0 4595 435 2400 6215 33
21 46
3 6 0 29696 10536 16376 812248 0 0 3802 504 2417 7921 43
25 32
1 6 1 29696 11236 16500 809636 0 0 3691 357 2171 5199 24
15 61
0 14 1 29696 10228 16036 801368 0 0 4038 561 1566 3288 16
12 72

Sorry it's so long but I thought some brief info would be better than
not. Thanks for reading,

--

Rob Fielding
Development
Designer Servers Ltd

Search Discussions

  • Richard Huxton at Feb 28, 2004 at 7:37 pm
    Rob

    Sir - I have to congratulate you on having the most coherently summarised and
    yet complex list query I have ever seen.

    I fear that I will be learning from this problem rather than helping, but one
    thing did puzzle me - you've set your random_page_cost to 0.5? I'm not sure
    this is sensible - you may be compensating for some other parameter
    out-of-range.


    --
    Richard Huxton
  • Rod Taylor at Feb 29, 2004 at 12:52 am
    random_page_cost = 0.5
    Not likely. The lowest this value should ever be is 1, and thats if
    you're using something like a ram drive.

    If you're drives are doing a ton of extra random IO due to the above
    (rather than sequential reads) it would lower the throughput quite a
    bit.

    Try a value of 2 for a while.
  • Rob Fielding at Feb 29, 2004 at 1:06 pm

    Rod Taylor wrote:

    random_page_cost = 0.5
    Try a value of 2 for a while.

    OK thanks Richard and Rod. I've upped this to 2. I think I left this
    over from a previous play with setttings on my IDE RAID 0 workstation.
    It seemed to have a good effect being set as a low float so it stuck.

    I've set it to 2.

    From another post off list, I've also bumped up

    max_fsm_relations = 1000 # min 10, fsm
    max_fsm_pages = 20000 # min 1000, fs
    vacuum_mem = 32768 # min 1024

    as they did seem a little low. I'm hesitant to set them too high at this
    stage as I'd prefer to keep as much RAM available for runtime at this time.

    I'm still hoping that perhaps the uber-pgadmin Mr Lane might reply about
    my WAL issue :) however I'm getting the feeling now the server is
    running with a much higher level of performance than it has been. Won't
    know until tomorrow thought.

    Cheers,

    --
    Rob Fielding
    Development
    Designer Servers Ltd
  • Rob Fielding at Mar 1, 2004 at 5:30 pm
    Further update to my WAL experimentation. pg_xlog files have increased
    to 81, and checking today up to 84. Currently nothing much going on with
    the server save a background process running a select every 30 seconds
    with almost no impact (according to IO from vmstats).

    This in itself is a good sign - an improvement on running last week, but
    I'd still like to get clarification on WAL file usage if possible.

    Log file tailing has nothing more interesting than a whole set of
    "recycled transaction log file" entries :

    2004-03-01 16:01:55 DEBUG: recycled transaction log file 0000007100000017
    2004-03-01 16:07:01 DEBUG: recycled transaction log file 0000007100000018
    2004-03-01 16:17:14 DEBUG: recycled transaction log file 0000007100000019
    2004-03-01 16:22:20 DEBUG: recycled transaction log file 000000710000001A
    2004-03-01 16:32:31 DEBUG: recycled transaction log file 000000710000001B
    2004-03-01 16:37:36 DEBUG: recycled transaction log file 000000710000001C
    2004-03-01 16:47:48 DEBUG: recycled transaction log file 000000710000001D
    2004-03-01 16:52:54 DEBUG: recycled transaction log file 000000710000001E
    2004-03-01 17:03:05 DEBUG: recycled transaction log file 000000710000001F

    Looks kinda automated, but the times aren't quite even at around 6-10
    minutes apart.

    cheers,
    --

    Rob Fielding
    rob@dsvr.net

    www.dsvr.co.uk Development Designer Servers Ltd
  • Simon Riggs at Mar 2, 2004 at 12:28 am

    Rob Fielding wrote:
    My focus today has been on WAL - I've not looked at WAL before. By
    increasing the settings thus :

    wal_buffers = 64 # need to determin WAL usage
    wal_files = 64 # range 0-64
    wal_sync_method = fsync # the default varies across platforms:
    wal_debug = 0 # range 0-16

    # hopefully this should see less LogFlushes per LogInsert - use more WAL
    though.
    commit_delay = 10000 # range 0-100000
    commit_siblings = 2 # range 1-1000
    checkpoint_segments = 16 # in logfile segments (16MB each), min 1
    checkpoint_timeout = 300 # in seconds, range 30-3600
    fsync = true
    But I think we can get more out of this as the above setting were picked
    from thin air and my concern here is being able to determin WAL file
    usage and if the system is caught out on the other extreme that we're
    not commiting fast enough. Currently I've read that WAL files shouldn't
    be more than 2*checkpoint_segments+1 however my pg_xlog directory
    contains 74 files. This suggests I'm using more logfiles than I should.
    Also I'm not sure what wal_buffers really should be set to.
    As Richard Huxton says, we're all learning...I'm looking at WAL logic
    for other reasons right now...

    This is based upon my reading of the code; I think the manual contains
    at least one confusion that has not assisted your understanding (or
    mine):

    The WAL files limit of 2*checkpoint_segments+1 refers to the number of
    files allocated-in-advance of the current log, not the total number of
    files in use. pg uses a cycle of logs, reusing older ones when all the
    transactions in those log files have been checkpointed. The limit is set
    to allow checkpoint to release segments and have them all be reused at
    once. Pg stores them up for use again later when workload hots up again.

    If it cannot recycle a file because there is a still-current txn on the
    end of the cycle, then it will allocate a new file and use this instead,
    but still keeping everything in a cycle. Thus if transactions are
    particularly long running, then the number of files in the cycle will
    grow. So overall, normal behaviour so far. I don't think there's
    anything to worry about in having that many files in your xlog cycle.

    That behaviour is usually seen with occasional long running txns. When a
    long running transaction is over, pg will try to reduce the number of
    files in the cycle until its back to target.

    You seem to be reusing one file in the cycle every 10 mins - this is
    happening as the result of a checkpoint timeout - "kinda automated" as
    you say. [A checkpoint is the only time you can get the messages you're
    getting] At one file per checkpoint, it will take 16*2+1=33
    checkpoints*10 mins = 5 hours before it hits the advance allocation file
    limit and then starts to reduce number of files. That's why they appear
    to stay constant...

    If you want to check whether this is correct, manually issue a number of
    CHECKPOINT statements. The messages should change from "recycled" to
    "removing" transaction log file once you've got to 33 checkpoints - the
    number of WAL log files should start to go down also? If so, then
    there's nothing too strange going on, just pg being a little slow in
    reducing the number of wal log files.

    So, it seems that you are running occasional very long transactions.
    During that period you run up to 60-80 wal files. That's just on the
    edge of your wal_buffers limit, which means you start to write wal
    quicker than you'd like past that point. Your checkpoint_timeout is 300
    seconds, but a checkpoint will also be called every checkpoint_segments,
    or currently every 16 wal files. Since you go as high as 60-80 then you
    are checkpointing 4-5 times during the heavy transaction period -
    assuming it's all one block of work. In the end, each checkpoint is
    causing a huge I/O storm, during which not much work happens.

    I would suggest that you reduce the effect of checkpointing by either:
    - re-write app to do scan deletes in smaller chunks in quieter periods
    or
    - increase checkpoint_segments to 128, though this may effect your
    recoverability

    You can of course only do so much with the memory available to you. If
    you increase one allocation of memory, you may have to reduce another
    parameter and that may be counter productive.

    [An alternative view is that you should go for more frequent, not less
    frequent checkpoints in this situation, smoothing out the effect of the
    checkpoints, rather than trying to avoid them at all. On the other hand,
    that approach also increases total WAL log volume, which means you'll
    make poor use of I/O and memory buffering. I'd stay high.]

    However, I'm not sure
    - why checkpoint interval of 300 secs causes them to happen every 10
    mins in quieter periods; is that an occaisional update occurring?
    - why checkpoint only releases single Wal file each time - but that
    maybe me just reading the code incorrectly.

    Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
    Can I get any feedback on this ? How to look into pg's WAL usage would
    be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 -
    it's been impossible to upgrade up until now because it's been too slow.
    I have moved the pg_xlog onto the root SCSI disk - it doesn't appear to
    have made a huge difference but it could be on the same cable.
    My advice is don't touch WAL_SYNC_METHOD...

    I **think** the WAL behaviour is still the same in 7.4.1, so no rush to
    upgrade on that account - unless you're using temporary tables....

    Best Regards, Simon Riggs
  • Tom Lane at Mar 2, 2004 at 1:01 am

    "Simon Riggs" <simon@2ndquadrant.com> writes:
    - why checkpoint interval of 300 secs causes them to happen every 10
    mins in quieter periods; is that an occaisional update occurring?
    There is code in there to suppress a checkpoint if no WAL-loggable
    activity has happened since the last checkpoint. Not sure if that's
    relevant to the issue or not though.

    regards, tom lane
  • Simon Riggs at Mar 2, 2004 at 10:16 pm

    Tom Lane
    "Simon Riggs" <simon@2ndquadrant.com> writes:
    - why checkpoint interval of 300 secs causes them to happen every 10
    mins in quieter periods; is that an occaisional update occurring?
    There is code in there to suppress a checkpoint if no WAL-loggable
    activity has happened since the last checkpoint. Not sure if that's
    relevant to the issue or not though.
    Thanks Tom, at least that clears up why the checkpoints are off.

    I must admit, I'm taken aback though:

    I'd prefer it if it DIDN'T do that. If the system is quiet, the odd
    checkpoint doesn't matter that much - however, taking a long time to
    return the xlog files to the *desired* state of having many
    pre-allocated log files is not a great thing.

    What do you think about continuing to checkpoint normally until the
    number of xlog files has returned to 2*checkpoint_segments+1, then
    allowing a slow down of checkpoints when quiet? It would be easy enough
    to set a variable true while rearranging files to the limit, then set it
    false when the limit has been hit and then using that to activate the
    slow-down code (not that I know where that is mind...). However, that
    would require some backend to postmaster ipc, which may be a problem.

    Or perhaps the real problem is only recycling one file at a time - if
    we're running this as the checkpoint process it wouldn't be a problem to
    recycle more than one at the same time would it?

    The reason for my interest is: when archiving logs for PITR, there may
    be occasional long pauses while waiting for tape mounts (typically 30
    minutes from notification to change). These pauses could (and therefore
    will eventually for some people) cause severe log file build up, and I'm
    interested in making sure this build up doesn't take too long to clear.
    Forgetting the archival API stuff for a second, this is roughly the same
    situation as Rob is experiencing (or at least causing him to pause and
    think).

    Best Regards, Simon Riggs
  • Josh Berkus at Mar 2, 2004 at 1:10 am
    Simon,
    Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
    I'm pretty sure that WAL_DEBUG requires a compile-time option.

    --
    -Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Simon Riggs at Mar 2, 2004 at 10:16 pm

    Josh Berkus wrote
    Simon Riggs wrote
    Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
    I'm pretty sure that WAL_DEBUG requires a compile-time option.
    In my naiveté, I just set and use it. I discovered it in the code, then
    set it to take advantage.

    I'm surprised, but you are right, the manual does SAY this requires a
    compile time option; it is unfortunately not correct. Maybe this was
    once as you say? If Rob is using 7.2.2 maybe this still applies to him?
    I don’t know.

    Setting wal_debug > 0 in postgresql.conf sets a variable XLOGDEBUG,
    which although it is all capitals is not in fact a compiler directive,
    as it looks. This variable is used within xlog.c to output **too much**
    information to the log. However, it is the only option at present.

    This prompts me however to consider the idea of having various levels of
    WAL debug output, or using some kind of log_checkpoint mechanism to
    better understand what is going on.

    Best Regards, Simon Riggs
  • Neil Conway at Mar 3, 2004 at 12:01 am

    Simon Riggs wrote:
    Josh Berkus wrote
    Simon Riggs wrote
    Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
    I'm pretty sure that WAL_DEBUG requires a compile-time option.
    I'm surprised, but you are right, the manual does SAY this requires a
    compile time option; it is unfortunately not correct.
    Actually, the manual is correct: in 7.4 and earlier releases, enabling
    wal_debug can be done without also setting a compile-time #ifdef. As
    of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
    variable is available.

    -Neil
  • Josh Berkus at Mar 3, 2004 at 8:46 pm
    Neil,
    Actually, the manual is correct: in 7.4 and earlier releases, enabling
    wal_debug can be done without also setting a compile-time #ifdef. As
    of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
    variable is available.
    Hmmm. I was told that it was this way for 7.4 as well; that's why it's in
    the docs that way.


    --
    -Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Neil Conway at Mar 3, 2004 at 11:44 pm

    Josh Berkus wrote:
    Hmmm. I was told that it was this way for 7.4 as well; that's why it's in
    the docs that way.
    No such statement is made in the docs AFAIK: they merely say "If
    nonzero, turn on WAL-related debugging output."

    I invented a new #ifdef symbol when making this change in CVS HEAD, so
    I think you are misremembering.

    -Neil
  • Simon Riggs at Mar 3, 2004 at 9:40 pm

    Neil Conway
    Simon Riggs wrote:
    Josh Berkus wrote
    Simon Riggs wrote
    Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
    I'm pretty sure that WAL_DEBUG requires a compile-time option.
    I'm surprised, but you are right, the manual does SAY this requires
    a
    compile time option; it is unfortunately not correct.
    Actually, the manual is correct: in 7.4 and earlier releases, enabling
    wal_debug can be done without also setting a compile-time #ifdef. As
    of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
    variable is available.
    Touche! I stand corrected, thank you both. My suggestion does work for
    Rob, then.

    [This also implies I have a screwed version on my machine, so thank you
    also for flushing that lurking issue out for me. I'd had a suspicion for
    a few weeks. Lucky I'm still just prototyping.]

    On the other hand, I was just about to change the wal_debug behaviour to
    allow better debugging of PITR features as they're added. I think it is
    very important to be able to put the system fairly easily into debug
    mode; a recompile is easy enough, but it would be even better to avoid
    this completely. This would mean reversing the change you describe:
    here's the design:

    The behaviour I wish to add is:
    Keep wal_debug as a value between 0 and 16.
    If =0 then no debug output (default).
    Use following bitmasks against the value
    Mask 1 = XLOG Checkpoints get logged
    Mask 2 = Archive API calls get logged
    Mask 4 = Transaction - commits get logged
    Mask 8 = Flush & INSERTs get logged

    That way it should be fairly straightforward to control the amount and
    type of information available to administrators. The existing design
    produces too much info to be easily usable, mostly requiring a perl
    program to filter out the info overload and do record counts. This
    suggested design allows you to control the volume of messages, since the
    bitmasks are arranged in volume/frequency order and brings the wal_debug
    option back into something useful for problem diagnosis on live systems,
    not just hacking the code.

    Anybody object to these mods, or have better/different ideas? Getting
    the diagnostics right is fairly important, IMHO, to making PITR become
    real.

    Best regards, Simon Riggs
  • Tom Lane at Mar 3, 2004 at 11:47 pm

    "Simon Riggs" <simon@2ndquadrant.com> writes:
    The behaviour I wish to add is:
    Keep wal_debug as a value between 0 and 16.
    If =0 then no debug output (default).
    Use following bitmasks against the value
    Mask 1 = XLOG Checkpoints get logged
    Mask 2 = Archive API calls get logged
    Mask 4 = Transaction - commits get logged
    Mask 8 = Flush & INSERTs get logged
    I see no value in reverting Neil's change. The above looks way too much
    like old-line assembler-programmer thinking to me, anyway. Why not
    invent a separate, appropriately named boolean variable for each thing
    you want to control? Even C programmers manage to avoid doing the sort
    of mental arithmetic that the above would force onto DBAs.

    As for whether it should be #ifdef'd or not, I'd have no objection to
    turning WAL_DEBUG on by default in pg_config_manual.h for the duration
    of PITR development. One should not however confuse short-term
    debugging needs with features that the average user is going to need
    indefinitely. (It was not too long ago that there was still debugging
    code for btree index building in there, for crissakes.)

    regards, tom lane
  • Neil Conway at Mar 3, 2004 at 11:49 pm

    Simon Riggs wrote:
    On the other hand, I was just about to change the wal_debug behaviour to
    allow better debugging of PITR features as they're added.
    That's a development activity. Enabling the WAL_DEBUG #ifdef by
    default during the 7.5 development cycle would be uncontroversial, I
    think.
    I think it is very important to be able to put the system fairly
    easily into debug mode
    It is? Why would this be useful for non-development activities?

    (It may well be the case that we ought to report more or better
    information about the status of the WAL subsystem; but WAL_DEBUG is
    surely not the right mechanism for emitting information intended for
    an administrator.)

    -Neil
  • Simon Riggs at Mar 9, 2004 at 8:27 pm

    Neil Conway
    Simon Riggs wrote:
    On the other hand, I was just about to change the wal_debug
    behaviour to
    allow better debugging of PITR features as they're added.
    That's a development activity. Enabling the WAL_DEBUG #ifdef by
    default during the 7.5 development cycle would be uncontroversial, I
    think.
    Yes that's the best proposal. Can I leave that with you?
    I think it is very important to be able to put the system fairly
    easily into debug mode
    It is? Why would this be useful for non-development activities?

    (It may well be the case that we ought to report more or better
    information about the status of the WAL subsystem; but WAL_DEBUG is
    surely not the right mechanism for emitting information intended for
    an administrator.)
    Right again. I guess my proposal amounted to quick-and-dirty logging.

    I'll think some more.

    Best Regards, Simon Riggs

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedFeb 28, '04 at 6:39p
activeMar 9, '04 at 8:27p
posts17
users7
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase