We have three patches in the hopper that all have the same goal:
reduce ProcArrayLock contention. They are:

[1] Pavan's patch (subsequently revised by Heikki) to put the "hot"
members of the PGPROC structure into a separate array
http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

[2] my FlexLocks patch, and
http://archives.postgresql.org/message-id/CA+Tgmoax_14rbx8Y6mmgvW64gCQL4ZviDzwEObXEMuiV=TwmxQ@mail.gmail.com

[3] my patch to eliminate some snapshot (I think this is also better
semantics, but at any rate it also improves performance)
http://archives.postgresql.org/message-id/CA+TgmoYDe3dx7xuK_rCPLWy7P67hp96ozyGe_K6W87kfx3YCGw@mail.gmail.com

Interestingly, these all try to reduce ProcArrayLock contention in
different ways: [1] does it by making snapshot-taking scan fewer cache
lines, [2] does it by reducing contention for the spinlock protecting
ProcArrayLock, and [3] does it by taking fewer snapshots. So you
might think that the effects of these patches would add, at least to
some degree.

Now the first two patches are the ones that seem to show the most
performance improvement, so I tested both patches individually and
also a combination of the two patches (the combined patch for this is
attached, as there were numerous conflicts). I tested them on two
different machines with completely different architectures; Nate
Boley's AMD 6128 box (which has 32 cores) and an HP Integrity server
(also with 32 cores). On Integrity, I compiled using the aCC
compiler, adjusted the resulting binary with chatr +pi L +pd L, and
ran both pgbench and the server with rtsched -s SCHED_NOAGE -p 178,
which are settings that seem to be necessary for good performance on
that platform. pgbench was run locally on the AMD box but from
another server over a high-speed network interconnect on the Integrity
server. Both servers were configured with shared_buffers=8GB,
checkpoint_segments=300, wal_writer_delay=20ms, and
synchronous_commit=off. Some of the other settings were different; on
the Integrity server, I had effective_cache_size=340GB,
checkpoint_timeout=30min, and wal_buffers=16MB, while on the AMD box I
had checkpoint_completion_target=0.9 and maintenance_work_mem=1GB. I
doubt that these settings differences were material (except that they
probably made reinitializing the database between tests take longer on
the Integrity system, since I forgot to set maintenance_work_mem), but
I could double-check that if anyone is concerned about it.

The results are below. In a nutshell, either patch by itself is very,
very good; and both patches together are somewhat better. Which one
helps more individually is somewhat variable. Lines marked "m" are
unpatched master as of commit
ff4fd4bf53c5512427f8ecea08d6ca7777efa2c5. "p" is Pavan's PGPROC patch
(maybe I should have said ppp...) as revised by Heikki; "f" is the
latest version of my FlexLocks patch, and "b" is the combination patch
attached herewith. The number immediately following is the number of
clients used, each with its own pgbench thread (i.e. -c N -j N). As
usual, each number is the median of three five-minute runs at scale
factor 100.

Since Pavan's patch has the advantage of being quite simple, I'm
thinking we should push that one through to completion first, and then
test all the other possible improvements in this area relative to that
new baseline.

== AMD Opteron 6128, 32 cores, Permanent Tables ==

m01 tps = 631.208073 (including connections establishing)
p01 tps = 631.182923 (including connections establishing)
f01 tps = 636.308562 (including connections establishing)
b01 tps = 629.295507 (including connections establishing)
m08 tps = 4516.479854 (including connections establishing)
p08 tps = 4614.772650 (including connections establishing)
f08 tps = 4652.454768 (including connections establishing)
b08 tps = 4679.363474 (including connections establishing)
m16 tps = 7788.615240 (including connections establishing)
p16 tps = 7824.025406 (including connections establishing)
f16 tps = 7841.876146 (including connections establishing)
b16 tps = 7859.334650 (including connections establishing)
m24 tps = 11720.145052 (including connections establishing)
p24 tps = 12782.696214 (including connections establishing)
f24 tps = 12559.765555 (including connections establishing)
b24 tps = 12891.945766 (including connections establishing)
m32 tps = 10223.015618 (including connections establishing)
p32 tps = 11585.902050 (including connections establishing)
f32 tps = 11626.542744 (including connections establishing)
b32 tps = 11866.969986 (including connections establishing)
m80 tps = 7540.482189 (including connections establishing)
p80 tps = 11598.446238 (including connections establishing)
f80 tps = 11529.752081 (including connections establishing)
b80 tps = 11714.364294 (including connections establishing)

== AMD Opteron 6128, 32 cores, Unlogged Tables ==

m01 tps = 680.398630 (including connections establishing)
p01 tps = 673.293390 (including connections establishing)
f01 tps = 679.993953 (including connections establishing)
b01 tps = 679.377600 (including connections establishing)
m08 tps = 4760.964292 (including connections establishing)
p08 tps = 4870.037842 (including connections establishing)
f08 tps = 5028.719509 (including connections establishing)
b08 tps = 4893.439824 (including connections establishing)
m16 tps = 7997.051705 (including connections establishing)
p16 tps = 8218.884377 (including connections establishing)
f16 tps = 8160.373682 (including connections establishing)
b16 tps = 8144.707958 (including connections establishing)
m24 tps = 13066.867858 (including connections establishing)
p24 tps = 14523.109116 (including connections establishing)
f24 tps = 14098.978673 (including connections establishing)
b24 tps = 14526.330294 (including connections establishing)
m32 tps = 10800.711985 (including connections establishing)
p32 tps = 19159.131614 (including connections establishing)
f32 tps = 22224.839905 (including connections establishing)
b32 tps = 23373.672552 (including connections establishing)
m80 tps = 7885.663468 (including connections establishing)
p80 tps = 17760.149440 (including connections establishing)
f80 tps = 19960.356205 (including connections establishing)
b80 tps = 18665.581069 (including connections establishing)

== HP Integrity, 32 cores, Permanent Tables ==

m01 tps = 883.732295 (including connections establishing)
p01 tps = 866.449154 (including connections establishing)
f01 tps = 924.364403 (including connections establishing)
b01 tps = 926.797302 (including connections establishing)
m08 tps = 6098.047731 (including connections establishing)
p08 tps = 6293.537855 (including connections establishing)
f08 tps = 6059.635731 (including connections establishing)
b08 tps = 6250.132288 (including connections establishing)
m16 tps = 9995.755003 (including connections establishing)
p16 tps = 10654.562946 (including connections establishing)
f16 tps = 10258.008496 (including connections establishing)
b16 tps = 10712.776806 (including connections establishing)
m24 tps = 11646.915026 (including connections establishing)
p24 tps = 13483.345338 (including connections establishing)
f24 tps = 12815.456128 (including connections establishing)
b24 tps = 13506.218109 (including connections establishing)
m32 tps = 10433.315312 (including connections establishing)
p32 tps = 14111.719739 (including connections establishing)
f32 tps = 13990.284158 (including connections establishing)
b32 tps = 14697.189751 (including connections establishing)
m80 tps = 8177.428209 (including connections establishing)
p80 tps = 11343.667289 (including connections establishing)
f80 tps = 11651.244256 (including connections establishing)
b80 tps = 12523.308466 (including connections establishing)

== HP Integrity, 32 cores, Unlogged Tables ==

m01 tps = 949.594327 (including connections establishing)
p01 tps = 958.753925 (including connections establishing)
f01 tps = 931.276655 (including connections establishing)
b01 tps = 943.836646 (including connections establishing)
m08 tps = 6211.621726 (including connections establishing)
p08 tps = 6412.267441 (including connections establishing)
f08 tps = 5843.870591 (including connections establishing)
b08 tps = 6428.415940 (including connections establishing)
m16 tps = 10341.538889 (including connections establishing)
p16 tps = 11161.425798 (including connections establishing)
f16 tps = 10545.954472 (including connections establishing)
b16 tps = 11235.441290 (including connections establishing)
m24 tps = 11859.831632 (including connections establishing)
p24 tps = 14380.766878 (including connections establishing)
f24 tps = 13489.351324 (including connections establishing)
b24 tps = 14579.649665 (including connections establishing)
m32 tps = 10716.208372 (including connections establishing)
p32 tps = 15497.819188 (including connections establishing)
f32 tps = 14590.406972 (including connections establishing)
b32 tps = 15991.920288 (including connections establishing)
m80 tps = 8465.159253 (including connections establishing)
p80 tps = 11945.494890 (including connections establishing)
f80 tps = 14676.324769 (including connections establishing)
b80 tps = 15623.109737 (including connections establishing)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Kevin Grittner at Nov 18, 2011 at 4:26 pm

    Robert Haas wrote:

    Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
    server (also with 32 cores).
    [clear improvement with flexlock patch]
    Hmm. We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
    RAM. It's about a week from going into production, at which point
    it will be extremely hard to schedule such tests, but for a few days
    more I've got shots at it. The flexlock patch doesn't appear to be
    such a clear win here.

    I started from Robert's tests, but used these settings so that I
    could go to higher client counts and better test serializable
    transactions. Everything is fully cached.

    max_connections = 200
    max_pred_locks_per_transaction = 256
    shared_buffers = 8GB
    maintenance_work_mem = 1GB
    checkpoint_segments = 30
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    seq_page_cost = 0.1
    random_page_cost = 0.1
    cpu_tuple_cost = 0.05
    effective_cache_size = 40GB
    default_transaction_isolation = '$iso'

    Serializable results not shown here -- that's to gather information
    for trying to improve SSI locking.

    m1 tps = 7847.834544 (including connections establishing)
    f1 tps = 7917.225382 (including connections establishing)
    m2 tps = 18672.145526 (including connections establishing)
    f2 tps = 17486.435322 (including connections establishing)
    m4 tps = 34371.278253 (including connections establishing)
    f4 tps = 34465.898173 (including connections establishing)
    m8 tps = 68228.261694 (including connections establishing)
    f8 tps = 68505.285830 (including connections establishing)
    m16 tps = 127449.815100 (including connections establishing)
    f16 tps = 127208.939670 (including connections establishing)
    m32 tps = 201738.209348 (including connections establishing)
    f32 tps = 201637.237903 (including connections establishing)
    m64 tps = 380326.800557 (including connections establishing)
    f64 tps = 380628.429408 (including connections establishing)
    m80 tps = 366628.197546 (including connections establishing)
    f80 tps = 162594.012051 (including connections establishing)
    m96 tps = 360922.948775 (including connections establishing)
    f96 tps = 366728.987041 (including connections establishing)
    m128 tps = 352159.631878 (including connections establishing)
    f128 tps = 355475.129448 (including connections establishing)

    I did five runs each and took the median. In most cases, the values
    were pretty close to one another in a group, so confidence is pretty
    high that this is meaningful. There were a few anomalies where
    performance for one or more samples was horrid. This seems
    consistent with the theory of pathological pileups on the LW locks
    (or also flexlocks?).

    The problem groups:

    m64 tps = 380407.768906 (including connections establishing)
    m64 tps = 79197.470389 (including connections establishing)
    m64 tps = 381112.194105 (including connections establishing)
    m64 tps = 378579.036542 (including connections establishing)
    m64 tps = 380326.800557 (including connections establishing)

    m96 tps = 360582.945291 (including connections establishing)
    m96 tps = 363021.805138 (including connections establishing)
    m96 tps = 362468.870516 (including connections establishing)
    m96 tps = 59614.322351 (including connections establishing)
    m96 tps = 360922.948775 (including connections establishing)

    f80 tps = 158905.149822 (including connections establishing)
    f80 tps = 157192.460599 (including connections establishing)
    f80 tps = 370757.790443 (including connections establishing)
    f80 tps = 162594.012051 (including connections establishing)
    f80 tps = 372170.638516 (including connections establishing)

    f96 tps = 366804.733788 (including connections establishing)
    f96 tps = 366728.987041 (including connections establishing)
    f96 tps = 365490.380848 (including connections establishing)
    f96 tps = 366770.193305 (including connections establishing)
    f96 tps = 125225.371140 (including connections establishing)

    So the lows don't seem to be as low when they happen with the
    flexlock patch, but they still happen -- possibly more often?

    -Kevin
  • Robert Haas at Nov 18, 2011 at 4:33 pm

    On Fri, Nov 18, 2011 at 11:26 AM, Kevin Grittner wrote:
    Robert Haas wrote:
    Nate Boley's AMD 6128 box (which has 32 cores) and an HP Integrity
    server (also with 32 cores).
    [clear improvement with flexlock patch]
    Hmm.  We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
    RAM.  It's about a week from going into production, at which point
    it will be extremely hard to schedule such tests, but for a few days
    more I've got shots at it.  The flexlock patch doesn't appear to be
    such a clear win here.

    I started from Robert's tests, but used these settings so that I
    could go to higher client counts and better test serializable
    transactions.  Everything is fully cached.

    max_connections = 200
    max_pred_locks_per_transaction = 256
    shared_buffers = 8GB
    maintenance_work_mem = 1GB
    checkpoint_segments = 30
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    seq_page_cost = 0.1
    random_page_cost = 0.1
    cpu_tuple_cost = 0.05
    effective_cache_size = 40GB
    default_transaction_isolation = '$iso'
    I had a dismaying benchmarking experience recently that involved
    settings very similar to the ones you've got there - in particular, I
    also had checkpoint_segments set to 30. When I raised it to 300,
    performance improved dramatically at 8 clients and above.

    Then again, is this a regular pgbench test or is this SELECT-only?
    Because the absolute numbers you're posting are vastly higher than
    anything I've ever seen on a write test.

    Can you by any chance check top or vmstat during the 32-client test
    and see what percentage you have of user time/system time/idle time?

    What OS are you running?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Nov 18, 2011 at 5:04 pm

    Robert Haas wrote:

    Then again, is this a regular pgbench test or is this SELECT-only?
    SELECT-only
    Can you by any chance check top or vmstat during the 32-client
    test and see what percentage you have of user time/system
    time/idle time?
    You didn't say whether you wanted master or flexlock, but it turned
    out that any difference was way too far into the noise to show.
    They both looked like this:

    procs --------------memory------------- ---swap-- -----io----
    r b swpd free buff cache si so bi bo
    ----system---- -----cpu------
    in cs us sy id wa st
    38 0 352 1157400 207177020 52360472 0 0 0 16
    13345 1190230 40 7 53 0 0
    37 0 352 1157480 207177020 52360472 0 0 0 0
    12953 1263310 40 8 52 0 0
    36 0 352 1157484 207177020 52360472 0 0 0 0
    13411 1233365 38 7 54 0 0
    37 0 352 1157476 207177020 52360472 0 0 0 0
    12780 1193575 41 7 51 0 0

    Keep in mind that while there are really 32 cores, the cpu
    percentages seem to be based on the "threads" from hyperthreading.
    Top showed pgbench (running on the same machine) as eating a pretty
    steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
    postgres processes.
    What OS are you running?
    Linux new-CIR 2.6.32.43-0.4-default #1 SMP 2011-07-14 14:47:44 +0200
    x86_64 x86_64 x86_64 GNU/Linux

    SUSE Linux Enterprise Server 11 (x86_64)
    VERSION = 11
    PATCHLEVEL = 1

    -Kevin
  • Robert Haas at Nov 18, 2011 at 5:26 pm

    On Fri, Nov 18, 2011 at 12:03 PM, Kevin Grittner wrote:
    Then again, is this a regular pgbench test or is this SELECT-only?
    SELECT-only
    Ah, OK. I would not expect flexlocks to help with that; Pavan's patch
    might, though.
    Can you by any chance check top or vmstat during the 32-client
    test and see what percentage you have of user time/system
    time/idle time?
    You didn't say whether you wanted master or flexlock, but it turned
    out that any difference was way too far into the noise to show.
    They both looked like this:

    procs --------------memory------------- ---swap-- -----io----
     r  b   swpd    free      buff    cache   si   so    bi    bo
     ----system---- -----cpu------
        in      cs us sy id wa st
    38  0    352 1157400 207177020 52360472    0    0     0    16
     13345 1190230 40  7 53  0  0
    37  0    352 1157480 207177020 52360472    0    0     0     0
     12953 1263310 40  8 52  0  0
    36  0    352 1157484 207177020 52360472    0    0     0     0
     13411 1233365 38  7 54  0  0
    37  0    352 1157476 207177020 52360472    0    0     0     0
     12780 1193575 41  7 51  0  0

    Keep in mind that while there are really 32 cores, the cpu
    percentages seem to be based on the "threads" from hyperthreading.
    Top showed pgbench (running on the same machine) as eating a pretty
    steady 5.2 of the cores, leaving 26.8 cores to actually drive the 32
    postgres processes.
    It doesn't make any sense for PostgreSQL master to be using only 50%
    of the CPU and leaving the rest idle on a lots-of-clients SELECT-only
    test. That could easily happen on 9.1, but my lock manager changes
    eliminated the only place where anything gets put to sleep in that
    path (except for the emergency sleeps done by s_lock, when a spinlock
    is really badly contended). So I'm confused by these results. Are we
    sure that the processes are being scheduled across all 32 physical
    cores?

    At any rate, I do think it's likely that you're being bitten by
    spinlock contention, but we'd need to do some legwork to verify that
    and work out the details. Any chance you can run oprofile (on either
    branch, don't really care) against the 32 client test and post the
    results? If it turns out s_lock is at the top of the heap, I can put
    together a patch to help figure out which spinlock is the culprit.

    Anyway, this is probably a digression as it relates to FlexLocks:
    those are not optimizing for a read-only workload.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Nov 18, 2011 at 5:46 pm

    Robert Haas wrote:
    Kevin Grittner wrote:
    Then again, is this a regular pgbench test or is this
    SELECT-only?
    SELECT-only
    Ah, OK. I would not expect flexlocks to help with that; Pavan's
    patch might, though.
    OK. Sorry for misunderstanding that. I haven't gotten around to a
    deep reading of the patch yet. :-( I based this on the test script
    you posted here (with slight modifications for my preferred
    directory structures):

    http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php

    If I just drop the -S switch will I have a good test, or are there
    other adjustments I should make (besides increasing checkpoint
    segments)? (Well, for the SELECT-only test I didn't bother putting
    pg_xlog on a separate RAID 10 on it's own BBU controller as we
    normally would for this machine, I'll cover that, too.)
    It doesn't make any sense for PostgreSQL master to be using only
    50% of the CPU and leaving the rest idle on a lots-of-clients
    SELECT-only test. That could easily happen on 9.1, but my lock
    manager changes eliminated the only place where anything gets put
    to sleep in that path (except for the emergency sleeps done by
    s_lock, when a spinlock is really badly contended). So I'm
    confused by these results. Are we sure that the processes are
    being scheduled across all 32 physical cores?
    I think so. My take was that it was showing 32 of 64 *threads*
    active -- the hyperthreading funkiness. Is there something in
    particular you'd like me to check?
    At any rate, I do think it's likely that you're being bitten by
    spinlock contention, but we'd need to do some legwork to verify
    that and work out the details. Any chance you can run oprofile
    (on either branch, don't really care) against the 32 client test
    and post the results? If it turns out s_lock is at the top of the
    heap, I can put together a patch to help figure out which spinlock
    is the culprit.
    oprofile isn't installed on this machine. I'll take care of that
    and post results when I can.

    -Kevin
  • Robert Haas at Nov 18, 2011 at 5:51 pm

    On Fri, Nov 18, 2011 at 12:45 PM, Kevin Grittner wrote:
    OK.  Sorry for misunderstanding that.  I haven't gotten around to a
    deep reading of the patch yet.  :-(  I based this on the test script
    you posted here (with slight modifications for my preferred
    directory structures):

    http://archives.postgresql.org/pgsql-hackers/2011-10/msg00605.php

    If I just drop the -S switch will I have a good test, or are there
    other adjustments I should make (besides increasing checkpoint
    segments)?  (Well, for the SELECT-only test I didn't bother putting
    pg_xlog on a separate RAID 10 on it's own BBU controller as we
    normally would for this machine, I'll cover that, too.)
    Yeah, I'd just drop -S. Make sure to use -c N -j N with pgbench, or
    you'll probably not be able to saturate it. I've also had good luck
    with wal_writer_delay=20ms, although if you have synchronous_commit=on
    that might not matter, and it's much less important since Simon's
    recent patch in that area went in.

    What scale factor are you testing at?
    It doesn't make any sense for PostgreSQL master to be using only
    50% of the CPU and leaving the rest idle on a lots-of-clients
    SELECT-only test.  That could easily happen on 9.1, but my lock
    manager changes eliminated the only place where anything gets put
    to sleep in that path (except for the emergency sleeps done by
    s_lock, when a spinlock is really badly contended).  So I'm
    confused by these results. Are we sure that the processes are
    being scheduled across all 32 physical cores?
    I think so.  My take was that it was showing 32 of 64 *threads*
    active -- the hyperthreading funkiness.  Is there something in
    particular you'd like me to check?
    Not really, just don't understand the number.
    At any rate, I do think it's likely that you're being bitten by
    spinlock contention, but we'd need to do some legwork to verify
    that and work out the details.  Any chance you can run oprofile
    (on either branch, don't really care) against the 32 client test
    and post the results?  If it turns out s_lock is at the top of the
    heap, I can put together a patch to help figure out which spinlock
    is the culprit.
    oprofile isn't installed on this machine.  I'll take care of that
    and post results when I can.
    OK.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Nov 18, 2011 at 6:15 pm

    Robert Haas wrote:

    Yeah, I'd just drop -S.
    Easily done.
    Make sure to use -c N -j N with pgbench, or you'll probably not be
    able to saturate it.
    Yeah, that's part of the script I copied from you.
    I've also had good luck with wal_writer_delay=20ms, although if
    you have synchronous_commit=on that might not matter, and it's
    much less important since Simon's recent patch in that area went
    in.
    What the heck; will do.
    What scale factor are you testing at?
    100. Perhaps I should boost that since I'm going as far as 128
    clients?

    -Kevin
  • Kevin Grittner at Nov 18, 2011 at 7:56 pm

    Robert Haas wrote:
    I think so. My take was that it was showing 32 of 64 *threads*
    active -- the hyperthreading funkiness. Is there something in
    particular you'd like me to check?
    Not really, just don't understand the number.
    I'm having trouble resolving the vmstat numbers I got during the
    32-client pgbench runs which modified data.

    -M simple:

    procs --------------memory------------- ---swap-- -----io-----
    r b swpd free buff cache si so bi bo
    ----system---- -----cpu------
    in cs us sy id wa st
    30 1 4464 513492 205564572 54472124 0 0 0 78170
    621724 1246300 30 8 61 1 0
    27 1 4464 509288 205564572 54474600 0 0 0 125620
    599403 1192046 29 8 63 1 0
    35 1 4464 508368 205564572 54476996 0 0 0 89801
    595939 1186496 29 8 63 0 0
    25 0 4464 506088 205564572 54478668 0 0 0 90121
    594800 1189649 28 8 63 0 0

    -M prepared:

    procs --------------memory-------------- ---swap-- -----io-----
    r b swpd free buff cache si so bi bo
    ----system---- -----cpu------
    in cs us sy id wa st
    28 0 5612 1204404 205107344 54230536 0 0 0 93212
    527284 1456417 22 9 69 0 0
    8 1 5612 1202044 205107344 54233336 0 0 0 93217
    512819 1417457 21 9 70 1 0
    17 1 5612 1201892 205107344 54236048 0 0 0 132699
    502333 1412878 21 9 70 0 0
    19 1 5612 1199208 205107344 54238936 0 0 0 93612
    519113 1484386 21 9 69 0 0

    So 60% or 70% idle without any I/O wait time. I don't know how to
    explain that.

    -Kevin
  • Kevin Grittner at Nov 18, 2011 at 7:06 pm

    Robert Haas wrote:

    Any chance you can run oprofile (on either branch, don't really
    care) against the 32 client test and post the results?
    Besides the other changes we discussed, I boosted scale to 150 and
    ran at READ COMMITTED isolation level (because all threads promptly
    crashed and burned at REPEATABLE READ -- we desperately need a
    pgbench option to retry a transaction on serialization failure).
    The oprofile hot spots at half a percent or higher:

    CPU: Intel Core/i7, speed 2262 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
    a unit mask of 0x00 (No unit mask) count 100000
    samples % image name symbol name
    933394 4.9651 postgres AllocSetAlloc
    848476 4.5134 postgres base_yyparse
    719515 3.8274 postgres SearchCatCache
    461275 2.4537 postgres hash_search_with_hash_value
    426411 2.2682 postgres GetSnapshotData
    322938 1.7178 postgres LWLockAcquire
    322236 1.7141 postgres core_yylex
    305471 1.6249 postgres MemoryContextAllocZeroAligned
    281543 1.4976 postgres expression_tree_walker
    270241 1.4375 postgres XLogInsert
    234899 1.2495 postgres MemoryContextAlloc
    210137 1.1178 postgres ScanKeywordLookup
    184857 0.9833 postgres heap_page_prune
    173608 0.9235 postgres hash_any
    153011 0.8139 postgres _bt_compare
    144538 0.7689 postgres nocachegetattr
    131466 0.6993 postgres fmgr_info_cxt_security
    131001 0.6968 postgres grouping_planner
    130808 0.6958 postgres LWLockRelease
    124112 0.6602 postgres PinBuffer
    120745 0.6423 postgres LockAcquireExtended
    112992 0.6010 postgres ExecInitExpr
    112830 0.6002 postgres lappend
    112311 0.5974 postgres new_list
    110368 0.5871 postgres check_stack_depth
    106036 0.5640 postgres AllocSetFree
    102565 0.5456 postgres MemoryContextAllocZero
    94689 0.5037 postgres SearchSysCache

    Do you want line numbers or lower percentages?

    Two runs:

    tps = 21946.961196 (including connections establishing)
    tps = 22911.873227 (including connections establishing)

    For write transactions, that seems pretty respectable.

    -Kevin
  • Anarazel at Nov 18, 2011 at 7:11 pm

    Kevin Grittner schrieb:
    Robert Haas wrote:
    Any chance you can run oprofile (on either branch, don't really
    care) against the 32 client test and post the results?
    Besides the other changes we discussed, I boosted scale to 150 and
    ran at READ COMMITTED isolation level (because all threads promptly
    crashed and burned at REPEATABLE READ -- we desperately need a
    pgbench option to retry a transaction on serialization failure).
    The oprofile hot spots at half a percent or higher:

    CPU: Intel Core/i7, speed 2262 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
    a unit mask of 0x00 (No unit mask) count 100000
    samples % image name symbol name
    933394 4.9651 postgres AllocSetAlloc
    848476 4.5134 postgres base_yyparse
    719515 3.8274 postgres SearchCatCache
    461275 2.4537 postgres hash_search_with_hash_value
    426411 2.2682 postgres GetSnapshotData
    322938 1.7178 postgres LWLockAcquire
    322236 1.7141 postgres core_yylex
    305471 1.6249 postgres MemoryContextAllocZeroAligned
    281543 1.4976 postgres expression_tree_walker
    270241 1.4375 postgres XLogInsert
    234899 1.2495 postgres MemoryContextAlloc
    210137 1.1178 postgres ScanKeywordLookup
    184857 0.9833 postgres heap_page_prune
    173608 0.9235 postgres hash_any
    153011 0.8139 postgres _bt_compare
    144538 0.7689 postgres nocachegetattr
    131466 0.6993 postgres fmgr_info_cxt_security
    131001 0.6968 postgres grouping_planner
    130808 0.6958 postgres LWLockRelease
    124112 0.6602 postgres PinBuffer
    120745 0.6423 postgres LockAcquireExtended
    112992 0.6010 postgres ExecInitExpr
    112830 0.6002 postgres lappend
    112311 0.5974 postgres new_list
    110368 0.5871 postgres check_stack_depth
    106036 0.5640 postgres AllocSetFree
    102565 0.5456 postgres MemoryContextAllocZero
    94689 0.5037 postgres SearchSysCache
    That profile looks like you ran pgbench with -m simple. How does it look with prepared instead?

    Andres
    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Kevin Grittner at Nov 18, 2011 at 7:24 pm

    "anarazel@anarazel.de" wrote:
    Kevin Grittner <Kevin.Grittner@wicourts.gov> schrieb:
    samples % image name symbol name
    933394 4.9651 postgres AllocSetAlloc
    848476 4.5134 postgres base_yyparse
    719515 3.8274 postgres SearchCatCache
    That profile looks like you ran pgbench with -m simple. How does
    it look with prepared instead?
    samples % image name symbol name
    495463 3.6718 postgres hash_search_with_hash_value
    490971 3.6385 postgres GetSnapshotData
    443965 3.2902 postgres LWLockAcquire
    443566 3.2872 postgres AllocSetAlloc
    302388 2.2409 postgres XLogInsert
    286889 2.1261 postgres SearchCatCache
    246417 1.8262 postgres PostgresMain
    235018 1.7417 postgres heap_page_prune
    198442 1.4706 postgres _bt_compare
    181446 1.3447 postgres hash_any
    177131 1.3127 postgres ExecInitExpr
    175775 1.3026 postgres LWLockRelease
    152324 1.1288 postgres PinBuffer
    150285 1.1137 postgres exec_bind_message
    145214 1.0762 postgres fmgr_info_cxt_security
    140493 1.0412 postgres s_lock
    124162 0.9201 postgres LockAcquireExtended
    120429 0.8925 postgres MemoryContextAlloc
    117076 0.8676 postgres pfree
    116493 0.8633 postgres AllocSetFree
    105027 0.7783 postgres pgstat_report_activity
    101407 0.7515 postgres ProcArrayLockAcquire
    100797 0.7470 postgres MemoryContextAllocZeroAligned
    98360 0.7289 postgres ProcArrayLockRelease
    86938 0.6443 postgres heap_hot_search_buffer
    82635 0.6124 postgres hash_search
    79902 0.5921 postgres errstart
    79465 0.5889 postgres HeapTupleSatisfiesVacuum
    78709 0.5833 postgres ResourceOwnerReleaseInternal
    76068 0.5637 postgres ExecModifyTable
    73043 0.5413 postgres heap_update
    72175 0.5349 postgres strlcpy
    71253 0.5280 postgres MemoryContextAllocZero

    tps = 27392.219364 (including connections establishing)

    -Kevin
  • Kevin Grittner at Nov 18, 2011 at 7:37 pm

    "Kevin Grittner" wrote:

    samples % image name symbol name
    495463 3.6718 postgres hash_search_with_hash_value
    When lines like these show up in the annotated version, I'm
    impressed that we're still finding gains as big as we are:

    44613 0.3306 : if (segp == NULL)
    : hash_corrupted(hashp);

    101910 0.7552 : keysize = hashp->keysize; /* ditto */

    There goes over 1% of my server run time, right there!

    Of course, these make no sense unless there is cache line
    contention, which is why that area is bearing fruit.

    -Kevin
  • Andres Freund at Nov 18, 2011 at 8:03 pm

    On Friday, November 18, 2011 08:36:59 PM Kevin Grittner wrote:
    "Kevin Grittner" wrote:
    samples % image name symbol name
    495463 3.6718 postgres hash_search_with_hash_value
    When lines like these show up in the annotated version, I'm
    impressed that we're still finding gains as big as we are:

    44613 0.3306 : if (segp == NULL)

    : hash_corrupted(hashp);

    101910 0.7552 : keysize = hashp->keysize; /* ditto */
    When doing line-level profiles I would suggest looking at the instructions.
    Quite often the line shown doesn't have much to do whats executed as the
    compiler tries to schedule instructions cleverly.
    Also in many situations the shown cost doesn't actually lie in the instruction
    shown but in some previous one. The shown instruction e.g. has to wait for the
    result of the earlier instructions. Pipelining makes that hard to correctly
    observe.

    A simplified example would be something like:

    bool func(int a, int b, int c){
    int res = a / b;
    if(res == c){
    return true;
    }
    return false;
    }

    Likely the instruction showing up in the profile would be the comparison. Which
    obviously is not the really expensive part...

    There goes over 1% of my server run time, right there!

    Of course, these make no sense unless there is cache line
    contention, which is why that area is bearing fruit.
    I don't think cache line contention is the most likely candidate here. Simple
    cache-misses seem far more likely. In combination with pipeline stalls...

    Newer cpus (nehalem+) can measure stalled cycles which can be really useful
    when analyzing performance. I don't remember how to do that with oprofile right
    now though as I use perf these days (its -e stalled-cycles{frontend|backend}
    there}).

    Andres
  • Kevin Grittner at Nov 18, 2011 at 8:16 pm

    Andres Freund wrote:

    When doing line-level profiles I would suggest looking at the
    instructions.
    What's the best way to do that?
    I don't think cache line contention is the most likely candidate
    here. Simple cache-misses seem far more likely. In combination
    with pipeline stalls...

    Newer cpus (nehalem+) can measure stalled cycles which can be
    really useful when analyzing performance. I don't remember how to
    do that with oprofile right now though as I use perf these days
    (its -e stalled-cycles{frontend|backend} there}).
    When I run oprofile, I still always go back to this post by Tom:

    http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php

    Can anyone provide such a "cheat sheet" for perf? I could give that
    a try if I knew how.

    -Kevin
  • Andres Freund at Nov 18, 2011 at 10:31 pm

    On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:
    Andres Freund wrote:
    When doing line-level profiles I would suggest looking at the
    instructions.
    What's the best way to do that?
    I think opannotate -a -s produces output with instructions/code intermingled.
    I don't think cache line contention is the most likely candidate
    here. Simple cache-misses seem far more likely. In combination
    with pipeline stalls...

    Newer cpus (nehalem+) can measure stalled cycles which can be
    really useful when analyzing performance. I don't remember how to
    do that with oprofile right now though as I use perf these days
    (its -e stalled-cycles{frontend|backend} there}).
    When I run oprofile, I still always go back to this post by Tom:
    http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php
    Hrm. I am on the train and for unknown reasons the only sensible working
    protocols are smtp + pop.... Waiting.... Waiting....
    Sorry, too slow/high latency atm. I wrote everything below and another mail
    and the page still hasn't loaded.

    oprofile can produces graphes as well (--callgraph). for both tools you need
    -fno-omit-frame-pointers to get usable graphs.
    Can anyone provide such a "cheat sheet" for perf? I could give that
    a try if I knew how.
    Unfortunately for sensible results the kernel needs to be rather new.
    I would say > 2.6.28 or so (just guessed).

    # to record activity
    perf record [-g|--call-graph] program|-p pid

    # to view a summation
    perf report

    graph:
    # Overhead Command Shared Object Symbol
    # ........ ........ ................. .........................................
    #
    4.09% postgres postgres [.] slab_alloc_dyn

    --- slab_alloc_dyn
    --18.52%-- new_list
    --63.79%-- lappend
    --13.40%-- find_usable_indexes
    create_index_paths
    set_rel_pathlist
    make_one_rel
    flat:

    # Overhead Command Shared Object Symbol
    # ........ ........ ................. .........................................
    #
    5.10% postgres [vdso] [.] 0x7ffff3d8d770
    4.26% postgres postgres [.] base_yyparse
    3.88% postgres postgres [.] slab_alloc_dyn
    2.82% postgres postgres [.] core_yylex
    2.37% postgres postgres [.] SearchCatCache
    1.85% postgres libc-2.13.so [.] __memcpy_ssse3
    1.66% postgres libc-2.13.so [.] __GI___strcmp_ssse3
    1.23% postgres postgres [.] MemoryContextAlloc


    # to view a line/source/instruction level view
    perf annotate -l symbol

    ...
    :
    : /*
    : * one-time startup overhead for each cache
    : */
    : if (cache->cc_tupdesc == NULL)
    0.35 : 6e81fd: 48 83 7f 28 00 cmpq $0x0,0x28(%rdi)
    /home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1070
    4.15 : 6e8202: 0f 84 54 04 00 00 je 6e865c <SearchCatCache+0x47c>
    : #endif
    :
    : /*
    : * initialize the search key information
    : */
    : memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
    0.00 : 6e8208: 48 8d bd a0 fe ff ff lea -0x160(%rbp),%rdi
    0.17 : 6e820f: 49 8d 77 70 lea 0x70(%r15),%rsi
    0.00 : 6e8213: b9 24 00 00 00 mov $0x24,%ecx
    /home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1080
    33.22 : 6e8218: f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
    : cur_skey[0].sk_argument = v1;
    /home/andres/src/postgresql/build/optimize/../../src/backend/utils/cache/catcache.c:1081
    1.56 : 6e821b: 48 89 9d e0 fe ff ff mov %rbx,-0x120(%rbp)
    ...

    # get heaps of stats from something
    perf stat -ddd someprogram|-p pid

    1242.409965 task-clock # 0.824 CPUs utilized [100.00%]
    14,572 context-switches # 0.012 M/sec [100.00%]
    264 CPU-migrations # 0.000 M/sec [100.00%]
    0 page-faults # 0.000 M/sec
    2,854,775,135 cycles # 2.298 GHz [26.28%]
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    2,024,997,785 instructions # 0.71 insns per cycle [25.25%]
    387,240,903 branches # 311.685 M/sec [26.51%]
    21,756,886 branch-misses # 5.62% of all branches [26.26%]
    753,912,137 L1-dcache-loads # 606.814 M/sec [13.24%]
    52,733,007 L1-dcache-load-misses # 6.99% of all L1-dcache hits [14.72%]
    35,006,406 LLC-loads # 28.176 M/sec [15.46%]
    26,673 LLC-load-misses # 0.08% of all LL-cache hits [13.38%]
    1,855,654,347 L1-icache-loads # 1493.593 M/sec [12.63%]
    52,169,033 L1-icache-load-misses # 2.81% of all L1-icache hits [12.88%]
    761,475,250 dTLB-loads # 612.902 M/sec [13.37%]
    4,457,558 dTLB-load-misses # 0.59% of all dTLB cache hits [13.12%]
    2,049,753,137 iTLB-loads # 1649.820 M/sec [20.09%]
    4,139,394 iTLB-load-misses # 0.20% of all iTLB cache hits [19.31%]
    3,705,429 L1-dcache-prefetches # 2.982 M/sec [19.64%]
    <not supported> L1-dcache-prefetch-misses

    1.507855345 seconds time elapsed

    -r can repeat a command and gives you the standard derivation...

    # show whats the system executing overall
    perf top -az

    # get help
    perf help (record|report|annotate|stat|...)

    In new versions many commands (those that produce pageable text) take --stdio
    and --tui to select between two interfaces. I personally fnd --tui unusable.

    I am not really sure how good the results are compared to oprofile I
    just prefer the ui by far... Also the overhead seems to be measurably
    smaller. Also its usable by every user, not just root...

    Hope that suffices? I have no problem answering further questions, so ...

    Andres
  • Andres Freund at Nov 18, 2011 at 10:36 pm

    On Friday, November 18, 2011 11:12:02 PM Andres Freund wrote:
    On Friday, November 18, 2011 09:16:01 PM Kevin Grittner wrote:
    Andres Freund wrote:
    When doing line-level profiles I would suggest looking at the
    instructions.
    What's the best way to do that?
    I think opannotate -a -s produces output with instructions/code
    intermingled.
    I don't think cache line contention is the most likely candidate
    here. Simple cache-misses seem far more likely. In combination
    with pipeline stalls...

    Newer cpus (nehalem+) can measure stalled cycles which can be
    really useful when analyzing performance. I don't remember how to
    do that with oprofile right now though as I use perf these days
    (its -e stalled-cycles{frontend|backend} there}).
    When I run oprofile, I still always go back to this post by Tom:
    http://archives.postgresql.org/pgsql-performance/2009-06/msg00154.php
    Hrm. I am on the train and for unknown reasons the only sensible working
    protocols are smtp + pop.... Waiting.... Waiting....
    Sorry, too slow/high latency atm. I wrote everything below and another mail
    and the page still hasn't loaded.

    oprofile can produces graphes as well (--callgraph). for both tools you
    need -fno-omit-frame-pointers to get usable graphs.
    Can anyone provide such a "cheat sheet" for perf? I could give that
    a try if I knew how.
    Unfortunately for sensible results the kernel needs to be rather new.
    I would say > 2.6.28 or so (just guessed).

    # to record activity
    perf record [-g|--call-graph] program|-p pid

    # to view a summation
    perf report
    # get heaps of stats from something
    perf stat -ddd someprogram|-p pid
    # show whats the system executing overall
    perf top -az

    # get help
    perf help (record|report|annotate|stat|...)
    ...
    I forgot that there is also

    # get a list of event types
    perf list

    # measure somethign for a specidif event
    perf (record|stat|top) -e some_event_type



    Andres
  • Kevin Grittner at Nov 18, 2011 at 11:18 pm

    Andres Freund wrote:

    I think opannotate -a -s produces output with instructions/code
    intermingled.
    Thanks. I'll check out perf later (thanks for the tips!), but for
    now, here's the function which was at the top of my oprofile
    results, annotated with those options. I'm afraid it's a bit
    intimidating to me -- the last time I did much with X86 assembly
    language was in the mid-80s, on an 80286. :-/ Hopefully, since
    this is at the top of the oprofile results when running with
    prepared statements, it will be of use to somebody.

    The instructions which are shown as having that 1% still seem odd to
    me, but as you say, they were probably actually waiting for some
    previous operation to finish:

    43329 0.3211 : 70b56a: test %rbp,%rbp

    99903 0.7404 : 70b58a: mov %rax,0x18(%rsp)

    If anyone wants any other detail from what I captured, let me know.

    -Kevin
  • Andres Freund at Nov 19, 2011 at 3:08 am

    On Saturday, November 19, 2011 12:18:07 AM Kevin Grittner wrote:
    Andres Freund wrote:
    I think opannotate -a -s produces output with instructions/code
    intermingled.
    Thanks. I'll check out perf later (thanks for the tips!), but for
    now, here's the function which was at the top of my oprofile
    results, annotated with those options. I'm afraid it's a bit
    intimidating to me -- the last time I did much with X86 assembly
    language was in the mid-80s, on an 80286. :-/
    While my assembly knoweldge surely isn't from the 80s be assured that I find it
    intimidating as well ;)
    Hopefully, since
    this is at the top of the oprofile results when running with
    prepared statements, it will be of use to somebody.
    I think in quite many situations hash_search_with_hash_value is rather
    noticeable in the profiles. Even without concurrency...

    Looking at your annotation output the code seems to be almost entirely stalled
    waiting for memory.
    The first stall is after the first reading memory access which is likely to be
    uncached (the first cacheline of the HTAB is accessed before but that will be
    in the cache). The interesting thing is that I would have expected a higher
    likelihood for this to stay in the cache.
    2225 0.0165 : 70b543: mov (%rdi),%r15
    :static inline uint32
    :calc_bucket(HASHHDR *hctl, uint32 hash_val)
    :{
    : uint32 bucket;
    :
    : bucket = hash_val & hctl->high_mask;
    4544 0.0337 : 70b546: and 0x2c(%r15),%ebx
    : if (bucket > hctl->max_bucket)
    53409 0.3958 : 70b54a: cmp 0x28(%r15),%ebx
    : 70b54e: jbe 70b554 <hash_search_with_hash_value+0x34>

    So a stall here is not that surprising.


    Here we fetch data from memory which is unlikely to be prefetchable and then
    require the result from that fetch. Note how segp = hashp->dir[segment_num];
    is distributed over line 52, 64, 83.

    : segp = hashp->dir[segment_num];
    2062 0.0153 : 70b562: shr %cl,%eax
    309 0.0023 : 70b564: mov %eax,%eax
    643 0.0048 : 70b566: mov (%rdx,%rax,8),%rbp
    :
    : if (segp == NULL)
    43329 0.3211 : 70b56a: test %rbp,%rbp





    The next cacheline is referenced here. Again a fetch from memory which is soon
    after needed to continue.
    Unless I misunderstood the code-flow this disproves my theory that we might
    have many collisions as that test seems to be outside the test (
    : prevBucketPtr = &segp[segment_ndx];
    : currBucket = *prevBucketPtr;
    122 9.0e-04 : 70b586: mov 0x0(%rbp),%rbx
    :
    : /*
    : * Follow collision chain looking for matching key
    : */
    : match = hashp->match; /* save one fetch in inner
    loop */
    : keysize = hashp->keysize; /* ditto */
    99903 0.7404 : 70b58a: mov %rax,0x18(%rsp)
    :
    : while (currBucket != NULL)
    1066 0.0079 : 70b58f: test %rbx,%rbx




    line 136 is the first time the contents of the current bucket is needed. Thats
    why the test is so noticeable.
    : currBucket = *prevBucketPtr;
    655 0.0049 : 70b5a3: mov (%rbx),%rbx
    : * Follow collision chain looking for matching key
    : */
    : match = hashp->match; /* save one fetch in inner
    loop */
    : keysize = hashp->keysize; /* ditto */
    :
    : while (currBucket != NULL)
    608 0.0045 : 70b5a6: test %rbx,%rbx
    : 70b5a9: je 70b5d0 <hash_search_with_hash_value+0xb0>
    : {
    : if (currBucket->hashvalue == hashvalue &&
    3504 0.0260 : 70b5ab: cmp %r12d,0x8(%rbx)
    98486 0.7299 : 70b5af: nop
    1233 0.0091 : 70b5b0: jne 70b5a0 <hash_search_with_hash_value+0x80>


    That covers all the slow points in the function. And unless I am missing
    something those are all the fetched cachelines of that function... For
    HASH_FIND that is.


    So I think that reinforces my belive that ordinary cachemisses are the culprit
    here. Which is to be excepted in a hashtable...


    Andres


    PS: No idea whether that rambling made sense to anyone... But I looked at that
    function fo the first time ;)
  • Robert Haas at Nov 18, 2011 at 11:19 pm

    On Fri, Nov 18, 2011 at 2:05 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    Any chance you can run oprofile (on either branch, don't really
    care) against the 32 client test and post the results?
    [ oprofile results ]
    Hmm. That looks a lot like a profile with no lock contention at all.
    Since I see XLogInsert in there, I assume this must be a pgbench write
    test on unlogged tables? How close am I?

    I was actually thinking it would be interesting to oprofile the
    read-only test; see if we can figure out where those slowdowns are
    coming from.
    Two runs:

    tps = 21946.961196 (including connections establishing)
    tps = 22911.873227 (including connections establishing)

    For write transactions, that seems pretty respectable.
    Very. What do you get without the patch?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Nov 18, 2011 at 11:47 pm

    Robert Haas wrote:

    Hmm. That looks a lot like a profile with no lock contention at
    all. Since I see XLogInsert in there, I assume this must be a
    pgbench write test on unlogged tables? How close am I?
    Not unless pgbench on HEAD does that by default. Here are the
    relevant statements:

    $prefix/bin/pgbench -i -s 150
    $prefix/bin/pgbench -T $time -c $clients -j $clients >>$resultfile

    Perhaps the Intel cores implement the relevant primitives better?
    Maybe I didn't run the profile or reports the right way?
    I was actually thinking it would be interesting to oprofile the
    read-only test; see if we can figure out where those slowdowns are
    coming from.
    I'll plan on doing that this weekend.
    tps = 21946.961196 (including connections establishing)
    tps = 22911.873227 (including connections establishing)

    For write transactions, that seems pretty respectable.
    Very. What do you get without the patch?
    [quick runs a couple tests that way]

    Single run with -M simple:

    tps = 23018.314292 (including connections establishing)

    Single run with -M prepared:

    tps = 27910.621044 (including connections establishing)

    So, the patch appears to hinder performance in this environment,
    although certainty is quite low with so few samples. I'll schedule
    a spectrum of runs before I leave this evening (very soon).

    -Kevin
  • Robert Haas at Nov 19, 2011 at 1:03 am

    On Fri, Nov 18, 2011 at 6:46 PM, Kevin Grittner wrote:
    tps = 21946.961196 (including connections establishing)
    tps = 22911.873227 (including connections establishing)

    For write transactions, that seems pretty respectable.
    Very.  What do you get without the patch?
    [quick runs a couple tests that way]

    Single run with -M simple:

    tps = 23018.314292 (including connections establishing)

    Single run with -M prepared:

    tps = 27910.621044 (including connections establishing)

    So, the patch appears to hinder performance in this environment,
    although certainty is quite low with so few samples.  I'll schedule
    a spectrum of runs before I leave this evening (very soon).
    Hmm. There's obviously something that's different in your environment
    or configuration from what I tested, but I don't know what it is. The
    fact that your scale factor is larger than shared_buffers might
    matter; or Intel vs. AMD. Or maybe you're running with
    synchronous_commit=on?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Nov 20, 2011 at 4:34 pm

    Robert Haas wrote:

    Hmm. There's obviously something that's different in your
    environment or configuration from what I tested, but I don't know
    what it is. The fact that your scale factor is larger than
    shared_buffers might matter; or Intel vs. AMD. Or maybe you're
    running with synchronous_commit=on?
    Yes, I had synchronous_commit = on for these runs. Here are the
    settings:

    cat >> $PGDATA/postgresql.conf <<EOM;
    max_connections = 200
    max_pred_locks_per_transaction = 256
    shared_buffers = 10GB
    maintenance_work_mem = 1GB
    checkpoint_segments = 300
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    wal_writer_delay = 20ms
    seq_page_cost = 0.1
    random_page_cost = 0.1
    cpu_tuple_cost = 0.05
    effective_cache_size = 40GB
    default_transaction_isolation = '$iso'
    EOM

    Is there any chance that having pg_xlog on a separate RAID 10 set of
    drives with it's own BBU controller would explain anything? I mean,
    I always knew that was a good idea for a big, heavily-loaded box,
    but I remember being surprised at how *big* a difference that made
    when a box accidentally went into production without moving the
    pg_xlog directory there.

    There is one other things which might matter, I didn't use the -n
    pgbench option, and on the sample you showed, you were using it.

    Here is the median of five from the latest runs. On these
    read/write tests there was very little spread within each set of
    five samples, with no extreme outliers like I had on the SELECT-only
    tests. In the first position s means simple protocol and p means
    prepared protocol. In the second position m means master, f means
    with the flexlock patch.

    sm1 tps = 1092.269228 (including connections establishing)
    sf1 tps = 1090.511552 (including connections establishing)
    sm2 tps = 2171.867100 (including connections establishing)
    sf2 tps = 2158.609189 (including connections establishing)
    sm4 tps = 4278.541453 (including connections establishing)
    sf4 tps = 4269.921594 (including connections establishing)
    sm8 tps = 8472.257182 (including connections establishing)
    sf8 tps = 8476.150588 (including connections establishing)
    sm16 tps = 15905.074160 (including connections establishing)
    sf16 tps = 15937.372689 (including connections establishing)
    sm32 tps = 22331.817413 (including connections establishing)
    sf32 tps = 22861.258757 (including connections establishing)
    sm64 tps = 26388.391614 (including connections establishing)
    sf64 tps = 26529.152361 (including connections establishing)
    sm80 tps = 25617.651194 (including connections establishing)
    sf80 tps = 26560.541237 (including connections establishing)
    sm96 tps = 24105.455175 (including connections establishing)
    sf96 tps = 26569.244384 (including connections establishing)
    sm128 tps = 21467.530210 (including connections establishing)
    sf128 tps = 25883.023093 (including connections establishing)

    pm1 tps = 1629.265970 (including connections establishing)
    pf1 tps = 1619.024905 (including connections establishing)
    pm2 tps = 3164.061963 (including connections establishing)
    pf2 tps = 3137.469377 (including connections establishing)
    pm4 tps = 6114.787505 (including connections establishing)
    pf4 tps = 6061.750200 (including connections establishing)
    pm8 tps = 11884.534375 (including connections establishing)
    pf8 tps = 11870.670086 (including connections establishing)
    pm16 tps = 20575.737107 (including connections establishing)
    pf16 tps = 20437.648809 (including connections establishing)
    pm32 tps = 27664.381103 (including connections establishing)
    pf32 tps = 28046.846479 (including connections establishing)
    pm64 tps = 26764.294547 (including connections establishing)
    pf64 tps = 26631.589294 (including connections establishing)
    pm80 tps = 27716.198263 (including connections establishing)
    pf80 tps = 28393.642871 (including connections establishing)
    pm96 tps = 26616.076293 (including connections establishing)
    pf96 tps = 28055.921427 (including connections establishing)
    pm128 tps = 23282.912620 (including connections establishing)
    pf128 tps = 23072.766829 (including connections establishing)

    Note that on this 32 core box, performance on the read/write pgbench
    is peaking at 64 clients, but without a lot of variance between 32
    and 96 clients. And with the patch, performance still hasn't fallen
    off too badly at 128 clients. This is good news in terms of not
    having to sweat connection pool sizing quite as much as earlier
    releases.

    Next I will get the profile for the SELECT-only runs. It seems to
    make sense to profile at the peak performance level, which was 64
    clients.

    I can run one more set of tests tonight before I have to give it
    back to the guy who's putting it into production. It sounds like a
    set like the above except with synchronous_commit = off might be
    desirable?

    -Kevin
  • Kevin Grittner at Nov 21, 2011 at 5:14 pm

    "Kevin Grittner" wrote:

    I can run one more set of tests tonight before I have to give it
    back to the guy who's putting it into production. It sounds like
    a set like the above except with synchronous_commit = off might be
    desirable?
    OK, that's what I did. This gave me my best numbers yet for an
    updating run of pgbench: tps = 38039.724212 for prepared statements
    using the flexlock patch. This patch is a clear win when you get to
    16 clients or more.

    sm1 tps = 1312.501168 (including connections establishing)
    sf1 tps = 1376.678293 (including connections establishing)
    sm2 tps = 2705.571856 (including connections establishing)
    sf2 tps = 2689.577938 (including connections establishing)
    sm4 tps = 5461.403557 (including connections establishing)
    sf4 tps = 5447.363103 (including connections establishing)
    sm8 tps = 10524.695338 (including connections establishing)
    sf8 tps = 10448.012069 (including connections establishing)
    sm16 tps = 18952.968472 (including connections establishing)
    sf16 tps = 18969.505631 (including connections establishing)
    sm32 tps = 27392.393850 (including connections establishing)
    sf32 tps = 29225.974112 (including connections establishing)
    sm64 tps = 28947.675549 (including connections establishing)
    sf64 tps = 31417.536816 (including connections establishing)
    sm80 tps = 28053.684182 (including connections establishing)
    sf80 tps = 29970.555401 (including connections establishing)
    sm96 tps = 25885.679957 (including connections establishing)
    sf96 tps = 28581.271436 (including connections establishing)
    sm128 tps = 22261.902571 (including connections establishing)
    sf128 tps = 24537.566960 (including connections establishing)

    pm1 tps = 2082.958841 (including connections establishing)
    pf1 tps = 2052.328339 (including connections establishing)
    pm2 tps = 4287.257860 (including connections establishing)
    pf2 tps = 4228.770795 (including connections establishing)
    pm4 tps = 8653.196863 (including connections establishing)
    pf4 tps = 8592.091631 (including connections establishing)
    pm8 tps = 16071.432101 (including connections establishing)
    pf8 tps = 16196.992207 (including connections establishing)
    pm16 tps = 27146.441216 (including connections establishing)
    pf16 tps = 27441.966562 (including connections establishing)
    pm32 tps = 34983.352396 (including connections establishing)
    pf32 tps = 38039.724212 (including connections establishing)
    pm64 tps = 33182.643501 (including connections establishing)
    pf64 tps = 34193.732669 (including connections establishing)
    pm80 tps = 30686.712607 (including connections establishing)
    pf80 tps = 33336.011769 (including connections establishing)
    pm96 tps = 24692.015615 (including connections establishing)
    pf96 tps = 32907.472665 (including connections establishing)
    pm128 tps = 24164.441954 (including connections establishing)
    pf128 tps = 25742.670928 (including connections establishing)

    At lower client numbers the tps values within each set of five
    samples were very tightly grouped. With either protocol, and
    whether or not the patch was applied, the higher concurrency groups
    tended to be bifurcated within a set of five samples between "good"
    and "bad" numbers. The patch seemed to increase the number of
    clients which could be handled without collapse into the bad
    numbers. It really looks like there's some sort of performance
    "collapse" at higher concurrency which may or may not happen in any
    particular five minute run. Just as one example, running the simple
    protocol with the flexlock patch:

    tps = 24491.653873 (including connections establishing)
    tps = 24537.566960 (including connections establishing)
    tps = 28462.276323 (including connections establishing)
    tps = 24403.373002 (including connections establishing)
    tps = 28458.902549 (including connections establishing)

    -Kevin
  • Pavan Deolasee at Nov 21, 2011 at 5:20 pm

    On Mon, Nov 21, 2011 at 10:44 PM, Kevin Grittner wrote:
    "Kevin Grittner" wrote:
    I can run one more set of tests tonight before I have to give it
    back to the guy who's putting it into production.  It sounds like
    a set like the above except with synchronous_commit = off might be
    desirable?
    OK, that's what I did.  This gave me my best numbers yet for an
    updating run of pgbench: tps = 38039.724212 for prepared statements
    using the flexlock patch.  This patch is a clear win when you get to
    16 clients or more.
    It will be a great help if you could spare few minutes to also test
    the patch to take out the frequently accessed PGPROC members to a
    different array. We are seeing good improvements on HPUX IA platform
    and the AMD Opteron and it will be interesting to know what happens on
    the Intel platform too.

    http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Kevin Grittner at Nov 21, 2011 at 5:31 pm

    Pavan Deolasee wrote:

    It will be a great help if you could spare few minutes to also
    test the patch to take out the frequently accessed PGPROC members
    to a different array. We are seeing good improvements on HPUX IA
    platform and the AMD Opteron and it will be interesting to know
    what happens on the Intel platform too.
    http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

    It's going to be hard to arrange more of the 20-hours runs I've been
    doing, but I can work in some more abbreviated tests. What would be
    the best test for this? (I would hate to try and find out I didn't
    exercise the right code path.)

    -Kevin
  • Pavan Deolasee at Nov 21, 2011 at 5:38 pm

    On Mon, Nov 21, 2011 at 11:01 PM, Kevin Grittner wrote:
    Pavan Deolasee wrote:
    It will be a great help if you could spare few minutes to also
    test the patch to take out the frequently accessed PGPROC members
    to a different array. We are seeing good improvements on HPUX IA
    platform and the AMD Opteron and it will be interesting to know
    what happens on the Intel platform too.
    http://archives.postgresql.org/message-id/4EB7C4C9.9070309@enterprisedb.com

    It's going to be hard to arrange more of the 20-hours runs I've been
    doing, but I can work in some more abbreviated tests.  What would be
    the best test for this?  (I would hate to try and find out I didn't
    exercise the right code path.)
    I think 2-3 runs with 32 and 128 clients each with prepared statements
    should suffice to quickly compare with the other numbers you posted
    for the master.

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Kevin Grittner at Nov 21, 2011 at 11:11 pm

    Pavan Deolasee wrote:

    It will be a great help if you could spare few minutes to also
    test the patch to take out the frequently accessed PGPROC members
    to a different array. We are seeing good improvements on HPUX IA
    platform and the AMD Opteron and it will be interesting to know
    what happens on the Intel platform too.
    For a read only comparison (which was run using the simple
    protocol), using identical settings to the previous master run, but
    with the PGPROC split patch:

    m32 tps = 201738.209348 (including connections establishing)
    p32 tps = 201620.966988 (including connections establishing)

    m128 tps = 352159.631878 (including connections establishing)
    p128 tps = 363998.703900 (including connections establishing)

    Clearly a win at 128 clients; not at 32.

    For updates:

    sm32 tps = 27392.393850 (including connections establishing)
    sp32 tps = 27995.784333 (including connections establishing)

    sm128 tps = 22261.902571 (including connections establishing)
    sp128 tps = 23690.408272 (including connections establishing)

    pm32 tps = 34983.352396 (including connections establishing)
    pp32 tps = 36076.373389 (including connections establishing)

    pm128 tps = 24164.441954 (including connections establishing)
    pp128 tps = 27070.824588 (including connections establishing)

    That's a pretty decisive win all around.

    -Kevin
  • Pavan Deolasee at Nov 22, 2011 at 6:05 pm

    On Tue, Nov 22, 2011 at 4:40 AM, Kevin Grittner wrote:
    Pavan Deolasee wrote:
    It will be a great help if you could spare few minutes to also
    test the patch to take out the frequently accessed PGPROC members
    to a different array. We are seeing good improvements on HPUX IA
    platform and the AMD Opteron and it will be interesting to know
    what happens on the Intel platform too.
    For a read only comparison (which was run using the simple
    protocol), using identical settings to the previous master run, but
    with the PGPROC split patch:

    m32 tps = 201738.209348 (including connections establishing)
    p32 tps = 201620.966988 (including connections establishing)

    m128 tps = 352159.631878 (including connections establishing)
    p128 tps = 363998.703900 (including connections establishing)

    Clearly a win at 128 clients; not at 32.

    For updates:

    sm32 tps = 27392.393850 (including connections establishing)
    sp32 tps = 27995.784333 (including connections establishing)

    sm128 tps = 22261.902571 (including connections establishing)
    sp128 tps = 23690.408272 (including connections establishing)

    pm32 tps = 34983.352396 (including connections establishing)
    pp32 tps = 36076.373389 (including connections establishing)

    pm128 tps = 24164.441954 (including connections establishing)
    pp128 tps = 27070.824588 (including connections establishing)

    That's a pretty decisive win all around.
    Thanks for running those tests. The numbers are not that bad, but
    definitely not as good as we saw on some other platforms. But its
    possible that they may improve in percentage terms with even more
    number of clients on this box. And given that we are seeing big gains
    on other platforms, hopefully it will give us confident to proceed
    with the patch.

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Kevin Grittner at Nov 22, 2011 at 7:04 pm

    Pavan Deolasee wrote:

    The numbers are not that bad, but definitely not as good as we saw
    on some other platforms.
    Well, this machine is definitely designed to hold up under high
    concurrency. As I understand it, each core is the memory manager
    for two 4GB DIMMs, with two channels to them, each with two buffers.
    The way the cores are connected, a core never needs to go through
    more than one other core to get to memory not directly managed, and
    that uses "snoop" technology which hands the cached data right over
    from one core to the other when possible, rather than making the
    core which now owns the cache line pull it from RAM. It seems the
    2.6.32 kernel is able to manage that technology in a reasonable
    fashion.

    At first I was surprised to see performance top out on the update
    tests between 80 and 96 clients. But then, that lands almost
    exactly where my old reliable ((2 * core count) + effective spindle
    count) would predict. The SELECT only tests peaked at 64 clients,
    but those were fully cached, so effective spindle count was zero,
    again fitting the formula. So these optimizations seem to me to
    break down the barriers which had previously capped the number of
    clients which could be handled, letting them peak at their "natural"
    levels.
    But its possible that they may improve in percentage terms with
    even more number of clients on this box.
    I think so; I think this box is just so scalable that at 128
    clients we were just barely getting past the "knee" in the
    performance graphs to where these patches help most.

    -Kevin
  • Kevin Grittner at Nov 20, 2011 at 7:38 pm

    Robert Haas wrote:

    I was actually thinking it would be interesting to oprofile the
    read-only test; see if we can figure out where those slowdowns are
    coming from.
    CPU: Intel Core/i7, speed 2262 MHz (estimated)
    Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
    a unit mask of 0x00 (No unit mask) count 100000
    samples % image name symbol name
    3124242 5.7137 postgres s_lock
    2555554 4.6737 postgres AllocSetAlloc
    2403412 4.3954 postgres GetSnapshotData
    1967132 3.5975 postgres SearchCatCache
    1872176 3.4239 postgres base_yyparse
    1327256 2.4273 postgres hash_search_with_hash_value
    1040131 1.9022 postgres _bt_compare
    1038976 1.9001 postgres LWLockAcquire
    817122 1.4944 postgres MemoryContextAllocZeroAligned
    738321 1.3503 postgres core_yylex
    622613 1.1386 postgres MemoryContextAlloc
    597054 1.0919 postgres PinBuffer
    556138 1.0171 postgres ScanKeywordLookup
    552318 1.0101 postgres expression_tree_walker
    494279 0.9039 postgres LWLockRelease
    488628 0.8936 postgres hash_any
    472906 0.8649 postgres nocachegetattr
    396482 0.7251 postgres grouping_planner
    382974 0.7004 postgres LockAcquireExtended
    375186 0.6861 postgres AllocSetFree
    375072 0.6859 postgres ProcArrayLockRelease
    373668 0.6834 postgres new_list
    365917 0.6692 postgres fmgr_info_cxt_security
    301398 0.5512 postgres ProcArrayLockAcquire
    300647 0.5498 postgres LockReleaseAll
    292073 0.5341 postgres DirectFunctionCall1Coll
    285745 0.5226 postgres MemoryContextAllocZero
    284684 0.5206 postgres FunctionCall2Coll
    282701 0.5170 postgres SearchSysCache

    max_connections = 100
    max_pred_locks_per_transaction = 64
    shared_buffers = 8GB
    maintenance_work_mem = 1GB
    checkpoint_segments = 300
    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    wal_writer_delay = 20ms
    seq_page_cost = 0.1
    random_page_cost = 0.1
    cpu_tuple_cost = 0.05
    effective_cache_size = 40GB
    default_transaction_isolation = '$iso'

    pgbench -i -s 100
    pgbench -S -M simple -T 300 -c 80 -j 80

    transaction type: SELECT only
    scaling factor: 100
    query mode: simple
    number of clients: 80
    number of threads: 80
    duration: 300 s
    number of transactions actually processed: 104391011
    tps = 347964.636256 (including connections establishing)
    tps = 347976.389034 (excluding connections establishing)

    vmstat 1 showed differently this time -- no clue why.

    procs --------------memory------------- ---swap-- -----io----
    r b swpd free buff cache si so bi bo
    ---system---- -----cpu------
    in cs us sy id wa st
    91 0 8196 4189436 203925700 52314492 0 0 0 0
    32255 1522807 85 13 1 0 0
    92 0 8196 4189404 203925700 52314492 0 0 0 0
    32796 1525463 85 14 1 0 0
    67 0 8196 4189404 203925700 52314488 0 0 0 0
    32343 1527988 85 13 1 0 0
    93 0 8196 4189404 203925700 52314488 0 0 0 0
    32701 1535827 85 13 1 0 0

    -Kevin
  • Kevin Grittner at Nov 18, 2011 at 5:20 pm

    "Kevin Grittner" wrote:

    We have a 32-core Intel box (4 x X7560 @ 2.27GHz) with 256 GB
    RAM.
    In case anyone cares, this is the same box for which I posted STREAM
    test results a while back. The PostgreSQL tests seem to peak on
    this 32-core box at 64 clients, while the STREAM test of raw RAM
    speed kept increasing up to 128 clients. Overall, though, it's
    impressive how close PostgreSQL is now coming to the raw RAM access
    speed curve.

    http://archives.postgresql.org/pgsql-hackers/2011-08/msg01306.php

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 18, '11 at 3:11p
activeNov 22, '11 at 7:04p
posts32
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase