FAQ
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.

On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:

[1 client]
tps = 4459.203159 (including connections establishing)
tps = 4488.456559 (including connections establishing)
tps = 4449.570530 (including connections establishing)
[8 clients]
tps = 33524.668762 (including connections establishing)
tps = 33501.833907 (including connections establishing)
tps = 33382.380908 (including connections establishing)
[32 clients]
tps = 178394.863906 (including connections establishing)
tps = 178457.706972 (including connections establishing)
tps = 179365.310179 (including connections establishing)
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)

With the lazy vxid locks patch, all of those numbers get better by a
few percentage points (the more clients, the more points) except at 80
clients, where the performance is sometimes better and sometimes
worse. These are 5-minute runs:

[80 clients, with lazy vxid locks]
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)

I can't explain in detail why there is so much variation between the
5-minute runs, or why some runs perform worse than without the lazy
vxid locks, but I can say that apply the first of the two attached
patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid
of SInvalReadLock and instead gives each backend its own spinlock.
SICleanupQueue() must therefore get somewhat fuzzy answers, but it
shouldn't matter. The only real problem is if you need to do the
super-duper-cleanup where you subtract a constant from all the values
in unison. I just got rid of that completely, by widening the
counters to 64 bits. Assuming I haven't botched the implementation,
this is clearly a win. There is still some performance drop-off going
from 32 clients to 80 clients, but it is much less.

[80 clients, with lazy vxid locks and sinval-per-backend-mutex]
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)

Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).

Now this implementation (sinval-unlocked.patch) is obviously not a
serious proposal as it stands, because on processors with weak memory
ordering it will presumably blow up all over the place. But here are
the results on x86:

[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)

Now, that's actually *faster* then the above 32-core numbers. Turns
out, with this approach, there's essentially no degradation between 32
clients and 80 clients. It appears that even one spinlock acquisition
in SIGetDataEntries() is too many. Currently, we have *three*: one to
get SInvalReadLock, one to get msgnumLock, and one to release
SInvalReadLock.

For contrast, on a two-core MacBook Pro, I can't measure any
difference at all between lazy-vxid,
lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
The last might even be a touch slower, although there's enough noise
that it's hard to tell for sure, and the effect is very small if there
is one. But on the 32-core machine, the differences are dramatic.

Thoughts? Comments? Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Kevin Grittner at Jul 21, 2011 at 4:16 pm

    Robert Haas wrote:

    SIGetDataEntries(). I believe we need to bite the bullet and
    rewrite this using a lock-free algorithm, using memory barriers on
    processors with weak memory ordering.
    [32 processors; 80 clients]
    On unpatched master
    tps = 132518.586371 (including connections establishing)
    tps = 130968.749747 (including connections establishing)
    tps = 132574.338942 (including connections establishing)
    With the lazy vxid locks patch
    tps = 119215.958372 (including connections establishing)
    tps = 113056.859871 (including connections establishing)
    tps = 160562.770998 (including connections establishing)
    gets rid of SInvalReadLock and instead gives each backend its own
    spinlock.
    tps = 167392.042393 (including connections establishing)
    tps = 171336.145020 (including connections establishing)
    tps = 170500.529303 (including connections establishing)
    SIGetDataEntries() can pretty easily be made lock-free.
    tps = 203256.701227 (including connections establishing)
    tps = 190637.957571 (including connections establishing)
    tps = 190228.617178 (including connections establishing)
    Thoughts? Comments? Ideas?
    Very impressive! Those numbers definitely justify some #ifdef code
    to provide alternatives for weak memory ordering machines versus
    others. With the number of CPUs climbing as it is, this is very
    important work!

    -Kevin
  • Robert Haas at Jul 21, 2011 at 6:31 pm

    On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner wrote:
    Very impressive!  Those numbers definitely justify some #ifdef code
    to provide alternatives for weak memory ordering machines versus
    others.  With the number of CPUs climbing as it is, this is very
    important work!
    Thanks. I'm not thinking so much about #ifdef (although that could
    work, too) as I am about providing some primitives to allow this sort
    of code-writing to be done in a somewhat less ad-hoc fashion. It
    seems like there are basically two categories of machines we need to
    worry about.

    1. Machines with strong memory ordering. On this category of machines
    (which include x86), the CPU basically does not perform loads or
    stores out of order. On some of these machines, it is apparently
    possible for there to be some ordering of stores relative to loads,
    but if the program stores two values or loads two values, those
    operations will performed in the same order they appear in the
    program. The main thing you need to make your code work reliably on
    these machines is a primitive that keeps the compiler from reordering
    your code during optimization. On x86, certain categories of exotic
    instructions do require

    2. Machines with weak memory ordering. On this category of machines
    (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
    reorders memory accesses arbitrarily unless you explicitly issue
    instructions that enforce synchronization. You still need to keep the
    compiler from moving things around, too. Alpha is particularly
    pernicious, because something like a->b can fetch the pointed-to value
    before loading the pointer itself. This is otherwise known as "we
    have basically no cache coherency circuits on this chip at all". On
    these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.

    So you can imagine a primitive that is defined to be a compiler
    barrier on machines with strong memory ordering, and as a memory
    fencing instruction on machines with weak memory ordering.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 21, 2011 at 6:50 pm

    Robert Haas writes:
    ... On these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    Right, and the reason that a spinlock fixes it is that we have memory
    barrier instructions built into the spinlock code sequences on machines
    where it matters.

    To get to the point where we could do the sort of optimization Robert
    is talking about, someone will have to build suitable primitives for
    all the platforms we support. In the cases where we use gcc ASM in
    s_lock.h, it shouldn't be too hard to pull out the barrier
    instruction(s) ... but on platforms where we rely on OS-supplied
    functions, some research is going to be needed.

    regards, tom lane
  • Robert Haas at Jul 21, 2011 at 7:16 pm

    On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    ... On these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    Right, and the reason that a spinlock fixes it is that we have memory
    barrier instructions built into the spinlock code sequences on machines
    where it matters.

    To get to the point where we could do the sort of optimization Robert
    is talking about, someone will have to build suitable primitives for
    all the platforms we support.  In the cases where we use gcc ASM in
    s_lock.h, it shouldn't be too hard to pull out the barrier
    instruction(s) ... but on platforms where we rely on OS-supplied
    functions, some research is going to be needed.
    Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
    on a backend-private slock_t should work anywhere that PostgreSQL
    works at all[1]. That will probably be slower than a memory fence
    instruction and certainly slower than a compiler barrier, but the
    point is that - right now - we're doing it the slow way everywhere.

    I think the real challenge is going to be testing. If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    Getting it to work on x86 is not the hard part.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

    [1] This was a suggestion from Noah Misch. I wasn't quite convinced
    when he initially made it, but having studied the issue a lot more, I
    now am. The CPU doesn't know how many processes have the memory
    mapped into their address space.
  • Dave Page at Jul 21, 2011 at 7:22 pm

    On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote:
    I think the real challenge is going to be testing.  If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    Getting it to work on x86 is not the hard part.
    I believe there's a PPC box in our storage facility in NJ that we
    might be able to dig out for you. There's also a couple in our India
    office. Let me know if they'd be of help.

    --
    Dave Page
    Blog: http://pgsnake.blogspot.com
    Twitter: @pgsnake

    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jul 21, 2011 at 7:26 pm

    On Thu, Jul 21, 2011 at 3:22 PM, Dave Page wrote:
    On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote:
    I think the real challenge is going to be testing.  If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    Getting it to work on x86 is not the hard part.
    I believe there's a PPC box in our storage facility in NJ that we
    might be able to dig out for you. There's also a couple in our India
    office. Let me know if they'd be of help.
    Yes!

    More processors is better, of course, but having anything at all to
    test on would be an improvement.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Dave Page at Jul 21, 2011 at 7:29 pm

    On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas wrote:
    On Thu, Jul 21, 2011 at 3:22 PM, Dave Page wrote:
    On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas wrote:
    I think the real challenge is going to be testing.  If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    Getting it to work on x86 is not the hard part.
    I believe there's a PPC box in our storage facility in NJ that we
    might be able to dig out for you. There's also a couple in our India
    office. Let me know if they'd be of help.
    Yes!

    More processors is better, of course, but having anything at all to
    test on would be an improvement.
    OK, will check with India first, as it'll be easier for them to deploy.

    --
    Dave Page
    Blog: http://pgsnake.blogspot.com
    Twitter: @pgsnake

    EnterpriseDB UK: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 21, 2011 at 7:54 pm

    Robert Haas writes:
    I think the real challenge is going to be testing. If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    There are multi-CPU PPCen in the buildfarm, or at least there were last
    time I broke the sinval code ;-). Note that testing on a single-core
    PPC will prove nothing.

    regards, tom lane
  • Robert Haas at Jul 21, 2011 at 8:02 pm

    On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    I think the real challenge is going to be testing.  If anyone has a
    machine with weak memory ordering they can give me access to, that
    would be really helpful for flushing the bugs out of this stuff.
    There are multi-CPU PPCen in the buildfarm, or at least there were last
    time I broke the sinval code ;-).  Note that testing on a single-core
    PPC will prove nothing.
    Yeah, I was just thinking about that.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Florian Pflug at Jul 21, 2011 at 10:22 pm

    On Jul21, 2011, at 21:15 , Robert Haas wrote:
    On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    ... On these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    Right, and the reason that a spinlock fixes it is that we have memory
    barrier instructions built into the spinlock code sequences on machines
    where it matters.

    To get to the point where we could do the sort of optimization Robert
    is talking about, someone will have to build suitable primitives for
    all the platforms we support. In the cases where we use gcc ASM in
    s_lock.h, it shouldn't be too hard to pull out the barrier
    instruction(s) ... but on platforms where we rely on OS-supplied
    functions, some research is going to be needed.
    Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
    on a backend-private slock_t should work anywhere that PostgreSQL
    works at all[1]. That will probably be slower than a memory fence
    instruction and certainly slower than a compiler barrier, but the
    point is that - right now - we're doing it the slow way everywhere.
    As I discovered while playing with various lockless algorithms to
    improve our LWLocks, spin locks aren't actually a replacement for
    a (full) barrier.

    Lock acquisition only really needs to guarantee that loads and stores
    which come after the acquisition operation in program order (i.e., in
    the instruction stream) aren't globally visible before that operation
    completes. This kind of barrier behaviour is often fittingly called
    "acquire barrier".

    Similarly, a lock release operation only needs to guarantee that loads
    and stores which occur before that operation in program order are
    globally visible before the release operation completes. This, again,
    is fittingly called "release barrier".

    Now assume the following code fragment

    global1 = 1;
    SpinLockAcquire();
    SpinLockRelease();
    global2 = 1;

    If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
    has "release barrier" sematics, the it's possible for the store to global1
    to be delayed until after SpinLockAcquire(), and similarly for the store
    to global2 to be executed before SpinLockRelease() completes. In other
    words, what happens is

    SpinLockAcquire();
    global1 = 1;
    global2 = 1;
    SpinLockRelease();

    But once that can happens, there's no reason that it couldn't also be

    SpinLockAcquire();
    global2 = 1;
    global1 = 1;
    SpinLockRelease();

    I didn't check if any of our spin lock implementations is actually affected
    by this, but it doesn't seem wise to rely on them being full barriers, even
    if it may be true today.

    best regards,
    Florian Pflug
  • Robert Haas at Jul 21, 2011 at 11:35 pm

    On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug wrote:
    On Jul21, 2011, at 21:15 , Robert Haas wrote:
    On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    ... On these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    Right, and the reason that a spinlock fixes it is that we have memory
    barrier instructions built into the spinlock code sequences on machines
    where it matters.

    To get to the point where we could do the sort of optimization Robert
    is talking about, someone will have to build suitable primitives for
    all the platforms we support.  In the cases where we use gcc ASM in
    s_lock.h, it shouldn't be too hard to pull out the barrier
    instruction(s) ... but on platforms where we rely on OS-supplied
    functions, some research is going to be needed.
    Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
    on a backend-private slock_t should work anywhere that PostgreSQL
    works at all[1]. That will probably be slower than a memory fence
    instruction and certainly slower than a compiler barrier, but the
    point is that - right now - we're doing it the slow way everywhere.
    As I discovered while playing with various lockless algorithms to
    improve our LWLocks, spin locks aren't actually a replacement for
    a (full) barrier.

    Lock acquisition only really needs to guarantee that loads and stores
    which come after the acquisition operation in program order (i.e., in
    the instruction stream) aren't globally visible before that operation
    completes. This kind of barrier behaviour is often fittingly called
    "acquire barrier".

    Similarly, a lock release operation only needs to guarantee that loads
    and stores which occur before that operation in program order are
    globally visible before the release operation completes. This, again,
    is fittingly called "release barrier".

    Now assume the following code fragment

    global1 = 1;
    SpinLockAcquire();
    SpinLockRelease();
    global2 = 1;

    If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
    has "release barrier" sematics, the it's possible for the store to global1
    to be delayed until after SpinLockAcquire(), and similarly for the store
    to global2 to be executed before SpinLockRelease() completes. In other
    words, what happens is

    SpinLockAcquire();
    global1 = 1;
    global2 = 1;
    SpinLockRelease();

    But once that can happens, there's no reason that it couldn't also be

    SpinLockAcquire();
    global2 = 1;
    global1 = 1;
    SpinLockRelease();

    I didn't check if any of our spin lock implementations is actually affected
    by this, but it doesn't seem wise to rely on them being full barriers, even
    if it may be true today.
    Hmm. I'm not worried about that. AFAIK, only IA64 has such an
    implementation, and our existing spinlock implementation doesn't use
    it. If we were to add something like that in the future, we'd
    presumably know that we were doing it, and would add the appropriate
    memory barrier primitive at the same time.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Dan Ports at Jul 21, 2011 at 10:38 pm

    On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
    1. Machines with strong memory ordering. On this category of machines
    (which include x86), the CPU basically does not perform loads or
    stores out of order. On some of these machines, it is apparently
    possible for there to be some ordering of stores relative to loads,
    but if the program stores two values or loads two values, those
    operations will performed in the same order they appear in the
    program.
    This is all correct, but...
    The main thing you need to make your code work reliably on
    these machines is a primitive that keeps the compiler from reordering
    your code during optimization.
    If you're suggesting that hardware memory barriers aren't going to be
    needed to implement lock-free code on x86, that isn't true. Because a
    read can be reordered with respect to a write to a different memory
    location, you can still have problems. So you do still need memory
    barriers, just fewer of them.

    Dekker's algorithm is the classic example: two threads each set a flag
    and then check whether the other thread's flag is set. In any
    sequential execution, at least one should see the other's flag set, but
    on the x86 that doesn't always happen. One thread's read might be
    reordered before its write.
    2. Machines with weak memory ordering. On this category of machines
    (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
    reorders memory accesses arbitrarily unless you explicitly issue
    instructions that enforce synchronization. You still need to keep the
    compiler from moving things around, too. Alpha is particularly
    pernicious, because something like a->b can fetch the pointed-to value
    before loading the pointer itself. This is otherwise known as "we
    have basically no cache coherency circuits on this chip at all". On
    these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    The Alpha is pretty much unique (thankfully!) in allowing dependent
    reads to be reordered. That makes it even weaker than the typical
    weak-ordering machine. Since reading a pointer and then dereferencing
    it is a pretty reasonable thing to do regularly in RCU code, you
    probably don't want to emit barriers in between on architectures where
    it's not actually necessary. That argues for another operation that's
    defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
    Certainly the Linux kernel found it useful to do so
    (read_barrier_depends)

    Alternatively, one might question how important it is to support the
    Alpha these days...

    Dan

    --
    Dan R. K. Ports MIT CSAIL http://drkp.net/
  • Robert Haas at Jul 22, 2011 at 12:25 am

    On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports wrote:
    If you're suggesting that hardware memory barriers aren't going to be
    needed to implement lock-free code on x86, that isn't true. Because a
    read can be reordered with respect to a write to a different memory
    location, you can still have problems. So you do still need memory
    barriers, just fewer of them.

    Dekker's algorithm is the classic example: two threads each set a flag
    and then check whether the other thread's flag is set. In any
    sequential execution, at least one should see the other's flag set, but
    on the x86 that doesn't always happen. One thread's read might be
    reordered before its write.
    In the case of sinval, what we need to do for SIGetDataEntries() is,
    approximately, a bunch of loads, followed by a store to one of the
    locations we loaded (which no one else can have written meanwhile).
    So I think that's OK.

    In SIInsertDataEntries(), what we need to do is, approximately, take a
    lwlock, load from a location which can only be written while holding
    the lwlock, do a bunch of stores, ending with a store to that first
    location, and release the lwlock. I think that's OK, too.
    2. Machines with weak memory ordering.  On this category of machines
    (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
    reorders memory accesses arbitrarily unless you explicitly issue
    instructions that enforce synchronization.  You still need to keep the
    compiler from moving things around, too.  Alpha is particularly
    pernicious, because something like a->b can fetch the pointed-to value
    before loading the pointer itself.  This is otherwise known as "we
    have basically no cache coherency circuits on this chip at all".  On
    these machines, you need to issue an explicit memory barrier
    instruction at each sequence point, or just acquire and release a
    spinlock.
    The Alpha is pretty much unique (thankfully!) in allowing dependent
    reads to be reordered. That makes it even weaker than the typical
    weak-ordering machine. Since reading a pointer and then dereferencing
    it is a pretty reasonable thing to do regularly in RCU code, you
    probably don't want to emit barriers in between on architectures where
    it's not actually necessary. That argues for another operation that's
    defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
    Certainly the Linux kernel found it useful to do so
    (read_barrier_depends)

    Alternatively, one might question how important it is to support the
    Alpha these days...
    Well, currently, we do, so we probably don't want to drop support for
    that without some careful thought. I searched the archive and found
    someone trying to compile 8.3.something on Alpha just a few years ago,
    so it's apparently not totally dead yet.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 21, 2011 at 8:54 pm

    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    For the last week or so, in between various other tasks, I've been
    trying to understand why performance drops off when you run pgbench -n
    -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
    counts. The answer, in a word, is SIGetDataEntries(). I believe we
    need to bite the bullet and rewrite this using a lock-free algorithm,
    using memory barriers on processors with weak memory ordering.
    Perhaps there is another way to do it, but nothing I've tried has
    really worked so far, and I've tried quite a few things. Here's the
    data.

    On unpatched master, performance scales pretty much linearly out to 32
    clients. As you add more clients, it drops off:
    [80 clients]
    tps = 132518.586371 (including connections establishing)
    tps = 130968.749747 (including connections establishing)
    tps = 132574.338942 (including connections establishing)
    [80 clients, with lazy vxid locks and sinval-unlocked]
    tps = 203256.701227 (including connections establishing)
    tps = 190637.957571 (including connections establishing)
    tps = 190228.617178 (including connections establishing)
    Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
    sound direction.

    In
    http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
    you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
    correct to conclude that AcceptInvalidationMessages() still reduces the
    transaction rate by 5-10% with this stack of patches?

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 21, 2011 at 10:17 pm

    On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch wrote:
    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    For the last week or so, in between various other tasks, I've been
    trying to understand why performance drops off when you run pgbench -n
    -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
    counts.  The answer, in a word, is SIGetDataEntries().  I believe we
    need to bite the bullet and rewrite this using a lock-free algorithm,
    using memory barriers on processors with weak memory ordering.
    Perhaps there is another way to do it, but nothing I've tried has
    really worked so far, and I've tried quite a few things.  Here's the
    data.

    On unpatched master, performance scales pretty much linearly out to 32
    clients.  As you add more clients, it drops off:
    [80 clients]
    tps = 132518.586371 (including connections establishing)
    tps = 130968.749747 (including connections establishing)
    tps = 132574.338942 (including connections establishing)
    [80 clients, with lazy vxid locks and sinval-unlocked]
    tps = 203256.701227 (including connections establishing)
    tps = 190637.957571 (including connections establishing)
    tps = 190228.617178 (including connections establishing)
    Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
    sound direction.

    In
    http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
    you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
    correct to conclude that AcceptInvalidationMessages() still reduces the
    transaction rate by 5-10% with this stack of patches?
    Good question - I have not tested.

    One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
    we could make AcceptInvalidationMessages() a macro, something like
    this:

    if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)
    ReallyAcceptInvalidationMessages();

    That ought to be extremely cheap and - if we use 64-bit counters for
    the message-number counters - safe. You might object that the load of
    maxMsgNum might migrate backward, but it can't possibly back up any
    further than the preceding lock acquisition, since that's required to
    be a full memory barrier on every architecture. And if we haven't
    acquired a relevant lock, then a relevant sinval message could show up
    an instance after we check regardless of the implementation.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 21, 2011 at 10:43 pm

    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    Profiling this combination of patches reveals that there is still some
    pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
    to me that on x86, we really don't need this lock ... or
    SInvalReadLock ... or a per-backend mutex. The whole of
    SIGetDataEntries() can pretty easily be made lock-free. The only real
    changes that seem to be are needed are (1) to use a 64-bit counter, so
    you never need to decrement
    On second thought, won't this be inadequate on 32-bit systems, where updating
    the 64-bit counter produces two stores? You must avoid reading it between those
    stores.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 22, 2011 at 12:25 am

    On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote:
    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    Profiling this combination of patches reveals that there is still some
    pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
    to me that on x86, we really don't need this lock ... or
    SInvalReadLock ... or a per-backend mutex.  The whole of
    SIGetDataEntries() can pretty easily be made lock-free.  The only real
    changes that seem to be are needed are (1) to use a 64-bit counter, so
    you never need to decrement
    On second thought, won't this be inadequate on 32-bit systems, where updating
    the 64-bit counter produces two stores?  You must avoid reading it between those
    stores.
    Now that is a potentially big problem.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 22, 2011 at 1:20 am

    Robert Haas writes:
    On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote:
    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    SIGetDataEntries() can pretty easily be made lock-free. The only real
    changes that seem to be are needed are (1) to use a 64-bit counter, so
    you never need to decrement
    On second thought, won't this be inadequate on 32-bit systems, where updating
    the 64-bit counter produces two stores? You must avoid reading it between those stores.
    Now that is a potentially big problem.
    Could we do something similar to the xxid hacks? That is, we have a lot
    of counters that should be fairly close to each other, so we store only
    the low-order 32 bits of each notional value, and separately maintain a
    common high-order word. You probably would need some additional
    overhead each time the high-order word bumps, but that's reasonably
    infrequent.

    regards, tom lane
  • Robert Haas at Jul 22, 2011 at 3:37 am

    On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch wrote:
    On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
    SIGetDataEntries() can pretty easily be made lock-free.  The only real
    changes that seem to be are needed are (1) to use a 64-bit counter, so
    you never need to decrement
    On second thought, won't this be inadequate on 32-bit systems, where updating
    the 64-bit counter produces two stores?  You must avoid reading it between those stores.
    Now that is a potentially big problem.
    Could we do something similar to the xxid hacks?  That is, we have a lot
    of counters that should be fairly close to each other, so we store only
    the low-order 32 bits of each notional value, and separately maintain a
    common high-order word.  You probably would need some additional
    overhead each time the high-order word bumps, but that's reasonably
    infrequent.
    Well, the trouble is figuring out what the shape of that additional
    overhead needs to look like. I think I have a simpler idea, though:
    before acquiring any locks, just have SIGetDataEntries() do this:

    + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
    + return 0;

    Patch (with comment explaining why I think this is OK) attached. If
    the message numbers happen to be equal only because the counter has
    wrapped, then stateP->resetState will be true, so we'll still realize
    we need to do some work.

    Test results, with the lazy vxid patch plus this patch, at 8 clients:

    tps = 34028.144439 (including connections establishing)
    tps = 34079.085935 (including connections establishing)
    tps = 34125.295938 (including connections establishing)

    And at 32 clients:

    tps = 185521.605364 (including connections establishing)
    tps = 188250.700451 (including connections establishing)
    tps = 186077.847215 (including connections establishing)

    And at 80 clients:

    tps = 188568.886569 (including connections establishing)
    tps = 191035.971512 (including connections establishing)
    tps = 189363.019377 (including connections establishing)

    Not quite as good as the unlocked version, but better than the
    per-backend mutex, and a whole lot simpler.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 22, 2011 at 7:28 pm

    On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
    I think I have a simpler idea, though:
    before acquiring any locks, just have SIGetDataEntries() do this:

    + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
    + return 0;

    Patch (with comment explaining why I think this is OK) attached. If
    the message numbers happen to be equal only because the counter has
    wrapped, then stateP->resetState will be true, so we'll still realize
    we need to do some work.
    This is attractive, and I don't see any problems with it. (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match. Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)

    +1 for doing this and moving on.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 22, 2011 at 7:54 pm

    On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote:
    On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
    I think I have a simpler idea, though:
    before acquiring any locks, just have SIGetDataEntries() do this:

    +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
    +               return 0;

    Patch (with comment explaining why I think this is OK) attached.  If
    the message numbers happen to be equal only because the counter has
    wrapped, then stateP->resetState will be true, so we'll still realize
    we need to do some work.
    This is attractive, and I don't see any problems with it.  (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match.  Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    It's actually 262,144 times as long - see MSGNUMWRAPAROUND.

    It would be pretty easy to eliminate even the theoretical possibility
    of a race by getting rid of resetState altogether and using nextMsgNum
    = -1 to mean that. Maybe I should go ahead and do that.
    +1 for doing this and moving on.
    Yeah, I think I'll go ahead and commit something along these lines if
    no one objects. We can always fine-tune it more if needed (but it
    probably isn't).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 25, 2011 at 10:24 pm

    On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
    On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote:
    This is attractive, and I don't see any problems with it.  (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match.  Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
    Ah, so it is.
    It would be pretty easy to eliminate even the theoretical possibility
    of a race by getting rid of resetState altogether and using nextMsgNum
    = -1 to mean that. Maybe I should go ahead and do that.
    Seems like a nice simplification.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 26, 2011 at 5:36 pm

    On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch wrote:
    On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
    On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote:
    This is attractive, and I don't see any problems with it.  (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match.  Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
    Ah, so it is.
    It would be pretty easy to eliminate even the theoretical possibility
    of a race by getting rid of resetState altogether and using nextMsgNum
    = -1 to mean that.  Maybe I should go ahead and do that.
    Seems like a nice simplification.
    On further reflection, I don't see that this helps: it just moves the
    problem around. With resetState as a separate variable, nextMsgNum is
    never changed by anyone other than the owner, so we can never have a
    stale load. But if we overload nextMsgNum to also indicate whether
    our state has been reset, then there's a race between when we load
    nextMsgNum and when we load maxMsgNum (instead of code I posted
    previously, which has a race between when we load resetState and when
    we load maxMsgNum). Now, as you say, it seems really, really
    difficult to hit that in practice, but I don't see a way of getting
    rid of the theoretical possibility without either (1) a spinlock or
    (2) a fence. (Of course, on x86, the fence could be optimized down to
    a compiler barrier.) I guess the question is "should we worry about
    that?".

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 26, 2011 at 5:43 pm

    Robert Haas writes:
    On further reflection, I don't see that this helps: it just moves the
    problem around. With resetState as a separate variable, nextMsgNum is
    never changed by anyone other than the owner, so we can never have a
    stale load. But if we overload nextMsgNum to also indicate whether
    our state has been reset, then there's a race between when we load
    nextMsgNum and when we load maxMsgNum (instead of code I posted
    previously, which has a race between when we load resetState and when
    we load maxMsgNum). Now, as you say, it seems really, really
    difficult to hit that in practice, but I don't see a way of getting
    rid of the theoretical possibility without either (1) a spinlock or
    (2) a fence. (Of course, on x86, the fence could be optimized down to
    a compiler barrier.) I guess the question is "should we worry about
    that?".
    Uh, yes. I've lost count of the number of times I've seen people hit
    one-instruction-wide race condition windows, SIGSEGV crash based on
    accessing only one byte past the valid data structure, etc etc.
    The worst thing about this type of bug is that you can't reproduce the
    failure when you try; doesn't mean your users won't see it.

    regards, tom lane
  • Alvaro Herrera at Jul 26, 2011 at 5:52 pm

    Excerpts from Tom Lane's message of mar jul 26 13:43:08 -0400 2011:

    Uh, yes. I've lost count of the number of times I've seen people hit
    one-instruction-wide race condition windows, SIGSEGV crash based on
    accessing only one byte past the valid data structure, etc etc.
    I think you need a better locking protocol on that counter of yours.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Simon Riggs at Jul 26, 2011 at 6:11 pm

    On Tue, Jul 26, 2011 at 6:36 PM, Robert Haas wrote:

    Now, as you say, it seems really, really
    difficult to hit that in practice, but I don't see a way of getting
    rid of the theoretical possibility without either (1) a spinlock or
    (2) a fence.  (Of course, on x86, the fence could be optimized down to
    a compiler barrier.)  I guess the question is "should we worry about
    that?".
    Perhaps the answer lies in a different direction altogether?

    Let me ask a few questions to stimulate a different solution

    * Can we do this using an active technique (e.g. signals) rather than
    a passive one (reading a counter?)

    * Can we partition the sinval lock, so we have multiple copies? That
    increases the task for those who trigger an invalidation, but will
    relieve the pressure for most readers.

    * Can we put the sinval info in a different place? e.g. inside each
    lock partition.

    * Why do we have a different mechanism for cache invalidation
    internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
    Why don't we have just one?

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Alvaro Herrera at Jul 26, 2011 at 6:24 pm

    Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:

    Let me ask a few questions to stimulate a different solution

    * Can we do this using an active technique (e.g. signals) rather than
    a passive one (reading a counter?)
    Signals are already in use for special cases (queue is full), and I
    think going through the kernel to achieve much more will lower
    performance significantly.
    * Can we partition the sinval lock, so we have multiple copies? That
    increases the task for those who trigger an invalidation, but will
    relieve the pressure for most readers.
    Not sure there's a way to meaningfully partition the queue. In any
    case, I think the problem being dealt with here is how to update the
    read heads of the queue, not its contents.
    * Can we put the sinval info in a different place? e.g. inside each
    lock partition.
    I don't think you can relate sinval messages to particular locks, which
    would make this infeasible. I think this bit is directly related to the
    point above.
    * Why do we have a different mechanism for cache invalidation
    internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
    Why don't we have just one?
    Performance. Not sure there are other reasons as well; but just this
    one is significant enough.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Simon Riggs at Jul 26, 2011 at 6:56 pm

    On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera wrote:
    Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
    Let me ask a few questions to stimulate a different solution

    * Can we do this using an active technique (e.g. signals) rather than
    a passive one (reading a counter?)
    Signals are already in use for special cases (queue is full), and I
    think going through the kernel to achieve much more will lower
    performance significantly.
    If there are no invalidations, there would be no signals. How would
    zero signals decrease performance?

    * Can we partition the sinval lock, so we have multiple copies? That
    increases the task for those who trigger an invalidation, but will
    relieve the pressure for most readers.
    Not sure there's a way to meaningfully partition the queue.  In any
    case, I think the problem being dealt with here is how to update the
    read heads of the queue, not its contents.
    I agree there's no meaningful way to partition the queue, but we can
    store the information in more than one place to reduce the contention
    of people requesting it.

    Both those ideas are still on the table.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 26, 2011 at 7:11 pm

    On Tue, Jul 26, 2011 at 2:56 PM, Simon Riggs wrote:
    On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
    wrote:
    Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
    Let me ask a few questions to stimulate a different solution

    * Can we do this using an active technique (e.g. signals) rather than
    a passive one (reading a counter?)
    Signals are already in use for special cases (queue is full), and I
    think going through the kernel to achieve much more will lower
    performance significantly.
    If there are no invalidations, there would be no signals. How would
    zero signals decrease performance?
    It wouldn't, although it might be bad in the case where there are lots
    of temp tables being created and dropped. I think the biggest problem
    with signals is that they don't provide any meaningful synchronization
    guarantees. When you send somebody a signal, you don't really know
    how long it's going to take for them to receive it.
    * Can we partition the sinval lock, so we have multiple copies? That
    increases the task for those who trigger an invalidation, but will
    relieve the pressure for most readers.
    Not sure there's a way to meaningfully partition the queue.  In any
    case, I think the problem being dealt with here is how to update the
    read heads of the queue, not its contents.
    I agree there's no meaningful way to partition the queue, but we can
    store the information in more than one place to reduce the contention
    of people requesting it.
    I thought about that. Basically, that saves you a factor of N in
    contention on the read side (where N is the number of copies) and
    costs you a factor of N on the write side (you have to update N
    copies, taking a spinlock or lwlock for each). In the limit, you
    could do one copy of the counter per backend.

    I think, though, that a lock-free implementation using memory barriers
    is going to end up being the only real solution. We might possibly
    convince ourselves that we're OK with increasing the cost of
    SIInsertDataEntries(), but any solution that involves replication the
    data is still based on doing at least some locking. And I am pretty
    well convinced that even one spinlock acquisition in
    SIInsertDataEntries() is too many.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Jul 26, 2011 at 7:25 pm

    On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote:

    It wouldn't, although it might be bad in the case where there are lots
    of temp tables being created and dropped.
    Do temp tables cause relcache invalidations?

    That seems like something we'd want to change in itself.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 27, 2011 at 1:36 am

    On Tue, Jul 26, 2011 at 3:25 PM, Simon Riggs wrote:
    On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote:
    It wouldn't, although it might be bad in the case where there are lots
    of temp tables being created and dropped.
    Do temp tables cause relcache invalidations?

    That seems like something we'd want to change in itself.
    I agree. Unfortunately, I think it's a non-trivial fix.

    I've also been wondering if we could avoid taking an
    AccessExclusiveLock on a newly created (temporary?) table. It seems
    like no one should be able to see it until commit, at which point we'd
    be releasing the lock anyway.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Jul 26, 2011 at 7:34 pm

    On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas wrote:

    * Can we partition the sinval lock, so we have multiple copies? That
    increases the task for those who trigger an invalidation, but will
    relieve the pressure for most readers.
    Not sure there's a way to meaningfully partition the queue.  In any
    case, I think the problem being dealt with here is how to update the
    read heads of the queue, not its contents.
    I agree there's no meaningful way to partition the queue, but we can
    store the information in more than one place to reduce the contention
    of people requesting it.
    I thought about that.  Basically, that saves you a factor of N in
    contention on the read side (where N is the number of copies) and
    costs you a factor of N on the write side (you have to update N
    copies, taking a spinlock or lwlock for each).  In the limit, you
    could do one copy of the counter per backend.

    I think, though, that a lock-free implementation using memory barriers
    is going to end up being the only real solution.  We might possibly
    convince ourselves that we're OK with increasing the cost of
    SIInsertDataEntries(), but any solution that involves replication the
    data is still based on doing at least some locking.  And I am pretty
    well convinced that even one spinlock acquisition in
    SIInsertDataEntries() is too many.
    You might be right, but I think we have little knowledge of how some
    memory barrier code you haven't written yet effects performance on
    various architectures.

    A spinlock per backend would cache very nicely, now you mention it. So
    my money would be on the multiple copies.

    It's not completely clear to me that updating N copies would be more
    expensive. Accessing N low contention copies rather than 1
    high-contention value might actually be a win.

    --
    Simon Riggs                   http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 27, 2011 at 1:33 am

    On Tue, Jul 26, 2011 at 3:34 PM, Simon Riggs wrote:
    You might be right, but I think we have little knowledge of how some
    memory barrier code you haven't written yet effects performance on
    various architectures.

    A spinlock per backend would cache very nicely, now you mention it. So
    my money would be on the multiple copies.
    Maybe so, but you can see from the numbers in my OP that the results
    still leave something to be desired.
    It's not completely clear to me that updating N copies would be more
    expensive. Accessing N low contention copies rather than 1
    high-contention value might actually be a win.
    Yeah, I haven't tested that approach.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 26, 2011 at 7:25 pm

    Simon Riggs writes:
    On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
    wrote:
    Signals are already in use for special cases (queue is full), and I
    think going through the kernel to achieve much more will lower
    performance significantly.
    If there are no invalidations, there would be no signals. How would
    zero signals decrease performance?
    But if there *is* an invalidation (which is not a negligible case),
    it'd get significantly slower.

    regards, tom lane
  • Noah Misch at Jul 26, 2011 at 8:38 pm

    On Tue, Jul 26, 2011 at 01:36:26PM -0400, Robert Haas wrote:
    On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch wrote:
    On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
    On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch wrote:
    This is attractive, and I don't see any problems with it.  (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match.  Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
    Ah, so it is.
    It would be pretty easy to eliminate even the theoretical possibility
    of a race by getting rid of resetState altogether and using nextMsgNum
    = -1 to mean that.  Maybe I should go ahead and do that.
    Seems like a nice simplification.
    On further reflection, I don't see that this helps: it just moves the
    problem around.
    Alas, yes.
    With resetState as a separate variable, nextMsgNum is
    never changed by anyone other than the owner, so we can never have a
    stale load.
    It's also changed when SICleanupQueue() decides to wrap everyone. This could
    probably be eliminated by using uint32 and letting overflow take care of wrap
    implicitly, but ...
    But if we overload nextMsgNum to also indicate whether
    our state has been reset, then there's a race between when we load
    nextMsgNum and when we load maxMsgNum (instead of code I posted
    previously, which has a race between when we load resetState and when
    we load maxMsgNum). Now, as you say, it seems really, really
    difficult to hit that in practice, but I don't see a way of getting
    rid of the theoretical possibility without either (1) a spinlock or
    (2) a fence. (Of course, on x86, the fence could be optimized down to
    a compiler barrier.)
    No new ideas come to mind, here. We could migrate back toward your original
    proposal of making the counter a non-wrapping uint64; Florian described some
    nice techniques for doing atomic 64-bit reads on x86:

    http://archives.postgresql.org/message-id/7C94C386-122F-4918-8624-A14352E8DBC5@phlo.org

    I'm not personally acquainted with those approaches, but they sound promising.
    Since the overlap between 32-bit installations and performance-sensitive
    installations is all but gone, we could even just use a spinlock under 32-bit.
    I guess the question is "should we worry about that?".
    I'd lean toward "no". I share Tom's unease about writing off a race condition
    as being too improbable, but this is quite exceptionally improbable. On the
    other hand, some of the fixes don't look too invasive.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 27, 2011 at 1:57 am

    On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote:
    No new ideas come to mind, here.
    OK, I have a new idea. :-)

    1. Add a new flag to each procState called something like "timeToPayAttention".
    2. Each call to SIGetDataEntries() iterates over all the ProcStates
    whose index is < lastBackend and sets stateP->timeToPayAttention =
    TRUE for each.
    3. At the beginning of SIGetDataEntries(), we do an unlocked if
    (!stateP->timeToPayAttention) return 0.
    4. Immediately following that if statement and before acquiring any
    locks, we set stateP->timeToPayAttention = FALSE.

    The LWLockRelease() in SIGetDataEntries() will be sufficient to force
    all of the stateP->timeToPayAttention writes out to main memory, and
    the read side is OK because we either just took a lock (which acted as
    a fence) or else there's a race anyway. But unlike my previous
    proposal, it doesn't involve *comparing* anything. We needn't worry
    about whether we could read two different values that are through
    great misfortune out of sync because we're only reading one value.

    If, by chance, the value is set to true just after we set it to false,
    that won't mess us up either: we'll still read all the messages after
    acquiring the lock.

    Now, there's some downside to this approach - it involves doing O(N)
    work for each invalidation message we receive. But as long as it's
    only O(N) stores and not O(N) lock acquisitions and releases, I think
    that might be OK.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jul 27, 2011 at 4:12 pm

    On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote:
    On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote:
    No new ideas come to mind, here.
    OK, I have a new idea.  :-)

    1. Add a new flag to each procState called something like "timeToPayAttention".
    2. Each call to SIGetDataEntries() iterates over all the ProcStates
    whose index is < lastBackend and sets stateP->timeToPayAttention =
    TRUE for each.
    3. At the beginning of SIGetDataEntries(), we do an unlocked if
    (!stateP->timeToPayAttention) return 0.
    4. Immediately following that if statement and before acquiring any
    locks, we set stateP->timeToPayAttention = FALSE.

    The LWLockRelease() in SIGetDataEntries() will be sufficient to force
    all of the stateP->timeToPayAttention writes out to main memory, and
    the read side is OK because we either just took a lock (which acted as
    a fence) or else there's a race anyway.  But unlike my previous
    proposal, it doesn't involve *comparing* anything.  We needn't worry
    about whether we could read two different values that are through
    great misfortune out of sync because we're only reading one value.

    If, by chance, the value is set to true just after we set it to false,
    that won't mess us up either: we'll still read all the messages after
    acquiring the lock.
    There turned out to be a little bit of further subtlety to this, but
    it seems to work. Patch attached.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 27, 2011 at 4:34 pm

    Robert Haas writes:
    On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote:
    1. Add a new flag to each procState called something like "timeToPayAttention".
    2. Each call to SIGetDataEntries() iterates over all the ProcStates
    whose index is < lastBackend and sets stateP->timeToPayAttention =
    TRUE for each.
    3. At the beginning of SIGetDataEntries(), we do an unlocked if
    (!stateP->timeToPayAttention) return 0.
    4. Immediately following that if statement and before acquiring any
    locks, we set stateP->timeToPayAttention = FALSE.
    There turned out to be a little bit of further subtlety to this, but
    it seems to work. Patch attached.
    And?

    It didn't sound to me like this could possibly be a performance win,
    but I await some numbers ...

    regards, tom lane
  • Robert Haas at Jul 27, 2011 at 4:52 pm

    On Wed, Jul 27, 2011 at 12:34 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas wrote:
    1. Add a new flag to each procState called something like "timeToPayAttention".
    2. Each call to SIGetDataEntries() iterates over all the ProcStates
    whose index is < lastBackend and sets stateP->timeToPayAttention =
    TRUE for each.
    3. At the beginning of SIGetDataEntries(), we do an unlocked if
    (!stateP->timeToPayAttention) return 0.
    4. Immediately following that if statement and before acquiring any
    locks, we set stateP->timeToPayAttention = FALSE.
    There turned out to be a little bit of further subtlety to this, but
    it seems to work.  Patch attached.
    And?

    It didn't sound to me like this could possibly be a performance win,
    but I await some numbers ...
    On master, with the patch, scale factor 100, pgbench -S -c $CLIENTS -j
    $CLIENTS -T 300 results for different client counts, on a 32-core
    machine with 128GB RAM, shared_buffers = 8GB:

    01 tps = 4444.280459 (including connections establishing)
    01 tps = 4438.365576 (including connections establishing)
    01 tps = 4423.718466 (including connections establishing)
    08 tps = 33187.827872 (including connections establishing)
    08 tps = 33288.247330 (including connections establishing)
    08 tps = 33304.307835 (including connections establishing)
    32 tps = 178876.350559 (including connections establishing)
    32 tps = 177293.082295 (including connections establishing)
    32 tps = 175577.058885 (including connections establishing)
    80 tps = 159202.449993 (including connections establishing)
    80 tps = 151541.717088 (including connections establishing)
    80 tps = 150454.658132 (including connections establishing)

    Without the patch:

    01 tps = 4447.660101 (including connections establishing)
    01 tps = 4440.830713 (including connections establishing)
    01 tps = 4411.610348 (including connections establishing)
    08 tps = 33135.773476 (including connections establishing)
    08 tps = 33365.387051 (including connections establishing)
    08 tps = 33364.859705 (including connections establishing)
    32 tps = 175834.280471 (including connections establishing)
    32 tps = 176713.118271 (including connections establishing)
    32 tps = 176830.687087 (including connections establishing)
    80 tps = 135211.036587 (including connections establishing)
    80 tps = 130642.264016 (including connections establishing)
    80 tps = 133621.549513 (including connections establishing)

    I'm fairly certain the results will be even more dramatic with the
    lazy-vxid patch applied, since master still has to fight with the lock
    manager on this workload. I haven't tested that yet, but there's not
    much reason to suppose that the results will be dramatically different
    from any of the other methods of reducing the sinval overhead that
    I've benchmarked, many of which I've already posted about. See, for
    example, the OP on this thread.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Jul 27, 2011 at 4:39 pm

    Robert Haas wrote:

    There turned out to be a little bit of further subtlety to this,
    but it seems to work. Patch attached.
    Stylistic question: Why is stateP->hasMessages set to false in one
    place and FALSE and TRUE in others? It seems like it would be less
    confusing to be consistent.

    A quick count of the usage of both forms in the PostgreSQL source
    codes shows the lowercase is used about 10 times more often:

    kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
    -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
    '\b(FALSE|TRUE)\b'
    1670

    kgrittn@kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
    -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
    '\b(false|true)\b'
    17349

    -Kevin
  • Noah Misch at Jul 27, 2011 at 4:55 pm

    On Tue, Jul 26, 2011 at 09:57:10PM -0400, Robert Haas wrote:
    On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch wrote:
    No new ideas come to mind, here.
    OK, I have a new idea. :-)

    1. Add a new flag to each procState called something like "timeToPayAttention".
    2. Each call to SIGetDataEntries() iterates over all the ProcStates
    whose index is < lastBackend and sets stateP->timeToPayAttention =
    TRUE for each.
    3. At the beginning of SIGetDataEntries(), we do an unlocked if
    (!stateP->timeToPayAttention) return 0.
    4. Immediately following that if statement and before acquiring any
    locks, we set stateP->timeToPayAttention = FALSE.

    The LWLockRelease() in SIGetDataEntries() will be sufficient to force
    all of the stateP->timeToPayAttention writes out to main memory, and
    the read side is OK because we either just took a lock (which acted as
    a fence) or else there's a race anyway. But unlike my previous
    proposal, it doesn't involve *comparing* anything. We needn't worry
    about whether we could read two different values that are through
    great misfortune out of sync because we're only reading one value.

    If, by chance, the value is set to true just after we set it to false,
    that won't mess us up either: we'll still read all the messages after
    acquiring the lock.
    This approach would work if a spinlock release constrained the global stores
    timeline. It makes a weaker guarantee: all stores preceding the lock release
    in program order will precede it globally. Consequently, no processor will
    observe the spinlock to be available without also observing all stores made by
    the last holding processor before that processor released it. That guarantee
    is not enough for this sequence of events:

    Backend 0:
    add a message for table "a"
    store 5 => maxMsgNum
    store true => timeToPayAttention
    LWLockRelease(SInvalWriteLock)
    <plenty of time passes>
    Backend 2:
    LOCK TABLE a;
    load timeToPayAttention => true
    Backend 1:
    add a message for table "b"
    store 6 => maxMsgNum
    SpinLockRelease(&vsegP->msgnumLock);
    store true => timeToPayAttention
    Backend 2:
    store false => timeToPayAttention
    LWLockAcquire(SInvalReadLock, LW_SHARED)
    load maxMsgNum => 5 [!]
    Backend 1:
    LWLockRelease(SInvalWriteLock);
    Backend 2:
    LOCK TABLE b;
    load timeToPayAttention => false
    <use b without processing updates>

    The "SpinLockRelease(&vsegP->msgnumLock)" guarantees that any process
    observing the backend 2 store of timeToPayAttention will also observe
    maxMsgNum = 6. However, nothing constrains which timeToPayAttention store
    will "win" in main memory. Backend 1 can observe neither update from backend
    2, yet still have its store appear later than the backend 1 stores in the
    global timeline.
    Now, there's some downside to this approach - it involves doing O(N)
    work for each invalidation message we receive. But as long as it's
    only O(N) stores and not O(N) lock acquisitions and releases, I think
    that might be OK.
    I think a benchmark is in order, something like 900 idle connections and 80
    connections running small transactions that create a few temporary tables. If
    that shows no statistically significant regression, then we're probably fine
    here. I'm not sure what result to expect, honestly.

    What did you think of making the message number a uint64, removing wraparound
    handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate the
    variant logic in READ_MSGNUM and WRITE_MSGNUM macros.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 27, 2011 at 5:30 pm

    On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch wrote:
    This approach would work if a spinlock release constrained the global stores
    timeline.  It makes a weaker guarantee: all stores preceding the lock release
    in program order will precede it globally.  Consequently, no processor will
    observe the spinlock to be available without also observing all stores made by
    the last holding processor before that processor released it.  That guarantee
    is not enough for this sequence of events:

    Backend 0:
    add a message for table "a"
    store 5 => maxMsgNum
    store true => timeToPayAttention
    LWLockRelease(SInvalWriteLock)
    <plenty of time passes>
    Backend 2:
    LOCK TABLE a;
    load timeToPayAttention => true
    Backend 1:
    add a message for table "b"
    store 6 => maxMsgNum
    SpinLockRelease(&vsegP->msgnumLock);
    store true => timeToPayAttention
    Backend 2:
    store false => timeToPayAttention
    LWLockAcquire(SInvalReadLock, LW_SHARED)
    load maxMsgNum => 5 [!]
    Eh, how can this possibly happen? You have to hold msgNumLock to to
    set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
    to guarantee that we never read a stale value, what is?
    I think a benchmark is in order, something like 900 idle connections and 80
    connections running small transactions that create a few temporary tables.  If
    that shows no statistically significant regression, then we're probably fine
    here.  I'm not sure what result to expect, honestly.
    That's setting the bar pretty high. I don't mind doing the
    experiment, but I'm not sure that's the case we should be optimizing
    for.
    What did you think of making the message number a uint64, removing wraparound
    handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
    variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
    Well, what you really need to know is whether the platform has 8-byte
    atomic stores, which doesn't seem trivial to figure out, plus you need
    a memory fence. If that's the only method of fixing this problem we
    can agree on, I'm willing to work on it, but an
    architecture-independent fix would be nicer.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 27, 2011 at 5:58 pm

    On Wed, Jul 27, 2011 at 01:30:47PM -0400, Robert Haas wrote:
    On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch wrote:
    [wrong objection]
    Eh, how can this possibly happen? You have to hold msgNumLock to to
    set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
    to guarantee that we never read a stale value, what is?
    Indeed, my analysis was all wrong.
    I think a benchmark is in order, something like 900 idle connections and 80
    connections running small transactions that create a few temporary tables.  If
    that shows no statistically significant regression, then we're probably fine
    here.  I'm not sure what result to expect, honestly.
    That's setting the bar pretty high. I don't mind doing the
    experiment, but I'm not sure that's the case we should be optimizing
    for.
    Granted. How about 32 clients running the temporary table transaction, no idle
    connections? Given the meager benefit of this patch compared to your previous
    version, it would be hard to justify a notable performance drop in return.
    What did you think of making the message number a uint64, removing wraparound
    handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
    variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
    Well, what you really need to know is whether the platform has 8-byte
    atomic stores, which doesn't seem trivial to figure out, plus you need
    a memory fence. If that's the only method of fixing this problem we
    can agree on, I'm willing to work on it, but an
    architecture-independent fix would be nicer.
    Fair enough.

    Thanks,
    nm

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 27, 2011 at 6:29 pm

    On Wed, Jul 27, 2011 at 1:58 PM, Noah Misch wrote:
    I think a benchmark is in order, something like 900 idle connections and 80
    connections running small transactions that create a few temporary tables.  If
    that shows no statistically significant regression, then we're probably fine
    here.  I'm not sure what result to expect, honestly.
    That's setting the bar pretty high.  I don't mind doing the
    experiment, but I'm not sure that's the case we should be optimizing
    for.
    Granted.  How about 32 clients running the temporary table transaction, no idle
    connections?  Given the meager benefit of this patch compared to your previous
    version, it would be hard to justify a notable performance drop in return.
    The reason the benefit is smaller is, I believe, because the previous
    numbers were generated with the lazy vxid locks patch applied, and
    these were generated against master. With the lock manager as a
    bottleneck, the sinval stuff doesn't get hit quite as hard, so the
    benefit is less. I can regenerate the numbers with the lazy vxid
    patch applied; I suspect they'll be similar to what we saw before.
    I'll also test out creating and dropping some tables.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jul 28, 2011 at 2:05 pm

    On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas wrote:
    The reason the benefit is smaller is, I believe, because the previous
    numbers were generated with the lazy vxid locks patch applied, and
    these were generated against master.  With the lock manager as a
    bottleneck, the sinval stuff doesn't get hit quite as hard, so the
    benefit is less.  I can regenerate the numbers with the lazy vxid
    patch applied; I suspect they'll be similar to what we saw before.
    Yep. Here's with both lazy-vxids and sinval-hasmessages;

    01 tps = 4470.133776 (including connections establishing)
    01 tps = 4471.450650 (including connections establishing)
    01 tps = 4490.833194 (including connections establishing)
    32 tps = 191416.960956 (including connections establishing)
    32 tps = 190653.742400 (including connections establishing)
    32 tps = 191832.231458 (including connections establishing)
    80 tps = 189348.509378 (including connections establishing)
    80 tps = 191080.641878 (including connections establishing)
    80 tps = 191366.728275 (including connections establishing)

    And with just lazy vxids:

    01 tps = 4458.667013 (including connections establishing)
    01 tps = 4526.922638 (including connections establishing)
    01 tps = 4480.415099 (including connections establishing)
    32 tps = 193273.434028 (including connections establishing)
    32 tps = 190661.279391 (including connections establishing)
    32 tps = 189526.560031 (including connections establishing)
    80 tps = 150572.020250 (including connections establishing)
    80 tps = 118643.970622 (including connections establishing)
    80 tps = 119211.643930 (including connections establishing)

    Same select-only, scale-factor-100 pgbench test, same 32 core machine,
    as I've been using for my other recent tests.
    I'll also test out creating and dropping some tables.
    Still need to work on this one.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at Jul 28, 2011 at 7:03 pm

    On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas wrote:
    I'll also test out creating and dropping some tables.
    Still need to work on this one.
    And there results are in. I set up the following sophisticated test
    script for pgbench:

    CREATE TEMP TABLE foo (a int);
    DROP TABLE foo;

    And then did 5-minute test runs with varying numbers of clients on
    Nate Boley's 32-core machine with (a) master, (b) master +
    sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
    + sinval-hasmessages. I held my breath until the results started
    popping out, then felt much better. In the table below, results with
    L are with the lazy-vxid patch, S is with the sinval-hasmessages
    patch, LS with both, and results with no letters are neither. The
    numbers are the client count. Long story short, it seems that the
    patch actually makes this test case faster at any client code, though
    the improvement at 1 and 8 clients might be in the noise:

    01L tps = 514.880290 (including connections establishing)
    01L tps = 525.097199 (including connections establishing)
    01L tps = 508.319588 (including connections establishing)
    08L tps = 1834.259810 (including connections establishing)
    08L tps = 1846.846089 (including connections establishing)
    08L tps = 1841.402433 (including connections establishing)
    32L tps = 1463.822936 (including connections establishing)
    32L tps = 1481.169483 (including connections establishing)
    32L tps = 1393.780335 (including connections establishing)
    80L tps = 1192.768020 (including connections establishing)
    80L tps = 1165.545010 (including connections establishing)
    80L tps = 1169.776066 (including connections establishing)

    01LS tps = 517.624068 (including connections establishing)
    01LS tps = 524.507723 (including connections establishing)
    01LS tps = 507.847622 (including connections establishing)
    08LS tps = 1831.248178 (including connections establishing)
    08LS tps = 1873.932133 (including connections establishing)
    08LS tps = 1863.048113 (including connections establishing)
    32LS tps = 1851.143407 (including connections establishing)
    32LS tps = 1754.683356 (including connections establishing)
    32LS tps = 1785.926527 (including connections establishing)
    80LS tps = 1510.778084 (including connections establishing)
    80LS tps = 1484.423486 (including connections establishing)
    80LS tps = 1480.692051 (including connections establishing)

    01 tps = 511.572832 (including connections establishing)
    01 tps = 499.389527 (including connections establishing)
    01 tps = 495.697080 (including connections establishing)
    08 tps = 1832.762548 (including connections establishing)
    08 tps = 1819.884564 (including connections establishing)
    08 tps = 1835.608561 (including connections establishing)
    32 tps = 1417.168790 (including connections establishing)
    32 tps = 1447.478971 (including connections establishing)
    32 tps = 1427.489879 (including connections establishing)
    80 tps = 1154.272515 (including connections establishing)
    80 tps = 1168.805881 (including connections establishing)
    80 tps = 1173.971801 (including connections establishing)

    01S tps = 519.860218 (including connections establishing)
    01S tps = 510.759087 (including connections establishing)
    01S tps = 517.159276 (including connections establishing)
    08S tps = 1880.179600 (including connections establishing)
    08S tps = 1829.693454 (including connections establishing)
    08S tps = 1886.168401 (including connections establishing)
    32S tps = 1809.950929 (including connections establishing)
    32S tps = 1809.474070 (including connections establishing)
    32S tps = 1798.620842 (including connections establishing)
    80S tps = 1483.037788 (including connections establishing)
    80S tps = 1481.059504 (including connections establishing)
    80S tps = 1487.215748 (including connections establishing)

    So, apparently, the extra work in SIInsertDataEntries() is more than
    paid for by the speedup in SIGetDataEntries(). I'm guessing that at
    high client counts you win because of reduced spinlock contention, and
    at low client counts you still win a little bit because
    SIGetDataEntries() is called multiple times per transaction, whereas
    SIInsertDataEntries() is only called once. I could be all wet on the
    reason, but at any rate the numbers look pretty good.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Noah Misch at Jul 29, 2011 at 3:05 am

    On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote:
    On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas wrote:
    I'll also test out creating and dropping some tables.
    Still need to work on this one.
    And there results are in. I set up the following sophisticated test
    script for pgbench:

    CREATE TEMP TABLE foo (a int);
    DROP TABLE foo;

    And then did 5-minute test runs with varying numbers of clients on
    Nate Boley's 32-core machine with (a) master, (b) master +
    sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
    + sinval-hasmessages.
    The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
    vs. (b) master + lazy-vxid + [2]sinval-hasmessages. The only claimed benefit of
    [2] over [1], as far as I can see, is invulnerability to the hazard described in
    [3]. Before selecting [2] over [1], it would be instructive to know how much
    performance we exchange for its benefit.

    I did a bit of (relatively unrigorous) poking at this workload with oprofile,
    and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's
    perhaps too cheap relative to the workload's other costs to matter. That's good
    enough for me.

    [1] http://archives.postgresql.org/message-id/CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.com
    [2] http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
    [3] http://archives.postgresql.org/message-id/20110727033537.GB18910@tornado.leadboat.com
    80L tps = 1192.768020 (including connections establishing)
    80L tps = 1165.545010 (including connections establishing)
    80L tps = 1169.776066 (including connections establishing)
    80LS tps = 1510.778084 (including connections establishing)
    80LS tps = 1484.423486 (including connections establishing)
    80LS tps = 1480.692051 (including connections establishing)
    80 tps = 1154.272515 (including connections establishing)
    80 tps = 1168.805881 (including connections establishing)
    80 tps = 1173.971801 (including connections establishing)
    80S tps = 1483.037788 (including connections establishing)
    80S tps = 1481.059504 (including connections establishing)
    80S tps = 1487.215748 (including connections establishing)

    So, apparently, the extra work in SIInsertDataEntries() is more than
    paid for by the speedup in SIGetDataEntries(). I'm guessing that at
    high client counts you win because of reduced spinlock contention, and
    at low client counts you still win a little bit because
    SIGetDataEntries() is called multiple times per transaction, whereas
    SIInsertDataEntries() is only called once. I could be all wet on the
    reason, but at any rate the numbers look pretty good.
    Interesting. I had hypothesized that at two clients per core, nearly every call
    to SIGetDataEntries() would find work to do, making the patch a strict loss. A
    bit of instrumented testing showed otherwise: at two clients per core,
    sinval-hasmessages optimized away 93% of the calls. At fifty clients per core,
    it still optimized away more than half of the calls.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jul 29, 2011 at 4:37 am

    On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch wrote:
    The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
    vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit of
    [2] over [1], as far as I can see, is invulnerability to the hazard described in
    [3].  Before selecting [2] over [1], it would be instructive to know how much
    performance we exchange for its benefit.

    I did a bit of (relatively unrigorous) poking at this workload with oprofile,
    and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
    perhaps too cheap relative to the workload's other costs to matter.  That's good
    enough for me.
    Me, too. There's another possible benefit, though: in
    sinval-fastpath, everybody's got to read maxMsgNum every time through;
    whereas in sinval-hasmessages, everyone's reading their own flag. I'm
    not sure how much it costs to have memory that gets read by multiple
    readers rather than just a single reader, but I think I've seen some
    things that indicate it might not always be free.
    Interesting.  I had hypothesized that at two clients per core, nearly every call
    to SIGetDataEntries() would find work to do, making the patch a strict loss.  A
    bit of instrumented testing showed otherwise: at two clients per core,
    sinval-hasmessages optimized away 93% of the calls.  At fifty clients per core,
    it still optimized away more than half of the calls.
    Wow. That's better than I expected. Great idea for a test, too.

    Unless someone has another concern here, I think we've got this one nailed.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Jul 26, 2011 at 7:40 pm

    Noah Misch writes:
    On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
    I think I have a simpler idea, though:
    before acquiring any locks, just have SIGetDataEntries() do this:

    + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
    + return 0;

    Patch (with comment explaining why I think this is OK) attached. If
    the message numbers happen to be equal only because the counter has
    wrapped, then stateP->resetState will be true, so we'll still realize
    we need to do some work.
    This is attractive, and I don't see any problems with it. (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match. Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    +1 for doing this and moving on.
    After some further reflection I believe this patch actually is pretty
    safe, although Noah's explanation of why seems a bit confused. The key
    points are that (1) each of the reads is atomic, and (2) we should not
    be able to see a value that is older than our last acquisition/release
    of a shared memory lock. These being granted, we will make a decision
    that is correct, or at least was correct as of some time after that last
    lock action. As stated in the patch comments, we are not required to
    promptly assimilate sinval actions on objects we don't hold any lock on,
    so this seems sufficient. In essence, we are relying on an assumed
    prior visit to the lock manager to provide memory access synchronization
    for the sinval no-work-to-do test.

    The case Noah speculates about above amounts to supposing that this
    reasoning fails to hold for the resetState value, and I don't see why
    that would be true. Even if the above scenario did manage to happen,
    it would not mean that we missed the reset, just that we hadn't noticed
    it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
    updates are a problem for us.

    BTW, the patch really ought to add something to the comment around
    line 100.

    regards, tom lane
  • Noah Misch at Jul 26, 2011 at 8:17 pm

    On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
    Noah Misch <noah@2ndQuadrant.com> writes:
    On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
    I think I have a simpler idea, though:
    before acquiring any locks, just have SIGetDataEntries() do this:

    + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
    + return 0;

    Patch (with comment explaining why I think this is OK) attached. If
    the message numbers happen to be equal only because the counter has
    wrapped, then stateP->resetState will be true, so we'll still realize
    we need to do some work.
    This is attractive, and I don't see any problems with it. (In theory, you could
    hit a case where the load of resetState gives an ancient "false" just as the
    counters wrap to match. Given that the wrap interval is 1000000x as long as the
    reset interval, I'm not worried about problems on actual silicon.)
    +1 for doing this and moving on.
    After some further reflection I believe this patch actually is pretty
    safe, although Noah's explanation of why seems a bit confused. The key
    points are that (1) each of the reads is atomic, and (2) we should not
    be able to see a value that is older than our last acquisition/release
    of a shared memory lock. These being granted, we will make a decision
    that is correct, or at least was correct as of some time after that last
    lock action. As stated in the patch comments, we are not required to
    promptly assimilate sinval actions on objects we don't hold any lock on,
    so this seems sufficient. In essence, we are relying on an assumed
    prior visit to the lock manager to provide memory access synchronization
    for the sinval no-work-to-do test.

    The case Noah speculates about above amounts to supposing that this
    reasoning fails to hold for the resetState value, and I don't see why
    that would be true. Even if the above scenario did manage to happen,
    it would not mean that we missed the reset, just that we hadn't noticed
    it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
    updates are a problem for us.
    Here's the way it can fail:

    1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
    = false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those
    latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
    2. Backend stalls for <long time>. Meanwhile, other backends issue
    MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears
    stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
    segP->maxMsgNum = 500.
    3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
    500 from main memory. The patch's test finds no need to reset or process
    invalidation messages.

    That's the theoretical risk I wished to illustrate. Though this appears
    possible on an abstract x86_64 system, I think it's unrealistic to suppose that
    a dirty cache line could persist *throughout* the issuance of more than 10^9
    invalidation messages on a concrete implementation.

    A way to think about the problem is that our read of segP->maxMsgNum is too new.
    If we had used a snapshot of values as of the most recent lock
    acquisition/release, there would be no problem. It's the mix of a new-enough
    stateP with an all-too-new segP that yields the anomaly.

    --
    Noah Misch http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJul 21, '11 at 1:46a
activeJul 29, '11 at 4:37a
posts56
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase