FAQ
I have thought of a few new TODO performance items:

1) Someone at O'Reilly suggested that we order our duplicate index
entries by tid so if we are hitting the heap for lots of duplicates, the
hits will be on sequential pages. Seems like a nice idea.

2) After Tatsuo's report of running 1000 backends on pgbench and from a
Solaris report, I think we should have a queue of backends waiting for a
spinlock, rather than having them sleep and try again. This is
particularly important for multi-processor machines.

3) I am reading the Solaris Internals book and there is mention of a
"free behind" capability with large sequential scans. When a large
sequential scan happens that would wipe out all the old cache entries,
the kernel detects this and places its previous pages first on the free
list. For out code, if we do a sequential scan of a table that is
larger than our buffer cache size, I think we should detect this and do
the same. See http://techdocs.postgresql.org for my performance paper
for an example.

New TODO entries are:

* Order duplicate index entries by tid
* Add queue of backends waiting for spinlock
* Add free-behind capability for large sequential scans

I will modify them with any comments people have.

--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

Search Discussions

  • Mikheev, Vadim at Jul 30, 2001 at 5:12 pm

    New TODO entries are:

    * Order duplicate index entries by tid
    In other words - add tid to index key: very old idea.
    * Add queue of backends waiting for spinlock
    We shouldn't mix two different approaches for different
    kinds of short-time internal locks - in one cases we need in
    light lmgr (when we're going to keep lock long enough, eg for IO)
    and in another cases we'd better to proceed with POSIX' mutex-es
    or semaphores instead of spinlocks. Queueing backends waiting
    for spinlock sounds like nonsense - how are you going to protect
    such queue? With spinlocks? -:)

    Vadim
  • Bruce Momjian at Jul 30, 2001 at 5:15 pm

    New TODO entries are:

    * Order duplicate index entries by tid
    In other words - add tid to index key: very old idea.
    I was thinking during index creation, it would be nice to order them by
    tid, but not do lots of work to keep it that way.
    * Add queue of backends waiting for spinlock
    We shouldn't mix two different approaches for different
    kinds of short-time internal locks - in one cases we need in
    light lmgr (when we're going to keep lock long enough, eg for IO)
    and in another cases we'd better to proceed with POSIX' mutex-es
    or semaphores instead of spinlocks. Queueing backends waiting
    for spinlock sounds like nonsense - how are you going to protect
    such queue? With spinlocks? -:)
    Yes, I guess so but hopefully we can spin waiting for the queue lock
    rather than sleep. We could use POSIX spinlocks/semaphores now but we
    don't because of performance, right?

    Should we be spinning waiting for spinlock on multi-cpu machines? Is
    that the answer?

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Tom Lane at Jul 30, 2001 at 7:29 pm

    Bruce Momjian writes:
    Should we be spinning waiting for spinlock on multi-cpu machines? Is
    that the answer?
    A multi-CPU machine is actually the only case where a true spinlock
    *does* make sense. On a single CPU you might as well yield the CPU
    immediately, because you have no chance of getting the lock until the
    current holder is allowed to run again. On a multi CPU it's a
    reasonable guess that the current holder is running on one of the other
    CPUs and may release the lock soon ("soon" == less than a process
    dispatch cycle, hence busy-wait is better than release CPU).

    We are currently using spinlocks for a lot of situations where the mean
    time spent holding the lock is probably larger than "soon" as defined
    above. We should have a different lock implementation for those cases.
    True spinlocks should be reserved for protecting code where the maximum
    time spent holding the lock is guaranteed to be very short.

    regards, tom lane
  • Mikheev, Vadim at Jul 30, 2001 at 5:45 pm

    * Order duplicate index entries by tid
    In other words - add tid to index key: very old idea.
    I was thinking during index creation, it would be nice to
    order them by tid, but not do lots of work to keep it that way.
    I hear this "not do lots of work" so often from you -:)
    Days of simplicity are gone, Bruce. To continue, this project
    requires more and more complex solutions.
    * Add queue of backends waiting for spinlock
    We shouldn't mix two different approaches for different
    kinds of short-time internal locks - in one cases we need in
    light lmgr (when we're going to keep lock long enough, eg for IO)
    and in another cases we'd better to proceed with POSIX' mutex-es
    or semaphores instead of spinlocks. Queueing backends waiting
    for spinlock sounds like nonsense - how are you going to protect
    such queue? With spinlocks? -:)
    Yes, I guess so but hopefully we can spin waiting for the queue lock
    rather than sleep. We could use POSIX spinlocks/semaphores now but we
    don't because of performance, right?
    No. As long as no one proved with test that mutexes are bad for
    performance...
    Funny, such test would require ~ 1 day of work.
    Should we be spinning waiting for spinlock on multi-cpu machines? Is
    that the answer?
    What do you mean?

    Vadim
  • Bruce Momjian at Jul 30, 2001 at 5:58 pm

    * Order duplicate index entries by tid
    In other words - add tid to index key: very old idea.
    I was thinking during index creation, it would be nice to
    order them by tid, but not do lots of work to keep it that way.
    I hear this "not do lots of work" so often from you -:)
    Days of simplicity are gone, Bruce. To continue, this project
    requires more and more complex solutions.
    Yep. I can dream. :-)
    * Add queue of backends waiting for spinlock
    We shouldn't mix two different approaches for different
    kinds of short-time internal locks - in one cases we need in
    light lmgr (when we're going to keep lock long enough, eg for IO)
    and in another cases we'd better to proceed with POSIX' mutex-es
    or semaphores instead of spinlocks. Queueing backends waiting
    for spinlock sounds like nonsense - how are you going to protect
    such queue? With spinlocks? -:)
    Yes, I guess so but hopefully we can spin waiting for the queue lock
    rather than sleep. We could use POSIX spinlocks/semaphores now but we
    don't because of performance, right?
    No. As long as no one proved with test that mutexes are bad for
    performance...
    Funny, such test would require ~ 1 day of work.
    Good question. I know the number of function calls to spinlock stuff is
    huge. Seems real semaphores may be a big win on multi-cpu boxes.
    Should we be spinning waiting for spinlock on multi-cpu machines? Is
    that the answer?
    What do you mean?
    On a single-cpu machine, if you can't get the spinlock, you should just
    sleep and let the process who had it continue. On a multi-cpu machine,
    you perhaps should just keep trying, hoping that the process who holds
    it finishes soon. I wonder is we should find some kind of test on
    postmaster startup that would test to see if we have multiple cpu's and
    change from spinlock sleeping to spinlock spin-waiting.

    Add to this that fact we can't sleep for less than on clock tick on most
    machines, 10ms, and the spinlock stuff needs work.

    TODO updated:

    * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
    spining to obtain lock on multi-cpu systems

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Matthew Kirkwood at Jul 31, 2001 at 9:12 am

    On Mon, 30 Jul 2001, Bruce Momjian wrote:

    * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
    spining to obtain lock on multi-cpu systems
    You may be interested in a discussion which happened over on
    linux-kernel a few months ago.

    Quite a lot of people want a lightweight userspace semaphore,
    and for pretty much the same reasons.

    Linus proposed a pretty interesting solution which has the
    same minimal overhead as the current spinlocks in the non-
    contention case, but avoids the spin where there's contention:

    http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html

    Matthew.
  • Bruce Momjian at Jul 31, 2001 at 1:19 pm

    On Mon, 30 Jul 2001, Bruce Momjian wrote:
    * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
    spining to obtain lock on multi-cpu systems
    You may be interested in a discussion which happened over on
    linux-kernel a few months ago.

    Quite a lot of people want a lightweight userspace semaphore,
    and for pretty much the same reasons.

    Linus proposed a pretty interesting solution which has the
    same minimal overhead as the current spinlocks in the non-
    contention case, but avoids the spin where there's contention:

    http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html
    Yes, many OS's have user-space spinlocks, for the same performance
    reasons (no kernel call).

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Mikheev, Vadim at Jul 30, 2001 at 6:17 pm

    We could use POSIX spinlocks/semaphores now but we
    don't because of performance, right?
    No. As long as no one proved with test that mutexes are bad for
    performance...
    Funny, such test would require ~ 1 day of work.
    Good question. I know the number of function calls to spinlock stuff
    is huge. Seems real semaphores may be a big win on multi-cpu boxes.
    Ok, being tired of endless discussions I'll try to use mutexes instead
    of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes.

    Vadim
  • Bruce Momjian at Sep 6, 2001 at 4:40 am

    We could use POSIX spinlocks/semaphores now but we
    don't because of performance, right?
    No. As long as no one proved with test that mutexes are bad for
    performance...
    Funny, such test would require ~ 1 day of work.
    Good question. I know the number of function calls to spinlock stuff
    is huge. Seems real semaphores may be a big win on multi-cpu boxes.
    Ok, being tired of endless discussions I'll try to use mutexes instead
    of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes.
    I have updated the TODO list with:

    * Improve spinlock code
    o use SysV semaphores or queue of backends waiting on the lock
    o wakeup sleeper or sleep for less than one clock tick
    o spin for lock on multi-cpu machines, yield on single cpu machines
    o read/write locks

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Tom Lane at Jul 30, 2001 at 7:24 pm

    Bruce Momjian writes:
    New TODO entries are:
    * Add queue of backends waiting for spinlock
    I already see:

    * Create spinlock sleepers queue so everyone doesn't wake up at once


    BTW, I agree with Vadim's opinion that we should add a new type of lock
    (intermediate between spinlocks and lockmanager locks) rather than try
    to add new semantics onto spinlocks. For example, it'd be very nice to
    distinguish read-only and read-write access in this new kind of lock,
    but we can't expect spinlocks to be able to do that.

    regards, tom lane
  • Bruce Momjian at Jul 30, 2001 at 9:10 pm

    Bruce Momjian writes:
    New TODO entries are:
    * Add queue of backends waiting for spinlock
    I already see:

    * Create spinlock sleepers queue so everyone doesn't wake up at once
    That is an old copy of the TODO. I reworded it. You will only see this
    now:

    * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
    spining to obtain lock on multi-cpu systems
    BTW, I agree with Vadim's opinion that we should add a new type of lock
    (intermediate between spinlocks and lockmanager locks) rather than try
    to add new semantics onto spinlocks. For example, it'd be very nice to
    distinguish read-only and read-write access in this new kind of lock,
    but we can't expect spinlocks to be able to do that.
    Yes, I agree too. On a uniprocessor machine, if I can't get the
    spinlock, I want to yield the cpu, ideally for less than 10ms. On a
    multi-cpu machine, if the lock is held by another CPU and that process
    is running, we want to spin waiting for the lock to free. If not, we
    can sleep. We basically need some more sophisticated semantics around
    these locks, or move to OS semaphores.

    In fact, can't we sleep on an interruptible system call, and send
    signals to processes when we release the lock? That way we don't have
    the 10ms sleep limitation. One idea is to have a bytes for each backend
    in shared memory and have each backend set the bit if it is waiting for
    the semaphore. There would be no contention with multiple backends
    registering their sleep at the same time.

    We have seen reports of 4-cpu systems having poor performance while the
    system is only 12% busy, perhaps because the processes are all sleeping
    waiting for the next tick.

    I think multi-cpu machines are going to give us new challenges.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Darren King at Jul 30, 2001 at 7:35 pm

    3) I am reading the Solaris Internals book and there is mention of a
    "free behind" capability with large sequential scans. When a large
    sequential scan happens that would wipe out all the old cache entries,
    the kernel detects this and places its previous pages first
    on the free list. For out code, if we do a sequential scan of a table
    that is larger than our buffer cache size, I think we should detect
    this and do the same. See http://techdocs.postgresql.org for my
    performance paper for an example.

    New TODO entries are:

    * Order duplicate index entries by tid
    * Add queue of backends waiting for spinlock
    * Add free-behind capability for large sequential scans
    So why do we cache sequetially-read pages? Or at least not have an
    option to control it?

    Oracle (to the best of my knowledge) does NOT cache pages read by a
    sequential index scan for at least two reasons/assumptions (two being
    all that I can recall):

    1. Caching pages for sequential scans over sufficiently large tables
    will just cycle the cache. The pages that will be cached at the end of
    the query will be the last N pages of the table, so when the same
    sequential query is run again, the scan from the beginning of the table
    will start flushing the oldest cached pages which are more than likely
    going to be the ones that will be needed at the end of the scan, etc,
    etc. In a multi-user environment, the effect is worse.

    2. Concurrent or consective queries in a dynamic database will not
    generate plans that use the same sequential scans, so they will tend to
    thrash the cache.

    Now there are some databases where the same general queries are run time
    after time and caching the pages from sequential scans does make sense,
    but in larger, enterprise-type systems, indices are created to help
    speed up the most used queries and the sequential cache entries only
    serve to clutter the cache and flush the useful pages.

    Is there any way that caching pages read in by a sequential scan could
    be made a configurable-option?

    Any chance someone could run pgbench on a test system set up to not
    cache sequential reads?

    Darren
  • Bruce Momjian at Jul 30, 2001 at 9:14 pm

    So why do we cache sequetially-read pages? Or at least not have an
    option to control it?

    Oracle (to the best of my knowledge) does NOT cache pages read by a
    sequential index scan for at least two reasons/assumptions (two being
    all that I can recall):

    1. Caching pages for sequential scans over sufficiently large tables
    will just cycle the cache. The pages that will be cached at the end of
    the query will be the last N pages of the table, so when the same
    sequential query is run again, the scan from the beginning of the table
    will start flushing the oldest cached pages which are more than likely
    going to be the ones that will be needed at the end of the scan, etc,
    etc. In a multi-user environment, the effect is worse.

    2. Concurrent or consective queries in a dynamic database will not
    generate plans that use the same sequential scans, so they will tend to
    thrash the cache.

    Now there are some databases where the same general queries are run time
    after time and caching the pages from sequential scans does make sense,
    but in larger, enterprise-type systems, indices are created to help
    speed up the most used queries and the sequential cache entries only
    serve to clutter the cache and flush the useful pages.

    Is there any way that caching pages read in by a sequential scan could
    be made a configurable-option?

    Any chance someone could run pgbench on a test system set up to not
    cache sequential reads?
    Yep, that is the issue. If the whole table fits in the cache, it is
    great. If not, it is useless or worse because it forces out other
    pages. Right now the cache is oldest-out and doesn't know anything
    about access patterns. We would need to get that info passed in the
    cache, probably some function parameter.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
  • Tom Lane at Jul 30, 2001 at 9:53 pm

    Bruce Momjian writes:
    I have thought of a few new TODO performance items:
    1) Someone at O'Reilly suggested that we order our duplicate index
    entries by tid so if we are hitting the heap for lots of duplicates, the
    hits will be on sequential pages. Seems like a nice idea.
    A more general solution is for indexscan to collect up a bunch of TIDs
    from the index, sort them in-memory by TID order, and then probe into
    the heap with those TIDs. This is better than the above because you get
    nice ordering of the heap accesses across multiple key values, not just
    among the tuples with the same key. (In a unique or near-unique index,
    the above idea is nearly worthless.)

    In the best case there are few enough TIDs retrieved from the index that
    you can just do this once, but even if there are lots of TIDs, it should
    be a win to do this in batches of a few thousand TIDs. Essentially we
    decouple indexscans into separate index-access and heap-access phases.

    One big problem is that this doesn't interact well with concurrent VACUUM:
    our present solution for concurrent VACUUM assumes that indexscans hold
    a pin on an index page until they've finished fetching the pointed-to
    heap tuples. Another objection is that we'd have a harder time
    implementing the TODO item of marking an indextuple dead when its
    associated heaptuple is dead. Anyone see a way around these problems?

    regards, tom lane
  • Bruce Momjian at Jul 30, 2001 at 10:11 pm

    A more general solution is for indexscan to collect up a bunch of TIDs
    from the index, sort them in-memory by TID order, and then probe into
    the heap with those TIDs. This is better than the above because you get
    nice ordering of the heap accesses across multiple key values, not just
    among the tuples with the same key. (In a unique or near-unique index,
    the above idea is nearly worthless.)

    In the best case there are few enough TIDs retrieved from the index that
    you can just do this once, but even if there are lots of TIDs, it should
    be a win to do this in batches of a few thousand TIDs. Essentially we
    decouple indexscans into separate index-access and heap-access phases.

    One big problem is that this doesn't interact well with concurrent VACUUM:
    our present solution for concurrent VACUUM assumes that indexscans hold
    a pin on an index page until they've finished fetching the pointed-to
    heap tuples. Another objection is that we'd have a harder time
    implementing the TODO item of marking an indextuple dead when its
    associated heaptuple is dead. Anyone see a way around these problems?
    Interesting. I figured the cache could keep most pages in such a case.
    I was thinking more of helping file system readahead by requesting the
    earlier block first in a mult-block request. Not sure how valuable that
    would be.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 853-3000
    + If your life is a hard drive, | 830 Blythe Avenue
    + Christ can be your backup. | Drexel Hill, Pennsylvania 19026

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedJul 30, '01 at 4:50p
activeSep 6, '01 at 4:40a
posts16
users5
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase