FAQ
Here's my roadmap for the "scan-resistant buffer cache" and
"synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to the
freelist, forcing a lot of WAL flushes. Instead, use a backend-private
ring of shared buffers that are recycled. This is what Simon's
"scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely to
be accessed in the near future, therefore it should be recycled. If
vacuum doesn't dirty the page, it's best to reuse the buffer immediately
for the next page. However, if the buffer is dirty (and not just because
we set hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but I
think the ring should be dynamically sized. Start with a small ring, and
whenever you need to do a WAL flush to write a dirty buffer, increase
the ring size. On every full iteration through the ring, decrease its
size to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say it
won't get worse than what we have now. In the future, we can use the
buffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This is
Jeff's "synchronized scans" patch. When a seq scan starts on a table
larger than some threshold, it starts from where the previous seq scan
is currently, or where it ended. This will synchronize the scans so that
for two concurrent scans the total I/O is halved in the best case. There
should be no other effect on performance.

If you have a partitioned table, or union of multiple tables or any
other plan where multiple seq scans are performed in arbitrary order,
this change won't change the order the partitions are scanned and won't
therefore ensure they will be synchronized.


Now that we have both pieces of the puzzle in place, it's time to
consider what more we can do with them:


3A. To take advantage of the "cache trail" of a previous seq scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.

This will allow taking advantage of the buffer cache even if the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.

This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will become faster.
But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the cache if
there's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large,
even with the change 3B to read the pages that are in cache first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make queries that are
slow anyway a little bit slower, than making queries that are normally
really fast, slow.


As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.

A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
5 * shared_buffers -> start from last read page, use buffer ring
I'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the local
buffer ring. Take all buffers with usage_count=0, plus a few with
usage_count=1 from the clock sweep. That way if there's a lot of buffers
in the buffer cache that are seldomly used, we'll use more buffers to
cache the large scan, and vice versa. And no matter how large the scan,
it wouldn't blow all buffers from the cache. But if you execute the same
query again, the buffers left in the cache from the last scan were
apparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).

Thoughts? Everyone happy with the roadmap?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Luke Lonergan at May 8, 2007 at 11:47 am
    Heikki,

    On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
    implement dynamic I/O caching. The experiments have shown that benefit
    of re-using PG buffer cache on large sequential scans is vanishingly
    small when the buffer cache size is small compared to the system memory.
    Since this is a normal and recommended situation (OS I/O cache is
    auto-tuning and easy to administer, etc), IMO the effort to optimize
    buffer cache reuse for seq scans > 1 x buffer cache size is not
    worthwhile.

    On 3B: The scenario described is "multiple readers seq scanning large
    table and sharing bufcache", but in practice this is not a common
    situation. The common situation is "multiple queries joining several
    small tables to one or more large tables that are >> 1 x bufcache". In
    the common scenario, the dominant factor is the ability to keep the
    small tables in bufcache (or I/O cache for that matter) while running
    the I/O bound large table scans as fast as possible.

    To that point - an important factor in achieving max I/O rate for large
    tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
    This is commonly in the range of 512KB -> 2MB, which is only important
    when considering a bound on the size of the ring buffer. The effect has
    been demonstrated to be significant - in the 20%+ range. Another thing
    to consider is the use of readahead inside the heapscan, in which case
    sizes >= 32KB are very effective.

    We've implemented both ideas (ring buffer, readahead) and see very
    significant improvements in I/O and query speeds on DSS workloads. I
    would expect benefits on OLTP as well.

    The modifications you suggest here may not have the following
    properties:
    - don't pollute bufcache for seqscan of tables > 1 x bufcache
    - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
    minimize L2 cache pollution

    - Luke
    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org
    On Behalf Of
    Heikki Linnakangas
    Sent: Tuesday, May 08, 2007 3:40 AM
    To: PostgreSQL-development
    Cc: Jeff Davis; Simon Riggs
    Subject: [HACKERS] Seq scans roadmap

    Here's my roadmap for the "scan-resistant buffer cache" and
    "synchronized scans" patches.

    1. Fix the current vacuum behavior of throwing dirty buffers
    to the freelist, forcing a lot of WAL flushes. Instead, use a
    backend-private ring of shared buffers that are recycled.
    This is what Simon's "scan-resistant buffer manager" did.

    The theory here is that if a page is read in by vacuum, it's
    unlikely to be accessed in the near future, therefore it
    should be recycled. If vacuum doesn't dirty the page, it's
    best to reuse the buffer immediately for the next page.
    However, if the buffer is dirty (and not just because we set
    hint bits), we ought to delay writing it to disk until the
    corresponding WAL record has been flushed to disk.

    Simon's patch used a fixed size ring of buffers that are
    recycled, but I think the ring should be dynamically sized.
    Start with a small ring, and whenever you need to do a WAL
    flush to write a dirty buffer, increase the ring size. On
    every full iteration through the ring, decrease its size to
    trim down an unnecessarily large ring.

    This only alters the behavior of vacuums, and it's pretty
    safe to say it won't get worse than what we have now. In the
    future, we can use the buffer ring for seqscans as well; more
    on that on step 3.

    2. Implement the list/table of last/ongoing seq scan
    positions. This is Jeff's "synchronized scans" patch. When a
    seq scan starts on a table larger than some threshold, it
    starts from where the previous seq scan is currently, or
    where it ended. This will synchronize the scans so that for
    two concurrent scans the total I/O is halved in the best
    case. There should be no other effect on performance.

    If you have a partitioned table, or union of multiple tables
    or any other plan where multiple seq scans are performed in
    arbitrary order, this change won't change the order the
    partitions are scanned and won't therefore ensure they will
    be synchronized.


    Now that we have both pieces of the puzzle in place, it's
    time to consider what more we can do with them:


    3A. To take advantage of the "cache trail" of a previous seq
    scan, scan
    backwards from where the previous seq scan ended, until a you hit a
    buffer that's not in cache.

    This will allow taking advantage of the buffer cache even if
    the table
    doesn't fit completely in RAM. That can make a big difference if the
    table size is just slightly bigger than RAM, and can avoid the nasty
    surprise when a table grows beyond RAM size and queries start taking
    minutes instead of seconds.

    This should be a non-controversial change on its own from performance
    point of view. No query should get slower, and some will
    become faster.
    But see step 3B:

    3B. Currently, sequential scans on a large table spoils the
    buffer cache
    by evicting other pages from the cache. In CVS HEAD, as soon as the
    table is larger than shared_buffers, the pages in the buffer won't be
    used to speed up running the same query again, and there's no
    reason to
    believe the pages read in would be more useful than any other page in
    the database, and in particular the pages that were in the
    buffer cache
    before the huge seq scan. If the table being scanned is > 5 *
    shared_buffers, the scan will evict every other page from the
    cache if
    there's no other activity in the database (max usage_count is 5).

    If the table is much larger than shared_buffers, say 10 times
    as large,
    even with the change 3B to read the pages that are in cache
    first, using
    all shared_buffers to cache the table will only speed up the query by
    10%. We should not spoil the cache for such a small gain, and use the
    local buffer ring strategy instead. It's better to make
    queries that are
    slow anyway a little bit slower, than making queries that are
    normally
    really fast, slow.


    As you may notice, 3A and 3B are at odds with each other. We can
    implement both, but you can't use both strategies in the same scan.
    Therefore we need to have decision logic of some kind to figure out
    which strategy is optimal.

    A simple heuristic is to decide based on the table size:

    < 0.1*shared_buffers -> start from page 0, keep in cache
    (like we do now)
    < 5 * shared_buffers -> start from last read page, keep in cache
    5 * shared_buffers -> start from last read page, use buffer ring
    I'm not sure about the constants, we might need to make them GUC
    variables as Simon argued, but that would be the general approach.

    In the future, I'm envisioning a smarter algorithm to size the local
    buffer ring. Take all buffers with usage_count=0, plus a few with
    usage_count=1 from the clock sweep. That way if there's a lot
    of buffers
    in the buffer cache that are seldomly used, we'll use more buffers to
    cache the large scan, and vice versa. And no matter how large
    the scan,
    it wouldn't blow all buffers from the cache. But if you
    execute the same
    query again, the buffers left in the cache from the last scan were
    apparently useful, so we use a bigger ring this time.

    I'm going to do this incrementally, and we'll see how far we get for
    8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
    Simon's patch (step 1), run some performance tests with vacuum, and
    submit a patch for that. Then I'll move to Jeff's patch (step 2).

    Thoughts? Everyone happy with the roadmap?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com

    ---------------------------(end of
    broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
  • Heikki Linnakangas at May 8, 2007 at 12:29 pm

    Luke Lonergan wrote:
    On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
    implement dynamic I/O caching. The experiments have shown that benefit
    of re-using PG buffer cache on large sequential scans is vanishingly
    small when the buffer cache size is small compared to the system memory.
    Since this is a normal and recommended situation (OS I/O cache is
    auto-tuning and easy to administer, etc), IMO the effort to optimize
    buffer cache reuse for seq scans > 1 x buffer cache size is not
    worthwhile.
    That's interesting. Care to share the results of the experiments you
    ran? I was thinking of running tests of my own with varying table sizes.

    The main motivation here is to avoid the sudden drop in performance when
    a table grows big enough not to fit in RAM. See attached diagram for
    what I mean. Maybe you're right and the effect isn't that bad in practice.

    I'm thinking of attacking 3B first anyway, because it seems much simpler
    to implement.
    On 3B: The scenario described is "multiple readers seq scanning large
    table and sharing bufcache", but in practice this is not a common
    situation. The common situation is "multiple queries joining several
    small tables to one or more large tables that are >> 1 x bufcache". In
    the common scenario, the dominant factor is the ability to keep the
    small tables in bufcache (or I/O cache for that matter) while running
    the I/O bound large table scans as fast as possible.
    How is that different from what I described?
    To that point - an important factor in achieving max I/O rate for large
    tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
    This is commonly in the range of 512KB -> 2MB, which is only important
    when considering a bound on the size of the ring buffer. The effect has
    been demonstrated to be significant - in the 20%+ range. Another thing
    to consider is the use of readahead inside the heapscan, in which case
    sizes >= 32KB are very effective.
    Yeah I remember the discussion on the L2 cache a while back.

    What do you mean with using readahead inside the heapscan? Starting an
    async read request?
    The modifications you suggest here may not have the following
    properties:
    - don't pollute bufcache for seqscan of tables > 1 x bufcache
    - for tables > 1 x bufcache use a ringbuffer for I/O that is ~ 32KB to
    minimize L2 cache pollution
    So the difference is that you don't want 3A (the take advantage of pages
    already in buffer cache) strategy at all, and want the buffer ring
    strategy to kick in earlier instead. Am I reading you correctly?

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Luke Lonergan at May 8, 2007 at 12:44 pm
    Heikki,
    That's interesting. Care to share the results of the
    experiments you ran? I was thinking of running tests of my
    own with varying table sizes.
    Yah - it may take a while - you might get there faster.

    There are some interesting effects to look at between I/O cache
    performance and PG bufcache, and at those speeds the only tool I've
    found that actually measures scan rate in PG is VACUUM. "SELECT
    COUNT(*)" measures CPU consumption in the aggregation node, not scan
    rate.

    Note that the copy from I/O cache to PG bufcache is where the L2 effect
    is seen.
    The main motivation here is to avoid the sudden drop in
    performance when a table grows big enough not to fit in RAM.
    See attached diagram for what I mean. Maybe you're right and
    the effect isn't that bad in practice.
    There are going to be two performance drops, first when the table
    doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
    + I/O cache. The second is severe, the first is almost insignificant
    (for common queries).
    How is that different from what I described?
    My impression of your descriptions is that they overvalue the case where
    there are multiple scanners of a large (> 1x bufcache) table such that
    they can share the "first load" of the bufcache, e.g. your 10% benefit
    for table = 10x bufcache argument. I think this is a non-common
    workload, rather there are normally many small tables and several large
    tables such that sharing the PG bufcache is irrelevant to the query
    speed.
    Yeah I remember the discussion on the L2 cache a while back.

    What do you mean with using readahead inside the heapscan?
    Starting an async read request?
    Nope - just reading N buffers ahead for seqscans. Subsequent calls use
    previously read pages. The objective is to issue contiguous reads to
    the OS in sizes greater than the PG page size (which is much smaller
    than what is needed for fast sequential I/O).
    The modifications you suggest here may not have the following
    properties:
    - don't pollute bufcache for seqscan of tables > 1 x bufcache
    - for tables > 1 x bufcache use a ringbuffer for I/O that
    is ~ 32KB to
    minimize L2 cache pollution
    So the difference is that you don't want 3A (the take
    advantage of pages already in buffer cache) strategy at all,
    and want the buffer ring strategy to kick in earlier instead.
    Am I reading you correctly?
    Yes, I think the ring buffer strategy should be used when the table size
    is > 1 x bufcache and the ring buffer should be of a fixed size smaller
    than L2 cache (32KB - 128KB seems to work well).

    - Luke
  • Heikki Linnakangas at May 8, 2007 at 12:57 pm

    Luke Lonergan wrote:
    What do you mean with using readahead inside the heapscan?
    Starting an async read request?
    Nope - just reading N buffers ahead for seqscans. Subsequent calls use
    previously read pages. The objective is to issue contiguous reads to
    the OS in sizes greater than the PG page size (which is much smaller
    than what is needed for fast sequential I/O).
    Are you filling multiple buffers in the buffer cache with a single
    read-call? The OS should be doing readahead for us anyway, so I don't
    see how just issuing multiple ReadBuffers one after each other helps.
    Yes, I think the ring buffer strategy should be used when the table size
    is > 1 x bufcache and the ring buffer should be of a fixed size smaller
    than L2 cache (32KB - 128KB seems to work well).
    I think we want to let the ring grow larger than that for updating
    transactions and vacuums, though, to avoid the WAL flush problem.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Andreas Zeugswetter at May 8, 2007 at 3:13 pm

    What do you mean with using readahead inside the heapscan?
    Starting an async read request?
    Nope - just reading N buffers ahead for seqscans. Subsequent calls
    use previously read pages. The objective is to issue
    contiguous reads
    to the OS in sizes greater than the PG page size (which is much
    smaller than what is needed for fast sequential I/O).
    Are you filling multiple buffers in the buffer cache with a
    single read-call?
    yes, needs vector or ScatterGather IO.
    The OS should be doing readahead for us
    anyway, so I don't see how just issuing multiple ReadBuffers
    one after each other helps.
    Last time I looked OS readahead was only comparable to 32k blocked
    reads.
    256k blocked reads still perform way better. Also when the OS is
    confronted
    with an IO storm the 256k reads perform way better than OS readahead.

    Andreas
  • Gregory Stark at May 8, 2007 at 4:12 pm

    "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:

    Are you filling multiple buffers in the buffer cache with a
    single read-call?
    yes, needs vector or ScatterGather IO.
    I would expect that to get only moderate improvement. To get the full benefit
    I would think you would want to either fire off a separate thread to do the
    read-ahead, use libaio, or funnel the read-ahead requests to a separate thread
    like our bgwriter only it would be a bgreader or something like that.
    The OS should be doing readahead for us
    anyway, so I don't see how just issuing multiple ReadBuffers
    one after each other helps.
    Last time I looked OS readahead was only comparable to 32k blocked reads.
    256k blocked reads still perform way better. Also when the OS is confronted
    with an IO storm the 256k reads perform way better than OS readahead.
    Well that's going to depend on the OS. Last I checked Linux's readahead logic
    is pretty straightforward and doesn't try to do any better than 32k readahead
    and is easily fooled. However I wouldn't be surprised if that's changed.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Andreas Zeugswetter at May 9, 2007 at 8:48 am

    Are you filling multiple buffers in the buffer cache with a single
    read-call?
    yes, needs vector or ScatterGather IO.
    I would expect that to get only moderate improvement.
    The vast improvement comes from 256k blocksize.
    To get
    the full benefit I would think you would want to either fire
    off a separate thread to do the read-ahead, use libaio, or
    funnel the read-ahead requests to a separate thread like our
    bgwriter only it would be a bgreader or something like that.
    I like bgreader :-) But that looks even more difficult than grabbing 32
    [scattered or contiguous] buffers at once.
    Especially in a situation where there is no concurrent load it would be
    nice to do CPU work while waiting for the next read ahead IO. If there
    is enough parallel CPU load it is actually not so important. So I opt,
    that on a high load server you get nearly all benefit without any sort
    of aio.
    The OS should be doing readahead for us anyway, so I don't see how
    just issuing multiple ReadBuffers one after each other helps.
    Last time I looked OS readahead was only comparable to 32k blocked
    reads.
    256k blocked reads still perform way better. Also when the OS is
    confronted with an IO storm the 256k reads perform way better than
    OS readahead.
    Well that's going to depend on the OS. Last I checked Linux's
    readahead logic is pretty straightforward and doesn't try to
    do any better than 32k readahead and is easily fooled.
    However I wouldn't be surprised if that's changed.
    My test was on AIX, 32 or 64k seem quite common, at least as default
    setting.
    Also on some OS's (like HPUX) OS readahead and writebehind strategy
    changes with large IO blocksizes, imho beneficially.

    Andreas
  • CK Tan at May 10, 2007 at 3:52 am
    Hi,

    In reference to the seq scans roadmap, I have just submitted a patch
    that addresses some of the concerns.

    The patch does this:

    1. for small relation (smaller than 60% of bufferpool), use the
    current logic
    2. for big relation:
    - use a ring buffer in heap scan
    - pin first 12 pages when scan starts
    - on consumption of every 4-page, read and pin the next 4-page
    - invalidate used pages of in the scan so they do not force out
    other useful pages

    4 files changed:
    bufmgr.c, bufmgr.h, heapam.c, relscan.h

    If there are interests, I can submit another scan patch that returns
    N tuples at a time, instead of current one-at-a-time interface. This
    improves code locality and further improve performance by another
    10-20%.

    For TPCH 1G tables, we are seeing more than 20% improvement in scans
    on the same hardware.

    ------------------------------------------------------------------------
    -
    ----- PATCHED VERSION
    ------------------------------------------------------------------------
    -
    gptest=# select count(*) from lineitem;
    count
    ---------
    6001215
    (1 row)

    Time: 2117.025 ms

    ------------------------------------------------------------------------
    -
    ----- ORIGINAL CVS HEAD VERSION
    ------------------------------------------------------------------------
    -
    gptest=# select count(*) from lineitem;
    count
    ---------
    6001215
    (1 row)

    Time: 2722.441 ms


    Suggestions for improvement are welcome.

    Regards,
    -cktan
    Greenplum, Inc.
    On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:

    Luke Lonergan wrote:
    What do you mean with using readahead inside the heapscan?
    Starting an async read request?
    Nope - just reading N buffers ahead for seqscans. Subsequent
    calls use
    previously read pages. The objective is to issue contiguous reads to
    the OS in sizes greater than the PG page size (which is much smaller
    than what is needed for fast sequential I/O).
    Are you filling multiple buffers in the buffer cache with a single
    read-call? The OS should be doing readahead for us anyway, so I
    don't see how just issuing multiple ReadBuffers one after each
    other helps.
    Yes, I think the ring buffer strategy should be used when the
    table size
    is > 1 x bufcache and the ring buffer should be of a fixed size
    smaller
    than L2 cache (32KB - 128KB seems to work well).
    I think we want to let the ring grow larger than that for updating
    transactions and vacuums, though, to avoid the WAL flush problem.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com

    ---------------------------(end of
    broadcast)---------------------------
    TIP 6: explain analyze is your friend
  • Andreas Zeugswetter at May 10, 2007 at 10:14 am

    In reference to the seq scans roadmap, I have just submitted
    a patch that addresses some of the concerns.

    The patch does this:

    1. for small relation (smaller than 60% of bufferpool), use
    the current logic 2. for big relation:
    - use a ring buffer in heap scan
    - pin first 12 pages when scan starts
    - on consumption of every 4-page, read and pin the next 4-page
    - invalidate used pages of in the scan so they do not
    force out other useful pages
    A few comments regarding the effects:

    I do not see how this speedup could be caused by readahead, so what are
    the effects ?
    (It should make no difference to do the CPU work for count(*) inbetween
    reading each block when the pages are not dirtied)
    Is the improvement solely reduced CPU because no search for a free
    buffer is needed and/or L2 cache locality ?

    What effect does the advance pinnig have, avoid vacuum ?

    A 16 x 8k page ring is too small to allow the needed IO blocksize of
    256k.
    The readahead is done 4 x one page at a time (=32k).
    What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
    3/4 the trail for followers and bgwriter ?

    I think in anticipation of doing a single IO call for more that one
    page, the KillAndReadBuffer function should be split into two parts. One
    that does the killing
    for n pages, and one that does the reading for n pages.
    Killing n before reading n would also have the positive effect of
    grouping perhaps needed writes (not interleaving them with the reads).

    I think the 60% Nbuffers is a very good starting point. I would only
    introduce a GUC when we see evidence that it is needed (I agree with
    Simon's partitioning comments, but I'd still wait and see).

    Andreas
  • Heikki Linnakangas at May 10, 2007 at 10:22 am

    Zeugswetter Andreas ADI SD wrote:
    In reference to the seq scans roadmap, I have just submitted
    a patch that addresses some of the concerns.

    The patch does this:

    1. for small relation (smaller than 60% of bufferpool), use
    the current logic 2. for big relation:
    - use a ring buffer in heap scan
    - pin first 12 pages when scan starts
    - on consumption of every 4-page, read and pin the next 4-page
    - invalidate used pages of in the scan so they do not
    force out other useful pages
    A few comments regarding the effects:

    I do not see how this speedup could be caused by readahead, so what are
    the effects ?
    I was wondering that as well. We'd really need to test all the changes
    separately to see where the improvements are really coming from.

    Also, that patch doesn't address the VACUUM issue at all. And using a
    small fixed size ring with scans that do updates can be devastating. I'm
    experimenting with different ring sizes for COPY at the moment. Too
    small ring leads to a lot of WAL flushes, it's basically the same
    problem we have with VACUUM in CVS HEAD.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Andreas Zeugswetter at May 10, 2007 at 10:52 am

    Also, that patch doesn't address the VACUUM issue at all. And
    using a small fixed size ring with scans that do updates can
    be devastating. I'm experimenting with different ring sizes
    for COPY at the moment. Too small ring leads to a lot of WAL
    flushes, it's basically the same problem we have with VACUUM
    in CVS HEAD.
    My first take on that would be to simply abandon any dirty (and actually
    also any still pinned) buffer from the ring and replace the ring slot
    with a buffer from the freelist.
    If the freelist is empty and LSN allows writing the buffer, write it
    (and maybe try to group these).
    If the LSN does not allow the write, replace the slot with a buffer from
    LRU.

    Andreas
  • Heikki Linnakangas at May 10, 2007 at 11:33 am

    Zeugswetter Andreas ADI SD wrote:
    Also, that patch doesn't address the VACUUM issue at all. And
    using a small fixed size ring with scans that do updates can
    be devastating. I'm experimenting with different ring sizes
    for COPY at the moment. Too small ring leads to a lot of WAL
    flushes, it's basically the same problem we have with VACUUM
    in CVS HEAD.
    My first take on that would be to simply abandon any dirty (and actually
    also any still pinned) buffer from the ring and replace the ring slot
    with a buffer from the freelist.
    If the freelist is empty and LSN allows writing the buffer, write it
    (and maybe try to group these).
    If the LSN does not allow the write, replace the slot with a buffer from
    LRU.
    That would effectively disable the ring for COPY and the 2nd phase of
    VACUUM.

    One problem with looking at the LSN is that you need the content lock to
    read it, and I wouldn't want to add any new locking. It could be done
    inside FlushBuffer when we hold the lock anyway, but I'm afraid the
    changes would be pretty invasive.

    I'm struggling to get a grip of what the optimal ring size is under
    various circumstances. Some thoughts I have this far:
    - a small ring gives better L2 cache behavior
    - for read-only queries, and for queries that just hint bits, 1 buffer
    is enough
    - small ring with query that writes WAL (COPY, mass updates, 2nd phase
    of VACUUM) leads to a lot of WAL flushes, which can become bottleneck.

    But all these assumptions need to be validated. I'm setting up tests
    with different ring sizes and queries to get a clear picture of this:
    - VACUUM on a clean table
    - VACUUM on a table with 1 dead tuple per page
    - read-only scan, large table
    - read-only scan, table fits in OS cache
    - COPY

    In addition, I'm going to run VACUUM in a DBT-2 test to see the affect
    on other queries running concurrently.

    I think a ring that grows when WAL flushes occur covers all the use
    cases reasonably well, but I need to do the testing...

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at May 10, 2007 at 4:03 pm

    Heikki Linnakangas wrote:
    But all these assumptions need to be validated. I'm setting up tests
    with different ring sizes and queries to get a clear picture of this:
    - VACUUM on a clean table
    - VACUUM on a table with 1 dead tuple per page
    - read-only scan, large table
    - read-only scan, table fits in OS cache
    - COPY
    Just to keep you guys informed, here's my results on a read-only scan on
    a table bigger than shared_buffers but smaller than RAM:

    select-1 | 00:00:10.853831
    select-1 | 00:00:10.380667
    select-1 | 00:00:11.530528
    select-2 | 00:00:08.634105
    select-2 | 00:00:02.674084
    select-4 | 00:00:02.65664
    select-8 | 00:00:02.662922
    select-16 | 00:00:02.682475
    select-32 | 00:00:02.693163
    select-64 | 00:00:02.722031
    select-128 | 00:00:02.873645
    select-256 | 00:00:03.185586
    select-512 | 00:00:03.534285
    select-1024 | 00:00:03.741867

    lshw utility tells me that this server has 32KB of L1 cache and 4MB of
    L2 cache. The performance starts to drop between 64-128 buffers, which
    is 512 - 1024 KB, so I'm not sure how it's related to cache size but
    using a small number of buffers is clearly better than using a large number.

    However, it caught me by total surprise that the performance with 1
    buffer is so horrible. Using 2 buffers is enough to avoid whatever the
    issue is with just 1 buffer. I have no idea what's causing that. There
    must be some interaction that I don't understand.

    All the numbers are quite repeatable, I ran the same test script many
    times. The runtime of the first select-2 test however varied between
    3-10 seconds, somehow the bad karma from using just 1 buffer in the
    earlier test carries over to the next test.

    I'm not sure what to think about this, but I'll set up more test
    scenarios with VACUUM and COPY.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at May 10, 2007 at 5:33 pm

    Heikki Linnakangas wrote:
    However, it caught me by total surprise that the performance with 1
    buffer is so horrible. Using 2 buffers is enough to avoid whatever the
    issue is with just 1 buffer. I have no idea what's causing that. There
    must be some interaction that I don't understand.
    Ok, I found the reason for that. I was using this query for the selects:
    SELECT COUNT(*) FROM (SELECT 1 FROM stock_copytest LIMIT 10000000) AS a;

    Stock_copytest is larger than RAM size, that's why I used the LIMIT to
    make the result set memory resident. That had the side effect that
    apparently the limit-node kept the single buffer pinned which defeated
    the buffer ring completely. To avoid issues like that we apparently want
    to use 2-4 buffers instead of just 1.

    I'll review my test methodology and keep testing...

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at May 11, 2007 at 11:29 pm

    I wrote:
    I'll review my test methodology and keep testing...
    I ran a set of tests on a 100 warehouse TPC-C stock table that is ~3.2
    GB in size and the server has 4 GB of memory. IOW the table fits in OS
    cache, but not in shared_buffers (set at 1 GB).

    copy - COPY from a file
    select - SELECT COUNT(*) FROM stock
    vacuum - VACUUM on a clean table, effectively a read-only operation
    vacuum_hintbits - VACUUM on a table with no dead tuples, but hint bits
    need to be set on every page
    vacuum_dirty - VACUUM with exactly 1 dead tuple per page,

    The number after the test name is the ring size used.

    There was no indexes on the table, which means that the vacuum tests
    only had to do one pass. The 1st vacuum phase of a real-world table is
    like a mixture of vacuum- and vacuum_hintbits-tests, and 2nd phase is
    like the vacuum_dirty test.

    copy-1 | 00:31:47.042365
    copy-2 | 00:17:57.630772
    copy-4 | 00:17:55.041794
    copy-8 | 00:08:31.014009
    copy-16 | 00:05:38.39848
    copy-32 | 00:05:52.295512
    copy-64 | 00:06:08.404646
    copy-128 | 00:05:05.032448
    copy-256 | 00:05:48.573146
    copy-512 | 00:04:56.098752
    copy-1024 | 00:05:27.05316
    select-4 | 00:00:04.344873
    select-4 | 00:00:02.2498
    select-1 | 00:00:08.754011
    select-1 | 00:00:10.521174
    select-1 | 00:00:10.819376
    select-1 | 00:00:14.818831
    select-1 | 00:00:14.893562
    select-1 | 00:00:16.973934
    select-2 | 00:00:15.722776
    select-2 | 00:00:02.291078
    select-2 | 00:00:02.230167
    select-4 | 00:00:02.232935
    select-8 | 00:00:02.238791
    select-16 | 00:00:02.245566
    select-32 | 00:00:02.267158
    select-64 | 00:00:02.311878
    select-128 | 00:00:02.487086
    select-256 | 00:00:02.764085
    select-512 | 00:00:03.161025
    select-1024 | 00:00:03.387246
    vacuum-1 | 00:00:01.843337
    vacuum-2 | 00:00:01.612738
    vacuum-4 | 00:00:01.6304
    vacuum-8 | 00:00:01.655126
    vacuum-16 | 00:00:01.641808
    vacuum-32 | 00:00:01.664108
    vacuum-64 | 00:00:01.729106
    vacuum-128 | 00:00:01.879023
    vacuum-256 | 00:00:02.218303
    vacuum-512 | 00:00:02.569571
    vacuum-1024 | 00:00:02.791995
    vacuum_dirty-1 | 00:24:15.424337
    vacuum_dirty-2 | 00:13:26.981835
    vacuum_dirty-4 | 00:08:07.260113
    vacuum_dirty-8 | 00:05:24.1476
    vacuum_dirty-16 | 00:03:52.690336
    vacuum_dirty-32 | 00:02:40.759203
    vacuum_dirty-64 | 00:02:45.14425
    vacuum_dirty-128 | 00:02:46.718922
    vacuum_dirty-256 | 00:02:43.797785
    vacuum_dirty-512 | 00:02:36.363763
    vacuum_dirty-1024 | 00:02:32.767481
    vacuum_hintbits-1 | 00:00:37.847935
    vacuum_hintbits-2 | 00:00:38.788662
    vacuum_hintbits-4 | 00:00:43.554029
    vacuum_hintbits-8 | 00:00:42.040379
    vacuum_hintbits-16 | 00:00:44.187508
    vacuum_hintbits-32 | 00:00:38.252052
    vacuum_hintbits-64 | 00:00:37.920379
    vacuum_hintbits-128 | 00:00:38.463007
    vacuum_hintbits-256 | 00:00:38.157724
    vacuum_hintbits-512 | 00:00:38.309285
    vacuum_hintbits-1024 | 00:00:39.178738

    I ran the some of the select tests multiple times because the behavior
    changed when the test was repeated. I don't know what's going on in the
    select-1 test, it looks like the same effect I had with the more complex
    query involving a LIMIT-node, but this time I'm just doing a plain
    SELECT COUNT(*). I ran the test script multiple times; the results shown
    above are copy-pasted from one particular run but the numbers didn't
    change much from run to run. In particular, the run times for the
    select-1 test really do increase as you repeat the test many times. The
    copy results seem to vary quite a bit, though.

    For comparison, here's the test results with vanilla CVS HEAD:

    copy-head | 00:06:21.533137
    copy-head | 00:05:54.141285
    select-head | 00:00:16.213693
    select-head | 00:00:18.500792
    vacuum-head | 00:00:12.843479
    vacuum-head | 00:00:08.719845
    vacuum_dirty-head | 00:22:02.533553
    vacuum_dirty-head | 00:22:02.852786
    vacuum_hintbits-head | 00:00:38.278701
    vacuum_hintbits-head | 00:00:35.226191

    Looking at the results, it seems that using a fixed sized ring of 32
    pages hits the sweet spot on all tests. I wonder if that holds on other
    hardware.

    The test scripts I used are attached. I used a modified DBT-2 schema and
    dump file, so you'll need to replace that with some other large table to
    run it. I would appreciate it if others would repeat the tests on other
    hardware to get a bigger sample.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Simon Riggs at May 12, 2007 at 7:36 am

    On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
    For comparison, here's the test results with vanilla CVS HEAD:

    copy-head | 00:06:21.533137
    copy-head | 00:05:54.141285
    I'm slightly worried that the results for COPY aren't anywhere near as
    good as the SELECT and VACUUM results. It isn't clear from those numbers
    that the benefit really is significant.

    Are you thinking that having COPY avoid cache spoiling is a benefit just
    of itself? Or do you see a pattern of benefit from your other runs?

    (BTW what was wal_buffers set to? At least twice the ring buffer size,
    hopefully).

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com
  • Luke Lonergan at May 12, 2007 at 3:43 pm
    Hi Simon,
    On 5/12/07 12:35 AM, "Simon Riggs" wrote:

    I'm slightly worried that the results for COPY aren't anywhere near as
    good as the SELECT and VACUUM results. It isn't clear from those numbers
    that the benefit really is significant.
    COPY is bottlenecked on datum formation and format translation with very low
    performance, so I don't think we should expect the ring buffer to make much
    of a dent.

    - Luke
  • CK Tan at May 13, 2007 at 11:42 pm
    Hi All,

    COPY/INSERT are also bottlenecked on record at a time insertion into
    heap, and in checking for pre-insert trigger, post-insert trigger and
    constraints.

    To speed things up, we really need to special case insertions without
    triggers and constraints, [probably allow for unique constraints],
    and make these insertions to go into heap N tuples at a time. With
    this change, comes the benefit of optimizing REDO log to log multiple
    inserts or even logging a whole new heap page that gets filled in a
    single WAL record.

    Those with triggers and other constraints would still have to go in
    one at a time because of the trigger/constraints semantics.

    It seems to me that dirty pages should be written out by the bg
    writer instead of circumventing it using ring buffer. If it is slow,
    we should change bg writer.

    -cktan
    On May 12, 2007, at 8:42 AM, Luke Lonergan wrote:

    Hi Simon,
    On 5/12/07 12:35 AM, "Simon Riggs" wrote:

    I'm slightly worried that the results for COPY aren't anywhere
    near as
    good as the SELECT and VACUUM results. It isn't clear from those
    numbers
    that the benefit really is significant.
    COPY is bottlenecked on datum formation and format translation with
    very low
    performance, so I don't think we should expect the ring buffer to
    make much
    of a dent.

    - Luke
  • Tom Lane at May 13, 2007 at 11:54 pm

    "CK Tan" <cktan@greenplum.com> writes:
    COPY/INSERT are also bottlenecked on record at a time insertion into
    heap, and in checking for pre-insert trigger, post-insert trigger and
    constraints.
    To speed things up, we really need to special case insertions without
    triggers and constraints, [probably allow for unique constraints],
    Do you have any profiling data to back up these assertions? I haven't
    noticed that firing zero tuples takes any visible percentage of COPY
    time.

    regards, tom lane
  • CK Tan at May 14, 2007 at 2:38 am
    Sorry, I should have been clearer. I meant because we need to check
    for trigger firing pre/post insertion, and the trigger definitions
    expect tuples to be inserted one by one, therefore we cannot insert N-
    tuples at a time into the heap. Checking for triggers itself is not
    taking up much CPU at all. If we could predetermine that there is not
    any triggers for a relation, inserts into that relation could then
    follow a different path that inserts N-tuples at a time.

    Regards,
    -cktan
    On May 13, 2007, at 4:54 PM, Tom Lane wrote:

    "CK Tan" <cktan@greenplum.com> writes:
    COPY/INSERT are also bottlenecked on record at a time insertion into
    heap, and in checking for pre-insert trigger, post-insert trigger and
    constraints.
    To speed things up, we really need to special case insertions without
    triggers and constraints, [probably allow for unique constraints],
    Do you have any profiling data to back up these assertions? I haven't
    noticed that firing zero tuples takes any visible percentage of COPY
    time.

    regards, tom lane
  • Heikki Linnakangas at May 14, 2007 at 11:42 am

    Simon Riggs wrote:
    On Fri, 2007-05-11 at 22:59 +0100, Heikki Linnakangas wrote:
    For comparison, here's the test results with vanilla CVS HEAD:

    copy-head | 00:06:21.533137
    copy-head | 00:05:54.141285
    I'm slightly worried that the results for COPY aren't anywhere near as
    good as the SELECT and VACUUM results. It isn't clear from those numbers
    that the benefit really is significant.
    Agreed, the benefit isn't clear.
    Are you thinking that having COPY avoid cache spoiling is a benefit just
    of itself? Or do you see a pattern of benefit from your other runs?
    I think it's worth having just to avoid cache spoiling. I wouldn't
    bother otherwise, but since we have the infrastructure for vacuum and
    large seqscans, we might as well use it for COPY as well.
    (BTW what was wal_buffers set to? At least twice the ring buffer size,
    hopefully).
    Good question. [checks]. wal_buffers was set to 128KB. I tried raising
    it to 1MB, but it didn't make any difference.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at May 15, 2007 at 9:32 am
    Just to keep you guys informed, I've been busy testing and pondering
    over different buffer ring strategies for vacuum, seqscans and copy.
    Here's what I'm going to do:

    Use a fixed size ring. Fixed as in doesn't change after the ring is
    initialized, however different kinds of scans use differently sized rings.

    I said earlier that it'd be invasive change to see if a buffer needs a
    WAL flush and choose another victim if that's the case. I looked at it
    again and found a pretty clean way of doing that, so I took that
    approach for seq scans.

    1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to
    give the L2 cache benefits and keep cache pollution low, but at the same
    time it's large enough that it keeps the need to WAL flush reasonable
    (1/32 of what we do now).

    2. For sequential scans, also use a ring of 32 buffers, but whenever a
    buffer in the ring would need a WAL flush to recycle, we throw it out of
    the buffer ring instead. On read-only scans (and scans that only update
    hint bit) this gives the L2 cache benefits and doesn't pollute the
    buffer cache. On bulk updates, it's effectively the current behavior. On
    scans that do some updates, it's something in between. In all cases it
    should be no worse than what we have now. 32 buffers should be large
    enough to leave a "cache trail" for Jeff's synchronized scans to work.

    3. For COPY that doesn't write WAL, use the same strategy as for
    sequential scans. This keeps the cache pollution low and gives the L2
    cache benefits.

    4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We
    want to use a ring that can accommodate 1 WAL segment worth of data, to
    avoid having to do any extra WAL flushes, and the WAL segment size is
    2048 pages in the default configuration.

    Some alternatives I considered but rejected:

    * Instead of throwing away dirtied buffers in seq scans, accumulate them
    in another fixed sized list. When the list gets full, do a WAL flush and
    put them to the shared freelist or a backend-private freelist. That
    would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs,
    and it could be used for vacuum as well. I think this would be the
    optimal algorithm but I don't feel like inventing something that
    complicated at this stage anymore. Maybe for 8.4.

    * Using a different sized ring for 1st and 2nd vacuum phase. Decided
    that it's not worth the trouble, the above is already an order of
    magnitude better than the current behavior.


    I'm going to rerun the performance tests I ran earlier with new patch,
    tidy it up a bit, and submit it in the next few days. This turned out to
    be even more laborious patch to review than I thought. While the patch
    is short and in the end turned out to be very close to Simon's original
    patch, there's many different usage scenarios that need to be catered
    for and tested.

    I still need to check the interaction with Jeff's patch. This is close
    enough to Simon's original patch that I believe the results of the tests
    Jeff ran earlier are still valid.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Luke Lonergan at May 15, 2007 at 9:40 am
    Heikki,

    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?

    - Luke
    -----Original Message-----
    From: Heikki Linnakangas On
    Behalf Of Heikki Linnakangas
    Sent: Tuesday, May 15, 2007 2:32 AM
    To: PostgreSQL-development
    Cc: Simon Riggs; Zeugswetter Andreas ADI SD; CK.Tan; Luke
    Lonergan; Jeff Davis
    Subject: Re: [HACKERS] Seq scans roadmap

    Just to keep you guys informed, I've been busy testing and
    pondering over different buffer ring strategies for vacuum,
    seqscans and copy.
    Here's what I'm going to do:

    Use a fixed size ring. Fixed as in doesn't change after the
    ring is initialized, however different kinds of scans use
    differently sized rings.

    I said earlier that it'd be invasive change to see if a
    buffer needs a WAL flush and choose another victim if that's
    the case. I looked at it again and found a pretty clean way
    of doing that, so I took that approach for seq scans.

    1. For VACUUM, use a ring of 32 buffers. 32 buffers is small
    enough to give the L2 cache benefits and keep cache pollution
    low, but at the same time it's large enough that it keeps the
    need to WAL flush reasonable
    (1/32 of what we do now).

    2. For sequential scans, also use a ring of 32 buffers, but
    whenever a buffer in the ring would need a WAL flush to
    recycle, we throw it out of the buffer ring instead. On
    read-only scans (and scans that only update hint bit) this
    gives the L2 cache benefits and doesn't pollute the buffer
    cache. On bulk updates, it's effectively the current
    behavior. On scans that do some updates, it's something in
    between. In all cases it should be no worse than what we have
    now. 32 buffers should be large enough to leave a "cache
    trail" for Jeff's synchronized scans to work.

    3. For COPY that doesn't write WAL, use the same strategy as
    for sequential scans. This keeps the cache pollution low and
    gives the L2 cache benefits.

    4. For COPY that writes WAL, use a large ring of 2048-4096
    buffers. We want to use a ring that can accommodate 1 WAL
    segment worth of data, to avoid having to do any extra WAL
    flushes, and the WAL segment size is
    2048 pages in the default configuration.

    Some alternatives I considered but rejected:

    * Instead of throwing away dirtied buffers in seq scans,
    accumulate them in another fixed sized list. When the list
    gets full, do a WAL flush and put them to the shared freelist
    or a backend-private freelist. That would eliminate the cache
    pollution of bulk DELETEs and bulk UPDATEs, and it could be
    used for vacuum as well. I think this would be the optimal
    algorithm but I don't feel like inventing something that
    complicated at this stage anymore. Maybe for 8.4.

    * Using a different sized ring for 1st and 2nd vacuum phase.
    Decided that it's not worth the trouble, the above is already
    an order of magnitude better than the current behavior.


    I'm going to rerun the performance tests I ran earlier with
    new patch, tidy it up a bit, and submit it in the next few
    days. This turned out to be even more laborious patch to
    review than I thought. While the patch is short and in the
    end turned out to be very close to Simon's original patch,
    there's many different usage scenarios that need to be
    catered for and tested.

    I still need to check the interaction with Jeff's patch. This
    is close enough to Simon's original patch that I believe the
    results of the tests Jeff ran earlier are still valid.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Heikki Linnakangas at May 15, 2007 at 9:42 am

    Luke Lonergan wrote:
    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?
    Sounds reasonable. We need to check the effect on the synchronized
    scans, though.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Jeff Davis at May 15, 2007 at 5:25 pm

    On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
    Luke Lonergan wrote:
    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?
    Sounds reasonable. We need to check the effect on the synchronized
    scans, though.
    I am a little worried that there will be greater differences in position
    as the number of scans increase. If we have only 8 buffers and several
    scans progressing, will they all be able to stay within a few buffers of
    eachother at any given time?

    Also, with 8 buffers, that means each scan must report every 4 pages at
    most (and maybe every page), which increases lock contention (the new
    design Heikki and I discussed requires a lock every time a backend
    reports its position).

    Regards,
    Jeff Davis
  • Jim C. Nasby at May 15, 2007 at 10:34 pm

    On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
    On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
    Luke Lonergan wrote:
    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?
    Sounds reasonable. We need to check the effect on the synchronized
    scans, though.
    I am a little worried that there will be greater differences in position
    as the number of scans increase. If we have only 8 buffers and several
    scans progressing, will they all be able to stay within a few buffers of
    eachother at any given time?

    Also, with 8 buffers, that means each scan must report every 4 pages at
    most (and maybe every page), which increases lock contention (the new
    design Heikki and I discussed requires a lock every time a backend
    reports its position).
    Given that spoiling the L2 cache is a trivial cost compared to extra
    physical IO, ISTM we should go with a largish ring for sync scans. What
    do you think would be the ideal size? 32 buffers?
    --
    Jim Nasby decibel@decibel.org
    EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
  • Andreas Zeugswetter at May 16, 2007 at 8:31 am

    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2
    cache effect.
    I'd say in a scenario where 32k pages are indicated you will also want
    larger than average L2 caches.
    How about using 256/blocksize?
    The reading ahead uses 1/4 ring size. To the best of our knowledge, this
    1/4 needs to be 128k for reading.
    So I'd say we need 512/blocksize.

    Andreas
  • Luke Lonergan at May 16, 2007 at 5:31 pm
    I think the analysis on syncscan needs to take the external I/O cache into
    account. I believe it is not necessary or desirable to keep the scans in
    lock step within the PG bufcache.

    The main benefit of a sync scan will be the ability to start the scan where
    other scans have already filled the I/O cache with useful blocks. This will
    require some knowledge of the size of the I/O cache by the syncscan logic,
    but that's where the largest amount of I/O cache memory (by far) is located.

    - Luke

    On 5/15/07 3:34 PM, "Jim C. Nasby" wrote:
    On Tue, May 15, 2007 at 10:25:35AM -0700, Jeff Davis wrote:
    On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
    Luke Lonergan wrote:
    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?
    Sounds reasonable. We need to check the effect on the synchronized
    scans, though.
    I am a little worried that there will be greater differences in position
    as the number of scans increase. If we have only 8 buffers and several
    scans progressing, will they all be able to stay within a few buffers of
    eachother at any given time?

    Also, with 8 buffers, that means each scan must report every 4 pages at
    most (and maybe every page), which increases lock contention (the new
    design Heikki and I discussed requires a lock every time a backend
    reports its position).
    Given that spoiling the L2 cache is a trivial cost compared to extra
    physical IO, ISTM we should go with a largish ring for sync scans. What
    do you think would be the ideal size? 32 buffers?
  • Jeff Davis at May 16, 2007 at 11:57 pm

    On Wed, 2007-05-16 at 10:31 -0700, Luke Lonergan wrote:
    I think the analysis on syncscan needs to take the external I/O cache into
    account. I believe it is not necessary or desirable to keep the scans in
    lock step within the PG bufcache.
    I partially agree. I don't think we need any huge amount of shared
    buffers for the scans to use, and the scans should be able to catch up
    by using the OS buffer cache (which is still faster than fetching from
    disk).

    However, as I said before, I'm worried that, if the ring is too small,
    it might not work to my expectations. I will try to test this to
    eliminate my doubts.
    The main benefit of a sync scan will be the ability to start the scan where
    other scans have already filled the I/O cache with useful blocks. This will
    require some knowledge of the size of the I/O cache by the syncscan logic,
    but that's where the largest amount of I/O cache memory (by far) is located.
    I don't think it's necessarily the largest "by far". However, it may be
    the largest.

    If you mean that the main benefit of sync scans is to make use of blocks
    that happen to be in cache before the scan began, I disagree. I think
    that's a possible benefit, but I was unable to show any huge benefit in
    my tests (someone may be able to on different hardware with different
    test cases).

    The main benefits that I see are:
    (1) reduce total number of blocks read from disk by making use of
    blocks as they are read by another concurrent seqscan.
    (2) eliminate the need for random I/O on concurrent sequential scans.

    I like the idea of using already-in-cache blocks. The code is there and
    it works, but I just didn't see the benefits yet. After I get any issues
    with this patch resolved and the reviewers are satisfied, I'd like to
    work on it.

    Regards,
    Jeff Davis
  • Luke Lonergan at May 17, 2007 at 2:46 pm
    Hi Jeff,
    On 5/16/07 4:56 PM, "Jeff Davis" wrote:

    The main benefit of a sync scan will be the ability to start the scan where
    other scans have already filled the I/O cache with useful blocks. This will
    require some knowledge of the size of the I/O cache by the syncscan logic,
    but that's where the largest amount of I/O cache memory (by far) is located.
    I don't think it's necessarily the largest "by far". However, it may be
    the largest.
    Compare the size of a 32 block ring buffer (between 256KB and 1024KB) and
    16,000,000 KB of RAM on a common server, automatically used to maximum
    extent by the OS dynamic I/O caching.
    If you mean that the main benefit of sync scans is to make use of blocks
    that happen to be in cache before the scan began, I disagree.
    That's not what I meant.
    I think
    that's a possible benefit, but I was unable to show any huge benefit in
    my tests (someone may be able to on different hardware with different
    test cases).
    I agree, I don't think this is worth pursuing.
    The main benefits that I see are:
    (1) reduce total number of blocks read from disk by making use of
    blocks as they are read by another concurrent seqscan.
    (2) eliminate the need for random I/O on concurrent sequential scans.
    Yes on (1), but with (2), again, the OS prefetch reduces the seeking to a
    minimal level.

    With (1), we just have to define the meaning of "concurrent" to be "within
    the span of the OS I/O cache" and we're describing the same effect.

    - Luke
  • Heikki Linnakangas at May 15, 2007 at 10:47 pm

    Jeff Davis wrote:
    On Tue, 2007-05-15 at 10:42 +0100, Heikki Linnakangas wrote:
    Luke Lonergan wrote:
    32 buffers = 1MB with 32KB blocksize, which spoils the CPU L2 cache
    effect.

    How about using 256/blocksize?
    Sounds reasonable. We need to check the effect on the synchronized
    scans, though.
    I am a little worried that there will be greater differences in position
    as the number of scans increase. If we have only 8 buffers and several
    scans progressing, will they all be able to stay within a few buffers of
    eachother at any given time?
    I'm not sure. Needs testing. Assuming the scan leaves behind a cache
    trail in the OS cache as well, it might not be that bad if a scan
    joining the party starts a little bit behind.
    Also, with 8 buffers, that means each scan must report every 4 pages at
    most (and maybe every page), which increases lock contention (the new
    design Heikki and I discussed requires a lock every time a backend
    reports its position).
    Keep in mind that processing a 32K page takes longer than processing an
    8K page.

    But we'll see..

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • CK Tan at May 10, 2007 at 6:12 pm
    The patch has no effect on scans that do updates. The
    KillAndReadBuffer routine does not force out a buffer if the dirty
    bit is set. So updated pages revert to the current performance
    characteristics.

    -cktan
    GreenPlum, Inc.
    On May 10, 2007, at 5:22 AM, Heikki Linnakangas wrote:

    Zeugswetter Andreas ADI SD wrote:
    In reference to the seq scans roadmap, I have just submitted a
    patch that addresses some of the concerns.

    The patch does this:

    1. for small relation (smaller than 60% of bufferpool), use the
    current logic 2. for big relation:
    - use a ring buffer in heap scan
    - pin first 12 pages when scan starts
    - on consumption of every 4-page, read and pin the next 4-page
    - invalidate used pages of in the scan so they do not force out
    other useful pages
    A few comments regarding the effects:
    I do not see how this speedup could be caused by readahead, so
    what are
    the effects ?
    I was wondering that as well. We'd really need to test all the
    changes separately to see where the improvements are really coming
    from.

    Also, that patch doesn't address the VACUUM issue at all. And using
    a small fixed size ring with scans that do updates can be
    devastating. I'm experimenting with different ring sizes for COPY
    at the moment. Too small ring leads to a lot of WAL flushes, it's
    basically the same problem we have with VACUUM in CVS HEAD.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • CK Tan at May 10, 2007 at 6:36 pm
    Sorry, 16x8K page ring is too small indeed. The reason we selected 16
    is because greenplum db runs on 32K page size, so we are indeed
    reading 128K at a time. The #pages in the ring should be made
    relative to the page size, so you achieve 128K per read.

    Also agree that KillAndReadBuffer could be split into a
    KillPinDontRead(), and ReadThesePinnedPages() functions. However, we
    are thinking of AIO and would rather see a ReadNPagesAsync() function.

    -cktan
    Greenplum, Inc.
    On May 10, 2007, at 3:14 AM, Zeugswetter Andreas ADI SD wrote:

    In reference to the seq scans roadmap, I have just submitted
    a patch that addresses some of the concerns.

    The patch does this:

    1. for small relation (smaller than 60% of bufferpool), use
    the current logic 2. for big relation:
    - use a ring buffer in heap scan
    - pin first 12 pages when scan starts
    - on consumption of every 4-page, read and pin the next 4-page
    - invalidate used pages of in the scan so they do not
    force out other useful pages
    A few comments regarding the effects:

    I do not see how this speedup could be caused by readahead, so what
    are
    the effects ?
    (It should make no difference to do the CPU work for count(*)
    inbetween
    reading each block when the pages are not dirtied)
    Is the improvement solely reduced CPU because no search for a free
    buffer is needed and/or L2 cache locality ?

    What effect does the advance pinnig have, avoid vacuum ?

    A 16 x 8k page ring is too small to allow the needed IO blocksize of
    256k.
    The readahead is done 4 x one page at a time (=32k).
    What is the reasoning behind 1/4 ring for readahead (why not 1/2), is
    3/4 the trail for followers and bgwriter ?

    I think in anticipation of doing a single IO call for more that one
    page, the KillAndReadBuffer function should be split into two
    parts. One
    that does the killing
    for n pages, and one that does the reading for n pages.
    Killing n before reading n would also have the positive effect of
    grouping perhaps needed writes (not interleaving them with the reads).

    I think the 60% Nbuffers is a very good starting point. I would only
    introduce a GUC when we see evidence that it is needed (I agree with
    Simon's partitioning comments, but I'd still wait and see).

    Andreas
  • Andreas Zeugswetter at May 11, 2007 at 10:48 am

    Sorry, 16x8K page ring is too small indeed. The reason we
    selected 16 is because greenplum db runs on 32K page size, so
    we are indeed reading 128K at a time. The #pages in the ring
    should be made relative to the page size, so you achieve 128K
    per read.
    Ah, ok. New disks here also have a peak at 128k with no other concurrent
    IO.
    Writes benefit from larger blocksizes though, 512k and more.
    Reads with other concurrent IO might also benefit from larger
    blocksizes.

    Comment to all: to test optimal blocksizes make sure you have other
    concurrent IO on the disk.
    Also agree that KillAndReadBuffer could be split into a
    KillPinDontRead(), and ReadThesePinnedPages() functions.
    However, we are thinking of AIO and would rather see a
    ReadNPagesAsync() function.
    Yes, you could start the aio and return an already read buffer to allow
    concurrent cpu work.
    However, you would still want to do blocked aio_readv calls to make sure
    the physical read uses the large blocksize.
    So I'd say aio would benefit from the same split.

    In another posting you wrote:
    The patch has no effect on scans that do updates.
    The KillAndReadBuffer routine does not force out a buffer if
    the dirty bit is set. So updated pages revert to the current
    performance characteristics.
    Yes I see, the ring slot is replaced by a standard ReadBuffer in that
    case, looks good.

    I still think it would be better to write out the buffers and keep them
    in the ring when possible, but that seems to need locks and some sort of
    synchronization with the new walwriter, so looks like a nice project for
    after 8.3.

    Andreas
  • Andreas Zeugswetter at May 8, 2007 at 3:09 pm

    Nope - just reading N buffers ahead for seqscans. Subsequent
    calls use previously read pages. The objective is to issue
    contiguous reads to the OS in sizes greater than the PG page
    size (which is much smaller than what is needed for fast
    sequential I/O).
    Problem here is that eighter you issue the large read into throwaway
    private memory and hope that when you later read 8k you get the page
    from OS buffercache, or you need ScatterGather IO and a way to grab 32
    buffers at once.
    Yes, I think the ring buffer strategy should be used when the
    table size is > 1 x bufcache and the ring buffer should be of
    a fixed size smaller than L2 cache (32KB - 128KB seems to work well).
    How do you merge those two objectives? It seems the ring buffer needs to
    be at least as large as the contiguous read size.
    Thus you would need at least 256k ring buffer. Better yet have twice the
    IO size as ring buffer size, so two sessions can alternately take the
    lead while the other session still blocks a prev page. Modern L2 cache
    is 8 Mb, so 512k seems no problem ?

    Andreas
  • Jeff Davis at May 8, 2007 at 9:27 pm

    On Tue, 2007-05-08 at 07:47 -0400, Luke Lonergan wrote:
    Heikki,

    On 3A: In practice, the popular modern OS'es (BSD/Linux/Solaris/etc)
    implement dynamic I/O caching. The experiments have shown that benefit
    of re-using PG buffer cache on large sequential scans is vanishingly
    small when the buffer cache size is small compared to the system memory.
    Since this is a normal and recommended situation (OS I/O cache is
    auto-tuning and easy to administer, etc), IMO the effort to optimize
    buffer cache reuse for seq scans > 1 x buffer cache size is not
    worthwhile.
    I think 3A is still an interesting idea, but I agree that it needs some
    results to back it up. Let's save 3A for the next release so that we
    have more time to see.
    To that point - an important factor in achieving max I/O rate for large
    tables (> 1 x bufcache) is avoiding the pollution of the CPU L2 cache.
    This is commonly in the range of 512KB -> 2MB, which is only important
    when considering a bound on the size of the ring buffer. The effect has
    been demonstrated to be significant - in the 20%+ range. Another thing
    to consider is the use of readahead inside the heapscan, in which case
    sizes >= 32KB are very effective.
    One thing I'd like us to keep in mind is to have a reasonable number of
    buffers active for a sequential scan. If the number is too small, my
    sync scans might not even work. Right now my patch only reports every 16
    pages, so 32KB (=4 pages) is not going to work for sync scans (I suppose
    only testing will tell).

    Regards,
    Jeff Davis
  • Jeff Davis at May 8, 2007 at 8:56 pm

    On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
    I'm going to do this incrementally, and we'll see how far we get for
    8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
    Simon's patch (step 1), run some performance tests with vacuum, and
    submit a patch for that. Then I'll move to Jeff's patch (step 2).
    I think it's best to postpone 3A (one aspect of my patch). There are a
    couple reasons:

    1) The results didn't show enough benefit yet.
    2) Complex interactions with Simon's patch.

    I think it's an area of research for the future, but for now I just want
    to concentrate on the most important aspect of my patch: the
    synchronized scanning ( #2 in your list ).

    Regards,
    Jeff Davis
  • Simon Riggs at May 9, 2007 at 9:28 am

    On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
    Here's my roadmap for the "scan-resistant buffer cache" and
    "synchronized scans" patches.

    1. Fix the current vacuum behavior of throwing dirty buffers to the
    freelist, forcing a lot of WAL flushes. Instead, use a backend-private
    ring of shared buffers that are recycled. This is what Simon's
    "scan-resistant buffer manager" did.

    The theory here is that if a page is read in by vacuum, it's unlikely to
    be accessed in the near future, therefore it should be recycled. If
    vacuum doesn't dirty the page, it's best to reuse the buffer immediately
    for the next page. However, if the buffer is dirty (and not just because
    we set hint bits), we ought to delay writing it to disk until the
    corresponding WAL record has been flushed to disk.

    Simon's patch used a fixed size ring of buffers that are recycled, but I
    think the ring should be dynamically sized. Start with a small ring, and
    whenever you need to do a WAL flush to write a dirty buffer, increase
    the ring size. On every full iteration through the ring, decrease its
    size to trim down an unnecessarily large ring.

    This only alters the behavior of vacuums, and it's pretty safe to say it
    won't get worse than what we have now.
    I think thats too much code, why not just leave it as it is. Would a
    dynamic buffer be substantially better? If not, why bother?
    In the future, we can use the
    buffer ring for seqscans as well; more on that on step 3.
    There was clear benefit for that. You sound like you are suggesting to
    remove the behaviour for Seq Scans, which wouldn't make much sense??
    2. Implement the list/table of last/ongoing seq scan positions. This is
    Jeff's "synchronized scans" patch. When a seq scan starts on a table
    larger than some threshold, it starts from where the previous seq scan
    is currently, or where it ended. This will synchronize the scans so that
    for two concurrent scans the total I/O is halved in the best case. There
    should be no other effect on performance.

    If you have a partitioned table, or union of multiple tables or any
    other plan where multiple seq scans are performed in arbitrary order,
    this change won't change the order the partitions are scanned and won't
    therefore ensure they will be synchronized.


    Now that we have both pieces of the puzzle in place, it's time to
    consider what more we can do with them:


    3A. To take advantage of the "cache trail" of a previous seq scan, scan
    backwards from where the previous seq scan ended, until a you hit a
    buffer that's not in cache.

    This will allow taking advantage of the buffer cache even if the table
    doesn't fit completely in RAM. That can make a big difference if the
    table size is just slightly bigger than RAM, and can avoid the nasty
    surprise when a table grows beyond RAM size and queries start taking
    minutes instead of seconds.

    This should be a non-controversial change on its own from performance
    point of view. No query should get slower, and some will become faster.
    But see step 3B:

    3B. Currently, sequential scans on a large table spoils the buffer cache
    by evicting other pages from the cache. In CVS HEAD, as soon as the
    table is larger than shared_buffers, the pages in the buffer won't be
    used to speed up running the same query again, and there's no reason to
    believe the pages read in would be more useful than any other page in
    the database, and in particular the pages that were in the buffer cache
    before the huge seq scan. If the table being scanned is > 5 *
    shared_buffers, the scan will evict every other page from the cache if
    there's no other activity in the database (max usage_count is 5).

    If the table is much larger than shared_buffers, say 10 times as large,
    even with the change 3B to read the pages that are in cache first, using
    all shared_buffers to cache the table will only speed up the query by
    10%. We should not spoil the cache for such a small gain, and use the
    local buffer ring strategy instead. It's better to make queries that are
    slow anyway a little bit slower, than making queries that are normally
    really fast, slow.


    As you may notice, 3A and 3B are at odds with each other. We can
    implement both, but you can't use both strategies in the same scan.
    Not sure I've seen any evidence of that.

    Most scans will be solo and so should use the ring buffer, since there
    is clear evidence of that. If there were evidence to suggest the two
    patches conflict then we should turn off the ring buffer only when
    concurrent scans are in progress (while remembering that concurrent
    scans will not typically overlap as much as the synch scan tests show
    and so for much of their execution they too will be solo).
    Therefore we need to have decision logic of some kind to figure out
    which strategy is optimal.

    A simple heuristic is to decide based on the table size:

    < 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
    < 5 * shared_buffers -> start from last read page, keep in cache
    5 * shared_buffers -> start from last read page, use buffer ring
    I'm not sure about the constants, we might need to make them GUC
    variables as Simon argued, but that would be the general approach.
    If you want to hardcode it, I'd say use the ring buffer on scans of 1000
    blocks or more, or we have a GUC. Sizing things to shared_buffers isn't
    appropriate because of the effects of partitioning, as I argued in my
    last post, which I think is still valid.
    Thoughts? Everyone happy with the roadmap?
    I think separating the patches is now the best way forward, though both
    are good.

    --
    Simon Riggs
    EnterpriseDB http://www.enterprisedb.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 8, '07 at 10:40a
activeMay 17, '07 at 2:46p
posts39
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase