"synchronized scans" patches.
1. Fix the current vacuum behavior of throwing dirty buffers to the
freelist, forcing a lot of WAL flushes. Instead, use a backend-private
ring of shared buffers that are recycled. This is what Simon's
"scan-resistant buffer manager" did.
The theory here is that if a page is read in by vacuum, it's unlikely to
be accessed in the near future, therefore it should be recycled. If
vacuum doesn't dirty the page, it's best to reuse the buffer immediately
for the next page. However, if the buffer is dirty (and not just because
we set hint bits), we ought to delay writing it to disk until the
corresponding WAL record has been flushed to disk.
Simon's patch used a fixed size ring of buffers that are recycled, but I
think the ring should be dynamically sized. Start with a small ring, and
whenever you need to do a WAL flush to write a dirty buffer, increase
the ring size. On every full iteration through the ring, decrease its
size to trim down an unnecessarily large ring.
This only alters the behavior of vacuums, and it's pretty safe to say it
won't get worse than what we have now. In the future, we can use the
buffer ring for seqscans as well; more on that on step 3.
2. Implement the list/table of last/ongoing seq scan positions. This is
Jeff's "synchronized scans" patch. When a seq scan starts on a table
larger than some threshold, it starts from where the previous seq scan
is currently, or where it ended. This will synchronize the scans so that
for two concurrent scans the total I/O is halved in the best case. There
should be no other effect on performance.
If you have a partitioned table, or union of multiple tables or any
other plan where multiple seq scans are performed in arbitrary order,
this change won't change the order the partitions are scanned and won't
therefore ensure they will be synchronized.
Now that we have both pieces of the puzzle in place, it's time to
consider what more we can do with them:
3A. To take advantage of the "cache trail" of a previous seq scan, scan
backwards from where the previous seq scan ended, until a you hit a
buffer that's not in cache.
This will allow taking advantage of the buffer cache even if the table
doesn't fit completely in RAM. That can make a big difference if the
table size is just slightly bigger than RAM, and can avoid the nasty
surprise when a table grows beyond RAM size and queries start taking
minutes instead of seconds.
This should be a non-controversial change on its own from performance
point of view. No query should get slower, and some will become faster.
But see step 3B:
3B. Currently, sequential scans on a large table spoils the buffer cache
by evicting other pages from the cache. In CVS HEAD, as soon as the
table is larger than shared_buffers, the pages in the buffer won't be
used to speed up running the same query again, and there's no reason to
believe the pages read in would be more useful than any other page in
the database, and in particular the pages that were in the buffer cache
before the huge seq scan. If the table being scanned is > 5 *
shared_buffers, the scan will evict every other page from the cache if
there's no other activity in the database (max usage_count is 5).
If the table is much larger than shared_buffers, say 10 times as large,
even with the change 3B to read the pages that are in cache first, using
all shared_buffers to cache the table will only speed up the query by
10%. We should not spoil the cache for such a small gain, and use the
local buffer ring strategy instead. It's better to make queries that are
slow anyway a little bit slower, than making queries that are normally
really fast, slow.
As you may notice, 3A and 3B are at odds with each other. We can
implement both, but you can't use both strategies in the same scan.
Therefore we need to have decision logic of some kind to figure out
which strategy is optimal.
A simple heuristic is to decide based on the table size:
< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
5 * shared_buffers -> start from last read page, use buffer ringI'm not sure about the constants, we might need to make them GUC
variables as Simon argued, but that would be the general approach.
In the future, I'm envisioning a smarter algorithm to size the local
buffer ring. Take all buffers with usage_count=0, plus a few with
usage_count=1 from the clock sweep. That way if there's a lot of buffers
in the buffer cache that are seldomly used, we'll use more buffers to
cache the large scan, and vice versa. And no matter how large the scan,
it wouldn't blow all buffers from the cache. But if you execute the same
query again, the buffers left in the cache from the last scan were
apparently useful, so we use a bigger ring this time.
I'm going to do this incrementally, and we'll see how far we get for
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up
Simon's patch (step 1), run some performance tests with vacuum, and
submit a patch for that. Then I'll move to Jeff's patch (step 2).
Thoughts? Everyone happy with the roadmap?