On Mon, 2006-12-18 at 11:55 +0900, ITAGAKI Takahiro wrote:
I'm testing the recently changes of WAL entries for freezing-tuples.
VACUUM FREEZE took more time. The cause seems to be flushing WAL buffers.
Vacuuming processes free buffers into freelist. The buffers in freelist is
preferentially used on next allocation of buffers. Then, if the buffer is
dirty, the allocator must write it before reuse. However, there are few buffers
in freelist typically, buffers made dirty recently are reused too soon
-- The WAL entries for the dirty buffer has not been flushed yet, so the
allocator flushes WAL, writes the buffer, and finally reuses it.
I think what you are saying is: VACUUM places blocks so that they are
immediately reused. This stops shared_buffers from being polluted by
vacuumed-blocks, but it also means that almost every write becomes a
backend dirty write when VACUUM is working, bgwriter or not. That also
means that we flush WAL more often than we otherwise would.
One solution is always keeping some buffers in freelist. If there were
N buffers in freelist, the necessity of WAL-flusing was reduced to 1/N,
because all WAL entries are flushed when we do one of them.
That sounds very similar to an idea I'd been working on which I'd called
cache looping. There is a related (but opposite) problem with sequential
scans - they don't move through the cache fast enough. A solution to
both issues is to have the Vacuum/SeqScans continually reuse a small
pool of buffers, rather than request the next one from the buffer
manager in the normal way.
The attached patch is an experimental implementation of the above. Keeping
32 buffers seems to be enough when executed separately. With some background
jobs, other numbers may be better.
N | time | XLogWrite/XLogFlush
1 | 68.2s | 25.6%
8 | 57.4s | 10.8%
32 | 54.0s | 3.4%
I think this is good proof; well done.
From above my thinking would be to have a more general implementation:
Each backend keeps a list of cache buffers to reuse in its local loop,
rather than using the freelist as a global list. That way the technique
would work even when we have multiple Vacuums working concurrently. It
would also then be possible to use this for the SeqScan case as well.
Cache looping would be implemented by a modified BufferAlloc routine,
say BufferScanAlloc() that is called only when a StrategyUseCacheLoop()
has been called during SeqScan or VacuumScan. strategy_cache_loop would
Each backend would have a list of previous N buffers touched. When N
=Nmax, we would link to oldest buffer to form a linked ring. Each time
we need next buffer we read from the ring rather than from the main
clock sweep. If the buffer identified is pinned, then we drop that from
the ring and apply normally for a new buffer and keep that instead. At
the end of the scan, we simply forget the buffer ring.
Another connected thought is the idea of a having a FullBufferList - the
opposite of a free buffer list. When VACUUM/INSERT/COPY fills a block we
notify the buffer manager that this block needs writing ahead of other
buffers, so that the bgwriter can work more effectively. That seems like
it would help with both this current patch and the additional thoughts
$ pgbench -s 40 -i;
# VACUUM FREEZE
# UPDATE accounts SET aid=aid WHERE random() < 0.005;
# VACUUM FREEZE accounts;
I cannot see the above problem in non-freeze vacuum. The number buffers
in freelist increases on index-vacuuming phase. When the vacuum found
seldom used buffers (refcount==0 and usage_count==0), they are added into
freelist. So the WAL entries generated in index-vacuuming or heap-vacuuming
phase are not so serious. However, entries for FREEZE are generated in
heap-scanning phase, it is before index-vacuuming.
This happens for setting hint-bits also in normal operation, which might
only occur once in most test situations. In practice, this can occur
each time we touch a row and then VACUUM, so we end up re-writing the
block many times in the way you describe.
IIRC Heikki was thinking of altering the way VACUUM works to avoid it
writing out blocks that it was going to come back to in the second phase
anyway. That would go some way to alleviating the problem you describe,
but wouldn't go as far as the technique you suggest.