current full speed, we could wrap the xid counter every 72 hours.
This is a concern for very large tables, since currently we have to
rescan the whole table. This has my interest for change in the next
We could solve that with a freeze map. Avoiding wraparound vacuums
will get harder and harder for us and an approach which is a minor
improvement on existing freezing won't help for long. People often ask
about 8 byte xids directly, to which I reply: been discussed, major
downsides. So I've been trying to rethink this from first principles
to come up with a better idea.
A much better idea is to hold the xmin epoch on the tuple, in addition
to the xid, if there was a good place to hold this.
CommandId is 32 bits, but very few commands need 2 billion commands in
one transaction. So the suggestion is that we optionally store the xid
epoch in 16 bits of the cmdid field. We would still allow very large
commandIds and combocids, but we optimise the common usage. By doing
this, we will be able to reduce xid wraparounds from every 72 hours to
once every 538 years, at current transaction rates, in most cases. And
in much less busy databases, this will drop to essentially never.
In detail: Whenever we write a tuple with a t_cid of less than 65536
we set a new flag, infomask2 HEAP_HAS_EPOCH and write the xid epoch to
the remaining 16 bits. (Or at least, the epoch modulo 65536). So this
approach is binary compatible with current storage format (even though
I was happy to break it if necessary, I don't think that is helpful).
We hold the relnextfreezeepoch on pg_class, which starts at the epoch
when the table is created -1 (or 65535 at bootstrap). i.e. we next do
a freeze scan in 65536 epochs. Anytime we write a t_cid that is >65535
we reset the relnextfreezeepoch for the table to currentepoch+1 using
a non-transactional update. Autovacuum looks at the relnextfreezeepoch
when deciding whether to kick off a wraparound freeze vacuum. Setting
relnextfreezeepoch needs good interlocks to avoid resetting it when a
long running transaction is running. Various designs, but suggest
using something similar to vacuum cycleid, for when a long running
transaction and an wraparound freeze vacuum occur concurrently. That
way we only need to take special care when we have a very long
Very long transactions, that is transactions with >65536 commands will
need freezing within 1 cycle just as we currently do, but only for
changes made in the later part of the transaction. However, those are
typically pg_dump reloads and we should be able to use COPY FREEZE
with those so they get loaded frozen and don't then need later
freezing at all.
Hang on! Which xid is the epoch for? Well, for xmin of course. If xmax
is set, its either a lock or a delete. And either way we get to clean
things up with regular vacuums, which we would still need to do each
cycle. It's only the xmin that is annoying, because there isn't much
other need to revisit those data blocks. If we don't like that, we
could use a few extra bits from the epoch field to determine which xid
it applies to, but I think that's too much.
This seemed a bit radical when I first thought of this, but it still
seems solid now.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services