FAQ
Since we're bashing around ideas around freezing, let me write down the
idea I've been pondering and discussing with various people for years. I
don't think I invented this myself, apologies to whoever did for not
giving credit.

The reason we have to freeze is that otherwise our 32-bit XIDs wrap
around and become ambiguous. The obvious solution is to extend XIDs to
64 bits, but that would waste a lot space. The trick is to add a field
to the page header indicating the 'epoch' of the XID, while keeping the
XIDs in tuple header 32-bit wide (*).

The other reason we freeze is to truncate the clog. But with 64-bit
XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid.
Instead, we could implicitly treat anything older than relfrozenxid as
frozen.

That's the basic idea. Vacuum freeze only needs to remove dead tuples,
but doesn't need to dirty pages that contain no dead tuples.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need
to replace the XIDs with FrozenXid whenever the difference between the
smallest and largest XID on a page exceeds 2^31. But that would only
happen when you're updating the page, in which case the page is dirtied
anyway, so it wouldn't cause any extra I/O.

This would also be the first step in allowing the clog to grow larger
than 2 billion transactions, eliminating the need for anti-wraparound
freezing altogether. You'd still want to truncate the clog eventually,
but it would be nice to not be pressed against the wall with "run vacuum
freeze now, or the system will shut down".

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental
model. If you just add a 32-bit epoch field, then you cannot have xids
from different epochs on the page, which would be a problem. In reality,
you would store one 64-bit XID value in the page header, and use that as
the "reference point" for all the 32-bit XIDs on the tuples. See
existing convert_txid() function for how that works. Another method is
to store the 32-bit xid values in tuple headers as offsets from the
per-page 64-bit value, but then you'd always need to have the 64-bit
value at hand when interpreting the XIDs, even if they're all recent.

- Heikki

Search Discussions

  • Josh Berkus at May 30, 2013 at 5:26 pm
    Heikki,

    This sounds a lot like my idea for 9.3, which didn't go anywhere.
    You've worked out the issues I couldn't, I think.
    Another method is
    to store the 32-bit xid values in tuple headers as offsets from the
    per-page 64-bit value, but then you'd always need to have the 64-bit
    value at hand when interpreting the XIDs, even if they're all recent.
    Yeah, -1 on the latter, not least because it would require a 100%
    rewrite of the tables in order to upgrade.

    --
    Josh Berkus
    PostgreSQL Experts Inc.
    http://pgexperts.com
  • Robert Haas at May 30, 2013 at 6:40 pm

    On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas wrote:
    The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
    and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
    that would waste a lot space. The trick is to add a field to the page header
    indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
    32-bit wide (*). Check.
    The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
    wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
    could implicitly treat anything older than relfrozenxid as frozen. Check.
    That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
    doesn't need to dirty pages that contain no dead tuples. Check.
    Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
    replace the XIDs with FrozenXid whenever the difference between the smallest
    and largest XID on a page exceeds 2^31. But that would only happen when
    you're updating the page, in which case the page is dirtied anyway, so it
    wouldn't cause any extra I/O.
    It would cause some extra WAL activity, but it wouldn't dirty the page
    an extra time.
    This would also be the first step in allowing the clog to grow larger than 2
    billion transactions, eliminating the need for anti-wraparound freezing
    altogether. You'd still want to truncate the clog eventually, but it would
    be nice to not be pressed against the wall with "run vacuum freeze now, or
    the system will shut down".
    Interesting. That seems like a major advantage.
    (*) "Adding an epoch" is inaccurate, but I like to use that as my mental
    model. If you just add a 32-bit epoch field, then you cannot have xids from
    different epochs on the page, which would be a problem. In reality, you
    would store one 64-bit XID value in the page header, and use that as the
    "reference point" for all the 32-bit XIDs on the tuples. See existing
    convert_txid() function for how that works. Another method is to store the
    32-bit xid values in tuple headers as offsets from the per-page 64-bit
    value, but then you'd always need to have the 64-bit value at hand when
    interpreting the XIDs, even if they're all recent.
    As I see it, the main downsides of this approach are:

    (1) It breaks binary compatibility (unless you do something to
    provided for it, like put the epoch in the special space).

    (2) It consumes 8 bytes per page. I think it would be possible to get
    this down to say 5 bytes per page pretty easily; we'd simply decide
    that the low-order 3 bytes of the reference XID must always be 0.
    Possibly you could even do with 4 bytes, or 4 bytes plus some number
    of extra bits.

    (3) You still need to periodically scan the entire relation, or else
    have a freeze map as Simon and Josh suggested.

    The upsides of this approach as compared with what Andres and I are
    proposing are:

    (1) It provides a stepping stone towards allowing indefinite expansion
    of CLOG, which is quite appealing as an alternative to a hard
    shut-down.

    (2) It doesn't place any particular requirements on PD_ALL_VISIBLE. I
    don't personally find this much of a benefit as I want to keep
    PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree.

    Random thought: Could you compute the reference XID based on the page
    LSN? That would eliminate the storage overhead.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Merlin Moncure at May 30, 2013 at 6:46 pm

    On Thu, May 30, 2013 at 1:39 PM, Robert Haas wrote:
    On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
    wrote:
    The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
    and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
    that would waste a lot space. The trick is to add a field to the page header
    indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
    32-bit wide (*).
    (3) You still need to periodically scan the entire relation, or else
    have a freeze map as Simon and Josh suggested.
    Why is this scan required?

    Also, what happens if you delete a tuple on a page when another tuple
    on the same page with age > 2^32 that is still in an open transaction?

    merlin
  • Heikki Linnakangas at May 30, 2013 at 7:04 pm

    On 30.05.2013 21:46, Merlin Moncure wrote:
    On Thu, May 30, 2013 at 1:39 PM, Robert Haaswrote:
    On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
    wrote:
    The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
    and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
    that would waste a lot space. The trick is to add a field to the page header
    indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
    32-bit wide (*).
    (3) You still need to periodically scan the entire relation, or else
    have a freeze map as Simon and Josh suggested.
    Why is this scan required?
    To find all the dead tuples and remove them, and advance relfrozenxid.
    That in turn is required so that you can truncate the clog. This scheme
    relies on assuming that everything older than relfrozenxid committed, so
    if there are any aborted XIDs present in the table, you can't advance
    relfrozenxid past them.

    Come to think of it, if there are no aborted XIDs in a range of XIDs,
    only commits, then you could just advance relfrozenxid past that range
    and truncate away the clog, without scanning the table. But that's quite
    a special case - generally there would be at least a few aborted XIDs -
    so it's probably not worth adding any special code to take advantage of
    that.
    Also, what happens if you delete a tuple on a page when another tuple
    on the same page with age> 2^32 that is still in an open transaction?
    Can't let that happen. Same as today.

    - Heikki
  • Andres Freund at May 30, 2013 at 7:22 pm

    On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
    Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
    replace the XIDs with FrozenXid whenever the difference between the smallest
    and largest XID on a page exceeds 2^31. But that would only happen when
    you're updating the page, in which case the page is dirtied anyway, so it
    wouldn't cause any extra I/O.
    It would cause some extra WAL activity, but it wouldn't dirty the page
    an extra time.
    You probably could do it similarly to how we currently do
    XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
    insert/update/delete record will already contain a FPI if necessary, so
    that should be safe.
    This would also be the first step in allowing the clog to grow larger than 2
    billion transactions, eliminating the need for anti-wraparound freezing
    altogether. You'd still want to truncate the clog eventually, but it would
    be nice to not be pressed against the wall with "run vacuum freeze now, or
    the system will shut down".
    Interesting. That seems like a major advantage.
    Hm. Why? If freezing gets notably cheaper I don't really see much need
    for keeping that much clog around? If we still run into anti-wraparound
    areas, there has to be some major operational problem.

    Greetings,

    Andres Freund

    --
      Andres Freund http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at May 31, 2013 at 2:04 am

    On Thu, May 30, 2013 at 3:22 PM, Andres Freund wrote:
    On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
    Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
    replace the XIDs with FrozenXid whenever the difference between the smallest
    and largest XID on a page exceeds 2^31. But that would only happen when
    you're updating the page, in which case the page is dirtied anyway, so it
    wouldn't cause any extra I/O.
    It would cause some extra WAL activity, but it wouldn't dirty the page
    an extra time.
    You probably could do it similarly to how we currently do
    XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
    insert/update/delete record will already contain a FPI if necessary, so
    that should be safe.
    Ah, good point.
    This would also be the first step in allowing the clog to grow larger than 2
    billion transactions, eliminating the need for anti-wraparound freezing
    altogether. You'd still want to truncate the clog eventually, but it would
    be nice to not be pressed against the wall with "run vacuum freeze now, or
    the system will shut down".
    Interesting. That seems like a major advantage.
    Hm. Why? If freezing gets notably cheaper I don't really see much need
    for keeping that much clog around? If we still run into anti-wraparound
    areas, there has to be some major operational problem.
    That is true, but we have a decent number of customers who do in fact
    have such problems. I think that's only going to get more common. As
    hardware gets faster and PostgreSQL improves, people are going to
    process more and more transactions in shorter and shorter periods of
    time. Heikki's benchmark results for the XLOG scaling patch show
    rates of >80,000 tps. Even at a more modest 10,000 tps, with default
    settings, you'll do anti-wraparound vacuums of the entire cluster
    about every 8 hours. That's not fun.

    Being able to do such scans only of the not-all-visible pages would be
    a huge step forward, of course. But not having to do them on any
    particular deadline would be a whole lot better.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Bruce Momjian at May 31, 2013 at 5:27 pm

    On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
    Hm. Why? If freezing gets notably cheaper I don't really see much need
    for keeping that much clog around? If we still run into anti-wraparound
    areas, there has to be some major operational problem.
    That is true, but we have a decent number of customers who do in fact
    have such problems. I think that's only going to get more common. As
    hardware gets faster and PostgreSQL improves, people are going to
    process more and more transactions in shorter and shorter periods of
    time. Heikki's benchmark results for the XLOG scaling patch show
    rates of >80,000 tps. Even at a more modest 10,000 tps, with default
    settings, you'll do anti-wraparound vacuums of the entire cluster
    about every 8 hours. That's not fun.
    Are you assuming these are all write transactions, hence consuming xids?

    --
       Bruce Momjian <bruce@momjian.us> http://momjian.us
       EnterpriseDB http://enterprisedb.com

       + It's impossible for everything to be true. +
  • Robert Haas at Jun 1, 2013 at 3:32 am

    On Fri, May 31, 2013 at 1:26 PM, Bruce Momjian wrote:
    On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:
    Hm. Why? If freezing gets notably cheaper I don't really see much need
    for keeping that much clog around? If we still run into anti-wraparound
    areas, there has to be some major operational problem.
    That is true, but we have a decent number of customers who do in fact
    have such problems. I think that's only going to get more common. As
    hardware gets faster and PostgreSQL improves, people are going to
    process more and more transactions in shorter and shorter periods of
    time. Heikki's benchmark results for the XLOG scaling patch show
    rates of >80,000 tps. Even at a more modest 10,000 tps, with default
    settings, you'll do anti-wraparound vacuums of the entire cluster
    about every 8 hours. That's not fun.
    Are you assuming these are all write transactions, hence consuming xids?
    Well, there might be read-only transactions as well, but the point is
    about how many write transactions there can be. 10,000 tps or more is
    not out of the question even today, and progressively higher numbers
    are only going to become more and more common.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Greg Stark at Jun 6, 2013 at 12:17 pm

    On Fri, May 31, 2013 at 3:04 AM, Robert Haas wrote:
    Even at a more modest 10,000 tps, with default
    settings, you'll do anti-wraparound vacuums of the entire cluster
    about every 8 hours. That's not fun.
    I've forgotten now. What happens if you have a long-lived transaction
    still alive from > 2B xid ago?


    --
    greg
  • Heikki Linnakangas at Jun 6, 2013 at 12:39 pm

    On 06.06.2013 15:16, Greg Stark wrote:
    On Fri, May 31, 2013 at 3:04 AM, Robert Haaswrote:
    Even at a more modest 10,000 tps, with default
    settings, you'll do anti-wraparound vacuums of the entire cluster
    about every 8 hours. That's not fun.
    I've forgotten now. What happens if you have a long-lived transaction
    still alive from> 2B xid ago?
    That will keep OldestXmin from advancing. Which will keep vacuum from
    advancing relfrozenxid/datfrozenxid. Which will first trigger the
    warnings about wrap-around, then stops new XIDs from being generated,
    and finally a forced shutdown.

    The forced shutdown will actually happen some time before going beyond 2
    billion XIDs. So it is not possible to have a long-lived transaction,
    older than 2 B XIDs, still live in the system. But let's imagine that
    you somehow bypass the safety mechanism:

    After wraparound, old tuples will look like being in the future, and
    will become invisible to new transactions. That happens even if there
    are no old transactions around. I'm not sure what exactly will happen if
    there is still a transaction alive with an XID and/or snapshots older
    than 2^31 XIDs. New tuples that are not supposed to be visible to the
    old snapshot would suddenly become visible, I guess.

    - Heikki
  • Greg Stark at Jun 6, 2013 at 10:28 pm

    On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas wrote:
    That will keep OldestXmin from advancing. Which will keep vacuum from
    advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
    about wrap-around, then stops new XIDs from being generated, and finally a
    forced shutdown.

    The forced shutdown will actually happen some time before going beyond 2
    billion XIDs. So it is not possible to have a long-lived transaction, older
    than 2 B XIDs, still live in the system. But let's imagine that you somehow
    bypass the safety mechanism:
    Ah, so if you do the epoch in the page header thing or Robert's LSN
    trick that I didn't follow then you'll need a new safety check against
    this. Since relfrozenxid/datfrozenxid will no longer be necessary.

    --
    greg
  • Robert Haas at Jun 7, 2013 at 5:54 pm

    On Thu, Jun 6, 2013 at 6:28 PM, Greg Stark wrote:
    On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
    wrote:
    That will keep OldestXmin from advancing. Which will keep vacuum from
    advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
    about wrap-around, then stops new XIDs from being generated, and finally a
    forced shutdown.

    The forced shutdown will actually happen some time before going beyond 2
    billion XIDs. So it is not possible to have a long-lived transaction, older
    than 2 B XIDs, still live in the system. But let's imagine that you somehow
    bypass the safety mechanism:
    Ah, so if you do the epoch in the page header thing or Robert's LSN
    trick that I didn't follow then you'll need a new safety check against
    this. Since relfrozenxid/datfrozenxid will no longer be necessary.
    Nothing proposed here gets rid of either of those, that I can see.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Heikki Linnakangas at Jun 7, 2013 at 6:08 pm

    On 07.06.2013 20:54, Robert Haas wrote:
    On Thu, Jun 6, 2013 at 6:28 PM, Greg Starkwrote:
    On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
    wrote:
    That will keep OldestXmin from advancing. Which will keep vacuum from
    advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
    about wrap-around, then stops new XIDs from being generated, and finally a
    forced shutdown.

    The forced shutdown will actually happen some time before going beyond 2
    billion XIDs. So it is not possible to have a long-lived transaction, older
    than 2 B XIDs, still live in the system. But let's imagine that you somehow
    bypass the safety mechanism:
    Ah, so if you do the epoch in the page header thing or Robert's LSN
    trick that I didn't follow then you'll need a new safety check against
    this. Since relfrozenxid/datfrozenxid will no longer be necessary.
    Nothing proposed here gets rid of either of those, that I can see.
    Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no
    longer means that all XIDs older than frozenxid are replaced with
    FrozenXid. Instead, it will mean that all XIDs older than frozenxid are
    committed, ie. all dead tuples older than that have been vacuumed away.

    - Heikki
  • Simon Riggs at Jun 7, 2013 at 6:33 pm

    On 7 June 2013 19:08, Heikki Linnakangas wrote:
    On 07.06.2013 20:54, Robert Haas wrote:

    On Thu, Jun 6, 2013 at 6:28 PM, Greg Starkwrote:
    On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
    wrote:
    That will keep OldestXmin from advancing. Which will keep vacuum from
    advancing relfrozenxid/datfrozenxid. Which will first trigger the
    warnings
    about wrap-around, then stops new XIDs from being generated, and finally
    a
    forced shutdown.

    The forced shutdown will actually happen some time before going beyond 2
    billion XIDs. So it is not possible to have a long-lived transaction,
    older
    than 2 B XIDs, still live in the system. But let's imagine that you
    somehow
    bypass the safety mechanism:

    Ah, so if you do the epoch in the page header thing or Robert's LSN
    trick that I didn't follow then you'll need a new safety check against
    this. Since relfrozenxid/datfrozenxid will no longer be necessary.

    Nothing proposed here gets rid of either of those, that I can see.

    Right. The meaning of relfrozenxid/datfrozenxid changes somewhat; it no
    longer means that all XIDs older than frozenxid are replaced with FrozenXid.
    Instead, it will mean that all XIDs older than frozenxid are committed, ie.
    all dead tuples older than that have been vacuumed away.
    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Heikki Linnakangas at Jun 7, 2013 at 6:56 pm

    On 07.06.2013 21:33, Simon Riggs wrote:
    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older
    than 2 billion XIDs, it must be possible to store XIDs on the same page
    that are more than 2 billion XIDs apart. All the discussed schemes where
    we store the epoch at the page level, either explicitly or derived from
    the LSN, rely on the fact that it's not currently necessary to do that.
    Currently, when one XID on a page is older than 2 billion XIDs, that old
    XID can always be replaced with FrozenXid, because there cannot be a
    snapshot old enough to not see it.

    I agree that it would be nice if you could find a way around that. You
    had a suggestion on making room on the tuple header for the epoch. I'm
    not sure I like that particular proposal, but we would need something
    like that. If we settle for snapshots that can be at most, say, 512
    billion transactions old, instead of 2 billion, then we would only need
    one byte to store an epoch "offset" in the tuple header. Ie. deduce the
    epoch of tuples on the page from the LSN on the page header, but allow
    individual tuples to specify an offset from that deduced epoch.

    In practice, I think we're still quite far from people running into that
    2 billion XID limit on snapshot age. But maybe in a few years, after
    we've solved all the more pressing vacuum and wraparound issues that
    people currently run into before reaching that stage...

    - Heikki
  • Simon Riggs at Jun 7, 2013 at 7:11 pm

    On 7 June 2013 19:56, Heikki Linnakangas wrote:
    On 07.06.2013 21:33, Simon Riggs wrote:

    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older than
    2 billion XIDs, it must be possible to store XIDs on the same page that are
    more than 2 billion XIDs apart. All the discussed schemes where we store the
    epoch at the page level, either explicitly or derived from the LSN, rely on
    the fact that it's not currently necessary to do that. Currently, when one
    XID on a page is older than 2 billion XIDs, that old XID can always be
    replaced with FrozenXid, because there cannot be a snapshot old enough to
    not see it.
    It does seem that there are two problems: avoiding freezing AND long
    running queries

    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    I agree that it would be nice if you could find a way around that. You had a
    suggestion on making room on the tuple header for the epoch. I'm not sure I
    like that particular proposal, but we would need something like that. If we
    settle for snapshots that can be at most, say, 512 billion transactions old,
    instead of 2 billion, then we would only need one byte to store an epoch
    "offset" in the tuple header. Ie. deduce the epoch of tuples on the page
    from the LSN on the page header, but allow individual tuples to specify an
    offset from that deduced epoch.
    I like the modification you propose. And I like it even better because
    it uses just 1 byte, which is even more easily squeezed into the
    existing tuple header, whether we go with my proposed squeezing route
    or not.
    In practice, I think we're still quite far from people running into that 2
    billion XID limit on snapshot age. But maybe in a few years, after we've
    solved all the more pressing vacuum and wraparound issues that people
    currently run into before reaching that stage...
    Your WALInsert lock patch will fix that. ;-)

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jun 7, 2013 at 7:16 pm

    On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggs wrote:
    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    For what it's worth (and that may not be much), I think most people
    will die a horrible death due to bloat after holding a transaction
    open for a tiny fraction of 2B XIDs. :-(

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Heikki Linnakangas at Jun 7, 2013 at 7:45 pm

    On 07.06.2013 22:15, Robert Haas wrote:
    On Fri, Jun 7, 2013 at 3:10 PM, Simon Riggswrote:
    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    For what it's worth (and that may not be much), I think most people
    will die a horrible death due to bloat after holding a transaction
    open for a tiny fraction of 2B XIDs. :-(
    Yeah, we should fix that too ;-).

    While we're at it: I've been thinking that we should try harder to
    vacuum dead tuples that are no longer visible to any snapshot, even if
    there's an even old snapshot. The stereotypical scenario is a table with
    a single row that's updated very very frequently. Like a counter.
    Autovacuum can normally keep it in check, but if you have a long-running
    transaction, it will bloat horrendously. But if you only have one
    long-running transaction with one really old snapshot, and everything
    else is recent, you'd really only need to keep one old tuple around for
    the old snapshot to see, and a recent version or two for the rest. At
    worst, the database needs to bloat to double the size, but not more than
    that.

    To know which tuples are dead at such a fine-grained level, vacuum would
    need to know in more detail what snapshots the backends have. I'm really
    excited about Ants Aasma's proposal to use a CSN for snapshots, or more
    precisely the variant using commit record's LSN for that. If a snapshot
    is just a single integer, it becomes easier for backends to share their
    snapshots, in limited amount of shared memory.

    - Heikki
  • Andres Freund at Jun 7, 2013 at 7:16 pm

    On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
    On 7 June 2013 19:56, Heikki Linnakangas wrote:
    On 07.06.2013 21:33, Simon Riggs wrote:

    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older than
    2 billion XIDs, it must be possible to store XIDs on the same page that are
    more than 2 billion XIDs apart. All the discussed schemes where we store the
    epoch at the page level, either explicitly or derived from the LSN, rely on
    the fact that it's not currently necessary to do that. Currently, when one
    XID on a page is older than 2 billion XIDs, that old XID can always be
    replaced with FrozenXid, because there cannot be a snapshot old enough to
    not see it.
    It does seem that there are two problems: avoiding freezing AND long
    running queries

    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    I'd say that's because it's prohibitive to run so long transactions
    anyway since it causes too much unremovable bloat. 2bio transactions
    really is a quite a bit, I don't think it's a relevant restriction. Yet.

    Let's discuss it if we have solved the other problems ;)

    Greetings,

    Andres Freund

    --
      Andres Freund http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Simon Riggs at Jun 7, 2013 at 7:30 pm

    On 7 June 2013 20:16, Andres Freund wrote:
    On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
    On 7 June 2013 19:56, Heikki Linnakangas wrote:
    On 07.06.2013 21:33, Simon Riggs wrote:

    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older than
    2 billion XIDs, it must be possible to store XIDs on the same page that are
    more than 2 billion XIDs apart. All the discussed schemes where we store the
    epoch at the page level, either explicitly or derived from the LSN, rely on
    the fact that it's not currently necessary to do that. Currently, when one
    XID on a page is older than 2 billion XIDs, that old XID can always be
    replaced with FrozenXid, because there cannot be a snapshot old enough to
    not see it.
    It does seem that there are two problems: avoiding freezing AND long
    running queries

    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    I'd say that's because it's prohibitive to run so long transactions
    anyway since it causes too much unremovable bloat. 2bio transactions
    really is a quite a bit, I don't think it's a relevant restriction. Yet.

    Let's discuss it if we have solved the other problems ;)
    Let me say that I think that problem is solvable also.

    At the moment we allow all visible tuple versions to be linked
    together, so that the last visible and latest update are linked by a
    chain. If we break that assumption and say that we will never follow
    an update chain from a snapshot in the distant past, then we can
    remove intermediate dead rows. We currently regard those as recently
    dead, but that just requires some extra thought. We still keep all
    *visible* tuple versions, we just don't bother to keep all the
    intermediate ones as well.

    Perhaps another day, but one day.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Hannu Krosing at Jun 7, 2013 at 8:28 pm

    On 06/07/2013 09:16 PM, Andres Freund wrote:
    On 2013-06-07 20:10:55 +0100, Simon Riggs wrote:
    On 7 June 2013 19:56, Heikki Linnakangas wrote:
    On 07.06.2013 21:33, Simon Riggs wrote:
    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older than
    2 billion XIDs, it must be possible to store XIDs on the same page that are
    more than 2 billion XIDs apart. All the discussed schemes where we store the
    epoch at the page level, either explicitly or derived from the LSN, rely on
    the fact that it's not currently necessary to do that. Currently, when one
    XID on a page is older than 2 billion XIDs, that old XID can always be
    replaced with FrozenXid, because there cannot be a snapshot old enough to
    not see it.
    It does seem that there are two problems: avoiding freezing AND long
    running queries

    The long running query problem hasn't ever been looked at, it seems,
    until here and now.
    I'd say that's because it's prohibitive to run so long transactions
    anyway since it causes too much unremovable bloat. 2bio transactions
    really is a quite a bit, I don't think it's a relevant restriction. Yet.
    2bio transaction means at least 2G rows inserted, updated or deleted,
    which may or may not bee "too much"

    In my simple key/value update tests I was able to do 40k single transaction
    updates sec on moderate sized server. at this rate 2G trx takes about 15
    hours
    which is not really very much.

    One may for example have an unfinished 2PC transaction lingering over
    weekend and while of course one should have monitoring in place to
    avoid this, it still would be nice if this did not cause database shutdown.



    --
    Hannu Krosing
    PostgreSQL Consultant
    Performance, Scalability and High Availability
    2ndQuadrant Nordic OÜ
  • Hannu Krosing at Jun 7, 2013 at 8:21 pm

    On 06/07/2013 08:56 PM, Heikki Linnakangas wrote:
    On 07.06.2013 21:33, Simon Riggs wrote:
    Now that I consider Greg's line of thought, the idea we focused on
    here was about avoiding freezing. But Greg makes me think that we may
    also wish to look at allowing queries to run longer than one epoch as
    well, if the epoch wrap time is likely to come down substantially.

    To do that I think we'll need to hold epoch for relfrozenxid as well,
    amongst other things.
    The biggest problem I see with that is that if a snapshot can be older
    than 2 billion XIDs, it must be possible to store XIDs on the same
    page that are more than 2 billion XIDs apart.
    It could be possible to recognise the situation and save the new XIDs on
    *another* page ?


    --
    Hannu Krosing
    PostgreSQL Consultant
    Performance, Scalability and High Availability
    2ndQuadrant Nordic OÜ
  • Robert Haas at May 31, 2013 at 3:02 am

    On Thu, May 30, 2013 at 2:39 PM, Robert Haas wrote:
    Random thought: Could you compute the reference XID based on the page
    LSN? That would eliminate the storage overhead.
    After mulling this over a bit, I think this is definitely possible.
    We begin a new "half-epoch" every 2 billion transactions. We remember
    the LSN at which the current half-epoch began and the LSN at which the
    previous half-epoch began. When a new half-epoch begins, the first
    backend that wants to stamp a tuple with an XID from the new
    half-epoch must first emit a "new half-epoch" WAL record, which
    becomes the starting LSN for the new half-epoch.

    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
      If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit. The advantage of this is that we avoid an
    epidemic of freezing right after a half-epoch change. Immediately
    after a half-epoch change, many pages will mix tuples from the current
    and previous half-epoch - but relatively few pages will have tuples
    from the current half-epoch and a half-epoch more than one in the
    past.

    As things stand today, we really only need to remember the last two
    half-epoch boundaries; they could be stored, for example, in the
    control file. But if we someday generalize CLOG to allow indefinite
    retention as you suggest, we could instead remember all half-epoch
    boundaries that have ever occurred; just maintain a file someplace
    with 8 bytes of data for every 2 billion XIDs consumed over the
    lifetime of the cluster. In fact, we might want to do it that way
    anyhow, just to keep our options open, and perhaps for forensics.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Heikki Linnakangas at Jun 1, 2013 at 6:49 pm

    On 31.05.2013 06:02, Robert Haas wrote:
    On Thu, May 30, 2013 at 2:39 PM, Robert Haaswrote:
    Random thought: Could you compute the reference XID based on the page
    LSN? That would eliminate the storage overhead.
    After mulling this over a bit, I think this is definitely possible.
    We begin a new "half-epoch" every 2 billion transactions. We remember
    the LSN at which the current half-epoch began and the LSN at which the
    previous half-epoch began. When a new half-epoch begins, the first
    backend that wants to stamp a tuple with an XID from the new
    half-epoch must first emit a "new half-epoch" WAL record, which
    becomes the starting LSN for the new half-epoch.
    Clever! Pages in unlogged tables need some extra treatment, as they
    don't normally have a valid LSN, but that shouldn't be too hard.
    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.
    Yep, I think that would work. Want to write the patch, or should I? ;-)

    - Heikki
  • Simon Riggs at Jun 1, 2013 at 7:23 pm

    On 1 June 2013 19:48, Heikki Linnakangas wrote:
    On 31.05.2013 06:02, Robert Haas wrote:

    On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com>
    wrote:
    Random thought: Could you compute the reference XID based on the page
    LSN? That would eliminate the storage overhead.

    After mulling this over a bit, I think this is definitely possible.
    We begin a new "half-epoch" every 2 billion transactions. We remember
    the LSN at which the current half-epoch began and the LSN at which the
    previous half-epoch began. When a new half-epoch begins, the first
    backend that wants to stamp a tuple with an XID from the new
    half-epoch must first emit a "new half-epoch" WAL record, which
    becomes the starting LSN for the new half-epoch.

    Clever! Pages in unlogged tables need some extra treatment, as they don't
    normally have a valid LSN, but that shouldn't be too hard.
    I like the idea of using the LSN to indicate the epoch. It saves any
    other work we might consider, such as setting page or tuple level
    epochs.

    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.

    Yep, I think that would work. Want to write the patch, or should I? ;-)
    If we set a bit, surely we need to write the page. Isn't that what we
    were trying to avoid?

    Why set a bit at all? If we know the LSN of the page and we know the
    epoch boundaries, then we can work out when the page was last written
    to and infer that the page is "virtually frozen".

    As soon as we make a change to a "virtually frozen" page, we can
    actually freeze it and then make the change.

    But we still have the problem of knowing which pages have been frozen
    and which haven't.

    Can we clear up those points first? Or at least my understanding of them.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jun 1, 2013 at 8:26 pm

    On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs wrote:
    If we set a bit, surely we need to write the page. Isn't that what we
    were trying to avoid?
    No, the bit only gets set in situations when we were going to dirty
    the page for some other reason anyway. Specifically, if a page
    modification discovers that we've switched epochs (but just once) and
    the bit isn't already set, we can set it in lieu of scanning the
    entire page for tuples that need freezing.

    Under this proposal, pages that don't contain any dead tuples needn't
    be dirtied for freezing, ever. Smells like awesome.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Simon Riggs at Jun 1, 2013 at 10:56 pm

    On 1 June 2013 21:26, Robert Haas wrote:
    On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs wrote:
    If we set a bit, surely we need to write the page. Isn't that what we
    were trying to avoid?
    No, the bit only gets set in situations when we were going to dirty
    the page for some other reason anyway. Specifically, if a page
    modification discovers that we've switched epochs (but just once) and
    the bit isn't already set, we can set it in lieu of scanning the
    entire page for tuples that need freezing.

    Under this proposal, pages that don't contain any dead tuples needn't
    be dirtied for freezing, ever. Smells like awesome.
    Agreed, well done both.

    What I especially like about it is how little logic it will require,
    and no page format changes.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jun 1, 2013 at 8:21 pm

    On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas wrote:
    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.
    Yep, I think that would work. Want to write the patch, or should I? ;-)
    Have at it. I think the tricky part is going to be figuring out the
    synchronization around half-epoch boundaries.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Heikki Linnakangas at Jun 10, 2013 at 6:58 pm

    On 01.06.2013 23:21, Robert Haas wrote:
    On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
    wrote:
    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.
    Yep, I think that would work. Want to write the patch, or should I? ;-)
    Have at it.
    Here's a first draft. A lot of stuff missing and broken, but "make
    check" passes :-).

    In the patch, instead of working with "half-epochs", there are "XID-LSN
    ranges", which can be of arbitrary size. An XID-LSN range consists of
    three values:

    minlsn: The point in WAL where this range begins.
    minxid - maxxid: The range of XIDs allowed in this range.

    Every point in WAL belongs to exactly one range. The minxid-maxxid of
    the ranges can overlap. For example:

    1. XIDs 25000942 - 27000003 LSN 0/3BB9938
    2. XIDs 23000742 - 26000003 LSN 0/2AB9288
    3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
    4. XIDs 22000002 - 24000003 LSN 0/10B1550

    The invariant with the ranges is that a page with a certain LSN is only
    allowed to contain XIDs that belong to the range specified by that LSN.
    For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
    and can only contain XIDs between 23000742 - 26000003. If a backend
    updates the page, so that it's LSN is updated to, say, 0/3D12345, all
    XIDs on the page older than 25000942 must be frozen first, to avoid
    violating the rule.

    The system keeps track of a small number of these XID-LSN ranges. Where
    we currently truncate clog, we can also truncate the ranges with maxxid
    < the clog truncation point. Vacuum removes any dead tuples and updates
    relfrozenxid as usual, to make sure that there are no dead tuples or
    aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
    It no longer needs to freeze old committed XIDs, however - that's the
    gain from this patch (except to uphold the invariant, if it has to
    remove some dead tuples on the page and update its LSN).

    A new range is created whenever we reach the maxxid on the current one.
    The new range's minxid is set to the current global oldest xmin value,
    and maxxid is just the old maxxid plus a fixed number (1 million in the
    patch, but we probably want a higher value in reality). This ensures
    that if you modify a page and update its LSN, all the existing XIDs on
    the page that cannot be frozen yet are greater than the minxid of the
    latest range. In other words, you can always freeze old XIDs on a page,
    so that any remaining non-frozen XIDs are within the minxid-maxxid of
    the latest range.

    The HeapTupleSatisfies functions are modified to look at the page's LSN
    first. If it's old enough, it doesn't look at the XIDs on the page level
    at all, and just considers everything on the page is visible to everyone
    (I'm calling this state a "mature page").
    I think the tricky part is going to be figuring out the
    synchronization around half-epoch boundaries.
    Yep. I skipped all those difficult parts in this first version. There
    are two race conditions that need to be fixed:

    1. When a page is updated, we check if it needs to be frozen. If its LSN
    is greater than the latest range's LSN. IOW, if we've already modified
    the page, and thus frozen all older tuples, within the current range.
    However, it's possible that a new range is created immediately after
    we've checked that. When we then proceed to do the actual update on the
    page and WAL-log that, the new LSN falls within the next range, and we
    should've frozen the page. I'm planning to fix that by adding a "parity
    bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
    or 1. When we check if a page needs to be frozen on update, we make note
    of the latest range's parity bit, and write it in the page header.
    Later, when we look at the page's LSN to determine which XID-LSN range
    it belongs to, we compare the parity. If the parity doesn't match, we
    know that the race condition happened, so we treat the page to belong to
    the previous range, not the one it would normally belong to, per the LSN.

    2. When we look at a page, and determine that it's not old enough to be
    "matured", we then check the clog as usual. A page is considered mature,
    if the XID-LSN range (and corresponding clog pages) has already been
    truncated away. It's possible that between those steps, the XID-LSN
    range and clog is truncated away, so that the backend tries to access a
    clog page that doesn't exist anymore. To fix that, the XID-LSN range and
    clog truncation needs to be done in two steps. First, mark the
    truncation point in shared memory. Then somehow wait until all backends
    see the new value, and go ahead with actually truncating the clog only
    after that.


    Aside from those two race conditions, there are plenty of scalability
    issues remaining. Currently, the shared XID-LSN range array is checked
    every time a page is accessed, so it could quickly become a bottleneck.
    Need to cache that information in each backend. Oh, and I didn't
    implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
    get a freezing frenzy right after a new XID-LSN range is created.

    I'll keep hacking away on those things, but please let me know if you
    see some fatal flaw with this plan.

    - Heikki
  • Simon Riggs at Jun 10, 2013 at 8:48 pm

    On 10 June 2013 19:58, Heikki Linnakangas wrote:
    On 01.06.2013 23:21, Robert Haas wrote:

    On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
    wrote:
    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.

    Yep, I think that would work. Want to write the patch, or should I? ;-)

    Have at it.

    Here's a first draft. A lot of stuff missing and broken, but "make check"
    passes :-).
    Well done, looks like good progress.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Robert Haas at Jun 10, 2013 at 10:10 pm

    On Mon, Jun 10, 2013 at 4:48 PM, Simon Riggs wrote:
    Well done, looks like good progress.
    +1.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Heikki Linnakangas at Aug 27, 2013 at 3:56 pm

    On 10.06.2013 21:58, Heikki Linnakangas wrote:
    On 01.06.2013 23:21, Robert Haas wrote:
    On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
    wrote:
    We define a new page-level bit, something like PD_RECENTLY_FROZEN.
    When this bit is set, it means there are no unfrozen tuples on the
    page with XIDs that predate the current half-epoch. Whenever we know
    this to be true, we set the bit. If the page LSN crosses more than
    one half-epoch boundary at a time, we freeze the page and set the bit.
    If the page LSN crosses exactly one half-epoch boundary, then (1) if
    the bit is set, we clear it and (2) if the bit is not set, we freeze
    the page and set the bit.
    Yep, I think that would work. Want to write the patch, or should I? ;-)
    Have at it.
    Here's a first draft. A lot of stuff missing and broken, but "make
    check" passes :-).

    In the patch, instead of working with "half-epochs", there are "XID-LSN
    ranges", which can be of arbitrary size. An XID-LSN range consists of
    three values:

    minlsn: The point in WAL where this range begins.
    minxid - maxxid: The range of XIDs allowed in this range.

    Every point in WAL belongs to exactly one range. The minxid-maxxid of
    the ranges can overlap. For example:

    1. XIDs 25000942 - 27000003 LSN 0/3BB9938
    2. XIDs 23000742 - 26000003 LSN 0/2AB9288
    3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
    4. XIDs 22000002 - 24000003 LSN 0/10B1550

    The invariant with the ranges is that a page with a certain LSN is only
    allowed to contain XIDs that belong to the range specified by that LSN.
    For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
    and can only contain XIDs between 23000742 - 26000003. If a backend
    updates the page, so that it's LSN is updated to, say, 0/3D12345, all
    XIDs on the page older than 25000942 must be frozen first, to avoid
    violating the rule.

    The system keeps track of a small number of these XID-LSN ranges. Where
    we currently truncate clog, we can also truncate the ranges with maxxid
    < the clog truncation point. Vacuum removes any dead tuples and updates
    relfrozenxid as usual, to make sure that there are no dead tuples or
    aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
    It no longer needs to freeze old committed XIDs, however - that's the
    gain from this patch (except to uphold the invariant, if it has to
    remove some dead tuples on the page and update its LSN).

    A new range is created whenever we reach the maxxid on the current one.
    The new range's minxid is set to the current global oldest xmin value,
    and maxxid is just the old maxxid plus a fixed number (1 million in the
    patch, but we probably want a higher value in reality). This ensures
    that if you modify a page and update its LSN, all the existing XIDs on
    the page that cannot be frozen yet are greater than the minxid of the
    latest range. In other words, you can always freeze old XIDs on a page,
    so that any remaining non-frozen XIDs are within the minxid-maxxid of
    the latest range.

    The HeapTupleSatisfies functions are modified to look at the page's LSN
    first. If it's old enough, it doesn't look at the XIDs on the page level
    at all, and just considers everything on the page is visible to everyone
    (I'm calling this state a "mature page").
    I think the tricky part is going to be figuring out the
    synchronization around half-epoch boundaries.
    Yep. I skipped all those difficult parts in this first version. There
    are two race conditions that need to be fixed:

    1. When a page is updated, we check if it needs to be frozen. If its LSN
    is greater than the latest range's LSN. IOW, if we've already modified
    the page, and thus frozen all older tuples, within the current range.
    However, it's possible that a new range is created immediately after
    we've checked that. When we then proceed to do the actual update on the
    page and WAL-log that, the new LSN falls within the next range, and we
    should've frozen the page. I'm planning to fix that by adding a "parity
    bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
    or 1. When we check if a page needs to be frozen on update, we make note
    of the latest range's parity bit, and write it in the page header.
    Later, when we look at the page's LSN to determine which XID-LSN range
    it belongs to, we compare the parity. If the parity doesn't match, we
    know that the race condition happened, so we treat the page to belong to
    the previous range, not the one it would normally belong to, per the LSN.

    2. When we look at a page, and determine that it's not old enough to be
    "matured", we then check the clog as usual. A page is considered mature,
    if the XID-LSN range (and corresponding clog pages) has already been
    truncated away. It's possible that between those steps, the XID-LSN
    range and clog is truncated away, so that the backend tries to access a
    clog page that doesn't exist anymore. To fix that, the XID-LSN range and
    clog truncation needs to be done in two steps. First, mark the
    truncation point in shared memory. Then somehow wait until all backends
    see the new value, and go ahead with actually truncating the clog only
    after that.

    Aside from those two race conditions, there are plenty of scalability
    issues remaining. Currently, the shared XID-LSN range array is checked
    every time a page is accessed, so it could quickly become a bottleneck.
    Need to cache that information in each backend. Oh, and I didn't
    implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
    get a freezing frenzy right after a new XID-LSN range is created.

    I'll keep hacking away on those things, but please let me know if you
    see some fatal flaw with this plan.
    Here's an updated patch. The race conditions I mentioned above have been
    fixed.

    This is still definitely work-in-progress, but overall I think this is
    quite promising. The patch basically works, although there are a bunch
    of TODO items like handling unlogged tables.

    This patch is also available in my git repository at
    git://git.postgresql.org/git/users/heikki/postgres.git, branch
    "freeze-by-xid-lsn-ranges".

    - Heikki
  • Heikki Linnakangas at Aug 27, 2013 at 4:37 pm

    On 27.08.2013 18:56, Heikki Linnakangas wrote:
    Here's an updated patch.
    Ah, forgot one thing:

    Here's a little extension I've been using to test this. It contains two
    functions; one to simply consume N xids, making it faster to hit the
    creation of new XID ranges and wraparound. The other,
    print_xidlsnranges(), prints out the contents of the current XID-LSN
    range array.

    Also, just ran into two silly bugs in the patch version I posted, while
    checking that that xidtest extension hasn't bitrotted. A fixed version
    has been pushed to the git repository, so make sure you use that version
    if you want to actually run it.

    - Heikki
  • Andres Freund at Aug 30, 2013 at 6:34 pm
    Hi Heikki,
    On 2013-08-27 18:56:15 +0300, Heikki Linnakangas wrote:
    Here's an updated patch. The race conditions I mentioned above have been
    fixed.
    Thanks for posting the new version!

    I have a quick question: The reason I'd asked about the status of the
    patch was that I was thinking about the state of the "forensic freezing"
    patch. After a quick look at your proposal, we still need to freeze in
    some situations (old & new data on the same page basically), so I'd say
    it still makes sense to apply the forensic freezing patch, right?

    Differing Opinions?

    Greetings,

    Andres Freund

    --
      Andres Freund http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Simon Riggs at Jun 1, 2013 at 2:17 pm

    On 30 May 2013 19:39, Robert Haas wrote:
    On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
    wrote:
    The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
    and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
    that would waste a lot space. The trick is to add a field to the page header
    indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
    32-bit wide (*). Check.
    The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
    wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
    could implicitly treat anything older than relfrozenxid as frozen. Check.
    That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
    doesn't need to dirty pages that contain no dead tuples.
    Check.
    Yes, this is the critical point. Large insert-only tables don't need
    to be completely re-written twice.

    Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
    replace the XIDs with FrozenXid whenever the difference between the smallest
    and largest XID on a page exceeds 2^31. But that would only happen when
    you're updating the page, in which case the page is dirtied anyway, so it
    wouldn't cause any extra I/O.
    It would cause some extra WAL activity, but it wouldn't dirty the page
    an extra time.
    This would also be the first step in allowing the clog to grow larger than 2
    billion transactions, eliminating the need for anti-wraparound freezing
    altogether. You'd still want to truncate the clog eventually, but it would
    be nice to not be pressed against the wall with "run vacuum freeze now, or
    the system will shut down".
    Interesting. That seems like a major advantage.
    (*) "Adding an epoch" is inaccurate, but I like to use that as my mental
    model. If you just add a 32-bit epoch field, then you cannot have xids from
    different epochs on the page, which would be a problem. In reality, you
    would store one 64-bit XID value in the page header, and use that as the
    "reference point" for all the 32-bit XIDs on the tuples. See existing
    convert_txid() function for how that works. Another method is to store the
    32-bit xid values in tuple headers as offsets from the per-page 64-bit
    value, but then you'd always need to have the 64-bit value at hand when
    interpreting the XIDs, even if they're all recent.
    As I see it, the main downsides of this approach are:

    (1) It breaks binary compatibility (unless you do something to
    provided for it, like put the epoch in the special space).

    (2) It consumes 8 bytes per page. I think it would be possible to get
    this down to say 5 bytes per page pretty easily; we'd simply decide
    that the low-order 3 bytes of the reference XID must always be 0.
    Possibly you could even do with 4 bytes, or 4 bytes plus some number
    of extra bits.
    Yes, the idea of having a "base Xid" on every page is complicated and
    breaks compatibility. Same idea can work well if we do this via tuple
    headers.

    (3) You still need to periodically scan the entire relation, or else
    have a freeze map as Simon and Josh suggested.
    I don't think that is needed with this approach.

    (The freeze map was Andres' idea, not mine. I just accepted it as what
    I thought was the only way forwards. Now I see other ways)
    The upsides of this approach as compared with what Andres and I are
    proposing are:

    (1) It provides a stepping stone towards allowing indefinite expansion
    of CLOG, which is quite appealing as an alternative to a hard
    shut-down.
    I would be against expansion of the CLOG beyond its current size. If
    we have removed all aborted rows and marked hints, then we don't need
    the CLOG values and can trim that down.

    I don't mind the hints, its the freezing we don't need.

    convert_txid() function for how that works. Another method is to store the
    32-bit xid values in tuple headers as offsets from the per-page 64-bit
    value, but then you'd always need to have the 64-bit value at hand when
    interpreting the XIDs, even if they're all recent.
    You've touched here on the idea of putting the epoch in the tuple
    header, which is where what I posted comes together. We don't need
    anything at page level, we just need something on each tuple.

    Please can you look at my recent post on how to put this in the tuple header?

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services
  • Bruce Momjian at May 30, 2013 at 9:06 pm

    On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
    This would also be the first step in allowing the clog to grow
    larger than 2 billion transactions, eliminating the need for
    anti-wraparound freezing altogether. You'd still want to truncate
    the clog eventually, but it would be nice to not be pressed against
    the wall with "run vacuum freeze now, or the system will shut down".
    Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
    truncation. Does this help that?

    --
       Bruce Momjian <bruce@momjian.us> http://momjian.us
       EnterpriseDB http://enterprisedb.com

       + It's impossible for everything to be true. +
  • Heikki Linnakangas at May 31, 2013 at 11:12 am

    On 31.05.2013 00:06, Bruce Momjian wrote:
    On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
    This would also be the first step in allowing the clog to grow
    larger than 2 billion transactions, eliminating the need for
    anti-wraparound freezing altogether. You'd still want to truncate
    the clog eventually, but it would be nice to not be pressed against
    the wall with "run vacuum freeze now, or the system will shut down".
    Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
    truncation. Does this help that?
    No. If you want to keep autovacuum_freeze_max_age set at less than 2
    billion, you don't need support for more than 2 billion transactions.
    But for those who would like to set autovacuum_freeze_max_age higher
    than 2B, it would be nice to allow it.

    Actually, even with autovacuum_freeze_max_age = 200 M, it would be nice
    to not have the hard stop at 2 billion, in case autovacuum falls behind
    really badly. With autovacuum_freeze_max_age = 200M, there's a lot of
    safety margin, but with 1000M or so, not so much.

    - Heikki
  • Simon Riggs at Jun 1, 2013 at 2:03 pm

    On 30 May 2013 14:33, Heikki Linnakangas wrote:
    Since we're bashing around ideas around freezing, let me write down the idea
    I've been pondering and discussing with various people for years. I don't
    think I invented this myself, apologies to whoever did for not giving
    credit.

    The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
    and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
    that would waste a lot space. The trick is to add a field to the page header
    indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
    32-bit wide (*).

    The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
    wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
    could implicitly treat anything older than relfrozenxid as frozen.

    That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
    doesn't need to dirty pages that contain no dead tuples.
    I have to say this is pretty spooky. I'd not read hackers all week, so
    I had no idea so many other people were thinking about freezing as
    well. This idea is damn near identical to what I've suggested. My
    suggestion came because I was looking to get rid of fields out of the
    tuple header; which didn't come to much. The good news is that is
    complete chance, so it must mean we're on the right track.

    --
      Simon Riggs http://www.2ndQuadrant.com/
      PostgreSQL Development, 24x7 Support, Training & Services

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 30, '13 at 1:34p
activeAug 30, '13 at 6:34p
posts39
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase