We currently use 4 int32s to store xmin, xmax, cmin, cmax and xvac on
every heap tuple. That's a lot of overhead, especially on tables with
narrow rows. Reduction in header size would give us considerable space
and I/O savings.

I'm thinking of removing cmin and cmax, and keeping that information in
backend-private memory instead. cmin and cmax are only interesting to
the inserting/deleting transaction, so using precious tuple header space
for that is a waste. This has been discussed before, and Manfred Koizar
even had a patch for 7.4 but didn't submit it because it was incomplete
(http://archives.postgresql.org/pgsql-hackers/2005-09/msg00172.php).
BTW: Manfred, do you still have the patch? It'd be interesting to look
at, even if it's not finished.

This reduces the tuple header size by 4 bytes, or 8 bytes if we can
later get rid of xvac as well.

There's some interesting properties we can exploit in implementing the
backend-private storage:

1. in small OLTP transactions that touch few rows, the cmin/cmax
information will easily fit in memory.

2. if current commandid == 1, every tuple with xmin == current xid is
not visible. Similarly, every tuple with xmax = current xid is visible.
So we don't need to store anything for the first command in a transaction.

3. we can forget modifications by command X as soon as there's no live
snapshots with curcid <= X.

4. we don't need to record the information, if there's no live
snapshots, and we know that the current command is not going to read
existing rows.

These optimizations take care of bulk inserts nicely. In particular,
pg_restore and similar applications wouldn't need to keep track of
inserted records.

By choosing a clever data structure, we might be able to get away
without spilling to disk. For example, a hash table works great for
small transactions. If a transaction does bulk modifications, a bitmap
per relation could be used. And to optimize for the cases when the
information is not actually used, we can just collect the information to
an array, and turn it into a more lookup-friendly data structure the
first time it's needed.

Even if we do have to spill to disk on large transactions that do
massive updates and selects, it would still be a win for most databases.
And it penalizes the transactions that do massive updates, instead of
every transaction in the system as we do now.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Tom Lane at Sep 19, 2006 at 2:41 pm

    Heikki Linnakangas writes:
    I'm thinking of removing cmin and cmax, and keeping that information in
    backend-private memory instead.
    I don't believe you can remove *both*. What's been discussed is
    removing one of them, by letting the field represent a lookup key for an
    in-memory structure in the infrequent case that xmin and xmax are both
    the current xact. You solve the table size problem by only having one
    entry for each unique cmin/cmax pair in use.

    regards, tom lane
  • Heikki Linnakangas at Sep 19, 2006 at 2:59 pm

    Tom Lane wrote:
    Heikki Linnakangas <heikki@enterprisedb.com> writes:
    I'm thinking of removing cmin and cmax, and keeping that information in
    backend-private memory instead.
    I don't believe you can remove *both*. What's been discussed is
    removing one of them, by letting the field represent a lookup key for an
    in-memory structure in the infrequent case that xmin and xmax are both
    the current xact. You solve the table size problem by only having one
    entry for each unique cmin/cmax pair in use.
    That's another possibility, but removing both cmin and cmax has also
    been discussed. It's also mentioned in the TODO item:
    One possible solution is to create a phantom cid which represents a
    cmin/cmax pair and is stored in local memory. *Another idea is to store
    both cmin and cmax only in local memory.*

    Saving 4 bytes per tuple with the phantom cid is nice, but saving 8
    bytes (assuming we get rid of xvac in the future, or overlay it with
    xmin for example) is even better.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 19, 2006 at 3:57 pm

    Heikki Linnakangas writes:
    Saving 4 bytes per tuple with the phantom cid is nice, but saving 8
    bytes (assuming we get rid of xvac in the future, or overlay it with
    xmin for example) is even better.
    xvac is not going away, so that argument is unconvincing, and I don't
    believe you can avoid blowing out local memory if you have to remember
    each tuple's cmin/cmax separately. (Notice that Manfred gave up on his
    patch for lack of a spill-to-disk mechanism.)

    I'm also concerned about loss of debug traceability if these fields
    disappear entirely from disk --- it's been handy more than once to be
    able to tell where in a complex transaction something happened.

    Lastly, at least on machines with 8-byte MAXALIGN, removing four more
    bytes from heap headers would save nothing. So I'm not excited about
    going through enormous pushups to get rid of both fields, when a far
    simpler and better-performing mechanism suffices to remove one.

    regards, tom lane
  • Gregory Stark at Sep 19, 2006 at 4:25 pm

    Tom Lane writes:

    Heikki Linnakangas <heikki@enterprisedb.com> writes:
    Saving 4 bytes per tuple with the phantom cid is nice, but saving 8
    bytes (assuming we get rid of xvac in the future, or overlay it with
    xmin for example) is even better.
    xvac is not going away, so that argument is unconvincing,
    The use case for vacuum full has been narrowing steadily over time. I'm not
    sure there's much left of it these days. In every case where it comes up on
    list people are inevitably just confused and don't need it. In the few cases
    where it's actually suggested we invariably recommend one of the various
    commands that rewrite the entire table instead.

    The main fundamental problem with rewriting the entire table is that you may
    not have enough storage to do so and I think there may be better ways of
    avoiding that than always storing 4 bytes in every tuple of every table just
    in case we want to run vacuum full one day.
    and I don't believe you can avoid blowing out local memory if you have to
    remember each tuple's cmin/cmax separately. (Notice that Manfred gave up on
    his patch for lack of a spill-to-disk mechanism.)
    spill to disk would certainly be a requirement. In the cases where it's needed
    the data has to be stored somewhere, it just doesn't have to go through WAL
    and take up table space when no other backend will every be interested in it.
    I'm also concerned about loss of debug traceability if these fields
    disappear entirely from disk --- it's been handy more than once to be
    able to tell where in a complex transaction something happened.
    That's an interesting thought. I'm not sure what scenario you're picturing
    though. You would still have xmin/xmax so it when do you need to look at cmin
    to know when something happened?

    You could certainly imagine a guc setting that would force the spill to disk
    to always even when the data isn't needed. It seems like a poor substitute for
    some dedicated tracing mechanism targeted specifically at the info needed for
    debugging.
    Lastly, at least on machines with 8-byte MAXALIGN, removing four more
    bytes from heap headers would save nothing.
    Well there still exist plenty of 4-byte MAXALIGN machines out there. But the
    more serious problem with this argument is that it comes up repeatedly. If we
    pass up 4 bytes here and 4 bytes there pretty soon we're talking about real
    data. Well, at least 8 bytes.

    Getting rid of cmin, cmax, xvac and in some cases xmin (as discussed at the
    code sprint) leaves us in much better shape even though any one of those
    doesn't necessarily buy us much.
    So I'm not excited about going through enormous pushups to get rid of both
    fields, when a far simpler and better-performing mechanism suffices to
    remove one.
    Frankly the whole phantom commandid thing sounds more complicated. You *still*
    need a local state data structure that *still* has to spill to disk and now
    it's much harder to characterize how large it will grow since it depends on
    arbitrary combinations of cmin and cmax.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 19, 2006 at 4:40 pm

    Gregory Stark writes:
    Frankly the whole phantom commandid thing sounds more complicated. You *still*
    need a local state data structure that *still* has to spill to disk and now
    it's much harder to characterize how large it will grow since it depends on
    arbitrary combinations of cmin and cmax.
    Yeah, but it requires only one entry when a command processes
    arbitrarily large numbers of tuples, so in practice it's not going to
    need to spill to disk. What Heikki wants to do will require an entry in
    local memory for *each tuple* modified by a transaction. That will ruin
    performance on a regular basis.

    regards, tom lane
  • Gregory Stark at Sep 19, 2006 at 5:00 pm

    Tom Lane writes:

    Gregory Stark <stark@enterprisedb.com> writes:
    Frankly the whole phantom commandid thing sounds more complicated. You *still*
    need a local state data structure that *still* has to spill to disk and now
    it's much harder to characterize how large it will grow since it depends on
    arbitrary combinations of cmin and cmax.
    Yeah, but it requires only one entry when a command processes
    arbitrarily large numbers of tuples, so in practice it's not going to
    need to spill to disk.
    Well there's a reason we support commandids up to 4 billion. One of the common
    use cases of bulk loading data in a series of individual inserts would cause
    such a structure to spill to disk. As would ISAM style programming that steps
    through a large data set and updates rows one by one.
    What Heikki wants to do will require an entry in local memory for *each
    tuple* modified by a transaction. That will ruin performance on a regular
    basis.
    Sure, but that's the same amount of data all those useless cmin/cmaxes are
    taking up now, actually it's less, it's only 6 bytes instead of 8. Even
    assuming no clever data structures compress it. And that data doesn't have to
    be fsynced so it can sit in filesystem cache and get spooled out to disk
    lazily. If you touch a million records in your transaction in one of the
    peculiar situations that require keeping the data you're talking about a few
    megs of cache sacrificed during that one operation versus extra i/o on every
    operation.

    I should probably let Heikki defend his idea though. I guess I was just
    feeling argumentative. I'm know he's thought through the same things.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 19, 2006 at 5:05 pm

    Gregory Stark writes:
    Well there's a reason we support commandids up to 4 billion. One of the common
    use cases of bulk loading data in a series of individual inserts would cause
    such a structure to spill to disk. As would ISAM style programming that steps
    through a large data set and updates rows one by one.
    You're missing the point though, which is that no memory entry is needed
    at all unless the same tuple has been both inserted and deleted in the
    current transaction. Bulk data loads will incur zero entries in this
    scheme, whereas what Heikki has in mind will incur an entry per tuple.

    regards, tom lane
  • Heikki Linnakangas at Sep 19, 2006 at 5:55 pm

    Tom Lane kirjoitti:
    Gregory Stark <stark@enterprisedb.com> writes:
    Frankly the whole phantom commandid thing sounds more complicated.
    You *still*
    need a local state data structure that *still* has to spill to disk
    and now
    it's much harder to characterize how large it will grow since it
    depends on
    arbitrary combinations of cmin and cmax.
    Yeah, but it requires only one entry when a command processes
    arbitrarily large numbers of tuples, so in practice it's not going to
    need to spill to disk. What Heikki wants to do will require an entry in
    local memory for *each tuple* modified by a transaction. That will ruin
    performance on a regular basis.
    As I tried to say in the first post, I believe we can actually get away
    without an entry in local memory in typical scenarios, including bulk
    data loads. Bulk updates are probably the biggest problem, but I think
    we could handle even them just fine with the right data structure.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 19, 2006 at 6:04 pm

    Heikki Linnakangas writes:
    As I tried to say in the first post, I believe we can actually get away
    without an entry in local memory in typical scenarios, including bulk
    data loads.
    I didn't find that argument very credible, particularly not the part
    that assumes we know what the oldest snapshot is. I remain of the
    opinion that this is going to be a large, complicated (ie buggy),
    poorly performing mechanism to hypothetically someday save 4 bytes
    that, even if we do save them, are just going to disappear into
    alignment padding on most newer servers.

    regards, tom lane
  • Bruce Momjian at Sep 19, 2006 at 6:46 pm

    Tom Lane wrote:
    Gregory Stark <stark@enterprisedb.com> writes:
    Frankly the whole phantom commandid thing sounds more complicated. You *still*
    need a local state data structure that *still* has to spill to disk and now
    it's much harder to characterize how large it will grow since it depends on
    arbitrary combinations of cmin and cmax.
    Yeah, but it requires only one entry when a command processes
    arbitrarily large numbers of tuples, so in practice it's not going to
    need to spill to disk. What Heikki wants to do will require an entry in
    local memory for *each tuple* modified by a transaction. That will ruin
    performance on a regular basis.
    Agreed. TODO has:

    * Merge xmin/xmax/cmin/cmax back into three header fields

    Before subtransactions, there used to be only three fields needed to
    store these four values. This was possible because only the current
    transaction looks at the cmin/cmax values. If the current transaction
    created and expired the row the fields stored where xmin (same as
    xmax), cmin, cmax, and if the transaction was expiring a row from a
    another transaction, the fields stored were xmin (cmin was not
    needed), xmax, and cmax. Such a system worked because a transaction
    could only see rows from another completed transaction. However,
    subtransactions can see rows from outer transactions, and once the
    subtransaction completes, the outer transaction continues, requiring
    the storage of all four fields. With subtransactions, an outer
    transaction can create a row, a subtransaction expire it, and when the
    subtransaction completes, the outer transaction still has to have
    proper visibility of the row's cmin, for example, for cursors.

    One possible solution is to create a phantom cid which represents a
    cmin/cmax pair and is stored in local memory. Another idea is to
    store both cmin and cmax only in local memory.

    I do see both the phantom idea and the local memory for all cmin/cmax
    values. I believe the phantom idea has the most merit.

    --
    Bruce Momjian bruce@momjian.us
    EnterpriseDB http://www.enterprisedb.com

    + If your life is a hard drive, Christ can be your backup. +
  • Heikki Linnakangas at Sep 19, 2006 at 6:23 pm

    Tom Lane kirjoitti:
    I'm also concerned about loss of debug traceability if these fields
    disappear entirely from disk --- it's been handy more than once to be
    able to tell where in a complex transaction something happened.
    Sure. We'll just have to try to compensate that with debug messages
    etc., whatever scheme we choose.
    Lastly, at least on machines with 8-byte MAXALIGN, removing four more
    bytes from heap headers would save nothing. So I'm not excited about
    going through enormous pushups to get rid of both fields, when a far
    simpler and better-performing mechanism suffices to remove one.
    It would be a win on 32-bit architectures. And there has been discussion of
    storing at least some data types unaligned.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Sep 19, 2006 at 6:44 pm

    Heikki Linnakangas writes:
    Tom Lane kirjoitti:
    I'm also concerned about loss of debug traceability if these fields
    disappear entirely from disk --- it's been handy more than once to be
    able to tell where in a complex transaction something happened.
    Sure. We'll just have to try to compensate that with debug messages
    etc., whatever scheme we choose.
    I think you completely misunderstand the context in which I'm concerned
    about that --- handwaving about "better debug messages" doesn't assuage
    the concern. In fact, since I wrote that message I've had another
    example of what stored cmin is good for: a few minutes ago, in
    connection with Marc Evan's issue here,
    http://archives.postgresql.org/pgsql-general/2006-09/msg00741.php
    we were able to eliminate a theory about an FK trigger having modified a
    row after its insertion by observing that the stored row still had cmin
    = 0. I've made use of cmin data in many prior cases to help identify
    what's what: in lots of real applications, the cmin value tells you
    exactly which kind of transaction inserted or modified the row, because
    different transactions have different numbers of steps. If cmin
    vanishes into transient storage then after-the-fact forensic analysis
    will be severely handicapped. No amount of "debug messages" will make
    up for data that's not there anymore when you realize you need it.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 19, '06 at 1:37p
activeSep 19, '06 at 6:46p
posts13
users4
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase