FAQ

Florian Helmberger writes:
On 25.05.11 04:47, Tom Lane wrote:
Florian Helmberger<fh@25th-floor.com> writes:
I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
Debian 5.0.4 and have an issue with a TOAST table and far to frequent
autovacuum runs.

I think I've pinned the problem down to the values pg_class holds for
the affected TOAST table:

relpages | 433596
reltuples | 1868538

These values are significantly too low. Interestingly, the autovacuum
logout reports the correct values:

pages: 0 removed, 34788136 remain
tuples: 932487 removed, 69599038 remain

but these aren't stored in pg_class after each run.
That's exceedingly weird. Do the pg_stat_all_tables columns update
after autovacuums on that table?
Yes they do:
I think I see what must be going on here: that toast table must contain
a long run of all-visible-according-to-the-VM pages (which is hardly a
surprising situation). This results in VACUUM choosing not to update
the pg_class entry:

/*
* Update statistics in pg_class. But only if we didn't skip any pages;
* the tuple count only includes tuples from the pages we've visited, and
* we haven't frozen tuples in unvisited pages either. The page count is
* accurate in any case, but because we use the reltuples / relpages ratio
* in the planner, it's better to not update relpages either if we can't
* update reltuples.
*/
if (vacrelstats->scanned_all)
vac_update_relstats(onerel,
vacrelstats->rel_pages, vacrelstats->rel_tuples,
vacrelstats->hasindex,
FreezeLimit);

For an ordinary table this wouldn't be fatal because we'll still do an
ANALYZE from time to time, and that will update the entries with new
(approximate) values. But we never run ANALYZE on toast tables.

And this would *still* be okay, because as noted in the comment, the
planner only depends on the ratio being roughly correct, not on either
individual value being current. But autovacuum didn't get the memo;
it thinks it can use reltuples to make decisions.

I can see two basic approaches we might take here:

1. Modify autovacuum to use something from the stats collector, rather
than reltuples, to make its decisions. I'm not too clear on why
reltuples is being used there anyway; is there some good algorithmic or
statistical reason why AV should be looking at a number from the last
vacuum?

2. Revise the vacuum code so that it doesn't skip updating the pg_class
entries. We could have it count the number of pages it skipped, rather
than just keeping a bool, and then scale up the rel_tuples count to be
approximately right by assuming the skipped pages have tuple density
similar to the scanned ones.

Thoughts?

regards, tom lane

Search Discussions

  • Alvaro Herrera at May 25, 2011 at 4:00 pm

    Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:

    I think I see what must be going on here: that toast table must contain
    a long run of all-visible-according-to-the-VM pages (which is hardly a
    surprising situation). This results in VACUUM choosing not to update
    the pg_class entry:

    /*
    * Update statistics in pg_class. But only if we didn't skip any pages;
    * the tuple count only includes tuples from the pages we've visited, and
    * we haven't frozen tuples in unvisited pages either. The page count is
    * accurate in any case, but because we use the reltuples / relpages ratio
    * in the planner, it's better to not update relpages either if we can't
    * update reltuples.
    */
    if (vacrelstats->scanned_all)
    vac_update_relstats(onerel,
    vacrelstats->rel_pages, vacrelstats->rel_tuples,
    vacrelstats->hasindex,
    FreezeLimit);

    For an ordinary table this wouldn't be fatal because we'll still do an
    ANALYZE from time to time, and that will update the entries with new
    (approximate) values. But we never run ANALYZE on toast tables. Ouch.
    I can see two basic approaches we might take here:

    1. Modify autovacuum to use something from the stats collector, rather
    than reltuples, to make its decisions. I'm not too clear on why
    reltuples is being used there anyway; is there some good algorithmic or
    statistical reason why AV should be looking at a number from the last
    vacuum?
    It uses reltuples simply because it was what the original contrib code
    was using. Since pgstat was considerably weaker at the time, reltuples
    might have been the only thing available. It's certainly the case that
    pgstat has improved a lot since autovacuum got in, and some things have
    been revised but not this one.
    2. Revise the vacuum code so that it doesn't skip updating the pg_class
    entries. We could have it count the number of pages it skipped, rather
    than just keeping a bool, and then scale up the rel_tuples count to be
    approximately right by assuming the skipped pages have tuple density
    similar to the scanned ones.
    Hmm, interesting idea. This would be done only for toast tables, or all
    tables?

    At this point I only wonder why we store tuples & pages rather than just
    live tuple density.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Tom Lane at May 25, 2011 at 4:24 pm

    Alvaro Herrera writes:
    Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:
    I can see two basic approaches we might take here:

    1. Modify autovacuum to use something from the stats collector, rather
    than reltuples, to make its decisions. I'm not too clear on why
    reltuples is being used there anyway; is there some good algorithmic or
    statistical reason why AV should be looking at a number from the last
    vacuum?
    It uses reltuples simply because it was what the original contrib code
    was using. Since pgstat was considerably weaker at the time, reltuples
    might have been the only thing available. It's certainly the case that
    pgstat has improved a lot since autovacuum got in, and some things have
    been revised but not this one.
    On reflection I'm hesitant to do this, especially for a backpatched bug
    fix, because it would be changing the feedback loop behavior for
    autovacuum scheduling. That could have surprising consequences.
    2. Revise the vacuum code so that it doesn't skip updating the pg_class
    entries. We could have it count the number of pages it skipped, rather
    than just keeping a bool, and then scale up the rel_tuples count to be
    approximately right by assuming the skipped pages have tuple density
    similar to the scanned ones.
    Hmm, interesting idea. This would be done only for toast tables, or all
    tables?
    I'm thinking just do it for all. The fact that these numbers don't
    necessarily update after a vacuum is certainly surprising in and of
    itself, and it did not work that way before the VM patch went in.
    I'm concerned about other stuff besides AV not dealing well with
    obsolete values.
    At this point I only wonder why we store tuples & pages rather than just
    live tuple density.
    It's just for backwards compatibility. I've thought about doing that in
    the past, but I don't know what client-side code might be looking at
    relpages/reltuples. It's not like collapsing them into one field would
    save much, anyway.

    regards, tom lane
  • Kevin Grittner at May 25, 2011 at 4:37 pm

    Tom Lane wrote:

    I don't know what client-side code might be looking at
    relpages/reltuples.
    I know that I find reltuples useful for getting an "accurate enough"
    sense of rows in a table (or set of tables) without resorting to
    count(*). I'd be OK with any two of pages, tuples, and density; but
    dropping to one would make databases harder to manage, IMV.

    Personally, I don't have much code that uses those columns;
    eliminating an existing column wouldn't involve much pain for me as
    long as it could still be derived.

    -Kevin
  • Alvaro Herrera at May 25, 2011 at 4:55 pm

    Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
    Tom Lane wrote:
    I don't know what client-side code might be looking at
    relpages/reltuples.
    I know that I find reltuples useful for getting an "accurate enough"
    sense of rows in a table (or set of tables) without resorting to
    count(*). I'd be OK with any two of pages, tuples, and density; but
    dropping to one would make databases harder to manage, IMV.

    Personally, I don't have much code that uses those columns;
    eliminating an existing column wouldn't involve much pain for me as
    long as it could still be derived.
    Well, we only actually need to store one number, because you can figure
    out a much more precise number-of-pages figure with pg_relation_size()
    divided by configured page size.

    (We feel free to hack around catalogs in other places, and we regularly
    break pgAdmin and the like when we drop columns -- people just live with
    it and update their tools. I don't think it's such a big deal in this
    particular case.)

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Cédric Villemain at May 25, 2011 at 5:32 pm

    2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>:
    Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
    Tom Lane wrote:
    I don't know what client-side code might be looking at
    relpages/reltuples.
    I know that I find reltuples useful for getting an "accurate enough"
    sense of rows in a table (or set of tables) without resorting to
    count(*).  I'd be OK with any two of pages, tuples, and density; but
    dropping to one would make databases harder to manage, IMV.

    Personally, I don't have much code that uses those columns;
    eliminating an existing column wouldn't involve much pain for me as
    long as it could still be derived.
    Well, we only actually need to store one number, because you can figure
    out a much more precise number-of-pages figure with pg_relation_size()
    divided by configured page size.

    (We feel free to hack around catalogs in other places, and we regularly
    break pgAdmin and the like when we drop columns -- people just live with
    it and update their tools.  I don't think it's such a big deal in this
    particular case.)
    I may miss something but we need relation size in costsize.c even if
    we have a reldensity (or we need a reltuples). Else what values should
    be used to estimate the relation size ? (pg_relation_size() goes down
    to kernel/fs to ask the stat.st_size, is it really what we want?)

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Alvaro Herrera at May 25, 2011 at 5:37 pm

    Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

    Well, we only actually need to store one number, because you can figure
    out a much more precise number-of-pages figure with pg_relation_size()
    divided by configured page size.
    I may miss something but we need relation size in costsize.c even if
    we have a reldensity (or we need a reltuples). Else what values should
    be used to estimate the relation size ? (pg_relation_size() goes down
    to kernel/fs to ask the stat.st_size, is it really what we want?)
    Actually yes, the planner does go to kernel to determine the current
    relation size, and then multiplies by density as computed from catalog
    data to figure out current reasonably accurate number of tuples.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Cédric Villemain at May 25, 2011 at 7:14 pm

    2011/5/25 Alvaro Herrera <alvherre@commandprompt.com>:
    Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:
    Well, we only actually need to store one number, because you can figure
    out a much more precise number-of-pages figure with pg_relation_size()
    divided by configured page size.
    I may miss something but we need relation size in costsize.c even if
    we have a reldensity (or we need a reltuples). Else what values should
    be used to estimate the relation size ? (pg_relation_size() goes down
    to kernel/fs to ask the stat.st_size, is it really what we want?)
    Actually yes, the planner does go to kernel to determine the current
    relation size, and then multiplies by density as computed from catalog
    data to figure out current reasonably accurate number of tuples.
    Okay! I just read that part. Interesting.
    (If I dive correctly, we search our last segment and then use a
    fileseek to the end of this segment to get our information)

    make more sense, suddendly :)
    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Robert Haas at May 25, 2011 at 4:23 pm

    On Wed, May 25, 2011 at 11:47 AM, Tom Lane wrote:
    2. Revise the vacuum code so that it doesn't skip updating the pg_class
    entries.  We could have it count the number of pages it skipped, rather
    than just keeping a bool, and then scale up the rel_tuples count to be
    approximately right by assuming the skipped pages have tuple density
    similar to the scanned ones.
    This approach doesn't seem like a good idea to me. The skipped
    portions of the table are the ones that haven't been modified in a
    while, so this is or embeds an assumption that the tuples in the hot
    and cold portions of the table are the same size. That might be true,
    but it isn't hard to think of cases where it might not be. There
    could also very easily be sampling error, because it's easy to imagine
    a situation where 99% of the table is getting skipped. Any error that
    creeps into the estimate is going to get scaled up along with the
    estimate itself.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 25, 2011 at 4:41 pm

    Robert Haas writes:
    On Wed, May 25, 2011 at 11:47 AM, Tom Lane wrote:
    2. Revise the vacuum code so that it doesn't skip updating the pg_class
    entries.  We could have it count the number of pages it skipped, rather
    than just keeping a bool, and then scale up the rel_tuples count to be
    approximately right by assuming the skipped pages have tuple density
    similar to the scanned ones.
    This approach doesn't seem like a good idea to me. The skipped
    portions of the table are the ones that haven't been modified in a
    while, so this is or embeds an assumption that the tuples in the hot
    and cold portions of the table are the same size. That might be true,
    but it isn't hard to think of cases where it might not be. There
    could also very easily be sampling error, because it's easy to imagine
    a situation where 99% of the table is getting skipped.
    Yeah, I had been thinking about the latter point. We could be
    conservative and just keep the reported tuple density the same (ie,
    update relpages to the new correct value, while setting reltuples so
    that the density ratio doesn't change). But that has its own problems
    when the table contents *do* change. What I'm currently imagining is
    to do a smoothed moving average, where we factor in the new density
    estimate with a weight dependent on the percentage of the table we did
    scan. That is, the calculation goes something like

    old_density = old_reltuples / old_relpages
    new_density = counted_tuples / scanned_pages
    reliability = scanned_pages / new_relpages
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages

    We could slow the moving-average convergence even further when
    reliability is small; for instance if you were really paranoid you might
    want to use the square of reliability in line 4. That might be
    overdoing it, though.

    regards, tom lane
  • Robert Haas at May 25, 2011 at 4:54 pm

    On Wed, May 25, 2011 at 12:41 PM, Tom Lane wrote:
    Yeah, I had been thinking about the latter point.  We could be
    conservative and just keep the reported tuple density the same (ie,
    update relpages to the new correct value, while setting reltuples so
    that the density ratio doesn't change).  But that has its own problems
    when the table contents *do* change.  What I'm currently imagining is
    to do a smoothed moving average, where we factor in the new density
    estimate with a weight dependent on the percentage of the table we did
    scan.  That is, the calculation goes something like

    old_density = old_reltuples / old_relpages
    new_density = counted_tuples / scanned_pages
    reliability = scanned_pages / new_relpages
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages

    We could slow the moving-average convergence even further when
    reliability is small; for instance if you were really paranoid you might
    want to use the square of reliability in line 4.  That might be
    overdoing it, though.
    I don't know. That's maybe better, but I'd be willing to wager that
    in some cases it will just slow down the rate at which we converge to
    a completely incorrect value, while in other cases it'll fail to
    update the data when it really has changed.

    I am wondering, though, why we're not just inserting a special-purpose
    hack for TOAST tables. Your email seems to indicate that regular
    tables are already handled well enough, and certainly if we only whack
    around the TOAST behavior it's much less likely to fry anything.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 25, 2011 at 5:04 pm

    Robert Haas writes:
    On Wed, May 25, 2011 at 12:41 PM, Tom Lane wrote:
    Yeah, I had been thinking about the latter point.  We could be
    conservative and just keep the reported tuple density the same (ie,
    update relpages to the new correct value, while setting reltuples so
    that the density ratio doesn't change).  But that has its own problems
    when the table contents *do* change.  What I'm currently imagining is
    to do a smoothed moving average, where we factor in the new density
    estimate with a weight dependent on the percentage of the table we did
    scan.  That is, the calculation goes something like

    old_density = old_reltuples / old_relpages
    new_density = counted_tuples / scanned_pages
    reliability = scanned_pages / new_relpages
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages

    We could slow the moving-average convergence even further when
    reliability is small; for instance if you were really paranoid you might
    want to use the square of reliability in line 4.  That might be
    overdoing it, though.
    I don't know. That's maybe better, but I'd be willing to wager that
    in some cases it will just slow down the rate at which we converge to
    a completely incorrect value, while in other cases it'll fail to
    update the data when it really has changed.
    [ shrug... ] When you don't have complete information, it's *always*
    the case that you will sometimes make a mistake. That's not
    justification for paralysis, especially not when the existing code is
    demonstrably broken.

    What occurs to me after thinking a bit more is that the existing tuple
    density is likely to be only an estimate, too (one coming from the last
    ANALYZE, which could very well have scanned even less of the table than
    VACUUM did). So what I now think is that both VACUUM and ANALYZE ought
    to use a calculation of the above form to arrive at a new value for
    pg_class.reltuples. In both cases it would be pretty easy to track the
    number of pages we looked at while counting tuples, so the same raw
    information is available.
    I am wondering, though, why we're not just inserting a special-purpose
    hack for TOAST tables.
    Because the problem is not specific to TOAST tables. As things
    currently stand, we will accept the word of an ANALYZE as gospel even if
    it scanned 1% of the table, and completely ignore the results from a
    VACUUM even if it scanned 99% of the table. This is not sane.

    regards, tom lane
  • Robert Haas at May 25, 2011 at 5:16 pm

    On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote:
    [ shrug... ]  When you don't have complete information, it's *always*
    the case that you will sometimes make a mistake.  That's not
    justification for paralysis, especially not when the existing code is
    demonstrably broken.

    What occurs to me after thinking a bit more is that the existing tuple
    density is likely to be only an estimate, too (one coming from the last
    ANALYZE, which could very well have scanned even less of the table than
    VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
    to use a calculation of the above form to arrive at a new value for
    pg_class.reltuples.  In both cases it would be pretty easy to track the
    number of pages we looked at while counting tuples, so the same raw
    information is available.
    I am wondering, though, why we're not just inserting a special-purpose
    hack for TOAST tables.
    Because the problem is not specific to TOAST tables.  As things
    currently stand, we will accept the word of an ANALYZE as gospel even if
    it scanned 1% of the table, and completely ignore the results from a
    VACUUM even if it scanned 99% of the table.  This is not sane.
    I agree that if VACUUM scanned 99% of the table, it's probably fine to
    use its numbers. It's also fine to use the numbers from ANALYZE,
    because those pages are chosen randomly. What bothers me is the idea
    of using a small *non-random* sample, and I'm not sure that
    incorporating possibly-bogus results slowly is any better than
    incorporating them quickly.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Robert Haas at May 25, 2011 at 5:31 pm

    On Wed, May 25, 2011 at 1:15 PM, Robert Haas wrote:
    I agree that if VACUUM scanned 99% of the table, it's probably fine to
    use its numbers.  It's also fine to use the numbers from ANALYZE,
    because those pages are chosen randomly.  What bothers me is the idea
    of using a small *non-random* sample, and I'm not sure that
    incorporating possibly-bogus results slowly is any better than
    incorporating them quickly.
    In particular, unless I'm misremembering, VACUUM *always* scans the
    first few pages of the table, until it sees enough consecutive
    all-visible bits that it decides to start skipping. If I'm right
    about that, then those pages could easily end up being overweighted
    when VACUUM does the counting; especially if ANALYZE or an actual
    full-table vacuum aren't allowed to snap the count back to reality.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 25, 2011 at 9:54 pm

    Robert Haas writes:
    On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote:
    Because the problem is not specific to TOAST tables.  As things
    currently stand, we will accept the word of an ANALYZE as gospel even if
    it scanned 1% of the table, and completely ignore the results from a
    VACUUM even if it scanned 99% of the table.  This is not sane.
    I agree that if VACUUM scanned 99% of the table, it's probably fine to
    use its numbers. It's also fine to use the numbers from ANALYZE,
    because those pages are chosen randomly. What bothers me is the idea
    of using a small *non-random* sample, and I'm not sure that
    incorporating possibly-bogus results slowly is any better than
    incorporating them quickly.
    The above is simply fuzzy thinking. The fact that ANALYZE looked at a
    random subset of pages is *no guarantee whatsoever* that its results are
    highly accurate. They might be more trustworthy than VACUUM's nonrandom
    sample of a similar number of pages, but it doesn't hold even a little
    bit of water to claim that we should believe ANALYZE completely and
    VACUUM not at all even when the latter has looked at a significantly
    larger sample of pages.

    In any case, your line of thought doesn't help us for fixing the problem
    with toast tables, because we aren't going to start doing ANALYZEs on
    toast tables.

    The bottom line here is that making use of stats we have is a lot better
    than not making use of them, even if they aren't entirely trustworthy.

    regards, tom lane
  • Robert Haas at May 25, 2011 at 11:11 pm

    On Wed, May 25, 2011 at 5:54 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    On Wed, May 25, 2011 at 1:04 PM, Tom Lane wrote:
    Because the problem is not specific to TOAST tables.  As things
    currently stand, we will accept the word of an ANALYZE as gospel even if
    it scanned 1% of the table, and completely ignore the results from a
    VACUUM even if it scanned 99% of the table.  This is not sane.
    I agree that if VACUUM scanned 99% of the table, it's probably fine to
    use its numbers.  It's also fine to use the numbers from ANALYZE,
    because those pages are chosen randomly.  What bothers me is the idea
    of using a small *non-random* sample, and I'm not sure that
    incorporating possibly-bogus results slowly is any better than
    incorporating them quickly.
    The above is simply fuzzy thinking.  The fact that ANALYZE looked at a
    random subset of pages is *no guarantee whatsoever* that its results are
    highly accurate.  They might be more trustworthy than VACUUM's nonrandom
    sample of a similar number of pages, but it doesn't hold even a little
    bit of water to claim that we should believe ANALYZE completely and
    VACUUM not at all even when the latter has looked at a significantly
    larger sample of pages.
    I think you're arguing against a straw-man.
    In any case, your line of thought doesn't help us for fixing the problem
    with toast tables, because we aren't going to start doing ANALYZEs on
    toast tables.
    Can we simply use a constant for tuple density on TOAST tables?
    The bottom line here is that making use of stats we have is a lot better
    than not making use of them, even if they aren't entirely trustworthy.
    http://dilbert.com/strips/comic/2008-05-07/

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Alvaro Herrera at May 25, 2011 at 5:13 pm

    Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011:

    I don't know. That's maybe better, but I'd be willing to wager that
    in some cases it will just slow down the rate at which we converge to
    a completely incorrect value, while in other cases it'll fail to
    update the data when it really has changed.
    For regular tables I don't think there's a real problem because it'll
    get fixed next time a full scan happens anyway. For toast tables, I
    think the set of operations is limited enough that this is easy to prove
    correct (or fixed so that it is) -- no HOT updates (in fact no updates
    at all), etc.

    BTW one thing we haven't fixed at all is how do HOT updates affect
    vacuuming behavior ...
    I am wondering, though, why we're not just inserting a special-purpose
    hack for TOAST tables. Your email seems to indicate that regular
    tables are already handled well enough, and certainly if we only whack
    around the TOAST behavior it's much less likely to fry anything.
    Well, having two paths one of which is uncommonly used means that it
    will get all the bugs. As with the xl_commit WAL record comment from
    Simon, I'd rather stick with having a single one.

    --
    Álvaro Herrera <alvherre@commandprompt.com>
    The PostgreSQL Company - Command Prompt, Inc.
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Greg Stark at May 26, 2011 at 5:14 am

    On Wed, May 25, 2011 at 9:41 AM, Tom Lane wrote:
    Yeah, I had been thinking about the latter point.  We could be
    conservative and just keep the reported tuple density the same (ie,
    update relpages to the new correct value, while setting reltuples so
    that the density ratio doesn't change).  But that has its own problems
    when the table contents *do* change.  What I'm currently imagining is
    to do a smoothed moving average, where we factor in the new density
    estimate with a weight dependent on the percentage of the table we did
    scan.  That is, the calculation goes something like

    old_density = old_reltuples / old_relpages
    new_density = counted_tuples / scanned_pages
    reliability = scanned_pages / new_relpages
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages
    This amounts to assuming that the pages observed in the vacuum have
    the density observed and the pages that weren't seen have the density
    that were previously in the reltuples/relpages stats. That seems like
    a pretty solid approach to me. If the numbers were sane before it
    follows that they should be sane after the update.

    --
    greg
  • Greg Stark at May 26, 2011 at 5:13 am

    On Wed, May 25, 2011 at 10:05 PM, Greg Stark wrote:
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages
    This amounts to assuming that the pages observed in the vacuum have
    the density observed and the pages that weren't seen have the density
    that were previously in the reltuples/relpages stats.
    In case it's not clear, Tom's expression for updated_density is
    equivalent by simple algebra to:

    updated_density = (old_density * (1-reliability)) + (new_density * reliability)

    --
    greg
  • Tom Lane at May 26, 2011 at 3:25 pm

    Greg Stark writes:
    On Wed, May 25, 2011 at 9:41 AM, Tom Lane wrote:
    ... What I'm currently imagining is
    to do a smoothed moving average, where we factor in the new density
    estimate with a weight dependent on the percentage of the table we did
    scan. That is, the calculation goes something like

    old_density = old_reltuples / old_relpages
    new_density = counted_tuples / scanned_pages
    reliability = scanned_pages / new_relpages
    updated_density = old_density + (new_density - old_density) * reliability
    new_reltuples = updated_density * new_relpages
    This amounts to assuming that the pages observed in the vacuum have
    the density observed and the pages that weren't seen have the density
    that were previously in the reltuples/relpages stats. That seems like
    a pretty solid approach to me. If the numbers were sane before it
    follows that they should be sane after the update.
    Hm, that's an interesting way of looking at it, but I was coming at it
    from a signal-processing point of view. What Robert is concerned about
    is that if VACUUM is cleaning a non-representative sample of pages, and
    repeated VACUUMs examine pretty much the same sample each time, then
    over repeated applications of the above formula the estimated density
    will eventually converge to what we are seeing in the sample. The speed
    of convergence depends on the moving-average multiplier, ie the
    "reliability" number above, and what I was after was just to slow down
    convergence for smaller samples. So I wouldn't have any problem with
    including a fudge factor to make the convergence even slower. But your
    analogy makes it seem like this particular formulation is actually
    "right" in some sense.

    One other point here is that Florian's problem is really only with our
    failing to update relpages. I don't think there is any part of the
    system that particularly cares about reltuples for a toast table. So
    even if the value did converge to some significantly-bad estimate over
    time, it's not really an issue AFAICS. We do care about having a sane
    reltuples estimate for regular tables, but for those we should have a
    mixture of updates from ANALYZE and updates from VACUUM. Also, for both
    regular and toast tables we will have an occasional vacuum-for-wraparound
    that is guaranteed to scan all pages and hence do a hard reset of
    reltuples to the correct value.

    I'm still of the opinion that an incremental estimation process like
    the above is a lot saner than what we're doing now, snarky Dilbert
    references notwithstanding. The only thing that seems worthy of debate
    from here is whether we should trust ANALYZE's estimates a bit more than
    VACUUM's estimates, on the grounds that the former are more likely to be
    from a random subset of pages. We could implement that by applying a
    fudge factor when folding a VACUUM estimate into the moving average (ie,
    multiply its reliability by something less than one). I don't have any
    principled suggestion for just what the fudge factor ought to be, except
    that I don't think "zero" is the best value, which AFAICT is what Robert
    is arguing. I think Greg's argument shows that "one" is the right value
    when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
    random set of pages ... but using that for VACUUM does seem
    overoptimistic.

    regards, tom lane
  • Robert Haas at May 26, 2011 at 3:56 pm

    On Thu, May 26, 2011 at 11:25 AM, Tom Lane wrote:
    I'm still of the opinion that an incremental estimation process like
    the above is a lot saner than what we're doing now, snarky Dilbert
    references notwithstanding.  The only thing that seems worthy of debate
    from here is whether we should trust ANALYZE's estimates a bit more than
    VACUUM's estimates, on the grounds that the former are more likely to be
    from a random subset of pages.  We could implement that by applying a
    fudge factor when folding a VACUUM estimate into the moving average (ie,
    multiply its reliability by something less than one).  I don't have any
    principled suggestion for just what the fudge factor ought to be, except
    that I don't think "zero" is the best value, which AFAICT is what Robert
    is arguing.  I think Greg's argument shows that "one" is the right value
    when dealing with an ANALYZE estimate, if you believe that ANALYZE saw a
    random set of pages ... but using that for VACUUM does seem
    overoptimistic.
    The problem is that it's quite difficult to predict the relative
    frequency of full-relation-vacuum, vacuum-with-skips, and ANALYZE
    operations on the table will be. It matters how fast the table is
    being inserted into vs. updated/deleted; and it also matters how fast
    the table is being updated compared with the system's rate of XID
    consumption. So in general it seems hard to say, well, we know this
    number might drift off course a little bit, but there will be a
    freezing vacuum or analyze or something coming along soon enough to
    fix the problem. There might be, but it's difficult to be sure. My
    argument isn't so much that using a non-zero value here is guaranteed
    to have bad effects, but that we really have no idea what will work
    out well in practice, and therefore it seems dangerous to whack the
    behavior around ... especially in stable branches.

    If we changed this in 9.1, and that's the last time we ever get a
    complaint about it, problem solved. But I would feel bad if we
    changed this in the back-branches and then found that, while solving
    this particular problem, we had created others. It also seems likely
    that the replacement problems would be more subtle and more difficult
    to diagnose, because they'd depend in a very complicated way on the
    workload, and having, say, the latest table contents would not
    necessarily enable us to reproduce the problem.

    I would feel a lot better about something that is deterministic, like,
    I dunno, if VACUUM visits more than 25% of the table, we use its
    estimate. And we always use ANALYZE's estimate. Or something.

    Another thought: Couldn't relation_needs_vacanalyze() just scale up
    reltuples by the ratio of the current number of pages in the relation
    to relpages, just as the query planner does?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 26, 2011 at 4:24 pm

    Robert Haas writes:
    I would feel a lot better about something that is deterministic, like,
    I dunno, if VACUUM visits more than 25% of the table, we use its
    estimate. And we always use ANALYZE's estimate. Or something.
    This argument seems to rather miss the point. The data we are working
    from is fundamentally not deterministic, and you can't make it so by
    deciding to ignore what data we do have. That leads to a less useful
    estimate, not a more useful estimate.
    Another thought: Couldn't relation_needs_vacanalyze() just scale up
    reltuples by the ratio of the current number of pages in the relation
    to relpages, just as the query planner does?
    Hmm ... that would fix Florian's immediate issue, and it does seem like
    a good change on its own merits. But it does nothing for the problem
    that we're failing to put the best available information into pg_class.

    Possibly we could compromise on doing just that much in the back
    branches, and the larger change for 9.1?

    regards, tom lane
  • Robert Haas at May 26, 2011 at 5:08 pm

    On Thu, May 26, 2011 at 12:23 PM, Tom Lane wrote:
    Another thought: Couldn't relation_needs_vacanalyze() just scale up
    reltuples by the ratio of the current number of pages in the relation
    to relpages, just as the query planner does?
    Hmm ... that would fix Florian's immediate issue, and it does seem like
    a good change on its own merits.  But it does nothing for the problem
    that we're failing to put the best available information into pg_class.

    Possibly we could compromise on doing just that much in the back
    branches, and the larger change for 9.1?
    Do you think we need to worry about the extra overhead of determining
    the current size of every relation as we sweep through pg_class? It's
    not a lot, but OTOH I think we'd be doing it once a minute... not sure
    what would happen if there were tons of tables.

    Going back to your thought upthread, I think we should really consider
    replacing reltuples with reltupledensity at some point. I continue to
    be afraid that using a decaying average in this case is going to end
    up overweighting the values from some portion of the table that's
    getting scanned repeatedly, at the expense of other portions of the
    table that are not getting scanned at all. Now, perhaps experience
    will prove that's not a problem. But storing relpages and
    reltupledensity separately would give us more flexibility, because we
    could feel free to bump relpages even when we're not sure what to do
    about reltupledensity. That would help Florian's problem quite a lot,
    even if we did nothing else.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at May 26, 2011 at 5:28 pm

    Robert Haas wrote:

    I think we should really consider replacing reltuples with
    reltupledensity at some point. I continue to be afraid that using
    a decaying average in this case is going to end up overweighting
    the values from some portion of the table that's getting scanned
    repeatedly, at the expense of other portions of the table that are
    not getting scanned at all. Now, perhaps experience will prove
    that's not a problem. But storing relpages and reltupledensity
    separately would give us more flexibility, because we could feel
    free to bump relpages even when we're not sure what to do about
    reltupledensity. That would help Florian's problem quite a lot,
    even if we did nothing else.
    Given how trivial it would be to adjust reltuples to keep its ratio
    to relpages about the same when we don't have a new "hard" number,
    but some evidence that we should fudge our previous value, I don't
    see where this change buys us much. It seems to mostly obfuscate
    the fact that we're changing our assumption about how many tuples we
    have. I would rather that we did that explicitly with code comments
    about why it's justified than to slip it in the way you suggest.

    -Kevin
  • Robert Haas at May 26, 2011 at 5:50 pm

    On Thu, May 26, 2011 at 1:28 PM, Kevin Grittner wrote:
    Robert Haas wrote:
    I think we should really consider replacing reltuples with
    reltupledensity at some point.  I continue to be afraid that using
    a decaying average in this case is going to end up overweighting
    the values from some portion of the table that's getting scanned
    repeatedly, at the expense of other portions of the table that are
    not getting scanned at all.  Now, perhaps experience will prove
    that's not a problem.  But storing relpages and reltupledensity
    separately would give us more flexibility, because we could feel
    free to bump relpages even when we're not sure what to do about
    reltupledensity.  That would help Florian's problem quite a lot,
    even if we did nothing else.
    Given how trivial it would be to adjust reltuples to keep its ratio
    to relpages about the same when we don't have a new "hard" number,
    but some evidence that we should fudge our previous value, I don't
    see where this change buys us much.  It seems to mostly obfuscate
    the fact that we're changing our assumption about how many tuples we
    have.  I would rather that we did that explicitly with code comments
    about why it's justified than to slip it in the way you suggest.
    I'm a bit confused by this - what the current design obfuscates is the
    fact that reltuples and relpages are not really independent columns;
    you can't update one without updating the other, unless you want
    screwy behavior. Replacing reltuples by reltupledensity would fix
    that problem - it would be logical and non-damaging to update either
    column independently.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at May 26, 2011 at 6:06 pm

    Robert Haas wrote:
    Kevin Grittner wrote:
    Given how trivial it would be to adjust reltuples to keep its
    ratio to relpages about the same when we don't have a new "hard"
    number, but some evidence that we should fudge our previous
    value, I don't see where this change buys us much. It seems to
    mostly obfuscate the fact that we're changing our assumption
    about how many tuples we have. I would rather that we did that
    explicitly with code comments about why it's justified than to
    slip it in the way you suggest.
    I'm a bit confused by this - what the current design obfuscates is
    the fact that reltuples and relpages are not really independent
    columns; you can't update one without updating the other, unless
    you want screwy behavior. Replacing reltuples by reltupledensity
    would fix that problem - it would be logical and non-damaging to
    update either column independently.
    They don't always move in tandem. Certainly there can be available
    space in those pages from which tuples can be allocated or which
    increases as tuples are vacuumed. Your proposed change would
    neither make more or less information available, because we've got
    two numbers which can be observed as raw counts, and a ratio between
    them. By storing the ratio and one count you make changes to the
    other count implied and less visible. It seems more understandable
    and less prone to error (to me, anyway) to keep the two "raw"
    numbers and calculate the ratio -- and when you observe a change in
    one raw number which you believe should force an adjustment to the
    other raw number before its next actual value is observed, to
    comment on why that's a good idea, and do the trivial arithmetic at
    that time.

    As a thought exercise, what happens each way if a table is loaded
    with a low fillfactor and then a lot of inserts are done? What
    happens if mass deletes are done from a table which has a high
    density?

    -Kevin
  • Robert Haas at May 26, 2011 at 7:48 pm

    On Thu, May 26, 2011 at 2:05 PM, Kevin Grittner wrote:
    I'm a bit confused by this - what the current design obfuscates is
    the fact that reltuples and relpages are not really independent
    columns; you can't update one without updating the other, unless
    you want screwy behavior.  Replacing reltuples by reltupledensity
    would fix that problem - it would be logical and non-damaging to
    update either column independently.
    They don't always move in tandem.  Certainly there can be available
    space in those pages from which tuples can be allocated or which
    increases as tuples are vacuumed.  Your proposed change would
    neither make more or less information available, because we've got
    two numbers which can be observed as raw counts, and a ratio between
    them.
    So far I agree.
    By storing the ratio and one count you make changes to the
    other count implied and less visible.  It seems more understandable
    and less prone to error (to me, anyway) to keep the two "raw"
    numbers and calculate the ratio -- and when you observe a change in
    one raw number which you believe should force an adjustment to the
    other raw number before its next actual value is observed, to
    comment on why that's a good idea, and do the trivial arithmetic at
    that time.
    Except that's not how it works. At least in the case of ANALYZE, we
    *aren't* counting all the tuples in the table. We're selecting a
    random sample of pages and inferring a tuple density, which we then
    extrapolate to the whole table and store. Then when we pull it back
    out of the table, we convert it back to a tuple density. The real
    value we are computing and using almost everywhere is tuple density;
    storing a total number of tuples in the table appears to be just
    confusing the issue.

    Unless, of course, I am misunderstanding, which is possible.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 26, 2011 at 8:30 pm

    Robert Haas writes:
    Except that's not how it works. At least in the case of ANALYZE, we
    *aren't* counting all the tuples in the table. We're selecting a
    random sample of pages and inferring a tuple density, which we then
    extrapolate to the whole table and store. Then when we pull it back
    out of the table, we convert it back to a tuple density. The real
    value we are computing and using almost everywhere is tuple density;
    storing a total number of tuples in the table appears to be just
    confusing the issue.
    If we were starting in a green field we might choose to store tuple
    density. However, the argument for changing it now is at best mighty
    thin; IMO it is not worth the risk of breaking client code.

    regards, tom lane
  • Kevin Grittner at May 26, 2011 at 9:24 pm

    Robert Haas wrote:
    Kevin Grittner wrote:
    By storing the ratio and one count you make changes to the
    other count implied and less visible. It seems more
    understandable and less prone to error (to me, anyway) to keep
    the two "raw" numbers and calculate the ratio -- and when you
    observe a change in one raw number which you believe should force
    an adjustment to the other raw number before its next actual
    value is observed, to comment on why that's a good idea, and do
    the trivial arithmetic at that time.
    Except that's not how it works. At least in the case of ANALYZE,
    we *aren't* counting all the tuples in the table. We're selecting
    a random sample of pages and inferring a tuple density, which we
    then extrapolate to the whole table and store. Then when we pull
    it back out of the table, we convert it back to a tuple density.
    The real value we are computing and using almost everywhere is
    tuple density; storing a total number of tuples in the table
    appears to be just confusing the issue.
    Well, if tuple density is the number which is most heavily used, it
    might shave a few nanoseconds doing the arithmetic in enough places
    to justify the change, but I'm skeptical. Basically I'm with Tom on
    the fact that this change would store neither more nor less
    information (and for that matter would not really change what
    information you can easily retrieve); and slightly changing the
    manner in which it is stored doesn't solve any of the problems you
    assert that it does.

    When we prune or vacuum a page, I don't suppose we have enough
    information about that page's previous state to calculate a tuple
    count delta, do we? That would allow a far more accurate number to
    be maintained than anything suggested so far, as long as we tweak
    autovacuum to count inserts toward the need to vacuum. (It seems to
    me I saw a post giving some reason that would have benefits anyway.)
    Except for the full pass during transaction wrap-around protection,
    where it could just set a new actual count, autovacuum would be
    skipping pages where the bit is set to indicate that all tuples are
    visible, right?

    -Kevin
  • Tom Lane at May 26, 2011 at 9:50 pm

    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    When we prune or vacuum a page, I don't suppose we have enough
    information about that page's previous state to calculate a tuple
    count delta, do we? That would allow a far more accurate number to
    be maintained than anything suggested so far, as long as we tweak
    autovacuum to count inserts toward the need to vacuum.
    Well, that was the other direction that was suggested upthread: stop
    relying on reltuples at all, but use the stats collector's counts.
    That might be a good solution in the long run, but there are some
    issues:

    1. It's not clear how using a current count, as opposed to
    time-of-last-vacuum count, would affect the behavior of the autovacuum
    control logic. At first glance I think it would break it, since the
    basic logic there is "how much of the table changed since it was last
    vacuumed?". Even if the equations could be modified to still work,
    I remember enough feedback control theory from undergrad EE to think that
    this is something to be seriously scared of tweaking without extensive
    testing. IMO it is far more risky than what Robert is worried about.

    2. You still have the problem that we're exposing inaccurate (or at
    least less accurate than they could be) counts to the planner and to
    onlooker clients. We could change the planner to also depend on the
    stats collector instead of reltuples, but at that point you just removed
    the option for people to turn off the stats collector. The implications
    for plan stability might be unpleasant, too.

    So that's not a direction I want to go without a significant amount
    of work and testing.

    regards, tom lane
  • Robert Haas at May 27, 2011 at 3:07 pm

    On Thu, May 26, 2011 at 5:50 PM, Tom Lane wrote:
    "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
    When we prune or vacuum a page, I don't suppose we have enough
    information about that page's previous state to calculate a tuple
    count delta, do we?  That would allow a far more accurate number to
    be maintained than anything suggested so far, as long as we tweak
    autovacuum to count inserts toward the need to vacuum.
    Well, that was the other direction that was suggested upthread: stop
    relying on reltuples at all, but use the stats collector's counts.
    That might be a good solution in the long run, but there are some
    issues:

    1. It's not clear how using a current count, as opposed to
    time-of-last-vacuum count, would affect the behavior of the autovacuum
    control logic.  At first glance I think it would break it, since the
    basic logic there is "how much of the table changed since it was last
    vacuumed?".  Even if the equations could be modified to still work,
    I remember enough feedback control theory from undergrad EE to think that
    this is something to be seriously scared of tweaking without extensive
    testing.  IMO it is far more risky than what Robert is worried about.
    Yeah, I think that would be broken.
    2. You still have the problem that we're exposing inaccurate (or at
    least less accurate than they could be) counts to the planner and to
    onlooker clients.  We could change the planner to also depend on the
    stats collector instead of reltuples, but at that point you just removed
    the option for people to turn off the stats collector.  The implications
    for plan stability might be unpleasant, too.

    So that's not a direction I want to go without a significant amount
    of work and testing.
    FWIW, I agree. Your proposed solution is certainly better than trying
    to do this; but it still seems a bit shaky to me.

    Still, maybe we don't have a better option. If it were me, I'd add an
    additional safety valve: use your formula if the percentage of the
    relation scanned is above some threshold where there's unlikely to be
    too much skew. But if the percentage scanned is too small, then don't
    use that formula. Instead, only update relpages/reltuples if the
    relation is now larger; set relpages to the new actual value, and
    scale up reltuples proportionately.

    However, I just work here. It's possible that I'm worrying about a
    problem that won't materialize in practice.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at May 27, 2011 at 6:34 pm

    Robert Haas writes:
    Still, maybe we don't have a better option. If it were me, I'd add an
    additional safety valve: use your formula if the percentage of the
    relation scanned is above some threshold where there's unlikely to be
    too much skew. But if the percentage scanned is too small, then don't
    use that formula. Instead, only update relpages/reltuples if the
    relation is now larger; set relpages to the new actual value, and
    scale up reltuples proportionately.
    Ah, progress: now we're down to arguing about the size of the fudge
    factor ;-). I'll do something involving derating the reliability
    when the number is coming from VACUUM.

    regards, tom lane
  • Tom Lane at May 28, 2011 at 7:01 pm

    Robert Haas writes:
    Still, maybe we don't have a better option. If it were me, I'd add an
    additional safety valve: use your formula if the percentage of the
    relation scanned is above some threshold where there's unlikely to be
    too much skew. But if the percentage scanned is too small, then don't
    use that formula. Instead, only update relpages/reltuples if the
    relation is now larger; set relpages to the new actual value, and
    scale up reltuples proportionately.
    However, I just work here. It's possible that I'm worrying about a
    problem that won't materialize in practice.
    Attached is a proposed patch to fix these issues. Experimentation
    convinced me that including a fudge factor for VACUUM's results made
    things *less* accurate, not more so. The reason seems to be bound up in
    Greg Stark's observation that the unmodified calculation is equivalent
    to assuming that the old average tuple density still applies to the
    unscanned pages. In a VACUUM, we know that the unscanned pages are
    exactly those that have had no changes since (at least) the last vacuum,
    which means that indeed the old density ought to be a good estimate.
    Now, this reasoning can break down if the table's tuple density is
    nonuniform, but what I found in my testing is that if you vacuum after a
    significant change in a table (such as deleting a lot of rows), and you
    don't apply the full unfudged correction, you get a badly wrong result.
    I think that's a more significant issue than the possibility of drift
    over time.

    I also found that Greg was right in thinking that it would help if we
    tweaked lazy_scan_heap to not always scan the first
    SKIP_PAGES_THRESHOLD-1 pages even if they were
    all_visible_according_to_vm. That seemed to skew the results if those
    pages weren't representative. And, for the case of a useless manual
    vacuum on a completely clean table, it would cause the reltuples value
    to drift when there was no reason to change it at all.

    Lastly, this patch removes a bunch of grotty interconnections between
    VACUUM and ANALYZE that were meant to prevent ANALYZE from updating the
    stats if VACUUM had done a full-table scan in the same command. With
    the new logic it's relatively harmless if ANALYZE does that, and anyway
    autovacuum frequently fires the two cases independently anyway, making
    all that logic quite useless in the normal case. (This simplification
    accounts for the bulk of the diff, actually.)

    Comments?

    regards, tom lane
  • Greg Stark at May 29, 2011 at 2:43 am

    On Sat, May 28, 2011 at 12:01 PM, Tom Lane wrote:
    I also found that Greg was right in thinking that it would help if we
    tweaked lazy_scan_heap to not always scan the first
    SKIP_PAGES_THRESHOLD-1 pages even if they were
    all_visible_according_to_vm.  That seemed to skew the results if those
    pages weren't representative.  And, for the case of a useless manual
    vacuum on a completely clean table, it would cause the reltuples value
    to drift when there was no reason to change it at all.
    You fixed the logic only for the first 32 pages which helps with the
    skew. But really the logic is backwards in general. Instead of
    counting how many missed opportunities for skipped pages we've seen in
    the past we should read the bits for the next 32 pages in advance and
    decide what to do before we read those pages.

    --
    greg
  • Tom Lane at May 29, 2011 at 5:11 am

    Greg Stark writes:
    On Sat, May 28, 2011 at 12:01 PM, Tom Lane wrote:
    I also found that Greg was right in thinking that it would help if we
    tweaked lazy_scan_heap to not always scan the first
    SKIP_PAGES_THRESHOLD-1 pages even if they were
    all_visible_according_to_vm.
    You fixed the logic only for the first 32 pages which helps with the
    skew. But really the logic is backwards in general. Instead of
    counting how many missed opportunities for skipped pages we've seen in
    the past we should read the bits for the next 32 pages in advance and
    decide what to do before we read those pages.
    OK, do you like the attached version of that logic? (Other fragments
    of the patch as before.)

    regards, tom lane

    diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
    index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
    *** a/src/backend/commands/vacuumlazy.c
    --- b/src/backend/commands/vacuumlazy.c
    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 311,317 ****
    int i;
    PGRUsage ru0;
    Buffer vmbuffer = InvalidBuffer;
    ! BlockNumber all_visible_streak;

    pg_rusage_init(&ru0);

    --- 305,312 ----
    int i;
    PGRUsage ru0;
    Buffer vmbuffer = InvalidBuffer;
    ! BlockNumber next_not_all_visible_block;
    ! bool skipping_all_visible_blocks;

    pg_rusage_init(&ru0);

    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 329,340 ****

    nblocks = RelationGetNumberOfBlocks(onerel);
    vacrelstats->rel_pages = nblocks;
    vacrelstats->nonempty_pages = 0;
    vacrelstats->latestRemovedXid = InvalidTransactionId;

    lazy_space_alloc(vacrelstats, nblocks);

    ! all_visible_streak = 0;
    for (blkno = 0; blkno < nblocks; blkno++)
    {
    Buffer buf;
    --- 324,369 ----

    nblocks = RelationGetNumberOfBlocks(onerel);
    vacrelstats->rel_pages = nblocks;
    + vacrelstats->scanned_pages = 0;
    vacrelstats->nonempty_pages = 0;
    vacrelstats->latestRemovedXid = InvalidTransactionId;

    lazy_space_alloc(vacrelstats, nblocks);

    ! /*
    ! * We want to skip pages that don't require vacuuming according to the
    ! * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
    ! * consecutive pages. Since we're reading sequentially, the OS should be
    ! * doing readahead for us, so there's no gain in skipping a page now and
    ! * then; that's likely to disable readahead and so be counterproductive.
    ! * Also, skipping even a single page means that we can't update
    ! * relfrozenxid, so we only want to do it if we can skip a goodly number
    ! * of pages.
    ! *
    ! * Before entering the main loop, establish the invariant that
    ! * next_not_all_visible_block is the next block number >= blkno that's
    ! * not all-visible according to the visibility map, or nblocks if there's
    ! * no such block. Also, we set up the skipping_all_visible_blocks flag,
    ! * which is needed because we need hysteresis in the decision: once we've
    ! * started skipping blocks, we may as well skip everything up to the next
    ! * not-all-visible block.
    ! *
    ! * Note: if scan_all is true, we won't actually skip any pages; but we
    ! * maintain next_not_all_visible_block anyway, so as to set up the
    ! * all_visible_according_to_vm flag correctly for each page.
    ! */
    ! for (next_not_all_visible_block = 0;
    ! next_not_all_visible_block < nblocks;
    ! next_not_all_visible_block++)
    ! {
    ! if (!visibilitymap_test(onerel, next_not_all_visible_block, &vmbuffer))
    ! break;
    ! }
    ! if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
    ! skipping_all_visible_blocks = true;
    ! else
    ! skipping_all_visible_blocks = false;
    !
    for (blkno = 0; blkno < nblocks; blkno++)
    {
    Buffer buf;
    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 347,387 ****
    OffsetNumber frozen[MaxOffsetNumber];
    int nfrozen;
    Size freespace;
    ! bool all_visible_according_to_vm = false;
    bool all_visible;
    bool has_dead_tuples;

    ! /*
    ! * Skip pages that don't require vacuuming according to the visibility
    ! * map. But only if we've seen a streak of at least
    ! * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
    ! * sequentially, the OS should be doing readahead for us and there's
    ! * no gain in skipping a page now and then. You need a longer run of
    ! * consecutive skipped pages before it's worthwhile. Also, skipping
    ! * even a single page means that we can't update relfrozenxid or
    ! * reltuples, so we only want to do it if there's a good chance to
    ! * skip a goodly number of pages.
    ! */
    ! if (!scan_all)
    {
    ! all_visible_according_to_vm =
    ! visibilitymap_test(onerel, blkno, &vmbuffer);
    ! if (all_visible_according_to_vm)
    {
    ! all_visible_streak++;
    ! if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
    ! {
    ! vacrelstats->scanned_all = false;
    ! continue;
    ! }
    }
    else
    ! all_visible_streak = 0;
    }

    vacuum_delay_point();

    ! scanned_pages++;

    /*
    * If we are close to overrunning the available space for dead-tuple
    --- 376,419 ----
    OffsetNumber frozen[MaxOffsetNumber];
    int nfrozen;
    Size freespace;
    ! bool all_visible_according_to_vm;
    bool all_visible;
    bool has_dead_tuples;

    ! if (blkno == next_not_all_visible_block)
    {
    ! /* Time to advance next_not_all_visible_block */
    ! for (next_not_all_visible_block++;
    ! next_not_all_visible_block < nblocks;
    ! next_not_all_visible_block++)
    {
    ! if (!visibilitymap_test(onerel, next_not_all_visible_block,
    ! &vmbuffer))
    ! break;
    }
    +
    + /*
    + * We know we can't skip the current block. But set up
    + * skipping_all_visible_blocks to do the right thing at the
    + * following blocks.
    + */
    + if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
    + skipping_all_visible_blocks = true;
    else
    ! skipping_all_visible_blocks = false;
    ! all_visible_according_to_vm = false;
    ! }
    ! else
    ! {
    ! /* Current block is all-visible */
    ! if (skipping_all_visible_blocks && !scan_all)
    ! continue;
    ! all_visible_according_to_vm = true;
    }

    vacuum_delay_point();

    ! vacrelstats->scanned_pages++;

    /*
    * If we are close to overrunning the available space for dead-tuple
  • Cédric Villemain at May 29, 2011 at 8:41 am

    2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
    Greg Stark <gsstark@mit.edu> writes:
    On Sat, May 28, 2011 at 12:01 PM, Tom Lane wrote:
    I also found that Greg was right in thinking that it would help if we
    tweaked lazy_scan_heap to not always scan the first
    SKIP_PAGES_THRESHOLD-1 pages even if they were
    all_visible_according_to_vm.
    You fixed the logic only for the first 32 pages which helps with the
    skew. But really the logic is backwards in general. Instead of
    counting how many missed opportunities for skipped pages we've seen in
    the past we should read the bits for the next 32 pages in advance and
    decide what to do before we read those pages.
    OK, do you like the attached version of that logic?  (Other fragments
    of the patch as before.)
    The idea was that remove only one page from the VACUUM will prevent
    relfrozenxid update and reltuples (and relpages) update.
    Now, I beleive that once we've skip at least one page thanks to
    SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
    many as possible pages from the VACUUM, tks to the VM.

    regards, tom lane

    diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
    index 9393fa0727aaad7508e1163623322b4066412257..231447b31223bc5350ce49a136cffafaa53bc5fb 100644
    *** a/src/backend/commands/vacuumlazy.c
    --- b/src/backend/commands/vacuumlazy.c
    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 311,317 ****
    int                     i;
    PGRUsage        ru0;
    Buffer          vmbuffer = InvalidBuffer;
    !       BlockNumber all_visible_streak;

    pg_rusage_init(&ru0);

    --- 305,312 ----
    int                     i;
    PGRUsage        ru0;
    Buffer          vmbuffer = InvalidBuffer;
    !       BlockNumber next_not_all_visible_block;
    !       bool            skipping_all_visible_blocks;

    pg_rusage_init(&ru0);

    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 329,340 ****

    nblocks = RelationGetNumberOfBlocks(onerel);
    vacrelstats->rel_pages = nblocks;
    vacrelstats->nonempty_pages = 0;
    vacrelstats->latestRemovedXid = InvalidTransactionId;

    lazy_space_alloc(vacrelstats, nblocks);

    !       all_visible_streak = 0;
    for (blkno = 0; blkno < nblocks; blkno++)
    {
    Buffer          buf;
    --- 324,369 ----

    nblocks = RelationGetNumberOfBlocks(onerel);
    vacrelstats->rel_pages = nblocks;
    +       vacrelstats->scanned_pages = 0;
    vacrelstats->nonempty_pages = 0;
    vacrelstats->latestRemovedXid = InvalidTransactionId;

    lazy_space_alloc(vacrelstats, nblocks);

    !       /*
    !        * We want to skip pages that don't require vacuuming according to the
    !        * visibility map, but only when we can skip at least SKIP_PAGES_THRESHOLD
    !        * consecutive pages.  Since we're reading sequentially, the OS should be
    !        * doing readahead for us, so there's no gain in skipping a page now and
    !        * then; that's likely to disable readahead and so be counterproductive.
    !        * Also, skipping even a single page means that we can't update
    !        * relfrozenxid, so we only want to do it if we can skip a goodly number
    !        * of pages.
    !        *
    !        * Before entering the main loop, establish the invariant that
    !        * next_not_all_visible_block is the next block number >= blkno that's
    !        * not all-visible according to the visibility map, or nblocks if there's
    !        * no such block.  Also, we set up the skipping_all_visible_blocks flag,
    !        * which is needed because we need hysteresis in the decision: once we've
    !        * started skipping blocks, we may as well skip everything up to the next
    !        * not-all-visible block.
    !        *
    !        * Note: if scan_all is true, we won't actually skip any pages; but we
    !        * maintain next_not_all_visible_block anyway, so as to set up the
    !        * all_visible_according_to_vm flag correctly for each page.
    !        */
    !       for (next_not_all_visible_block = 0;
    !                next_not_all_visible_block < nblocks;
    !                next_not_all_visible_block++)
    !       {
    !               if (!visibilitymap_test(onerel, next_not_all_visible_block,     &vmbuffer))
    !                       break;
    !       }
    !       if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
    !               skipping_all_visible_blocks = true;
    !       else
    !               skipping_all_visible_blocks = false;
    !
    for (blkno = 0; blkno < nblocks; blkno++)
    {
    Buffer          buf;
    *************** lazy_scan_heap(Relation onerel, LVRelSta
    *** 347,387 ****
    OffsetNumber frozen[MaxOffsetNumber];
    int                     nfrozen;
    Size            freespace;
    !               bool            all_visible_according_to_vm = false;
    bool            all_visible;
    bool            has_dead_tuples;

    !               /*
    !                * Skip pages that don't require vacuuming according to the visibility
    !                * map. But only if we've seen a streak of at least
    !                * SKIP_PAGES_THRESHOLD pages marked as clean. Since we're reading
    !                * sequentially, the OS should be doing readahead for us and there's
    !                * no gain in skipping a page now and then. You need a longer run of
    !                * consecutive skipped pages before it's worthwhile. Also, skipping
    !                * even a single page means that we can't update relfrozenxid or
    !                * reltuples, so we only want to do it if there's a good chance to
    !                * skip a goodly number of pages.
    !                */
    !               if (!scan_all)
    {
    !                       all_visible_according_to_vm =
    !                               visibilitymap_test(onerel, blkno, &vmbuffer);
    !                       if (all_visible_according_to_vm)
    {
    !                               all_visible_streak++;
    !                               if (all_visible_streak >= SKIP_PAGES_THRESHOLD)
    !                               {
    !                                       vacrelstats->scanned_all = false;
    !                                       continue;
    !                               }
    }
    else
    !                               all_visible_streak = 0;
    }

    vacuum_delay_point();

    !               scanned_pages++;

    /*
    * If we are close to overrunning the available space for dead-tuple
    --- 376,419 ----
    OffsetNumber frozen[MaxOffsetNumber];
    int                     nfrozen;
    Size            freespace;
    !               bool            all_visible_according_to_vm;
    bool            all_visible;
    bool            has_dead_tuples;

    !               if (blkno == next_not_all_visible_block)
    {
    !                       /* Time to advance next_not_all_visible_block */
    !                       for (next_not_all_visible_block++;
    !                                next_not_all_visible_block < nblocks;
    !                                next_not_all_visible_block++)
    {
    !                               if (!visibilitymap_test(onerel, next_not_all_visible_block,
    !                                                                               &vmbuffer))
    !                                       break;
    }
    +
    +                       /*
    +                        * We know we can't skip the current block.  But set up
    +                        * skipping_all_visible_blocks to do the right thing at the
    +                        * following blocks.
    +                        */
    +                       if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
    +                               skipping_all_visible_blocks = true;
    else
    !                               skipping_all_visible_blocks = false;
    !                       all_visible_according_to_vm = false;
    !               }
    !               else
    !               {
    !                       /* Current block is all-visible */
    !                       if (skipping_all_visible_blocks && !scan_all)
    !                               continue;
    !                       all_visible_according_to_vm = true;
    }

    vacuum_delay_point();

    !               vacrelstats->scanned_pages++;

    /*
    * If we are close to overrunning the available space for dead-tuple

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Tom Lane at May 29, 2011 at 2:40 pm

    =?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric.villemain.debian@gmail.com> writes:
    2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
    OK, do you like the attached version of that logic?  (Other fragments
    of the patch as before.)
    The idea was that remove only one page from the VACUUM will prevent
    relfrozenxid update and reltuples (and relpages) update.
    Now, I beleive that once we've skip at least one page thanks to
    SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
    many as possible pages from the VACUUM, tks to the VM.
    That would require proof, not just suggestion. Skipping pages will
    defeat the OS read-ahead algorithm, and so could easily cost more than
    reading them.

    regards, tom lane
  • Pavan Deolasee at May 29, 2011 at 3:49 pm

    On Sun, May 29, 2011 at 8:10 PM, Tom Lane wrote:
    =?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric.villemain.debian@gmail.com> writes:
    2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
    OK, do you like the attached version of that logic?  (Other fragments
    of the patch as before.)
    The idea was that remove only one page from the VACUUM will prevent
    relfrozenxid update and reltuples (and relpages) update.
    Now, I beleive that once we've skip at least one page thanks to
    SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
    many as possible pages from the VACUUM, tks to the VM.
    That would require proof, not just suggestion.  Skipping pages will
    defeat the OS read-ahead algorithm, and so could easily cost more than
    reading them.
    My worry is what we have right now is also based on just assumptions
    and gut feelings rather than any numbers.

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Tom Lane at May 29, 2011 at 3:57 pm

    Pavan Deolasee writes:
    On Sun, May 29, 2011 at 8:10 PM, Tom Lane wrote:
    That would require proof, not just suggestion.  Skipping pages will
    defeat the OS read-ahead algorithm, and so could easily cost more than
    reading them.
    My worry is what we have right now is also based on just assumptions
    and gut feelings rather than any numbers.
    So go collect some numbers.

    regards, tom lane
  • Pavan Deolasee at May 29, 2011 at 4:27 pm

    On Sun, May 29, 2011 at 9:27 PM, Tom Lane wrote:
    Pavan Deolasee <pavan.deolasee@gmail.com> writes:
    On Sun, May 29, 2011 at 8:10 PM, Tom Lane wrote:
    That would require proof, not just suggestion.  Skipping pages will
    defeat the OS read-ahead algorithm, and so could easily cost more than
    reading them.
    My worry is what we have right now is also based on just assumptions
    and gut feelings rather than any numbers.
    So go collect some numbers.
    I am sorry if I sounded terse above. But my gripe is that sometimes we
    are too reluctant to listen to ideas and insist on producing some hard
    numbers first which might take significant efforts. But we are not
    equally strict when such changes are introduced initially. For
    example, in this particular case, the change was introduced after this
    discussion:

    http://archives.postgresql.org/pgsql-hackers/2008-12/msg01316.php

    Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
    why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
    a better number based on his testings.

    May be a performance build farm as being discussed is the solution
    where we can throw some simple patches and see if something helps or
    not.

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Tom Lane at May 29, 2011 at 5:06 pm

    Pavan Deolasee writes:
    I am sorry if I sounded terse above. But my gripe is that sometimes we
    are too reluctant to listen to ideas and insist on producing some hard
    numbers first which might take significant efforts. But we are not
    equally strict when such changes are introduced initially.
    The reason for not wanting to change it without some actual evidence
    is that there is already evidence: the code has been in the field with
    this setting since 8.4, and nobody's vacuum performance has fallen off a
    cliff. So while I'd agree that there was little testing done before the
    code went in, there is more than zero reason to leave it where it is.
    Without some positive evidence showing that another value is better,
    I'm disinclined to change it. I also think that you're not helping
    by complaining about the code without being willing to do some work
    to try to collect such evidence.
    Heikki suggested 20, Simon proposed 32 to make it a power of 2. But
    why not 16 ? Thats closer to 16 than 32. And Greg yesterday said, 8 is
    a better number based on his testings.
    Greg said he had found that the read speed was the same for reading
    every page vs reading every 8th page. That's not the same as concluding
    that 8 is the optimal skip distance for vacuum; or at least, he didn't
    say that's what he had concluded. vacuum isn't just reading ...

    regards, tom lane
  • Pavan Deolasee at May 29, 2011 at 5:19 pm

    On Sun, May 29, 2011 at 10:35 PM, Tom Lane wrote:
    Pavan Deolasee <pavan.deolasee@gmail.com> writes:
    I am sorry if I sounded terse above. But my gripe is that sometimes we
    are too reluctant to listen to ideas and insist on producing some hard
    numbers first which might take significant efforts. But we are not
    equally strict when such changes are introduced initially.
    The reason for not wanting to change it without some actual evidence
    is that there is already evidence: the code has been in the field with
    this setting since 8.4, and nobody's vacuum performance has fallen off a
    cliff.
    Well, that's probably because there was definitely much improvement
    over what existed before. But that does not mean we can't make it
    better. IOW there are no complaints because there is no regression.
    So while I'd agree that there was little testing done before the
    code went in, there is more than zero reason to leave it where it is.
    Without some positive evidence showing that another value is better,
    I'm disinclined to change it.  I also think that you're not helping
    by complaining about the code without being willing to do some work
    to try to collect such evidence.
    I am not complaining about the code. I am suggesting we can be more
    receptive to ideas, especially when we know what we have today was not
    backed by any evidence either. I will anyways do some tests and post
    numbers when I work on single-pass vacuum patch. I'll try to
    experiment with this stuff at that time.

    Thanks,
    Pavan

    --
    Pavan Deolasee
    EnterpriseDB     http://www.enterprisedb.com
  • Cédric Villemain at May 29, 2011 at 11:33 pm

    2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
    =?ISO-8859-1?Q?C=E9dric_Villemain?= <cedric.villemain.debian@gmail.com> writes:
    2011/5/29 Tom Lane <tgl@sss.pgh.pa.us>:
    OK, do you like the attached version of that logic?  (Other fragments
    of the patch as before.)
    The idea was that remove only one page from the VACUUM will prevent
    relfrozenxid update and reltuples (and relpages) update.
    Now, I beleive that once we've skip at least one page thanks to
    SKIP_PAGES_THRESHOLD, then we should be more agressive and remove as
    many as possible pages from the VACUUM, tks to the VM.
    That would require proof, not just suggestion.  Skipping pages will
    defeat the OS read-ahead algorithm, and so could easily cost more than
    reading them.
    Correct, it needs proof. Parenthesis: I did learn also that reading
    the first block of a file make read-ahead have its larger window from
    the beginning (the one that posix_fadvise_sequential set too), so
    remove those initial reads might be counter-productive also. But this
    is damn hard to benchmark because the read ahead is also influenced by
    memory pressure for example.

    From theory, 1. readahead algo is a bit smarter and can work with
    read-with-holes (if the holes are not larger than the read-ahead
    window) and 2. if holes are that large then maybe it is not so good to
    keep a larger read-ahead window (which keep trashing our buffer
    cache).




    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Tom Lane at May 26, 2011 at 8:37 pm

    Robert Haas writes:
    On Thu, May 26, 2011 at 12:23 PM, Tom Lane wrote:
    Another thought: Couldn't relation_needs_vacanalyze() just scale up
    reltuples by the ratio of the current number of pages in the relation
    to relpages, just as the query planner does?
    Hmm ... that would fix Florian's immediate issue, and it does seem like
    a good change on its own merits.  But it does nothing for the problem
    that we're failing to put the best available information into pg_class.

    Possibly we could compromise on doing just that much in the back
    branches, and the larger change for 9.1?
    Do you think we need to worry about the extra overhead of determining
    the current size of every relation as we sweep through pg_class? It's
    not a lot, but OTOH I think we'd be doing it once a minute... not sure
    what would happen if there were tons of tables.
    Ugh ... that is a mighty good point, since the RelationGetNumberOfBlocks
    call would have to happen for each table, even the ones we then decide
    not to vacuum. We've already seen people complain about the cost of the
    AV launcher once they have a lot of databases, and this would probably
    increase it quite a bit.
    Going back to your thought upthread, I think we should really consider
    replacing reltuples with reltupledensity at some point. I continue to
    be afraid that using a decaying average in this case is going to end
    up overweighting the values from some portion of the table that's
    getting scanned repeatedly, at the expense of other portions of the
    table that are not getting scanned at all.
    Changing the representation of the information would change that issue
    not in the slightest. The fundamental point here is that we have new,
    possibly partial, information which we ought to somehow merge with the
    old, also possibly partial, information. Storing the data a little bit
    differently doesn't magically eliminate that issue.

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 25, '11 at 3:48p
activeMay 29, '11 at 11:33p
posts44
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase