For historical reasons, what we now think of as VACUUM is referred to
in some portions of the code as "lazy vacuum", to distinguish it from
pre-9.0 VACUUM FULL. As I understand it, VACUUM works like this:

- Scan the relation, accumulating a list of tuples to kill.
- When you get to the end of the relation or when you fill up
maintenance_work_mem, scan each index and kill all index entries
pointing to those tuples.
- Scan the relation a second time and kill the heap tuples.

I'm wondering if there might be some benefit in having an even lazier
type of vacuum that makes only a single scan over the relation and
ignores the indexes. If it hits a tuple that could otherwise be
killed, it marks it LP_DEAD and defragments the page. If it hits a
page with only all-visible tuples, it marks the page PD_ALL_VISIBLE
and sets the visibility map bit. This would be significantly cheaper
than what we normally do right now because it wouldn't touch the
indexes at all, and it would only scan the heap once rather than
twice. The first part is particularly significant for relations where
a high percentage of the visibility map bits area already set, because
we always scan every index in its enitrety even if we only need to
kill a handful of tuples, but we use the visibility map avoid scanning
portions of the heap where no dead tuples can exist. A further
advantage of this approach is that it is very easy to do incrementally
- for example, it'd be perfectly reasonable to scan a 128MB chunk of
the relation and then stop, expecting to do the rest later. That's a
lot less reasonable with our existing approach because you have to
scan all the indexes in their entirety every time. On the downside,
this approach doesn't get rid of any index tuples, nor does it allow
any CTIDs to be reclaimed. But it does reclaim most of the heap
space, and it allows index tuples to be reclaimed opportunistically.
Also, it gets PD_ALL_VISIBLE bits set, which makes scans cheaper and
reduces the cost of later vacuuming.

I'm not quite sure how we'd decide whether to do a "really lazy"
vacuum or the kind we do now. The case where this approach wins big
is when there are few or no dead tuples. In that case, we do a lot of
work looking at the indexes and we don't get much out of it; plus we
scan the heap twice instead of just once. If there are a lot of dead
tuples, then we have to bite the bullet and do the whole thing. It'd
be really nice to find a way to avoid needing to scan the entire index
to reclaim the dead tuples, but unless we're willing to assume that we
can always refind the relevant index tuples based on the heap tuple
(an assumption I believe we have not been willing to make in the
past), it doesn't work - and even if we did make that assumption, it's
not going to be ideal when cleaning out large numbers of index tuples,
because the index I/O will be random rather than sequential as it is
currently.

Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Search Discussions

  • Tom Lane at Mar 14, 2011 at 8:18 pm

    Robert Haas writes:
    I'm not quite sure how we'd decide whether to do a "really lazy"
    vacuum or the kind we do now. The case where this approach wins big
    is when there are few or no dead tuples. In that case, we do a lot of
    work looking at the indexes and we don't get much out of it; plus we
    scan the heap twice instead of just once.
    Um, if there are *no* dead tuples then we don't look at the indexes
    anyway, except for the vacuum-cleanup pass which I don't think you get
    to decide you don't need. (You certainly don't get to decide that
    unilaterally without the index AM's cooperation.) I'm less than
    convinced that there's much gold to be mined here.

    regards, tom lane
  • Robert Haas at Mar 14, 2011 at 8:33 pm

    On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    I'm not quite sure how we'd decide whether to do a "really lazy"
    vacuum or the kind we do now.  The case where this approach wins big
    is when there are few or no dead tuples.  In that case, we do a lot of
    work looking at the indexes and we don't get much out of it; plus we
    scan the heap twice instead of just once.
    Um, if there are *no* dead tuples then we don't look at the indexes
    anyway ...
    But you do still have to scan the heap twice.
    except for the vacuum-cleanup pass which I don't think you get
    to decide you don't need.  (You certainly don't get to decide that
    unilaterally without the index AM's cooperation.)  I'm less than
    convinced that there's much gold to be mined here.
    I'm not sure about that either, although I'm not sure of the reverse
    either. But before I invest any time in it, do you have any other
    good ideas for addressing the "it stinks to scan the entire index
    every time we vacuum" problem? Or for generally making vacuum
    cheaper?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Greg Stark at Mar 14, 2011 at 11:41 pm

    On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas wrote:
    I'm not sure about that either, although I'm not sure of the reverse
    either.  But before I invest any time in it, do you have any other
    good ideas for addressing the "it stinks to scan the entire index
    every time we vacuum" problem?  Or for generally making vacuum
    cheaper?
    You could imagine an index am that instead of scanning the index just
    accumulated all the dead tuples in a hash table and checked it before
    following any index link. Whenever the hash table gets too big it
    could do a sequential scan and prune any pointers to those tuples and
    start a new hash table.

    That would work well if there are frequent vacuums finding a few
    tuples per vacuum. It might even allow us to absorb dead tuples from
    "retail" vacuums so we could get rid of line pointers earlier. But it
    would involve more WAL-logged operations and incur an extra overhead
    on each index lookup.




    --
    greg
  • Robert Haas at Mar 15, 2011 at 3:13 am

    On Mon, Mar 14, 2011 at 7:40 PM, Greg Stark wrote:
    On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas wrote:
    I'm not sure about that either, although I'm not sure of the reverse
    either.  But before I invest any time in it, do you have any other
    good ideas for addressing the "it stinks to scan the entire index
    every time we vacuum" problem?  Or for generally making vacuum
    cheaper?
    You could imagine an index am that instead of scanning the index just
    accumulated all the dead tuples in a hash table and checked it before
    following any index link. Whenever the hash table gets too big it
    could do a sequential scan and prune any pointers to those tuples and
    start a new hash table.
    Hmm. For something like a btree, you could also remove each TID from
    the hash table when you kill the corresponding index tuple.
    That would work well if there are frequent vacuums finding a few
    tuples per vacuum. It might even allow us to absorb dead tuples from
    "retail" vacuums so we could get rid of line pointers earlier.  But it
    would involve more WAL-logged operations and incur an extra overhead
    on each index lookup.
    Yeah, that seems deeply unfortunate. It's hard to imagine us wanting
    to go there.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Mar 15, 2011 at 12:00 am

    Robert Haas writes:
    On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane wrote:
    Um, if there are *no* dead tuples then we don't look at the indexes
    anyway ...
    But you do still have to scan the heap twice.
    Seems like that should be fixable ... is the second pass actually
    going to do anything?

    regards, tom lane
  • Robert Haas at Mar 15, 2011 at 5:02 am

    On Mon, Mar 14, 2011 at 8:00 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane wrote:
    Um, if there are *no* dead tuples then we don't look at the indexes
    anyway ...
    But you do still have to scan the heap twice.
    Seems like that should be fixable ... is the second pass actually
    going to do anything?
    Err, I'm wrong. There's no second pass if there are exactly zero
    tuples to prune. So the possible problem case is when there are some
    dead tuples, but not all that many compared to the size of the
    indexes. In theory, you could be better off just reclaiming free
    space, and waiting for dead tuples to accumulate before cleaning the
    indexes. I'm not certain whether that target is wide enough to be
    worth aiming at, but it seems plausible that at least in some
    workloads the opportunistic index cleaning stuff would keep index
    bloat under control, so the only reason to ever make a full scan of
    the indexes during vacuuming would be to avoid leaking CTIDs.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jim Nasby at Mar 16, 2011 at 10:36 pm

    On Mar 14, 2011, at 2:36 PM, Robert Haas wrote:
    I'm not quite sure how we'd decide whether to do a "really lazy"
    vacuum or the kind we do now. The case where this approach wins big
    is when there are few or no dead tuples. In that case, we do a lot of
    work looking at the indexes and we don't get much out of it; plus we
    scan the heap twice instead of just once. If there are a lot of dead
    tuples, then we have to bite the bullet and do the whole thing. <snip>
    Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it?
    One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).

    Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.

    From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.

    Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Robert Haas at Mar 17, 2011 at 12:44 am

    On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby wrote:
    One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).
    I mostly agree with this, but you also can't postpone vacuuming
    indefinitely just because you're too busy; that's going to blow up in
    your face.
    Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.
    Right. Really-lazy vacuum could freeze tuples. Unlike regular
    vacuum, it can also sensibly be done incrementally. One thing I was
    thinking about is counting the number of times that we fetched a tuple
    that was older than RecentGlobalXmin and had a committed xmin and an
    invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's
    happening a lot, it probably means that some vacuuming would speed
    things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you
    could work out some formula where you do a variable amount of
    super-lazy vacuuming depending on the number of such tuple fetches.
    The trick would be to avoid overdoing it (so that you swamp the I/O
    system) or underdoing it (so that the system never converges). It
    would be really nice (for this and for other things) if we had some
    way of measuring the I/O saturation of the system, so that we could
    automatically adjust the aggressiveness of background processes
    accordingly.

    Note also that if and when we get index-only scans, making sure the
    PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
    set is going to be a lot more important.
    From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.
    I can see that.
    Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.
    I'm a bit doubtful about this part.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jesper Krogh at Mar 17, 2011 at 8:18 am

    Robert Haas wrote:
    Right. Really-lazy vacuum could freeze tuples. Unlike regular
    vacuum, it can also sensibly be done incrementally. One thing I was
    thinking about is counting the number of times that we fetched a tuple
    that was older than RecentGlobalXmin and had a committed xmin and an
    invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's
    happening a lot, it probably means that some vacuuming would speed
    things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you
    could work out some formula where you do a variable amount of
    super-lazy vacuuming depending on the number of such tuple fetches.
    The trick would be to avoid overdoing it (so that you swamp the I/O
    system) or underdoing it (so that the system never converges). It
    would be really nice (for this and for other things) if we had some
    way of measuring the I/O saturation of the system, so that we could
    automatically adjust the aggressiveness of background processes
    accordingly.

    Note also that if and when we get index-only scans, making sure the
    PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
    set is going to be a lot more important.
    Is it obvious that the visibillity map bits should track complete
    pages and not individual tuples? If the visibillity map tracks at
    page-level the benefit would fall on "slim tables" where you squeeze
    200 tuples into each page and having an update rate of 1% would
    lower the likelyhood even more. (it may be that for slim tables the
    index-only-scans are not as benefitial as to wide tables).

    In collaboration with a vacuuming discussion, I dont know if it
    is there allready but how about "opportunistic vacuuming". Say
    you have a page what due to changes in one of the tuples are
    being written out, will it, while being written out anyway get the
    other tuples on the page vacuumed?

    It actually dont have to hook into the process directly to benefit
    the IO-usage, if it just can get the opportunity to do it before
    the page gets evicted from the OS-cache, then it would save a
    second read on that page, but it seems way harder to do something
    sane around that assumption.

    Really lazy vacuums would "only" benefit "really static tables" ? where
    vacuuming is not that big a problem in the first place.


    --
    Jesper - Demonstrating totally lack of insight I would assume.
  • Robert Haas at Mar 17, 2011 at 2:02 pm

    On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh wrote:
    Is it obvious that the visibillity map bits should track complete
    pages and not individual tuples? If the visibillity map tracks at
    page-level the benefit would fall on "slim tables" where you squeeze
    200 tuples into each page and having an update rate of 1% would
    lower the likelyhood even more. (it may be that for slim tables the
    index-only-scans are not as benefitial as to wide tables).
    I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
    say it's 200. If you track visibility info per tuple rather than per
    page, then the size of the visibility map is going to expand by a
    factor of 200. That might decrease contention, but otherwise it's a
    bad thing - the whole point of having the visibility map in the first
    place is that it's much, much smaller than the heap. If it were the
    same size as the heap, we could just read the heap. What the map
    attempts to accomplish is to allow us, by reading a small number of
    pages, to check whether the tuples we're thinking of reading are
    likely to be all-visible without actually looking at them.
    In collaboration with a vacuuming discussion, I dont know if it
    is there allready but how about "opportunistic vacuuming". Say
    you have a page what due to changes in one of the tuples are
    being written out, will it, while being written out anyway get the
    other tuples on the page vacuumed?
    The really lazy kind of vacuuming I'm talking about could be done this
    way. Regular vacuuming cannot, because you can't actually prune a
    dead tuple until you've scanned all the indexes for references to that
    CTID. The obvious place to do this would be the background writer: if
    the page is dirty anyway and we're about to evict it, we could decide
    to (1) set hint bits, (2) set visibility map bits, (3) freeze tuples
    that need freezing, and (4) identify dead tuples and reclaim the space
    they use, but not the associated line pointer.

    Sadly, I'm not sure this would help much. If we have, say, a 4MB
    relation, you're not even going to notice it when vacuum comes along
    and does its thing. Even a vacuum freeze is chump change. The
    problem is with big relations, like say 1GB+. Processing the whole
    table at once can have a material impact on system performance, so
    it'd be nice to do some work incrementally. But it's likely that
    doing it opportunistically as you evict things from shared buffers is
    only going to help here and there. Even if you optimistically assumed
    that we could opportunistically do 10% of the vacuuming that way,
    that's still not much of a dent. And I think it'd probably be less
    than that in most real-world cases. A further problem is that the
    background writer is already a pretty busy process, and giving it more
    things to do isn't very appealing from the standpoint of overall
    system performance.
    It actually dont have to hook into the process directly to benefit
    the IO-usage, if it just can get the opportunity to do it before
    the page gets evicted from the OS-cache, then it would save a
    second read on that page, but it seems way harder to do something
    sane around that assumption. Yeah.
    Really lazy vacuums would "only" benefit "really static tables" ?  where
    vacuuming is not that big a problem in the first place.
    I think the benefit would be tables that are either quite large (where
    the ability to do this incrementally would be an advantage over
    regular VACUUM) or insert-only (where we currently have no way to get
    PD_ALL_VISIBLE bits set without user intervention).

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jesper Krogh at Mar 17, 2011 at 8:03 pm

    On 2011-03-17 15:02, Robert Haas wrote:
    On Thu, Mar 17, 2011 at 4:17 AM, Jesper Kroghwrote:
    Is it obvious that the visibillity map bits should track complete
    pages and not individual tuples? If the visibillity map tracks at
    page-level the benefit would fall on "slim tables" where you squeeze
    200 tuples into each page and having an update rate of 1% would
    lower the likelyhood even more. (it may be that for slim tables the
    index-only-scans are not as benefitial as to wide tables).
    I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
    say it's 200. If you track visibility info per tuple rather than per
    page, then the size of the visibility map is going to expand by a
    factor of 200. That might decrease contention, but otherwise it's a
    bad thing - the whole point of having the visibility map in the first
    place is that it's much, much smaller than the heap. If it were the
    same size as the heap, we could just read the heap. What the map
    attempts to accomplish is to allow us, by reading a small number of
    pages, to check whether the tuples we're thinking of reading are
    likely to be all-visible without actually looking at them.
    Yes, that was sort of the math I was trying to make. I do allthough
    belive that you have a way better feeling about it. But according
    to this:
    http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F

    The bulk row-overhead is around 24bytes, which will with 1 bit per row
    give a
    size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341
    tuples/page
    (where each tuple, does not contain any data at all). With that ratio, the
    visibillitymap of a relation of 10GB would fill 52MB on disk (still
    worst case)
    and that by itself would by all means be awesome. (with that small tuples a
    10GB relation would have around 42 billion tuples).

    On the 1 bit per page the "best case" would be 341 times better than above
    reducing the size of the visibiility map on a 10GB table to around 152KB
    which
    is extremely small (and thus also awesome) But the consequenses of a single
    update would mean that you loose visibilllity map benefit on 341 tuples in
    one shot.

    Worst case situations are, where we approach the 4 tuples per page, before
    we hit toast where the ratio of space reduction in 1 bit per tuple would
    be:
    1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
    In the 1 bit per tuple a visibillity map of a 10GB relation would be
    around 610KB
    1 bit per page would then drop it to around 160KB.


    Can we drag out some average-case numbers on row-size in the heap
    from some real production systems?

    I may have gotten something hugely wrong in above calculations and/or
    have missed some important points.

    --
    Jesper
  • Robert Haas at Mar 17, 2011 at 9:23 pm

    On Thu, Mar 17, 2011 at 4:02 PM, Jesper Krogh wrote:
    On the 1 bit per page the "best case" would be 341 times better than above
    reducing the size of the visibiility map on a 10GB table to around 152KB
    which
    is extremely small (and thus also awesome) But the consequenses of a single
    update would mean that you loose visibilllity map benefit on 341 tuples in
    one shot.
    True, but I'm not really sure that matters very much. Keep in mind
    also that would increase the frequency with which visibility map bits
    would need to be flipped, which would carry its own costs.
    Worst case situations are, where we approach the 4 tuples per page, before
    we hit toast where the ratio of space reduction in 1 bit per tuple would be:
    1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
    In the 1 bit per tuple a visibillity map of a 10GB relation would be around
    610KB
    1 bit per page would then drop it to around 160KB.


    Can we drag out some average-case numbers on row-size in the heap
    from some real production systems?
    Well, unless you went to some significantly more complicated data
    structure, you'd need to allow room for the maximum number of tuples
    per page on every page, whether the slots were all in use or not.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Cédric Villemain at Mar 17, 2011 at 1:26 pm

    2011/3/17 Robert Haas <robertmhaas@gmail.com>:
    On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby wrote:
    One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).
    I mostly agree with this, but you also can't postpone vacuuming
    indefinitely just because you're too busy; that's going to blow up in
    your face.
    Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.
    Right.  Really-lazy vacuum could freeze tuples.  Unlike regular
    vacuum, it can also sensibly be done incrementally.  One thing I was
    thinking about is counting the number of times that we fetched a tuple
    that was older than RecentGlobalXmin and had a committed xmin and an
    invalid xmax, but where the page was not PD_ALL_VISIBLE.  If that's
    happening a lot, it probably means that some vacuuming would speed
    things up, by getting those PD_ALL_VISIBLE bits set.  Perhaps you
    could work out some formula where you do a variable amount of
    super-lazy vacuuming depending on the number of such tuple fetches.
    The trick would be to avoid overdoing it (so that you swamp the I/O
    system) or underdoing it (so that the system never converges).  It
    would be really nice (for this and for other things) if we had some
    way of measuring the I/O saturation of the system, so that we could
    automatically adjust the aggressiveness of background processes
    Yes. I am thinking of something like that (the IO saturation
    measurement) to let the background writer try to work on hint bit when
    it does not have so much to do, if IO ressources are ok.
    Note also that if and when we get index-only scans, making sure the
    PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
    set is going to be a lot more important.
    From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.
    I can see that.
    Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.
    I'm a bit doubtful about this part.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company

    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Jim Nasby at Mar 21, 2011 at 8:59 pm

    On Mar 16, 2011, at 7:44 PM, Robert Haas wrote:
    It
    would be really nice (for this and for other things) if we had some
    way of measuring the I/O saturation of the system, so that we could
    automatically adjust the aggressiveness of background processes
    accordingly.
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Greg Stark at Mar 21, 2011 at 11:37 pm

    On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby wrote:
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    I've done this -- actually better, I used mincore to actually check
    whether the block was in cache before issuing the read -- but it turns
    out you can't get what you're looking for this way.

    It turns out when you do this you see one block being read from disk
    followed by n blocks that all appear to be cache hits. Because they've
    been prefetched by the kernel.

    What you end up with is actually something like the number of iops
    which is also an interesting measure but not really what you were
    looking for.

    My getrusage patch, which I should still dig out though it's rather
    too late to be committing now unless someone tells me otherwise, would
    tell you how much i/o a plan node actually did. But you won't know
    which blocks did the i/o since I was only tracking totals for the plan
    node. That's probably what you're looking for here.


    --
    greg
  • Cédric Villemain at Mar 22, 2011 at 4:50 pm

    2011/3/22 Greg Stark <gsstark@mit.edu>:
    On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby wrote:
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    I've done this -- actually better, I used mincore to actually check
    whether the block was in cache before issuing the read -- but it turns
    out you can't get what you're looking for this way.
    The linux fincore() syscall never get in the kernel, maybe something
    to revive...
    It turns out when you do this you see one block being read from disk
    followed by n blocks that all appear to be cache hits. Because they've
    been prefetched by the kernel.
    I did the same, I now believe that it is not very important to have
    the very exact numbers.
    Prefetech blocks *are* in memory when we request them, the first read
    access read more than one block because the cost is the same.
    What you end up with is actually something like the number of iops
    which is also an interesting measure but not really what you were
    looking for.

    My getrusage patch, which I should still dig out though it's rather
    too late to be committing now unless someone tells me otherwise, would
    tell you how much i/o a plan node actually did. But you won't know
    which blocks did the i/o since I was only tracking totals for the plan
    node. That's probably what you're looking for here.
    Please show us the patch :)


    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Jim Nasby at Mar 24, 2011 at 5:33 pm

    On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
    2011/3/22 Greg Stark <gsstark@mit.edu>:
    On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby wrote:
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    I've done this -- actually better, I used mincore to actually check
    whether the block was in cache before issuing the read -- but it turns
    out you can't get what you're looking for this way.
    The linux fincore() syscall never get in the kernel, maybe something
    to revive...
    Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?
    It turns out when you do this you see one block being read from disk
    followed by n blocks that all appear to be cache hits. Because they've
    been prefetched by the kernel.
    I did the same, I now believe that it is not very important to have
    the very exact numbers.
    Prefetech blocks *are* in memory when we request them, the first read
    access read more than one block because the cost is the same.
    Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.

    And having this info in pg_stats would be extremely valuable.
    What you end up with is actually something like the number of iops
    which is also an interesting measure but not really what you were
    looking for.

    My getrusage patch, which I should still dig out though it's rather
    too late to be committing now unless someone tells me otherwise, would
    tell you how much i/o a plan node actually did. But you won't know
    which blocks did the i/o since I was only tracking totals for the plan
    node. That's probably what you're looking for here.
    Please show us the patch :)
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Cédric Villemain at Mar 25, 2011 at 10:51 am

    2011/3/24 Jim Nasby <jim@nasby.net>:
    On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
    2011/3/22 Greg Stark <gsstark@mit.edu>:
    On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby wrote:
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    I've done this -- actually better, I used mincore to actually check
    whether the block was in cache before issuing the read -- but it turns
    out you can't get what you're looking for this way.
    The linux fincore() syscall never get in the kernel, maybe something
    to revive...
    Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?
    fincore() syscall is a shortcut for mmap+mincore calls, suggested by
    people working on libprefetch.
    see http://lwn.net/Articles/371538/

    The alternative via time measurement is interesting, should be easy to
    ouput both measures in pg_statio_* and see what happens...
    It turns out when you do this you see one block being read from disk
    followed by n blocks that all appear to be cache hits. Because they've
    been prefetched by the kernel.
    I did the same, I now believe that it is not very important to have
    the very exact numbers.
    Prefetech blocks *are* in memory when we request them, the first read
    access read more than one block because the cost is the same.
    Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.

    And having this info in pg_stats would be extremely valuable.
    yes, also Robert wrote some interesting items to keep in mind when
    thinking about that, in another thread, recently.
    A fs-cache snapshot or just a 'percent_in_cache' per relation/file (?)
    is easy to do/add to some auto-analyze daemon.

    *but* making a good use of it in the planner is not as trivial as it
    looks. (i.e. without breaking what is working well)

    Once I get time to add hooks in costsize.c, a simple extension can do
    the trick. (just need some shared_buffers to keep FS-pg_stats and
    hooks to use it in some places).
    What you end up with is actually something like the number of iops
    which is also an interesting measure but not really what you were
    looking for.

    My getrusage patch, which I should still dig out though it's rather
    too late to be committing now unless someone tells me otherwise, would
    tell you how much i/o a plan node actually did. But you won't know
    which blocks did the i/o since I was only tracking totals for the plan
    node. That's probably what you're looking for here.
    Please show us the patch :)
    --
    Jim C. Nasby, Database Architect                   jim@nasby.net
    512.569.9461 (cell)                         http://jim.nasby.net



    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Cédric Villemain at Apr 9, 2011 at 12:10 pm

    2011/3/24 Jim Nasby <jim@nasby.net>:
    On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
    2011/3/22 Greg Stark <gsstark@mit.edu>:
    On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby wrote:
    Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
    I've done this -- actually better, I used mincore to actually check
    whether the block was in cache before issuing the read -- but it turns
    out you can't get what you're looking for this way.
    The linux fincore() syscall never get in the kernel, maybe something
    to revive...
    Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?
    I made a quick test with time measurement, and find quickly the main
    bottleneck with this strategy. How to know if block has been fetched
    from OS memory, SAN memory, quick RAID, slow SATA ......
    I just added a gettimeofday around the read() call, and adjust the
    XXms|µs used to seperate disk fetch and memory fetch.
    By manualy adjusting this duration I get good results but wonder how
    this can be automatically adjusted on other systems, also the method
    use for measuring may impact the measure.

    Maybe using it to just track 'slow' access, and define 'slow access' in a GUC...
    It turns out when you do this you see one block being read from disk
    followed by n blocks that all appear to be cache hits. Because they've
    been prefetched by the kernel.
    I did the same, I now believe that it is not very important to have
    the very exact numbers.
    Prefetech blocks *are* in memory when we request them, the first read
    access read more than one block because the cost is the same.
    Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.

    And having this info in pg_stats would be extremely valuable.
    What you end up with is actually something like the number of iops
    which is also an interesting measure but not really what you were
    looking for.

    My getrusage patch, which I should still dig out though it's rather
    too late to be committing now unless someone tells me otherwise, would
    tell you how much i/o a plan node actually did. But you won't know
    which blocks did the i/o since I was only tracking totals for the plan
    node. That's probably what you're looking for here.
    Please show us the patch :)
    --
    Jim C. Nasby, Database Architect                   jim@nasby.net
    512.569.9461 (cell)                         http://jim.nasby.net



    --
    Cédric Villemain               2ndQuadrant
    http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
  • Andres Freund at Apr 9, 2011 at 12:14 pm

    On Thursday, March 24, 2011 06:32:10 PM Jim Nasby wrote:
    Is there an equivalent in other OSes?
    Some have mincore which can be used for that in combination with mmap.

    Andres

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMar 14, '11 at 7:36p
activeApr 9, '11 at 12:14p
posts21
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase