This problem has been discussed before. Those familiar with the subject
please skip the next paragraph.

When autovacuum finds a substantial amount of empty pages at the end of
a relation, it attempts to truncate it in lazy_truncate_heap(). Because
all the scanning had been done in parallel to normal DB activity, it
needs to verify that all those blocks are still empty. To do that
autovacuum grabs an AccessExclusiveLock on the relation, then scans
backwards to the last non-empty page. If any other backend needs to
access that table during this time, it will kill the autovacuum from the
deadlock detection code, which by default is done after a 1000ms
timeout. The autovacuum launcher will start another vacuum after
(default) 60 seconds, which most likely is getting killed again, and
again, and again. The net result of this is that the table is never
truncated and every 60 seconds there is a 1 second hiccup before the
autovacuum is killed.

Proposal:

Add functions to lmgr that are derived from the lock release code, but
instead of releasing the lock and waking up waiters, just return a
boolean telling if there are any waiters that would be woken up if this
lock was released.

Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c
to periodically check, if there is a conflicting lock request waiting.
If not, keep going. If there is a waiter, truncate the relation to the
point checked thus far, release the AccessExclusiveLock, then loop back
to where we acquire this lock in the first place and continue
checking/truncating.

I have a working patch here:

https://github.com/wieck/postgres/tree/autovacuum-truncate-lock

This patch does introduce three new postgresql.conf parameters, which I
would be happy to get rid of if we could derive them from something
else. Something based on the deadlock timeout may be possible.

autovacuum_truncate_lock_check = 100ms # how frequent to check
# for conflicting locks
autovacuum_truncate_lock_retry = 50 # how often to try acquiring
# the exclusive lock
autovacuum_truncate_lock_wait = 20ms # nap in between attempts

With these settings, I see the truncate of a bloated table progressing
at a rate of 3 minutes per GB, while that table is accessed 20 times per
second.

The original "kill autovacuum" mechanism in the deadlock code is still
there. All this code really does is 10 lmgr lookups per second and
releasing the AccessExclusiveLock if there are any waiters. I don't
think it can get any cheaper than this.

I am attaching a script that uses pgbench to demonstrate the actual
problem of a bloated table with significant empty pages at the end.


Comments?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

Search Discussions

  • Jan Wieck at Oct 24, 2012 at 10:20 pm
    Here is the patch for it.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Stephen Frost at Oct 25, 2012 at 2:22 am
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    This problem has been discussed before. Those familiar with the
    subject please skip the next paragraph.
    Apologies if this was already thought-of and ruled out for some reason,
    but...
    Because all the scanning had been done in parallel to normal DB
    activity, it needs to verify that all those blocks are still empty.
    Would it be possible to use the FSM to figure out if things have changed
    since the last scan..? Does that scan update the FSM, which would then
    be updated by another backend in the event that it decided to write
    something there? Or do we consider the FSM to be completely
    untrustworthy wrt this (and if so, I don't suppose there's any hope to
    using the visibility map...)?

    The notion of having to double-scan and the AccessExclusiveLock on the
    relation are telling me this work-around, while completely possible,
    isn't exactly ideal...

    Perhaps another option would be a page-level or something which is
    larger than per-row (strikes me as a lot of overhead for this and it's
    not clear how we'd do it), but less than an entire relation, but there
    are certainly pain points there too.

    Thanks,

    Stephen
  • Jan Wieck at Oct 25, 2012 at 1:04 pm
    Steven,
    On 10/24/2012 10:46 PM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    This problem has been discussed before. Those familiar with the
    subject please skip the next paragraph.
    Apologies if this was already thought-of and ruled out for some reason,
    but...
    Because all the scanning had been done in parallel to normal DB
    activity, it needs to verify that all those blocks are still empty.
    Would it be possible to use the FSM to figure out if things have changed
    since the last scan..? Does that scan update the FSM, which would then
    be updated by another backend in the event that it decided to write
    something there? Or do we consider the FSM to be completely
    untrustworthy wrt this (and if so, I don't suppose there's any hope to
    using the visibility map...)?
    I honestly don't know if we can trust the FSM enough when it comes to
    throwing away heap pages. Can we?
    The notion of having to double-scan and the AccessExclusiveLock on the
    relation are telling me this work-around, while completely possible,
    isn't exactly ideal...
    Under normal circumstances with just a few pages to trim off the end
    this is no problem. Those pages were the last pages just scanned by this
    very autovacuum, so they are found in the shared buffers anyway. All the
    second scan does in that case is to fetch the page once more from shared
    buffers to be 100% sure, we are not truncating off new tuples. We
    definitely need the AccessExclusiveLock to prevent someone from
    extending the relation at the end between our check for relation size
    and the truncate. Fetching 50 empty blocks from the buffer cache while
    at it isn't that big of a deal and that is what it normally looks like.

    The problem case this patch is dealing with is rolling window tables
    that experienced some bloat. The typical example is a log table, that
    has new data constantly added and the oldest data constantly purged out.
    This data normally rotates through some blocks like a rolling window. If
    for some reason (purging turned off for example) this table bloats by
    several GB and later shrinks back to its normal content, soon all the
    used blocks are at the beginning of the heap and we find tens of
    thousands of empty pages at the end. Only now does the second scan take
    more than 1000ms and autovacuum is at risk to get killed while at it.

    Since we have experienced this problem several times now on our
    production systems, something clearly needs to be done. But IMHO it
    doesn't happen often enough to take any risk here.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Tom Lane at Oct 25, 2012 at 1:46 pm

    Jan Wieck writes:
    On 10/24/2012 10:46 PM, Stephen Frost wrote:
    Would it be possible to use the FSM to figure out if things have changed
    since the last scan..? Does that scan update the FSM, which would then
    be updated by another backend in the event that it decided to write
    something there? Or do we consider the FSM to be completely
    untrustworthy wrt this (and if so, I don't suppose there's any hope to
    using the visibility map...)?
    I honestly don't know if we can trust the FSM enough when it comes to
    throwing away heap pages. Can we?
    No. Backends are under no obligation to update FSM for each individual
    tuple insertion, and typically don't do so.

    More to the point, you have to take AccessExclusiveLock *anyway*,
    because this is interlocking not only against new insertions but plain
    read-only seqscans: if a seqscan falls off the end of the table it will
    be very unhappy. So I don't see where we'd buy anything by consulting
    the FSM.

    regards, tom lane
  • Jan Wieck at Oct 25, 2012 at 1:57 pm

    On 10/25/2012 9:45 AM, Tom Lane wrote:
    Jan Wieck <janwieck@yahoo.com> writes:
    On 10/24/2012 10:46 PM, Stephen Frost wrote:
    Would it be possible to use the FSM to figure out if things have changed
    since the last scan..? Does that scan update the FSM, which would then
    be updated by another backend in the event that it decided to write
    something there? Or do we consider the FSM to be completely
    untrustworthy wrt this (and if so, I don't suppose there's any hope to
    using the visibility map...)?
    I honestly don't know if we can trust the FSM enough when it comes to
    throwing away heap pages. Can we?
    No. Backends are under no obligation to update FSM for each individual
    tuple insertion, and typically don't do so.

    More to the point, you have to take AccessExclusiveLock *anyway*,
    because this is interlocking not only against new insertions but plain
    read-only seqscans: if a seqscan falls off the end of the table it will
    be very unhappy. So I don't see where we'd buy anything by consulting
    the FSM.
    Thank you.

    One thing that I haven't mentioned yet is that with this patch, we could
    actually insert a vacuum_delay_point() into the loop in
    count_nondeletable_pages(). We no longer cling to the exclusive lock but
    rather get out of the way as soon as somebody needs the table. Under
    this condition we no longer need to do the second scan full bore.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Stephen Frost at Oct 25, 2012 at 2:12 pm
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for example)
    this table bloats by several GB and later shrinks back to its normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages we're
    getting ready to nuke rather than the entire relation. Perhaps we can
    consider how to make those changes alongside of changes to eliminate or
    reduce the extent locking that has been painful (for me at least) when
    doing massive parallel loads into a table.
    Since we have experienced this problem several times now on our
    production systems, something clearly needs to be done. But IMHO it
    doesn't happen often enough to take any risk here.
    I'm not advocating a 'do-nothing' approach, was just looking for another
    option that might allow for this work to happen on the heap in parallel
    with regular access. Since we havn't got any way to do that currently,
    +1 for moving forward with this as it clearly improves the current
    situation.

    Thanks,

    Stephen
  • Jan Wieck at Oct 25, 2012 at 4:16 pm

    On 10/25/2012 10:12 AM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for example)
    this table bloats by several GB and later shrinks back to its normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages we're
    getting ready to nuke rather than the entire relation. Perhaps we can
    consider how to make those changes alongside of changes to eliminate or
    reduce the extent locking that has been painful (for me at least) when
    doing massive parallel loads into a table.
    I've been testing this with loads of 20 writes/s to that bloated table.
    Preventing not only the clean up, but the following ANALYZE as well is
    precisely what happens. There may be multiple ways how to get into this
    situation, but once you're there the symptoms are the same. Vacuum fails
    to truncate it and causing a 1 second hiccup every minute, while vacuum
    is holding the exclusive lock until the deadlock detection code of
    another transaction kills it.

    My patch doesn't change the logic how we ensure that we don't zap any
    data by accident with the truncate and Tom's comments suggest we should
    stick to it. It only makes autovacuum check frequently if the
    AccessExclusiveLock is actually blocking anyone and then get out of the
    way.

    I would rather like to discuss any ideas how to do all this without 3
    new GUCs.

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
    the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster than
    it is now and still guarantees that autovacuum will make some progress
    with the truncate.

    The other two GUCs control how often and how fast autovacuum tries to
    acquire the exclusive lock in the first place. Since we actively release
    the lock *because someone needs it* it is pretty much guaranteed that
    the immediate next lock attempt fails. We on purpose do a
    ConditionalLockRelation() because there is a chance to deadlock. The
    current code only tries one lock attempt and gives up immediately. I
    don't know from what to derive a good value for how long to retry, but
    the nap time in between tries could be a hardcoded 20ms or using the
    cost based vacuum nap time (which defaults to 20ms).

    Any other ideas are welcome.


    Thanks,
    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Alvaro Herrera at Oct 25, 2012 at 4:24 pm

    Jan Wieck wrote:

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms)
    as the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster
    than it is now and still guarantees that autovacuum will make some
    progress with the truncate.
    So you would be calling GetCurrentTimestamp() continuously? Since you
    mentioned adding a vacuum delay point I wonder if it would make sense to
    test for lockers each time it would consider going to sleep, instead.
    (One hazard to keep in mind is the case where no vacuum delay is
    configured.)

    --
    Álvaro Herrera http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Jan Wieck at Oct 25, 2012 at 5:48 pm

    On 10/25/2012 12:24 PM, Alvaro Herrera wrote:
    Jan Wieck wrote:
    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms)
    as the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster
    than it is now and still guarantees that autovacuum will make some
    progress with the truncate.
    So you would be calling GetCurrentTimestamp() continuously? Since you
    mentioned adding a vacuum delay point I wonder if it would make sense to
    test for lockers each time it would consider going to sleep, instead.
    (One hazard to keep in mind is the case where no vacuum delay is
    configured.)
    Depends on your definition of "continuously". If doing one
    INSTR_TIME_SET_CURRENT(), which on Unix boils down to a gettimeofday(),
    every 32 ReadBufferExtended() calls counts as continuously, then yes.

    Adding a vacuum_delay_point() is something we should consider. However,
    the vacuum_delay_point() call simply naps when enough cost has been
    racked up. You don't know if the next call will nap or not. We would
    have to extend that functionality with some vacuum_delay_would_nap()
    call to do what you suggest.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Amit Kapila at Oct 26, 2012 at 5:29 am

    On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
    On 10/25/2012 10:12 AM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for example)
    this table bloats by several GB and later shrinks back to its normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages we're
    getting ready to nuke rather than the entire relation. Perhaps we can
    consider how to make those changes alongside of changes to eliminate or
    reduce the extent locking that has been painful (for me at least) when
    doing massive parallel loads into a table.
    I've been testing this with loads of 20 writes/s to that bloated table.
    Preventing not only the clean up, but the following ANALYZE as well is
    precisely what happens. There may be multiple ways how to get into this
    situation, but once you're there the symptoms are the same. Vacuum fails
    to truncate it and causing a 1 second hiccup every minute, while vacuum
    is holding the exclusive lock until the deadlock detection code of
    another transaction kills it.

    My patch doesn't change the logic how we ensure that we don't zap any
    data by accident with the truncate and Tom's comments suggest we should
    stick to it. It only makes autovacuum check frequently if the
    AccessExclusiveLock is actually blocking anyone and then get out of the
    way.

    I would rather like to discuss any ideas how to do all this without 3
    new GUCs.

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
    the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster than
    it is now and still guarantees that autovacuum will make some progress
    with the truncate.
    One other way could be to check after every few pages for a conflicting
    lock request.
    The other two GUCs control how often and how fast autovacuum tries to
    acquire the exclusive lock in the first place. Since we actively release
    the lock *because someone needs it* it is pretty much guaranteed that
    the immediate next lock attempt fails. We on purpose do a
    ConditionalLockRelation() because there is a chance to deadlock. The
    current code only tries one lock attempt and gives up immediately. I
    don't know from what to derive a good value for how long to retry,
    Can't we do something like, after nap check for conditional lock and if it
    didn't get
    then get lock unconditionally.
    The reason why after your implementation it might be okay to have lock
    unconditionally after one try is that
    anyway after every few pages or after small time, it will release the lock
    if there is any waiter.
    but
    the nap time in between tries could be a hardcoded 20ms or using the
    cost based vacuum nap time (which defaults to 20ms).
    I think using cost based vacuum nap time or default value is good.

    Adding new parameters might have user/administrator overhead, it is always
    better if it can be intelligently decided by database itself.
    However if you feel these are parameters which can vary based on different
    kind of usage, then I think it is better to expose it through configuration
    parameters to users.

    With Regards,
    Amit Kapila.
  • Jan Wieck at Oct 26, 2012 at 6:19 am

    On 10/26/2012 1:29 AM, Amit Kapila wrote:
    On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
    On 10/25/2012 10:12 AM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for example)
    this table bloats by several GB and later shrinks back to its normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages we're
    getting ready to nuke rather than the entire relation. Perhaps we can
    consider how to make those changes alongside of changes to eliminate or
    reduce the extent locking that has been painful (for me at least) when
    doing massive parallel loads into a table.
    I've been testing this with loads of 20 writes/s to that bloated table.
    Preventing not only the clean up, but the following ANALYZE as well is
    precisely what happens. There may be multiple ways how to get into this
    situation, but once you're there the symptoms are the same. Vacuum fails
    to truncate it and causing a 1 second hiccup every minute, while vacuum
    is holding the exclusive lock until the deadlock detection code of
    another transaction kills it.

    My patch doesn't change the logic how we ensure that we don't zap any
    data by accident with the truncate and Tom's comments suggest we should
    stick to it. It only makes autovacuum check frequently if the
    AccessExclusiveLock is actually blocking anyone and then get out of the
    way.

    I would rather like to discuss any ideas how to do all this without 3
    new GUCs.

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
    the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster than
    it is now and still guarantees that autovacuum will make some progress
    with the truncate.
    One other way could be to check after every few pages for a conflicting
    lock request.
    How is this any different from what my patch does? Did you even look at
    the code?
    The other two GUCs control how often and how fast autovacuum tries to
    acquire the exclusive lock in the first place. Since we actively release
    the lock *because someone needs it* it is pretty much guaranteed that
    the immediate next lock attempt fails. We on purpose do a
    ConditionalLockRelation() because there is a chance to deadlock. The
    current code only tries one lock attempt and gives up immediately. I
    don't know from what to derive a good value for how long to retry,
    Can't we do something like, after nap check for conditional lock and if it
    didn't get
    then get lock unconditionally.
    No, we cannot. This is also well documented in the code.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Amit Kapila at Oct 26, 2012 at 10:37 am

    On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:
    On 10/26/2012 1:29 AM, Amit Kapila wrote:
    On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
    On 10/25/2012 10:12 AM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window
    tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for
    example)
    this table bloats by several GB and later shrinks back to its
    normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at
    risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take
    a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages
    we're
    getting ready to nuke rather than the entire relation. Perhaps we
    can
    consider how to make those changes alongside of changes to
    eliminate
    or
    reduce the extent locking that has been painful (for me at least)
    when
    doing massive parallel loads into a table.
    I've been testing this with loads of 20 writes/s to that bloated
    table.
    Preventing not only the clean up, but the following ANALYZE as well
    is
    precisely what happens. There may be multiple ways how to get into
    this
    situation, but once you're there the symptoms are the same. Vacuum
    fails
    to truncate it and causing a 1 second hiccup every minute, while
    vacuum
    is holding the exclusive lock until the deadlock detection code of
    another transaction kills it.

    My patch doesn't change the logic how we ensure that we don't zap any
    data by accident with the truncate and Tom's comments suggest we
    should
    stick to it. It only makes autovacuum check frequently if the
    AccessExclusiveLock is actually blocking anyone and then get out of
    the
    way.

    I would rather like to discuss any ideas how to do all this without 3
    new GUCs.

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
    the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster
    than
    it is now and still guarantees that autovacuum will make some
    progress
    with the truncate.
    One other way could be to check after every few pages for a
    conflicting
    lock request.
    How is this any different from what my patch does?
    The difference is that in the patch it checks for waiters by using 2
    parameters autovacuum_truncate_lock_check and blkno%32 and what I
    had mentioned was to check only based on blkno.
    Will it effect too much if we directly check for waiters after every 32
    (any feasible number) blocks?
    Did you even look at the code?
    I haven't looked at code when I had given reply to your previous mail. But
    now I have checked it.

    With Regards,
    Amit Kapila.
  • Jan Wieck at Oct 26, 2012 at 1:31 pm

    On 10/26/2012 6:35 AM, Amit Kapila wrote:
    On Friday, October 26, 2012 11:50 AM Jan Wieck wrote:
    On 10/26/2012 1:29 AM, Amit Kapila wrote:
    One other way could be to check after every few pages for a
    conflicting
    lock request.
    How is this any different from what my patch does?
    The difference is that in the patch it checks for waiters by using 2
    parameters autovacuum_truncate_lock_check and blkno%32 and what I
    had mentioned was to check only based on blkno.
    Will it effect too much if we directly check for waiters after every 32
    (any feasible number) blocks?
    The blkno%32 is there to not do the gettimeofday() call too often. But
    relying on the blkno alone is IMHO not a good idea. It had to be a
    number small enough so that even on a busy system and when the pages
    have to be read from disk, vacuum checks and releases the lock quickly.
    But large enough so that it doesn't create a significant amount of
    spinlock calls in the lmgr. We would end up with another parameter,
    number of blocks, that is a lot harder to estimate a good value for.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Alvaro Herrera at Nov 23, 2012 at 3:44 pm
    Jan,

    Are you posting an updated patch?

    --
    Álvaro Herrera http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Amit Kapila at Oct 26, 2012 at 7:05 am

    On Friday, October 26, 2012 10:59 AM Amit Kapila wrote:
    On Thursday, October 25, 2012 9:46 PM Jan Wieck wrote:
    On 10/25/2012 10:12 AM, Stephen Frost wrote:
    Jan,

    * Jan Wieck (janwieck@yahoo.com) wrote:
    The problem case this patch is dealing with is rolling window
    tables
    that experienced some bloat. The typical example is a log table,
    that has new data constantly added and the oldest data constantly
    purged out. This data normally rotates through some blocks like a
    rolling window. If for some reason (purging turned off for example)
    this table bloats by several GB and later shrinks back to its
    normal
    content, soon all the used blocks are at the beginning of the heap
    and we find tens of thousands of empty pages at the end. Only now
    does the second scan take more than 1000ms and autovacuum is at
    risk
    to get killed while at it.
    My concern is that this could certainly also happen to a heavily updated
    table in an OLTP type of environment where the requirement to take a
    heavy lock to clean it up might prevent it from ever happening.. I was
    simply hoping we could find a mechanism to lock just those pages
    we're
    getting ready to nuke rather than the entire relation. Perhaps we
    can
    consider how to make those changes alongside of changes to eliminate or
    reduce the extent locking that has been painful (for me at least)
    when
    doing massive parallel loads into a table.
    I've been testing this with loads of 20 writes/s to that bloated table.
    Preventing not only the clean up, but the following ANALYZE as well is
    precisely what happens. There may be multiple ways how to get into this
    situation, but once you're there the symptoms are the same. Vacuum fails
    to truncate it and causing a 1 second hiccup every minute, while vacuum
    is holding the exclusive lock until the deadlock detection code of
    another transaction kills it.

    My patch doesn't change the logic how we ensure that we don't zap any
    data by accident with the truncate and Tom's comments suggest we should
    stick to it. It only makes autovacuum check frequently if the
    AccessExclusiveLock is actually blocking anyone and then get out of the
    way.

    I would rather like to discuss any ideas how to do all this without 3
    new GUCs.

    In the original code, the maximum delay that autovacuum can cause by
    holding the exclusive lock is one deadlock_timeout (default 1s). It
    would appear reasonable to me to use max(deadlock_timeout/10,10ms) as
    the interval to check for a conflicting lock request. For another
    transaction that needs to access the table this is 10 times faster than
    it is now and still guarantees that autovacuum will make some progress
    with the truncate.
    One other way could be to check after every few pages for a
    conflicting
    lock request.
    The other two GUCs control how often and how fast autovacuum tries to
    acquire the exclusive lock in the first place. Since we actively release
    the lock *because someone needs it* it is pretty much guaranteed that
    the immediate next lock attempt fails. We on purpose do a
    ConditionalLockRelation() because there is a chance to deadlock. The
    current code only tries one lock attempt and gives up immediately. I
    don't know from what to derive a good value for how long to retry,
    Can't we do something like, after nap check for conditional lock and
    if it
    didn't get
    then get lock unconditionally.
    The reason why after your implementation it might be okay to have lock
    unconditionally after one try is that
    anyway after every few pages or after small time, it will release the
    lock
    if there is any waiter.
    I am sorry, at this point trying to take unconditional X lock can lead to
    deadlock, so above is not possible.
    but
    the nap time in between tries could be a hardcoded 20ms or using the
    cost based vacuum nap time (which defaults to 20ms).
    I think using cost based vacuum nap time or default value is good.

    Adding new parameters might have user/administrator overhead, it is
    always
    better if it can be intelligently decided by database itself.
    However if you feel these are parameters which can vary based on
    different
    kind of usage, then I think it is better to expose it through
    configuration
    parameters to users.

    With Regards,
    Amit Kapila.



    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Robert Haas at Oct 29, 2012 at 11:20 am

    On Wed, Oct 24, 2012 at 4:20 PM, Jan Wieck wrote:
    This patch does introduce three new postgresql.conf parameters, which I
    would be happy to get rid of if we could derive them from something else.
    Something based on the deadlock timeout may be possible.

    autovacuum_truncate_lock_check = 100ms # how frequent to check
    # for conflicting locks
    autovacuum_truncate_lock_retry = 50 # how often to try acquiring
    # the exclusive lock
    autovacuum_truncate_lock_wait = 20ms # nap in between attempts
    +1 for this general approach.

    As you suggested downthread, I think that hard-coding
    autovacuum_truncate_lock_check to one-tenth of the deadlock timeout
    should be just fine. For the other two parameters, I doubt we need to
    make them configurable at all. It's not exactly clear what to set
    them to, but it does seem clear that the down side of setting them
    incorrectly isn't very much as long as the defaults are roughly sane.
    Personally, I'd be inclined to retry less frequently but over a
    slightly longer time period - say twenty retries, one after every
    100ms. But I wouldn't be upset if we settled on what you've got here,
    either. We just don't want to let the total time we spend waiting for
    the lock get too long, because that means pinning down an auto-vacuum
    worker that might be critically needed elsewhere. So the product of
    autovacuum_truncate_lock_retry and autovacuum_truncate_lock_wait
    probably should not be more than a couple of seconds.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Dimitri Fontaine at Nov 15, 2012 at 10:32 pm

    Jan Wieck writes:
    Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
    periodically check, if there is a conflicting lock request waiting. If not,
    keep going. If there is a waiter, truncate the relation to the point checked
    thus far, release the AccessExclusiveLock, then loop back to where we
    acquire this lock in the first place and continue checking/truncating.
    I think that maybe we could just bail out after releasing the
    AccessExclusiveLock and trust autovacuum to get back to truncating that
    relation later. That would allow removing 2 of the 3 GUCs below:
    autovacuum_truncate_lock_check = 100ms # how frequent to check
    # for conflicting locks
    This is the one remaining. Could we maybe check for lock conflict after
    every move backward a page, or some multiple thereof? The goal would be
    to ensure that progress is made, while also being aware of concurrent
    activity, ala CHECK_FOR_INTERRUPTS().

    Regards,
    --
    Dimitri Fontaine
    http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
  • Alvaro Herrera at Nov 15, 2012 at 10:39 pm

    Dimitri Fontaine wrote:
    Jan Wieck <janwieck@yahoo.com> writes:
    Use this lmgr feature inside count_nondeletable_pages() of vacuumlazy.c to
    periodically check, if there is a conflicting lock request waiting. If not,
    keep going. If there is a waiter, truncate the relation to the point checked
    thus far, release the AccessExclusiveLock, then loop back to where we
    acquire this lock in the first place and continue checking/truncating.
    I think that maybe we could just bail out after releasing the
    AccessExclusiveLock and trust autovacuum to get back to truncating that
    relation later.
    That doesn't work, because the truncating code is not reached unless
    vacuuming actually took place. So if you interrupt it, it will just not
    get called again later. Maybe we could have autovacuum somehow invoke
    that separately, but that would require that the fact that truncation
    was aborted is kept track of somewhere.

    --
    Álvaro Herrera http://www.2ndQuadrant.com/
    PostgreSQL Development, 24x7 Support, Training & Services
  • Amit Kapila at Nov 16, 2012 at 5:44 am

    On Friday, November 16, 2012 4:09 AM Alvaro Herrera wrote:
    Dimitri Fontaine wrote:
    Jan Wieck <janwieck@yahoo.com> writes:
    Use this lmgr feature inside count_nondeletable_pages() of
    vacuumlazy.c to
    periodically check, if there is a conflicting lock request waiting.
    If not,
    keep going. If there is a waiter, truncate the relation to the point
    checked
    thus far, release the AccessExclusiveLock, then loop back to where
    we
    acquire this lock in the first place and continue
    checking/truncating.
    I think that maybe we could just bail out after releasing the
    AccessExclusiveLock and trust autovacuum to get back to truncating that
    relation later.
    That doesn't work, because the truncating code is not reached unless
    vacuuming actually took place. So if you interrupt it, it will just not
    get called again later. Maybe we could have autovacuum somehow invoke
    that separately, but that would require that the fact that truncation
    was aborted is kept track of somewhere.
    Won't it have a chance to be handled next time when vacuum will trigger due
    to updates/deletes on some other pages.

    OTOH, may be next time again the same thing happens and it was not able to
    complete the truncate.
    So I think it's better to complete first time only, but may be using some
    heuristic time for wait and retry rather than
    with configuration variables.

    With Regards,
    Amit Kapila
  • Kevin Grittner at Nov 23, 2012 at 9:42 pm

    Alvaro Herrera wrote:

    Are you posting an updated patch?
    Well, there wasn't exactly a consensus on what should change, so I'll
    throw some initial review comments out even though I still need to
    check some things in the code and do more testing.

    The patch applied cleanly, compiled without warnings, and passed all
    tests in `make check-world`.

    The previous discussion seemed to me to achieve consensus on the need
    for the feature, but a general dislike for adding so many new GUC
    settings to configure it. I concur on that. I agree with the feelings
    of others that just using deadlock_timeout / 10 (probably clamped to
    a minimum of 10ms) would be good instead of
    autovacuum_truncate_lock_check. I agree that the values of the other
    two settings probably aren't too critical as long as they are fairly
    reasonable, which I would define as being 20ms to 100ms with with
    retries lasting no more than 2 seconds.  I'm inclined to suggest a
    total time of deadlock_timeout * 2 clamped to 2 seconds, but maybe
    that is over-engineering it.

    There was also a mention of possibly inserting a vacuum_delay_point()
    call outside of the AccessExclusiveLock. I don't feel strongly about
    it, but if the page scanning cost can be tracked easily, I guess it
    is better to do that.

    Other than simplifying the code based on eliminating the new GUCs,
    the coding issues I found were:

    TRUE and FALSE were used as literals.  Since true and false are used
    about 10 times as often and current code in the modified files is
    mixed, I would recommend the lowercase form. We decided it wasn't
    worth the trouble to convert the minority usage over, but I don't see
    any reason to add new instances of the minority case.

    In LockHasWaiters() the partition lock is acquired using
    LW_EXCLUSIVE. I don't see why that can't be LW_SHARED. Was there a
    reason for this, or was it just not changed after copy/paste?

    I still need to review the timing calls, since I'm not familiar with
    them so it wasn't immediately obvious to me whether they were being
    used correctly. I have no reason to believe that they aren't, but
    feel I should check.

    Also, I want to do another pass looking just for off-by-one errors on
    blkno. Again, I have no reason to believe that there is a problem; it
    just seems like it would be a particularly easy type of mistake to
    make and miss when a key structure has this field:

    BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */

    And I want to test more.

    But if a new version could be created based on the above, that would
    be great. Chances seem relatively slim at this point that there will
    be much else to change, if anything.

    -Kevin
  • Kevin Grittner at Nov 28, 2012 at 8:33 pm

    Kevin Grittner wrote:

    I still need to review the timing calls, since I'm not familiar
    with them so it wasn't immediately obvious to me whether they
    were being used correctly. I have no reason to believe that they
    aren't, but feel I should check.
    It seems odd to me that assignment of one instr_time to another is
    done with INSTR_TIME_SET_ZERO() of the target followed by
    INSTR_TIME_ADD() with the target and the source. It seems to me
    that simple assignment would be more readable, and I can't see any
    down side.

    Why shouldn't:

    INSTR_TIME_SET_ZERO(elapsed);
    INSTR_TIME_ADD(elapsed, currenttime);
    INSTR_TIME_SUBTRACT(elapsed, starttime);

    instead be?:

    elapsed = currenttime;
    INSTR_TIME_SUBTRACT(elapsed, starttime);

    And why shouldn't:

    INSTR_TIME_SET_ZERO(starttime);
    INSTR_TIME_ADD(starttime, currenttime);

    instead be?:

    starttime = currenttime;

    Resetting starttime this way seems especially odd.
    Also, I want to do another pass looking just for off-by-one
    errors on blkno. Again, I have no reason to believe that there is
    a problem; it just seems like it would be a particularly easy
    type of mistake to make and miss when a key structure has this
    field:

    BlockNumber nonempty_pages;
    /* actually, last nonempty page + 1 */
    No problems found with that.
    And I want to test more.
    The patch worked as advertised in all my tests, but I became
    uncomforatable with the games being played with the last autovacuum
    timestamp and the count of dead tuples. Sure, that causes
    autovacuum to kick back in until it has dealt with the truncation,
    but it makes it hard for a human looking at the stat views to see
    what's going on, and I'm not sure it wouldn't lead to bad plans due
    to stale statistics.

    Personally, I would rather see us add a boolean to indicate that
    autovacuum was needed (regardless of the math on the other columns)
    and use that to control the retries -- leaving the other columns
    free to reflect reality.

    -Kevin
  • Jan Wieck at Nov 29, 2012 at 2:42 pm

    On 11/28/2012 3:33 PM, Kevin Grittner wrote:
    Kevin Grittner wrote:
    I still need to review the timing calls, since I'm not familiar
    with them so it wasn't immediately obvious to me whether they
    were being used correctly. I have no reason to believe that they
    aren't, but feel I should check.
    It seems odd to me that assignment of one instr_time to another is
    done with INSTR_TIME_SET_ZERO() of the target followed by
    INSTR_TIME_ADD() with the target and the source. It seems to me
    that simple assignment would be more readable, and I can't see any
    down side.

    Why shouldn't:

    INSTR_TIME_SET_ZERO(elapsed);
    INSTR_TIME_ADD(elapsed, currenttime);
    INSTR_TIME_SUBTRACT(elapsed, starttime);

    instead be?:

    elapsed = currenttime;
    INSTR_TIME_SUBTRACT(elapsed, starttime);

    And why shouldn't:

    INSTR_TIME_SET_ZERO(starttime);
    INSTR_TIME_ADD(starttime, currenttime);

    instead be?:

    starttime = currenttime;

    Resetting starttime this way seems especially odd.
    instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is

    starttime = currenttime;

    portable if those are structs?
    Also, I want to do another pass looking just for off-by-one
    errors on blkno. Again, I have no reason to believe that there is
    a problem; it just seems like it would be a particularly easy
    type of mistake to make and miss when a key structure has this
    field:

    BlockNumber nonempty_pages;
    /* actually, last nonempty page + 1 */
    No problems found with that.
    And I want to test more.
    The patch worked as advertised in all my tests, but I became
    uncomforatable with the games being played with the last autovacuum
    timestamp and the count of dead tuples. Sure, that causes
    autovacuum to kick back in until it has dealt with the truncation,
    but it makes it hard for a human looking at the stat views to see
    what's going on, and I'm not sure it wouldn't lead to bad plans due
    to stale statistics.

    Personally, I would rather see us add a boolean to indicate that
    autovacuum was needed (regardless of the math on the other columns)
    and use that to control the retries -- leaving the other columns
    free to reflect reality.
    You mean to extend the stats by yet another column? The thing is that
    this whole case happens rarely. We see this every other month or so and
    only on a rolling window table after it got severely bloated due to some
    abnormal use (purging disabled for some operational reason). The whole
    situation resolves itself after a few minutes to maybe an hour or two.

    Personally I don't think living with a few wrong stats on one table for
    that time is so bad that it warrants changing that much more code.

    Lower casing TRUE/FALSE will be done.

    If the LW_SHARED is enough in LockHasWaiters(), that's fine too.

    I think we have a consensus that the check interval should be derived
    from deadlock_timeout/10 clamped to 10ms. I'm going to remove that GUC.

    About the other two, I'm just not sure. Maybe using half the value as
    for the lock waiter interval as the lock retry interval, again clamped
    to 10ms, and simply leaving one GUC that controls how long autovacuum
    should retry this. I'm not too worried about the retry interval.
    However, the problem with that overall retry duration is that this value
    very much depends on the usage pattern of the database. If long running
    transactions (like >5 seconds) are the norm, then a hard coded value of
    2 seconds before autovacuum gives up isn't going to help much.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Tom Lane at Nov 29, 2012 at 2:47 pm

    Jan Wieck writes:
    On 11/28/2012 3:33 PM, Kevin Grittner wrote:
    Resetting starttime this way seems especially odd.
    instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
    starttime = currenttime;
    portable if those are structs?
    Sure. We rely on struct assignment in lots of places.

    regards, tom lane
  • Jan Wieck at Nov 29, 2012 at 2:57 pm

    On 11/29/2012 9:46 AM, Tom Lane wrote:
    Jan Wieck <janwieck@yahoo.com> writes:
    On 11/28/2012 3:33 PM, Kevin Grittner wrote:
    Resetting starttime this way seems especially odd.
    instr_time is LARGE_INTEGER on Win32 but struct timeval on Unix. Is
    starttime = currenttime;
    portable if those are structs?
    Sure. We rely on struct assignment in lots of places.
    Will be done then.


    Thanks,
    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Jan Wieck at Dec 2, 2012 at 3:13 pm
    Attached is a new patch that addresses most of the points raised in
    discussion before.

    1) Most of the configuration variables are derived from deadlock_timeout
    now. The "check for conflicting lock request" interval is
    deadlock_timeout/10, clamped to 10ms. The "try to acquire exclusive
    lock" interval is deadlock_timeout/20, also clamped to 10ms. The only
    GUC variable remaining is autovacuum_truncate_lock_try=2000ms with a
    range from 0 (just try once) to 20000ms.

    I'd like to point out that this is a significant change in functionality
    as without the config option for the check interval, there is no longer
    any possibility to disable the call to LockHasWaiters() and return to
    the original (deadlock code kills autovacuum) behavior.

    2) The partition lock in LockHasWaiters() was lowered to LW_SHARED. The
    LW_EXCLUSIVE was indeed a copy/paste result.

    3) The instr_time handling was simplified as suggested.

    4) Lower case TRUE/FALSE.


    I did not touch the part about suppressing the stats and the ANALYZE
    step of "auto vacuum+analyze". The situation is no different today. When
    the deadlock code kills autovacuum, stats aren't updated either. And
    this patch is meant to cause autovacuum to finish the truncate in a few
    minutes or hours, so that the situation fixes itself.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Kevin Grittner at Dec 3, 2012 at 10:42 pm

    Jan Wieck wrote:

    Attached is a new patch that addresses most of the points raised
    in discussion before.

    1) Most of the configuration variables are derived from
    deadlock_timeout now. The "check for conflicting lock request"
    interval is deadlock_timeout/10, clamped to 10ms. The "try to
    acquire exclusive lock" interval is deadlock_timeout/20, also
    clamped to 10ms. The only GUC variable remaining is
    autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
    once) to 20000ms.
    If we're going to keep this GUC, we need docs for it.
    I'd like to point out that this is a significant change in
    functionality as without the config option for the check
    interval, there is no longer any possibility to disable the call
    to LockHasWaiters() and return to the original (deadlock code
    kills autovacuum) behavior.
    Arguably we could simplify the deadlock resolution code a little,
    but it seems like it is probably safer to leave it as a failsafe,
    at least for now.
    I did not touch the part about suppressing the stats and the
    ANALYZE step of "auto vacuum+analyze". The situation is no
    different today. When the deadlock code kills autovacuum, stats
    aren't updated either. And this patch is meant to cause
    autovacuum to finish the truncate in a few minutes or hours, so
    that the situation fixes itself.
    Unless someone want to argue for more aggressive updating of the
    stats, I guess I'll yield to that argument.

    Besides the documentation, the only thing I could spot on this
    go-around was that this comment probably no longer has a reason for
    being since this is no longer conditional and what it's doing is
    obvious from the code:

    /* Initialize the starttime if we check for conflicting lock requests */

    With docs and dropping the obsolete comment, I think this will be
    ready.

    -Kevin
  • Jan Wieck at Dec 3, 2012 at 11:08 pm

    On 12/3/2012 5:42 PM, Kevin Grittner wrote:
    Jan Wieck wrote:
    Attached is a new patch that addresses most of the points raised
    in discussion before.

    1) Most of the configuration variables are derived from
    deadlock_timeout now. The "check for conflicting lock request"
    interval is deadlock_timeout/10, clamped to 10ms. The "try to
    acquire exclusive lock" interval is deadlock_timeout/20, also
    clamped to 10ms. The only GUC variable remaining is
    autovacuum_truncate_lock_try=2000ms with a range from 0 (just try
    once) to 20000ms.
    If we're going to keep this GUC, we need docs for it.
    Certainly. But since we're still debating which and how many GUC
    variables we want, I don't think doc-time has come yet.
    I'd like to point out that this is a significant change in
    functionality as without the config option for the check
    interval, there is no longer any possibility to disable the call
    to LockHasWaiters() and return to the original (deadlock code
    kills autovacuum) behavior.
    Arguably we could simplify the deadlock resolution code a little,
    but it seems like it is probably safer to leave it as a failsafe,
    at least for now.
    Thinking about it, I'm not really happy with removing the
    autovacuum_truncate_lock_check GUC at all.

    Fact is that the deadlock detection code and the configuration parameter
    for it should IMHO have nothing to do with all this in the first place.
    A properly implemented application does not deadlock. Someone running
    such a properly implemented application should be able to safely set
    deadlock_timeout to minutes without the slightest ill side effect, but
    with the benefit that the deadlock detection code itself does not add to
    the lock contention. The only reason one cannot do so today is because
    autovacuum's truncate phase could then freeze the application with an
    exclusive lock for that long.

    I believe the check interval needs to be decoupled from the
    deadlock_timeout again. This will leave us with 2 GUCs at least.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Kevin Grittner at Dec 4, 2012 at 1:06 pm

    Jan Wieck wrote:

    Thinking about it, I'm not really happy with removing the
    autovacuum_truncate_lock_check GUC at all.

    Fact is that the deadlock detection code and the configuration
    parameter for it should IMHO have nothing to do with all this in
    the first place. A properly implemented application does not
    deadlock.
    I don't agree. I believe that in some cases it is possible and
    practicable to set access rules which would prevent deadlocks in
    application access to a database. In other cases the convolutions
    required in the code, the effort in educating dozens or hundreds of
    programmers maintaining the code (and keeping the training current
    during staff turnover), and the staff time required for compliance
    far outweigh the benefit of an occasional transaction retry.
    However, it is enough for your argument that there are cases where
    it can be done.
    Someone running such a properly implemented application should be
    able to safely set deadlock_timeout to minutes without the
    slightest ill side effect, but with the benefit that the deadlock
    detection code itself does not add to the lock contention. The
    only reason one cannot do so today is because autovacuum's
    truncate phase could then freeze the application with an
    exclusive lock for that long.

    I believe the check interval needs to be decoupled from the
    deadlock_timeout again. OK
    This will leave us with 2 GUCs at least.
    Hmm. What problems do you see with hard-coding reasonable values?
    Adding two or three GUC settings for a patch with so little
    user-visible impact seems weird. And it seems to me (and also
    seemed to Robert) as though the specific values of the other two
    settings really aren't that critical as long as they are anywhere
    within a reasonable range. Configuring PostgreSQL can be
    intimidating enough without adding knobs that really don't do
    anything useful. Can you show a case where special values would be
    helpful?

    -Kevin
  • Jan Wieck at Dec 4, 2012 at 4:55 pm

    On 12/4/2012 8:06 AM, Kevin Grittner wrote:
    Jan Wieck wrote:
    I believe the check interval needs to be decoupled from the
    deadlock_timeout again. OK
    This will leave us with 2 GUCs at least.
    Hmm. What problems do you see with hard-coding reasonable values?
    The question is what is reasonable?

    Lets talk about the time to (re)acquire the lock first. In the cases
    where truncating a table can hurt we are dealing with many gigabytes.
    The original vacuumlazy scan of them can take hours if not days. During
    that scan the vacuum worker has probably spent many hours napping in the
    vacuum delay points. For me 50ms interval for 5 seconds would be
    reasonable for (re)acquiring that lock.

    The reasoning behind it being that we need some sort of retry mechanism
    because if autovacuum just gave up the exclusive lock because someone
    needed access, it is more or less guaranteed that the immediate attempt
    to reacquire it will fail until that waiter has committed. But if it
    can't get a lock after 5 seconds, the system seems busy enough so that
    autovacuum should come back much later, when the launcher kicks it off
    again.

    I don't care much about occupying that autovacuum worker for a few
    seconds. It just spent hours vacuuming that very table. How much harm
    will a couple more seconds do?

    The check interval for the LockHasWaiters() call however depends very
    much on the response time constraints of the application. A 200ms
    interval for example would cause the truncate phase to hold onto the
    exclusive lock for 200ms at least. That means that a steady stream of
    short running transactions would see a 100ms "blocking" on average,
    200ms max. For many applications that is probably OK. If your response
    time constraint is <=50ms on 98% of transactions, you might want to have
    that knob though.

    I admit I really have no idea what the most reasonable default for that
    value would be. Something between 50ms and deadlock_timeout/2 I guess.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Kevin Grittner at Dec 4, 2012 at 6:51 pm

    Jan Wieck wrote:

    [arguments for GUCs]
    This is getting confusing. I thought I had already conceded the
    case for autovacuum_truncate_lock_try, and you appeared to spend
    most of your post arguing for it anyway. I think. It's a little
    hard to tell. Perhaps the best thing is to present the issue to the
    list and solicit more opinions on what to do. Please correct me if
    I mis-state any of this.

    The primary problem this patch is solving is that in some
    workloads, autovacuum will repeatedly try to truncate the unused
    pages at the end of a table, but will continually get canceled
    after burning resources because another process wants to acquire a
    lock on the table which conflicts with the one held by autovacuum.
    This is handled by the deadlock checker, so another process must
    block for the deadlock_timeout interval each time. All work done by
    the truncate phase of autovacuum is lost on each interrupted
    attempt. Statistical information is not updated, so another attempt
    will trigger the next time autovacuum looks at whether to vacuum
    the table.

    It's obvious that this pattern not only fails to release
    potentially large amounts of unused space back to the OS, but the
    headbanging can continue to consume significant resources and for
    an extended period, and the repeated blocking for deadlock_timeout
    could cause latency problems.

    The patch has the truncate work, which requires
    AccessExclusiveLock, check at intervals for whether another process
    is waiting on its lock. That interval is one of the timings we need
    to determine, and for which a GUC was initially proposed. I think
    that the check should be fast enough that doing it once every 20ms
    as a hard-coded interval would be good enough. When it sees this
    situation, it truncates the file for as far as it has managed to
    get, releases its lock on the table, sleeps for an interval, and
    then checks to see if the lock has become available again.

    How long it should sleep between tries to reacquire the lock is
    another possible GUC. Again, I'm inclined to think that this could
    be hard-coded. Since autovacuum was knocked off-task after doing
    some significant work, I'm inclined to make this interval a little
    bigger, but I don't think it matters a whole lot. Anything between
    20ms and 100ms seens sane. Maybe 50ms?

    At any point that it is unable to acquire the lock, there is a
    check for how long this autovacuum task has been starved for the
    lock. Initially I was arguing for twice the deadlock_timeout on the
    basis that this would probably be short enough not to leave the
    autovacuum worker sidelined for too long, but long enough for the
    attempt to get past a single deadlock between two other processes.
    This is the setting Jan is least willing to concede.

    If the autovacuum worker does abandon the attempt, it will keep
    retrying, since we go out of our way to prevent the autovacuum
    process from updating the statistics based on the "incomplete"
    processing. This last interval is not how long it will attempt to
    truncate, but how long it will keep one autovacuum worker making
    unsuccessful attempts to acquire the lock before it is put to other
    uses. Workers will keep coming back to this table until the
    truncate phase is completed, just as it does without the patch; the
    difference being that anytime it gets the lock, even briefly, it is
    able to persist some progress.

    So the question on the table is which of these three intervals
    should be GUCs, and what values to use if they aren't.

    -Kevin
  • Jan Wieck at Dec 4, 2012 at 7:06 pm

    On 12/4/2012 1:51 PM, Kevin Grittner wrote:
    Jan Wieck wrote:
    [arguments for GUCs]
    This is getting confusing. I thought I had already conceded the
    case for autovacuum_truncate_lock_try, and you appeared to spend
    most of your post arguing for it anyway. I think. It's a little
    hard to tell. Perhaps the best thing is to present the issue to the
    list and solicit more opinions on what to do. Please correct me if
    I mis-state any of this.

    The primary problem this patch is solving is that in some
    workloads, autovacuum will repeatedly try to truncate the unused
    pages at the end of a table, but will continually get canceled
    after burning resources because another process wants to acquire a
    lock on the table which conflicts with the one held by autovacuum.
    This is handled by the deadlock checker, so another process must
    block for the deadlock_timeout interval each time. All work done by
    the truncate phase of autovacuum is lost on each interrupted
    attempt. Statistical information is not updated, so another attempt
    will trigger the next time autovacuum looks at whether to vacuum
    the table.

    It's obvious that this pattern not only fails to release
    potentially large amounts of unused space back to the OS, but the
    headbanging can continue to consume significant resources and for
    an extended period, and the repeated blocking for deadlock_timeout
    could cause latency problems.

    The patch has the truncate work, which requires
    AccessExclusiveLock, check at intervals for whether another process
    is waiting on its lock. That interval is one of the timings we need
    to determine, and for which a GUC was initially proposed. I think
    that the check should be fast enough that doing it once every 20ms
    as a hard-coded interval would be good enough. When it sees this
    situation, it truncates the file for as far as it has managed to
    get, releases its lock on the table, sleeps for an interval, and
    then checks to see if the lock has become available again.

    How long it should sleep between tries to reacquire the lock is
    another possible GUC. Again, I'm inclined to think that this could
    be hard-coded. Since autovacuum was knocked off-task after doing
    some significant work, I'm inclined to make this interval a little
    bigger, but I don't think it matters a whole lot. Anything between
    20ms and 100ms seens sane. Maybe 50ms?

    At any point that it is unable to acquire the lock, there is a
    check for how long this autovacuum task has been starved for the
    lock. Initially I was arguing for twice the deadlock_timeout on the
    basis that this would probably be short enough not to leave the
    autovacuum worker sidelined for too long, but long enough for the
    attempt to get past a single deadlock between two other processes.
    This is the setting Jan is least willing to concede.

    If the autovacuum worker does abandon the attempt, it will keep
    retrying, since we go out of our way to prevent the autovacuum
    process from updating the statistics based on the "incomplete"
    processing. This last interval is not how long it will attempt to
    truncate, but how long it will keep one autovacuum worker making
    unsuccessful attempts to acquire the lock before it is put to other
    uses. Workers will keep coming back to this table until the
    truncate phase is completed, just as it does without the patch; the
    difference being that anytime it gets the lock, even briefly, it is
    able to persist some progress.
    That is all correct.
    So the question on the table is which of these three intervals
    should be GUCs, and what values to use if they aren't.
    I could live with all the above defaults, but would like to see more
    comments on them.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Robert Haas at Dec 5, 2012 at 3:07 pm

    On Tue, Dec 4, 2012 at 2:05 PM, Jan Wieck wrote:
    So the question on the table is which of these three intervals
    should be GUCs, and what values to use if they aren't.
    I could live with all the above defaults, but would like to see more
    comments on them.
    I largely agree with what's already been said. The only interval that
    seems to me to maybe need its own knob is the total time after which
    the autovacuum worker will give up. If we set it to 2 *
    deadlock_timeout, some people might find that a reason to raise
    deadlock_timeout. Since people *already* raise deadlock_timeout to
    obscenely high values (a minute? an hour???) and then complain that
    things blow up in their face, I think there's a decent argument to be
    made that piggybacking anything else on that setting is unwise.

    Against that, FWICT, this problem only affects a small number of
    users: Jan is the only person I can ever remember reporting this
    issue. I'm not dumb enough to think he's the only person who it
    affects; but my current belief is that it's not an enormously common
    problem. So the main argument I can see against adding a GUC is that
    the problem is too marginal to justify a setting of its own. What I
    really see as the key issue is: suppose we hardcode this to say 2
    seconds. Is that going to fix the problem effectively for 99% of the
    people who have this problem, or for 25% of the people who have this
    problem? In the former case, we probably don't need a GUC; in the
    latter case, we probably do.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Kevin Grittner at Dec 5, 2012 at 4:24 pm

    Robert Haas wrote:

    Since people *already* raise deadlock_timeout to obscenely high
    values (a minute? an hour???) and then complain that things blow
    up in their face, I think there's a decent argument to be made
    that piggybacking anything else on that setting is unwise.
    If people are really doing that, then I tend to agree. I wasn't
    aware of that practice.
    Against that, FWICT, this problem only affects a small number of
    users: Jan is the only person I can ever remember reporting this
    issue. I'm not dumb enough to think he's the only person who it
    affects; but my current belief is that it's not an enormously
    common problem. So the main argument I can see against adding a
    GUC is that the problem is too marginal to justify a setting of
    its own. What I really see as the key issue is: suppose we
    hardcode this to say 2 seconds. Is that going to fix the problem
    effectively for 99% of the people who have this problem, or for
    25% of the people who have this problem? In the former case, we
    probably don't need a GUC; in the latter case, we probably do.
    Given the fact that autovacuum will keep throwing workers at it to
    essentially loop indefinitely at an outer level, I don't think the
    exact setting of this interval is all that critical either. My gut
    feel is that anything in the 2 second to 5 second range would be
    sane, so I won't argue over any explicit setting within that range.
    Below that, I think the overhead of autovacuum coming back to the
    table repeatedly would probably start to get too high; below that
    we could be causing some small, heavily-updated table to be
    neglected by autovacuum -- especially if you get multiple
    autovacuum workers tied up in this delay on different tables at the
    same time.

    Regarding how many people are affected, I have seen several reports
    of situations where users claim massive impact on performance when
    autovacuum kicks in. The reports have not included enough detail to
    quantify the impact or in most cases to establish a cause, but this
    seems like it could have a noticable impact, especially if the
    deadlock timeout was set to more than a second.

    -Kevin
  • Robert Haas at Dec 5, 2012 at 7:00 pm

    On Wed, Dec 5, 2012 at 11:24 AM, Kevin Grittner wrote:
    Robert Haas wrote:
    Since people *already* raise deadlock_timeout to obscenely high
    values (a minute? an hour???) and then complain that things blow
    up in their face, I think there's a decent argument to be made
    that piggybacking anything else on that setting is unwise.
    If people are really doing that, then I tend to agree. I wasn't
    aware of that practice.
    It's probably not quite common enough to be called a "practice", but I
    have encountered it a number of times in support situations. Alas, I
    no longer remember the details of exactly what misery it caused, but I
    do remember it wasn't good. :-)
    Against that, FWICT, this problem only affects a small number of
    users: Jan is the only person I can ever remember reporting this
    issue. I'm not dumb enough to think he's the only person who it
    affects; but my current belief is that it's not an enormously
    common problem. So the main argument I can see against adding a
    GUC is that the problem is too marginal to justify a setting of
    its own. What I really see as the key issue is: suppose we
    hardcode this to say 2 seconds. Is that going to fix the problem
    effectively for 99% of the people who have this problem, or for
    25% of the people who have this problem? In the former case, we
    probably don't need a GUC; in the latter case, we probably do.
    Given the fact that autovacuum will keep throwing workers at it to
    essentially loop indefinitely at an outer level, I don't think the
    exact setting of this interval is all that critical either. My gut
    feel is that anything in the 2 second to 5 second range would be
    sane, so I won't argue over any explicit setting within that range.
    Below that, I think the overhead of autovacuum coming back to the
    table repeatedly would probably start to get too high; below that
    we could be causing some small, heavily-updated table to be
    neglected by autovacuum -- especially if you get multiple
    autovacuum workers tied up in this delay on different tables at the
    same time.
    I think that part of what's tricky here is that the dynamics of this
    problem depend heavily on table size. I handled one support case
    where lowering autovacuum_naptime to 15s was an indispenable part of
    the solution, so in that case having an autovacuum worker retry for
    more than a few seconds sounds kind of insane. OTOH, that case
    involved a small, rapidly changing table. If you've got an enormous
    table where vacuum takes an hour to chug through all of it, abandoning
    the effort to truncate the table after a handful of seconds might
    sound equally insane.

    Many it'd be sensible to relate the retry time to the time spend
    vacuuming the table. Say, if the amount of time spent retrying
    exceeds 10% of the time spend vacuuming the table, with a minimum of
    1s and a maximum of 1min, give up. That way, big tables will get a
    little more leeway than small tables, which is probably appropriate.
    Regarding how many people are affected, I have seen several reports
    of situations where users claim massive impact on performance when
    autovacuum kicks in. The reports have not included enough detail to
    quantify the impact or in most cases to establish a cause, but this
    seems like it could have a noticable impact, especially if the
    deadlock timeout was set to more than a second.
    Yeah, I agree this could be a cause of those types of reports, but I
    don't have any concrete evidence that any of the cases I've worked
    were actually due to this specific issue. The most recent case of
    this type I worked on was due to I/O saturation - which, since it
    happened to be EC2, really meant network saturation.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jan Wieck at Dec 6, 2012 at 3:16 am

    On 12/5/2012 2:00 PM, Robert Haas wrote:
    Many it'd be sensible to relate the retry time to the time spend
    vacuuming the table. Say, if the amount of time spent retrying
    exceeds 10% of the time spend vacuuming the table, with a minimum of
    1s and a maximum of 1min, give up. That way, big tables will get a
    little more leeway than small tables, which is probably appropriate.
    That sort of "dynamic" approach would indeed be interesting. But I fear
    that it is going to be complex at best. The amount of time spent in
    scanning heavily depends on the visibility map. The initial vacuum scan
    of a table can take hours or more, but it does update the visibility map
    even if the vacuum itself is aborted later. The next vacuum may scan
    that table in almost no time at all, because it skips all blocks that
    are marked "all visible".

    So the total time the "scan" takes is no yardstick I'd use.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Robert Haas at Dec 6, 2012 at 5:45 pm

    On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck wrote:
    On 12/5/2012 2:00 PM, Robert Haas wrote:

    Many it'd be sensible to relate the retry time to the time spend
    vacuuming the table. Say, if the amount of time spent retrying
    exceeds 10% of the time spend vacuuming the table, with a minimum of
    1s and a maximum of 1min, give up. That way, big tables will get a
    little more leeway than small tables, which is probably appropriate.
    That sort of "dynamic" approach would indeed be interesting. But I fear that
    it is going to be complex at best. The amount of time spent in scanning
    heavily depends on the visibility map. The initial vacuum scan of a table
    can take hours or more, but it does update the visibility map even if the
    vacuum itself is aborted later. The next vacuum may scan that table in
    almost no time at all, because it skips all blocks that are marked "all
    visible".
    Well, if that's true, then there's little reason to worry about giving
    up quickly, because the next autovacuum a minute later won't consume
    many resources.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Jan Wieck at Dec 8, 2012 at 10:30 pm

    On 12/6/2012 12:45 PM, Robert Haas wrote:
    On Wed, Dec 5, 2012 at 10:16 PM, Jan Wieck wrote:
    That sort of "dynamic" approach would indeed be interesting. But I fear that
    it is going to be complex at best. The amount of time spent in scanning
    heavily depends on the visibility map. The initial vacuum scan of a table
    can take hours or more, but it does update the visibility map even if the
    vacuum itself is aborted later. The next vacuum may scan that table in
    almost no time at all, because it skips all blocks that are marked "all
    visible".
    Well, if that's true, then there's little reason to worry about giving
    up quickly, because the next autovacuum a minute later won't consume
    many resources.
    "Almost no time" is of course "relative" to what an actual scan and dead
    tuple removal cost. Looking at a table with 3 GB of dead tuples at the
    end, the initial vacuum scan takes hours. When vacuum comes back to this
    table, cleaning up a couple megabytes of newly deceased tuples and then
    skipping over the all visible pages may take a minute.

    Based on the discussion and what I feel is a consensus I have created an
    updated patch that has no GUC at all. The hard coded parameters in
    include/postmaster/autovacuum.h are

    AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */

    I gave that the worst workload I can think of. A pgbench (style)
    application that throws about 10 transactions per second at it, so that
    there is constantly the need to give up the lock due to conflicting lock
    requests and then reacquiring it again. A "cleanup" process is
    periodically moving old tuples from the history table to an archive
    table, making history a rolling window table. And a third job that 2-3
    times per minute produces a 10 second lasting transaction, forcing
    autovacuum to give up on the lock reacquisition.

    Even with that workload autovacuum slow but steady is chopping away at
    the table.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Jan Wieck at Dec 6, 2012 at 6:34 pm
    Kevin and Robert are well aware of most of the below. I just want to put
    this out here so other people, who haven't followed the discussion too
    closely, may chime in.

    Some details on the problem:

    First of all, there is a minimum number of 1000 pages that the vacuum
    scan must detect as possibly being all empty at the end of a relation.
    Without at least 8MB of possible free space at the end, the code never
    calls lazy_truncate_heap(). This means we don't have to worry about tiny
    relations at all. Any relation that stays under 8MB turnover between
    autovacuum VACUUM runs can never get into this ever.

    Relations that have higher turnover than that, but at random places or
    with a high percentage of rather static rows, don't fall into the
    problem category either. They may never accumulate that much "contiguous
    free space at the end". The turnover will be reusing free space all over
    the place. So again, lazy_truncate_heap() won't be called ever.

    Relations that eventually build up more than 8MB of free space at the
    end aren't automatically a problem. The autovacuum VACUUM scan just
    scanned those pages at the end, which means that the safety scan for
    truncate, done under exclusive lock, is checking exactly those pages at
    the end and most likely they are still in memory. The truncate safety
    scan will be fast due to a 99+% buffer cache hit rate.

    The only actual problem case (I have found so far) are rolling window
    tables of significant size, that can bloat multiple times their normal
    size every now and then. This is indeed a rare corner case and I have no
    idea how many users may (unknowingly) be suffering from it.

    This rare corner case triggers lazy_truncate_heap() with a significant
    amount of free space to truncate. The table bloats, then all the bloat
    is deleted and the periodic 100% turnover will guarantee that all "live"
    tuples will shortly after circulate in lower block numbers again, with
    gigabytes of empty space at the end.

    This by itself isn't a problem still. The existing code may do the job
    just fine "unless" there is "frequent" access to that very table. Only
    at this special combination of circumstances we actually have a problem.

    Only now, with a significant amount of free space at the end and
    frequent access to the table, the truncate safety scan takes long enough
    and has to actually read pages from disk to interfere with client
    transactions.

    At this point, the truncate safety scan may have to be interrupted to
    let the frequent other traffic go through. This is what we accomplish
    with the autovacuum_truncate_lock_check interval, where we voluntarily
    release the lock whenever someone else needs it. I agree with Kevin that
    a 20ms check interval is reasonable because the code to check this is
    even less expensive than releasing the exclusive lock we're holding.

    At the same time, completely giving up and relying on the autovacuum
    launcher to restart another worker isn't as free as it looks like
    either. The next autovacuum worker will have to do the VACUUM scan
    first, before getting to the truncate phase. We cannot just skip blindly
    to the truncate code. With repeated abortion of the truncate, the table
    would deteriorate and accumulate dead tuples again. The removal of dead
    tuples and their index tuples has priority.

    As said earlier in the discussion, the VACUUM scan will skip pages, that
    are marked as completely visible. So the scan won't physically read the
    majority of the empty pages at the end of the table over and over. But
    it will at least scan all pages, that had been modified since the last
    VACUUM run.

    To me this means that we want to be more generous to the truncate code
    about acquiring the exclusive lock. In my tests, I've seen that a
    rolling window table with a "live" set of just 10 MB or so, but empty
    space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that
    work away because we can't acquire the exclusive lock withing 2 seconds
    is a waste of effort.

    Something in between 2-60 seconds sounds more reasonable to me.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Kevin Grittner at Dec 9, 2012 at 7:37 pm

    Jan Wieck wrote:

    Based on the discussion and what I feel is a consensus I have
    created an updated patch that has no GUC at all. The hard coded
    parameters in include/postmaster/autovacuum.h are

    AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */
    Since these really aren't part of the external API and are only
    referenced in vacuumlazy.c, it seems more appropriate to define
    them there.
    I gave that the worst workload I can think of. A pgbench (style)
    application that throws about 10 transactions per second at it,
    so that there is constantly the need to give up the lock due to
    conflicting lock requests and then reacquiring it again. A
    "cleanup" process is periodically moving old tuples from the
    history table to an archive table, making history a rolling
    window table. And a third job that 2-3 times per minute produces
    a 10 second lasting transaction, forcing autovacuum to give up on
    the lock reacquisition.

    Even with that workload autovacuum slow but steady is chopping
    away at the table.
    Applies with minor offsets, builds without warning, and passes
    `make check-world`. My tests based on your earlier posted test
    script confirm the benefit.

    There are some minor white-space issues; for example git diff
    --color shows some trailing spaces in comments.

    There are no docs, but since there are no user-visible changes in
    behavior other than better performance and more prompt and reliable
    trunction of tables where we were already doing so, it doesn't seem
    like any new docs are needed. Due to the nature of the problem,
    tests are tricky to run correctly and take a long time to run, so I
    don't see how any regressions tests would be appropriate, either.

    This patch seems ready for committer, and I would be comfortable
    with making the minor changes I mention above and committing it.
    I'll wait a day or two to allow any other comments or objections.

    To summarize, there has been pathalogical behavior in an
    infrequently-encountered corner case of autovacuum, wasting a lot
    of resources indefinitely when it is encountered; this patch gives
    a major performance improvement in in this situation without any
    other user-visible change and without requiring any new GUCs. It
    adds a new public function in the locking area to allow a process
    to check whether a particular lock it is holding is blocking any
    other process, and another to wrap that to make it easy to check
    whether the lock held on a particular table is blocking another
    process. It uses this new capability to be smarter about scheduling
    autovacuum's truncation work, and to avoid throwing away
    incremental progress in doing so.

    As such, I don't think it would be crazy to back-patch this, but I
    think it would be wise to allow it to be proven on master/9.3 for a
    while before considering that.

    -Kevin
  • Jan Wieck at Dec 11, 2012 at 6:11 pm

    On 12/9/2012 2:37 PM, Kevin Grittner wrote:
    Jan Wieck wrote:
    Based on the discussion and what I feel is a consensus I have
    created an updated patch that has no GUC at all. The hard coded
    parameters in include/postmaster/autovacuum.h are

    AUTOVACUUM_TRUNCATE_LOCK_CHECK_INTERVAL 20 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_WAIT_INTERVAL 50 /* ms */
    AUTOVACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */
    Since these really aren't part of the external API and are only
    referenced in vacuumlazy.c, it seems more appropriate to define
    them there.
    I gave that the worst workload I can think of. A pgbench (style)
    application that throws about 10 transactions per second at it,
    so that there is constantly the need to give up the lock due to
    conflicting lock requests and then reacquiring it again. A
    "cleanup" process is periodically moving old tuples from the
    history table to an archive table, making history a rolling
    window table. And a third job that 2-3 times per minute produces
    a 10 second lasting transaction, forcing autovacuum to give up on
    the lock reacquisition.

    Even with that workload autovacuum slow but steady is chopping
    away at the table.
    Applies with minor offsets, builds without warning, and passes
    `make check-world`. My tests based on your earlier posted test
    script confirm the benefit.

    There are some minor white-space issues; for example git diff
    --color shows some trailing spaces in comments.
    Cleaned up all of those.


    Jan

    --
    Anyone who trades liberty for security deserves neither
    liberty nor security. -- Benjamin Franklin
  • Kevin Grittner at Dec 11, 2012 at 8:38 pm

    Jan Wieck wrote:

    Cleaned up all of those.
    Applied with trivial editing, mostly from a pgindent run against
    modified files.

    -Kevin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedOct 24, '12 at 8:20p
activeDec 11, '12 at 8:38p
posts42
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase