This seems to be bug month for my client. Now there are seeing periods
where all new connections fail immediately with the error:

FATAL: lock AccessShareLock on object 0/1260/0 is already held

This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
been up for months. It started happening sporadicly a few days ago. It will
do this for a period of several minutes to an hour and then go back to
normal for hours or days.

One complete failing session out of several hundred around that time:
-----------------
2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection received: host=op05.xxx port=34067
2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: user=apps database=c77
2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on object 0/1260/0 is already held
------------------

What can I do to help track this down?

-dg


--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

Search Discussions

  • Robert Haas at Aug 14, 2011 at 4:16 am

    On Fri, Aug 12, 2011 at 7:19 PM, daveg wrote:
    This seems to be bug month for my client. Now there are seeing periods
    where all new connections fail immediately with the error:

    FATAL:  lock AccessShareLock on object 0/1260/0 is already held

    This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    been up for months. It started happening sporadicly a few days ago. It will
    do this for a period of several minutes to an hour and then go back to
    normal for hours or days.

    One complete failing session out of several hundred around that time:
    -----------------
    2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    ------------------

    What can I do to help track this down?
    I've seen that error (though not that exact fact pattern) caused by
    bad RAM. It's unclear to me what else could cause it.

    In terms of debugging, it seems like it might be sensible to start by
    injecting some debugging code that dumps out the contents of the LOCK
    and LOCALLOCK structures at the point the error occurs.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Daveg at Aug 14, 2011 at 5:44 am

    On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
    On Fri, Aug 12, 2011 at 7:19 PM, daveg wrote:
    This seems to be bug month for my client. Now there are seeing periods
    where all new connections fail immediately with the error:

    FATAL:  lock AccessShareLock on object 0/1260/0 is already held

    This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    been up for months. It started happening sporadicly a few days ago. It will
    do this for a period of several minutes to an hour and then go back to
    normal for hours or days.

    One complete failing session out of several hundred around that time:
    -----------------
    2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: host=op05.xxx port=34067
    2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: user=apps database=c77
    2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    ------------------

    What can I do to help track this down?
    I've seen that error (though not that exact fact pattern) caused by
    bad RAM. It's unclear to me what else could cause it.
    I'll look into that. I think it is only happening on one host, so that might
    make sense. On the other hand, these are pretty fancy hosts all ECC and that
    so I'd hope they would have squeaked about bad ram.
    In terms of debugging, it seems like it might be sensible to start by
    injecting some debugging code that dumps out the contents of the LOCK
    and LOCALLOCK structures at the point the error occurs.
    Hmm, we will update to 9.0 next week on these hosts, so I'll try to hold off
    on this part at least until then.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Daveg at Aug 23, 2011 at 7:24 am

    On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
    On Fri, Aug 12, 2011 at 7:19 PM, daveg wrote:
    This seems to be bug month for my client. Now there are seeing periods
    where all new connections fail immediately with the error:

    FATAL:  lock AccessShareLock on object 0/1260/0 is already held
    ...
    What can I do to help track this down?
    I've seen that error (though not that exact fact pattern) caused by
    bad RAM. It's unclear to me what else could cause it.

    In terms of debugging, it seems like it might be sensible to start by
    injecting some debugging code that dumps out the contents of the LOCK
    and LOCALLOCK structures at the point the error occurs.
    I've made up the attached patch to print this, please suggest any additions.
    I'll deploy this on a couple of the production hosts that have had the
    issue this evening, but there is no telling when or if it will strike next.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Daveg at Aug 22, 2011 at 7:31 am

    On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote:

    This seems to be bug month for my client. Now there are seeing periods
    where all new connections fail immediately with the error:

    FATAL: lock AccessShareLock on object 0/1260/0 is already held

    This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
    been up for months. It started happening sporadicly a few days ago. It will
    do this for a period of several minutes to an hour and then go back to
    normal for hours or days.

    One complete failing session out of several hundred around that time:
    -----------------
    2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection received: host=op05.xxx port=34067
    2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: user=apps database=c77
    2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on object 0/1260/0 is already held
    ------------------
    This is to add additional information to the original report:

    For a while this was happening on many different databases in one postgresql
    8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5.
    That has been quiet for several days and the newest batches of errors have
    happened on only on a single database 'c23', in a postgresql 9.0.4 instance
    on a smaller host ('A', 64GB 8cpu) running SuSE 10.2.

    No memory errors or other misbehaviour have been seen on either of these
    hosts in recent months.

    The original error was:

    lock AccessShareLock on object 0/1260/0 is already held

    which is for pg_database. The recent errors are:

    lock AccessShareLock on object 16403/2615/0 is already held

    which is for pg_namespace in database c23.

    All of the orginal and most of the recent batchs of errors were immediately
    after connecting to a database and being authorized, that is, before any
    statements were attempted. However, some of the most recent are on the first
    "query" statement. That is after logging in and doing things like "set
    transaction ... " the first select would hit this error.

    It seems to come in clusters, sometimes, which suggests something shared
    by multiple processes. For example, here are the times for the errors
    on c23 in the afternoon of August 20:

    20 07:14:12.722

    20 16:05:07.798
    20 16:05:07.808

    20 16:05:10.519

    20 16:07:07.726
    20 16:07:08.722
    20 16:07:09.734
    20 16:07:10.656

    20 16:07:25.436

    20 16:22:23.983
    20 16:22:24.014
    20 16:22:24.335
    20 16:22:24.409
    20 16:22:24.477
    20 16:22:24.499
    20 16:22:24.516

    20 16:30:58.210

    20 16:31:15.261
    20 16:31:15.296
    20 16:31:15.324
    20 16:31:15.348

    20 18:06:16.515

    20 18:06:49.198
    20 18:06:49.204

    20 18:06:51.444

    20 21:03:05.940

    So far I've got:

    - affects system tables
    - happens very soon after process startup
    - in 8.4.7 and 9.0.4
    - not likely to be hardware or OS related
    - happens in clusters for period of a few second to many minutes

    I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    hard to get downtime to pick up new builds. Any other ideas on getting to
    the bottom of this?

    Thanks

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Robert Haas at Aug 23, 2011 at 4:15 pm

    On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote:
    So far I've got:

    - affects system tables
    - happens very soon after process startup
    - in 8.4.7 and 9.0.4
    - not likely to be hardware or OS related
    - happens in clusters for period of a few second to many minutes

    I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    hard to get downtime to pick up new builds. Any other ideas on getting to
    the bottom of this?
    I've been thinking this one over, and doing a little testing. I'm
    still stumped, but I have a few thoughts. What that error message is
    really saying is that the LOCALLOCK bookkeeping doesn't match the
    PROCLOCK bookkeeping; it doesn't tell us which one is to blame.

    My first thought was that there might be some situation where
    LockAcquireExtended() gets an interrupt between the time it does the
    LOCALLOCK lookup and the time it acquires the partition lock. If the
    interrupt handler were to acquire (but not releases) a lock in the
    meantime, then we'd get confused. However, I can't see how that's
    possible. I inserted some debugging code to fail an assertion if
    CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
    ImmediateInterruptOK is set on entering the function, and the system
    still passes regression tests.

    My second thought is that perhaps a process is occasionally managing
    to exit without fully cleaning up the associated PROCLOCK entry. At
    first glance, it appears that this would explain the observed
    symptoms. A new backend gets the PGPROC belonging to the guy who
    didn't clean up after himself, hits the error, and disconnects,
    sticking himself right back on to the head of the SHM_QUEUE where the
    next connection will inherit the same PGPROC and hit the same problem.
    But it's not clear to me what could cause the system to get into this
    state in the first place, or how it would eventually right itself.

    It might be worth kludging up your system to add a test to
    InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    either NULL or empty, along the lines of the attached patch (which
    assumes that assertions are enabled; otherwise, put in an elog() of
    some sort). Actually, I wonder if we shouldn't move all the
    SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    doing it over again every time someone calls InitProcess(). Besides
    being a waste of cycles, it's probably less robust this way. If
    there somehow are leftovers in one of those queues, the next
    successful call to LockReleaseAll() ought to clean up the mess, but of
    course there's no chance of that working if we've nuked the queue
    pointers.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Daveg at Sep 7, 2011 at 9:16 am

    On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
    On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote:
    So far I've got:

    - affects system tables
    - happens very soon after process startup
    - in 8.4.7 and 9.0.4
    - not likely to be hardware or OS related
    - happens in clusters for period of a few second to many minutes

    I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    hard to get downtime to pick up new builds. Any other ideas on getting to
    the bottom of this?
    I've been thinking this one over, and doing a little testing. I'm
    still stumped, but I have a few thoughts. What that error message is
    really saying is that the LOCALLOCK bookkeeping doesn't match the
    PROCLOCK bookkeeping; it doesn't tell us which one is to blame. ...
    My second thought is that perhaps a process is occasionally managing
    to exit without fully cleaning up the associated PROCLOCK entry. At
    first glance, it appears that this would explain the observed
    symptoms. A new backend gets the PGPROC belonging to the guy who
    didn't clean up after himself, hits the error, and disconnects,
    sticking himself right back on to the head of the SHM_QUEUE where the
    next connection will inherit the same PGPROC and hit the same problem.
    But it's not clear to me what could cause the system to get into this
    state in the first place, or how it would eventually right itself.

    It might be worth kludging up your system to add a test to
    InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    either NULL or empty, along the lines of the attached patch (which
    assumes that assertions are enabled; otherwise, put in an elog() of
    some sort). Actually, I wonder if we shouldn't move all the
    SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    doing it over again every time someone calls InitProcess(). Besides
    being a waste of cycles, it's probably less robust this way. If
    there somehow are leftovers in one of those queues, the next
    successful call to LockReleaseAll() ought to clean up the mess, but of
    course there's no chance of that working if we've nuked the queue
    pointers.
    I did this in the elog flavor as we don't build production images with asserts.
    It has been running on all hosts for a few days. Today it hit the extra
    checks in initproc.

    00:02:32.782 8626 [unknown] [unknown] LOG: connection received: host=bk0 port=42700
    00:02:32.783 8627 [unknown] [unknown] LOG: connection received: host=op2 port=45876
    00:02:32.783 8627 d61 apps FATAL: Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
    00:02:32.783 8626 d35 postgres LOG: connection authorized: user=postgres database=c35
    00:02:32.783 21535 LOG: server process (PID 8627) exited with exit code 1
    00:02:32.783 21535 LOG: terminating any other active server processes
    00:02:32.783 8626 c35 postgres WARNING: terminating connection because of crash of another server process

    The patch that produced this is attached. If you can think of anything I
    can add to this to help I'd be happy to do so. Also, can I clean this up
    and continue somehow? Maybe clear the queue instead having to have a restart?
    Or is there a way to just pause this proc here, maybe mark it not to be used
    and exit, or just to sleep forever so I can debug later?

    Thanks

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Robert Haas at Sep 7, 2011 at 1:46 pm

    On Wed, Sep 7, 2011 at 5:16 AM, daveg wrote:
    On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
    On Mon, Aug 22, 2011 at 3:31 AM, daveg wrote:
    So far I've got:

    - affects system tables
    - happens very soon after process startup
    - in 8.4.7 and 9.0.4
    - not likely to be hardware or OS related
    - happens in clusters for period of a few second to many minutes

    I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
    hard to get downtime to pick up new builds. Any other ideas on getting to
    the bottom of this?
    I've been thinking this one over, and doing a little testing. I'm
    still stumped, but I have a few thoughts.  What that error message is
    really saying is that the LOCALLOCK bookkeeping doesn't match the
    PROCLOCK bookkeeping; it doesn't tell us which one is to blame. ...
    My second thought is that perhaps a process is occasionally managing
    to exit without fully cleaning up the associated PROCLOCK entry.  At
    first glance, it appears that this would explain the observed
    symptoms.  A new backend gets the PGPROC belonging to the guy who
    didn't clean up after himself, hits the error, and disconnects,
    sticking himself right back on to the head of the SHM_QUEUE where the
    next connection will inherit the same PGPROC and hit the same problem.
    But it's not clear to me what could cause the system to get into this
    state in the first place, or how it would eventually right itself.

    It might be worth kludging up your system to add a test to
    InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
    either NULL or empty, along the lines of the attached patch (which
    assumes that assertions are enabled; otherwise, put in an elog() of
    some sort).  Actually, I wonder if we shouldn't move all the
    SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
    doing it over again every time someone calls InitProcess().  Besides
    being a waste of cycles, it's probably less robust this way.   If
    there somehow are leftovers in one of those queues, the next
    successful call to LockReleaseAll() ought to clean up the mess, but of
    course there's no chance of that working if we've nuked the queue
    pointers.
    I did this in the elog flavor as we don't build production images with asserts.
    It has been running on all hosts for a few days. Today it hit the extra
    checks in initproc.

    00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 port=42700
    00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 port=45876
    00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558)
    00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres database=c35
    00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
    00:02:32.783  21535  LOG:  terminating any other active server processes
    00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of crash of another server process

    The patch that produced this is attached. If you can think of anything I
    can add to this to help I'd be happy to do so. Also, can I clean this up
    and continue somehow? Maybe clear the queue instead having to have a restart?
    Or is there a way to just pause this proc here, maybe mark it not to be used
    and exit, or just to sleep forever so I can debug later?
    I think what we really need to know here is: when the debugging code
    you injected here fires, what happened to the previous owner of that
    PGPROC that caused it to not clean up its state properly? The PGPROC
    that 8627 inherited in the above example is obviously messed up - but
    what did the last guy do that messed it up? It would be nice to log
    the address of the PGPROC in every log message here - then you could
    go back and look to see whether the last guy terminated in some
    unusual way. In the meantime, could you could look back a couple of
    minutes from the time of the above occurrence and see if there are any
    FATAL errors, or usual ERRORs?

    After spending some time staring at the code, I do have one idea as to
    what might be going on here. When a backend is terminated,
    ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all
    user locks, and the first one (or so we suppose) releases all normal
    locks as part of aborting the transaction. But what if there's no
    transaction in progress? In that case, AbortOutOfAnyTransaction()
    won't do anything - which is fine as far as transaction-level locks
    go, because we shouldn't be holding any of those anyway if there's no
    transaction in progress. However, if we hold a session-level lock at
    that point, then we'd orphan it. We don't make much use of session
    locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE
    INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    standby servers, redo of DROP DATABASE actions. Any chance one of
    those died or was killed off around the time this happened?

    One thing that might be worth doing in the interest of debugging is
    adding code just like what you've already added to ShutdownPostgres(),
    *after* calling AbortOutOfAnyTransaction() and LockReleaseAll(). That
    might enable us to trap the problem at the time it happens, rather
    than waiting until the next backend gets that PGPROC. Of course that
    still might not answer the question of how it got like that, but at
    least we'd be getting closer to the time of the actual event.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Sep 7, 2011 at 2:20 pm

    Robert Haas writes:
    After spending some time staring at the code, I do have one idea as to
    what might be going on here. When a backend is terminated,
    ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all
    user locks, and the first one (or so we suppose) releases all normal
    locks as part of aborting the transaction. But what if there's no
    transaction in progress? In that case, AbortOutOfAnyTransaction()
    won't do anything - which is fine as far as transaction-level locks
    go, because we shouldn't be holding any of those anyway if there's no
    transaction in progress. However, if we hold a session-level lock at
    that point, then we'd orphan it. We don't make much use of session
    locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE
    INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    standby servers, redo of DROP DATABASE actions. Any chance one of
    those died or was killed off around the time this happened?
    I don't believe this theory at all, because if that were the issue,
    we'd have heard about it long since. The correct theory has to involve
    a very-little-used triggering condition. At the moment I'm wondering
    about advisory (userspace) locks ... Dave, do your apps use any of those?

    regards, tom lane
  • Daveg at Sep 7, 2011 at 8:22 pm

    On Wed, Sep 07, 2011 at 10:20:24AM -0400, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    After spending some time staring at the code, I do have one idea as to
    what might be going on here. When a backend is terminated,
    ShutdownPostgres() calls AbortOutOfAnyTransaction() and then
    LockReleaseAll(USER_LOCKMETHOD, true). The second call releases all
    user locks, and the first one (or so we suppose) releases all normal
    locks as part of aborting the transaction. But what if there's no
    transaction in progress? In that case, AbortOutOfAnyTransaction()
    won't do anything - which is fine as far as transaction-level locks
    go, because we shouldn't be holding any of those anyway if there's no
    transaction in progress. However, if we hold a session-level lock at
    that point, then we'd orphan it. We don't make much use of session
    locks. As far as I can see, they are used by (1) VACUUM, (2) CREATE
    INDEX CONCURRENTLY, (3) ALTER DATABASE .. SET TABLESPACE, and (4) on
    standby servers, redo of DROP DATABASE actions. Any chance one of
    those died or was killed off around the time this happened?
    I don't believe this theory at all, because if that were the issue,
    we'd have heard about it long since. The correct theory has to involve
    a very-little-used triggering condition. At the moment I'm wondering
    about advisory (userspace) locks ... Dave, do your apps use any of those?
    Yes, we make extensive use of advisory locks. That was my thought too when
    Robert mentioned session level locks.

    I'm happy to add any additional instrumentation, but my client would be
    happier to actually run it if there was a way to recover from this without
    an unplanned outage. Is there something I can do when the patch detects the
    problem to be able to continue without a restart? Is is save to just reset
    the proclock queue? I don't think they would mind leaking locks, for instance,
    and a later planned restart to clear it up as much as they mind unscheduled
    downtime.

    Thank

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Robert Haas at Sep 7, 2011 at 8:28 pm

    On Wed, Sep 7, 2011 at 4:22 PM, daveg wrote:
    Yes, we make extensive use of advisory locks. That was my thought too when
    Robert mentioned session level locks.

    I'm happy to add any additional instrumentation, but my client would be
    happier to actually run it if there was a way to recover from this without
    an unplanned outage. Is there something I can do when the patch detects the
    problem to be able to continue without a restart?
    Well, basically, you want to do the same thing that LockRelease()
    would do - remove the PROCLOCK from the SHM_QUEUE and delete it from
    the hash table, adjusting the counts in the LOCK object as
    appropriate. If you just ignore the failure, you'll get the "blah
    blah blah is already held" messages you were having before.

    Tom's right to be skeptical of my theory, because it would require a
    CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    pathways that use session-level locks, and I can't find one.

    OTOH, I'm skeptical of the theory that this involves userlocks,
    because this whole thread started because of a complaint about lock
    0/1260/0 already being held. That ain't no userlock.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Sep 7, 2011 at 8:55 pm

    Robert Haas writes:
    Tom's right to be skeptical of my theory, because it would require a
    CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    pathways that use session-level locks, and I can't find one.
    More to the point: session-level locks are released on error. The only
    way to get out of a transaction while still holding one is for the
    VACUUM-or-whichever-command code to deliberately commit and exit while
    still holding it. An error exit path would release the lock.
    OTOH, I'm skeptical of the theory that this involves userlocks,
    because this whole thread started because of a complaint about lock
    0/1260/0 already being held. That ain't no userlock.
    Yeah, and for that matter it seems to let VACUUM off the hook too.
    If we assume that the reported object ID is non-corrupt (and if it's
    always the same, that seems like the way to bet) then this is a lock
    on pg_authid.

    Hmmm ... could the pathway involve an error exit from client
    authentication? We're still finding bugs in the 9.0 rewrite of
    auth-time database access.

    regards, tom lane
  • Robert Haas at Sep 7, 2011 at 10:17 pm

    On Wed, Sep 7, 2011 at 4:55 PM, Tom Lane wrote:
    Yeah, and for that matter it seems to let VACUUM off the hook too.
    If we assume that the reported object ID is non-corrupt (and if it's
    always the same, that seems like the way to bet) then this is a lock
    on pg_authid.

    Hmmm ... could the pathway involve an error exit from client
    authentication?  We're still finding bugs in the 9.0 rewrite of
    auth-time database access.
    Well, according to Dave's report upthread, it's not only this one relation:

    DG> The recent errors are:
    DG> lock AccessShareLock on object 16403/2615/0 is already held
    DG> which is for pg_namespace in database c23.

    I thought about an error exit from client authentication, and that's a
    somewhat appealing explanation, but I can't quite see why we wouldn't
    clean up there the same as anywhere else. The whole mechanism feels a
    bit rickety to me - we don't actually release locks; we just abort the
    transaction and *assume* that will cause locks to get released. And
    it should. But there's a bit more action-at-a-distance there than I'd
    ideally like.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Sep 7, 2011 at 10:25 pm

    Robert Haas writes:
    I thought about an error exit from client authentication, and that's a
    somewhat appealing explanation, but I can't quite see why we wouldn't
    clean up there the same as anywhere else. The whole mechanism feels a
    bit rickety to me - we don't actually release locks; we just abort the
    transaction and *assume* that will cause locks to get released.
    Well, transaction abort will call LockReleaseAll, which is carefully
    coded to clean up the proclock lists regardless of what is in the
    locallocks table, so I'm not sure why you find that any more rickety
    than anything else. But maybe it'd be interesting for Dave to stick a
    LockReleaseAll call into ProcKill() and see if that makes things better.
    (Dave: test that before you put it in production, I'm not totally sure
    it's safe.)

    regards, tom lane
  • Daveg at Sep 7, 2011 at 11:05 pm

    On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    I thought about an error exit from client authentication, and that's a
    somewhat appealing explanation, but I can't quite see why we wouldn't
    clean up there the same as anywhere else. The whole mechanism feels a
    bit rickety to me - we don't actually release locks; we just abort the
    transaction and *assume* that will cause locks to get released.
    Well, transaction abort will call LockReleaseAll, which is carefully
    coded to clean up the proclock lists regardless of what is in the
    locallocks table, so I'm not sure why you find that any more rickety
    than anything else. But maybe it'd be interesting for Dave to stick a
    LockReleaseAll call into ProcKill() and see if that makes things better.
    (Dave: test that before you put it in production, I'm not totally sure
    it's safe.)
    Re safety, what is the worst case here?

    Also, this is very intermittant, we have seen it only in recent months
    on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    feels like a few times a month. Possibly some new application behaviour
    is provoking it, but I have no guesses as to what.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Tom Lane at Sep 7, 2011 at 11:16 pm

    daveg writes:
    On Wed, Sep 07, 2011 at 06:25:23PM -0400, Tom Lane wrote:
    ... But maybe it'd be interesting for Dave to stick a
    LockReleaseAll call into ProcKill() and see if that makes things better.
    (Dave: test that before you put it in production, I'm not totally sure
    it's safe.)
    Re safety, what is the worst case here?
    I think a failure would be pretty obvious --- if it gets through
    regression tests it's probably fine.

    regards, tom lane
  • Tom Lane at Sep 7, 2011 at 11:39 pm

    daveg writes:
    Also, this is very intermittant, we have seen it only in recent months
    on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    feels like a few times a month. Possibly some new application behaviour
    is provoking it, but I have no guesses as to what.
    BTW ... what were the last versions you were running on which you had
    *not* seen the problem? (Just wondering about the possibility that we
    back-patched some "fix" that broke things. It would be good to have
    a version range before trawling the commit logs.)

    regards, tom lane
  • Daveg at Sep 8, 2011 at 12:56 am

    On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    daveg <daveg@sonic.net> writes:
    Also, this is very intermittant, we have seen it only in recent months
    on both 8.4.7 and 9.0.4 after years of no problems. Lately we see it what
    feels like a few times a month. Possibly some new application behaviour
    is provoking it, but I have no guesses as to what.
    BTW ... what were the last versions you were running on which you had
    *not* seen the problem? (Just wondering about the possibility that we
    back-patched some "fix" that broke things. It would be good to have
    a version range before trawling the commit logs.)
    The first version we saw it on was 8.4.7.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Tom Lane at Sep 8, 2011 at 1:02 am

    daveg writes:
    On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    BTW ... what were the last versions you were running on which you had
    *not* seen the problem? (Just wondering about the possibility that we
    back-patched some "fix" that broke things. It would be good to have
    a version range before trawling the commit logs.)
    The first version we saw it on was 8.4.7.
    Yeah, you said that. I was wondering what you'd last run before 8.4.7.

    regards, tom lane
  • Daveg at Sep 8, 2011 at 6:35 am

    On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
    daveg <daveg@sonic.net> writes:
    On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
    BTW ... what were the last versions you were running on which you had
    *not* seen the problem? (Just wondering about the possibility that we
    back-patched some "fix" that broke things. It would be good to have
    a version range before trawling the commit logs.)
    The first version we saw it on was 8.4.7.
    Yeah, you said that. I was wondering what you'd last run before 8.4.7.
    Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
    since shortly after it was released. Prior to that we have had all the major
    and most of the minor releases since 7.1.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Tom Lane at Sep 8, 2011 at 3:32 pm

    daveg writes:
    On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
    daveg <daveg@sonic.net> writes:
    The first version we saw it on was 8.4.7.
    Yeah, you said that. I was wondering what you'd last run before 8.4.7.
    Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
    since shortly after it was released. Prior to that we have had all the major
    and most of the minor releases since 7.1.
    Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
    I can't find anything that looks like it could possibly be related.
    So at this point I'm inclined to think that the bug is older than
    that, but your usage patterns changed so that you started to tickle it.

    Can you think of any changes in your usage that might date to around
    that time, and would somehow be connected to backend startup/shutdown?

    regards, tom lane
  • Robert Haas at Sep 8, 2011 at 1:22 am

    On Wed, Sep 7, 2011 at 6:25 PM, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    I thought about an error exit from client authentication, and that's a
    somewhat appealing explanation, but I can't quite see why we wouldn't
    clean up there the same as anywhere else.  The whole mechanism feels a
    bit rickety to me - we don't actually release locks; we just abort the
    transaction and *assume* that will cause locks to get released.
    Well, transaction abort will call LockReleaseAll, which is carefully
    coded to clean up the proclock lists regardless of what is in the
    locallocks table, so I'm not sure why you find that any more rickety
    than anything else.
    Well, it's very clear that you CAN orphan locks if a backend holding a
    session lock ever does CHECK_FOR_INTERRUPTS() outside of a
    transaction. Try the attached patch.

    rhaas=# vacuum full foo;
    FATAL: terminating connection due to administrator command
    FATAL: terminating connection due to administrator command
    The connection to the server was lost. Attempting reset: Succeeded.
    rhaas=# vacuum full foo;
    ERROR: lock AccessExclusiveLock on object 16384/1431013/0 is already held

    Now, I don't see any evidence of a live bug here (and on further
    thought it can't be Dave's bug because he is orphaning
    AccessShareLocks, not AccessExclusiveLocks), but I find this pretty
    convincing as a demonstration of ricketiness. It is certainly not
    obvious on its face that a misplaced CHECK_FOR_INTERRUPTS() can result
    in a backend exiting without cleaning up its locks, and I'd argue its
    a bad idea to leave it that way even if there's no user-visible
    problem there today.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Daveg at Sep 7, 2011 at 10:23 pm

    On Wed, Sep 07, 2011 at 04:55:24PM -0400, Tom Lane wrote:
    Robert Haas <robertmhaas@gmail.com> writes:
    Tom's right to be skeptical of my theory, because it would require a
    CHECK_FOR_INTERRUPTS() outside of a transaction block in one of the
    pathways that use session-level locks, and I can't find one.
    More to the point: session-level locks are released on error. The only
    way to get out of a transaction while still holding one is for the
    VACUUM-or-whichever-command code to deliberately commit and exit while
    still holding it. An error exit path would release the lock.
    OTOH, I'm skeptical of the theory that this involves userlocks,
    because this whole thread started because of a complaint about lock
    0/1260/0 already being held. That ain't no userlock.
    Yeah, and for that matter it seems to let VACUUM off the hook too.
    If we assume that the reported object ID is non-corrupt (and if it's
    always the same, that seems like the way to bet) then this is a lock
    on pg_authid.

    Hmmm ... could the pathway involve an error exit from client
    authentication? We're still finding bugs in the 9.0 rewrite of
    auth-time database access.
    It does not seem restricted to pg_authid:

    2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on object 16403/2615/0

    And I think I've seen it on other tables too.

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.
  • Tom Lane at Sep 7, 2011 at 10:35 pm

    daveg writes:
    It does not seem restricted to pg_authid:
    2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on object 16403/2615/0
    And I think I've seen it on other tables too.
    Hmm. 2615 = pg_namespace, which most likely is the first catalog
    accessed by just about any SQL command that's going to access tables at
    all, so I suspect that this is mostly just a "the first access failed"
    thing and not something peculiar to pg_namespace. But we still don't
    have a clue why the locks are not getting released by the previous
    owner of the PGPROC slot. Have you trawled your logs to see if there's
    any sign of any distress at all, shortly before the problem starts to
    happen?

    regards, tom lane
  • Daveg at Sep 7, 2011 at 11:01 pm

    On Wed, Sep 07, 2011 at 06:35:08PM -0400, Tom Lane wrote:
    daveg <daveg@sonic.net> writes:
    It does not seem restricted to pg_authid:
    2011-08-24 18:35:57.445 24987 c23 apps ERROR: lock AccessShareLock on object 16403/2615/0
    And I think I've seen it on other tables too.
    Hmm. 2615 = pg_namespace, which most likely is the first catalog
    accessed by just about any SQL command that's going to access tables at
    all, so I suspect that this is mostly just a "the first access failed"
    thing and not something peculiar to pg_namespace. But we still don't
    have a clue why the locks are not getting released by the previous
    owner of the PGPROC slot. Have you trawled your logs to see if there's
    any sign of any distress at all, shortly before the problem starts to
    happen?
    Will do, but its a pretty big haystack. Sure wish I knew what the needle
    looked like. ;-)

    -dg

    --
    David Gould daveg@sonic.net 510 536 1443 510 282 0869
    If simplicity worked, the world would be overrun with insects.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedAug 12, '11 at 11:19p
activeSep 8, '11 at 3:32p
posts25
users3
websitepostgresql.org...
irc#postgresql

3 users in discussion

Daveg: 11 posts Tom Lane: 8 posts Robert Haas: 6 posts

People

Translate

site design / logo © 2022 Grokbase