Hello,

While doing performance tests on Windows Server 2003 we observed to following
two problems.

Environment: J2EE application running in JBoss application server, against
pgsql 8.1 database. Load is caused by a smallish number of (very) complex
transactions, typically about 5-10 concurrently.

The first one, which bothers me the most, is that after about 6-8 hours the
application stops processing. No errors are reported, neither by the JDBC
driver nor by the server, but when I kill the application server, I see that
all my connections hang in a SQL statements (which never seem to return):

2006-03-03 08:17:12 4504 6632560 LOG: duration: 45087000.000 ms statement:
EXECUTE <unnamed> [PREPARE: SELECT objID FROM objects WHERE objID = $1 FOR
UPDATE]

I think I can reliably reproduce this by loading the app, and waiting a couple
of hours.



The second problem is less predictable:

JDBC exception:

An I/O error occured while sending to the backend.
org.postgresql.util.PSQLException: An I/O error occured while sending to the
backend.
at
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:214)
at
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:430)
at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:346)
at
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:250)


In my server log, I have:

2006-03-02 12:31:02 5692 6436342 LOG: could not receive data from client: A
non-blocking socket operation could not be completed immediately.

At the time my box is fairly heavy loaded, but still responsive. Server and
JBoss appserver live on the same dual 2Ghz Opteron.

A quick Google told me that:

1. More people have seen this.
2. No solutions.
3. The server message appears to indicate an unhandled WSAEWOULDBLOCK winsock
error on recv(), which MSDN said is to be expected and should be retried.

Is this a known bug?

jan


--
--------------------------------------------------------------
Jan de Visser                     jdevisser@digitalfairway.com

Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------

Search Discussions

  • Jan de Visser at Mar 9, 2006 at 8:07 pm
    I have more information on this issue.

    First of, the problem now happens after about 1-2 hours, as opposed to the 6-8
    I mentioned earlier. Yey for shorter test cycles.

    Furtermore, it does not happen on Linux machines, both single CPU and dual
    CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU
    windows machine, and if we take one CPU out, it does not happen.

    I executed the following after it hung:

    db=# select l.pid, c.relname, l.mode, l.granted, l.page, l.tuple
    from pg_locks l, pg_class c where c.oid = l.relation order by l.pid;

    Which showed me that several transactions where waiting for a particular row
    which was locked by another transaction. This transaction had no pending
    locks (so no deadlock), but just does not complete and hence never
    relinquishes the lock.

    What gives? has anybody ever heard of problems like this on dual CPU windows
    machines?

    jan


    On Monday 06 March 2006 09:38, Jan de Visser wrote:
    Hello,

    While doing performance tests on Windows Server 2003 we observed to
    following two problems.

    Environment: J2EE application running in JBoss application server, against
    pgsql 8.1 database. Load is caused by a smallish number of (very) complex
    transactions, typically about 5-10 concurrently.

    The first one, which bothers me the most, is that after about 6-8 hours the
    application stops processing. No errors are reported, neither by the JDBC
    driver nor by the server, but when I kill the application server, I see
    that all my connections hang in a SQL statements (which never seem to
    return):

    2006-03-03 08:17:12 4504 6632560 LOG:  duration: 45087000.000 ms
    statement: EXECUTE <unnamed>  [PREPARE:  SELECT objID FROM objects WHERE
    objID = $1 FOR UPDATE]

    I think I can reliably reproduce this by loading the app, and waiting a
    couple of hours.
    --
    --------------------------------------------------------------
    Jan de Visser                     jdevisser@digitalfairway.com

    Baruk Khazad! Khazad ai-menu!
    --------------------------------------------------------------
  • Tom Lane at Mar 9, 2006 at 8:10 pm

    Jan de Visser writes:
    Furtermore, it does not happen on Linux machines, both single CPU and dual
    CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU
    windows machine, and if we take one CPU out, it does not happen.
    ...
    Which showed me that several transactions where waiting for a particular row
    which was locked by another transaction. This transaction had no pending
    locks (so no deadlock), but just does not complete and hence never
    relinquishes the lock.
    Is the stuck transaction still consuming CPU time, or just stopped?

    Is it possible to get a stack trace from the stuck process? I dunno
    if you've got anything gdb-equivalent under Windows, but that's the
    first thing I'd be interested in ...

    regards, tom lane
  • Jan de Visser at Mar 9, 2006 at 9:15 pm

    On Thursday 09 March 2006 15:10, Tom Lane wrote:
    Jan de Visser <jdevisser@digitalfairway.com> writes:
    Furtermore, it does not happen on Linux machines, both single CPU and
    dual CPU, nor on single CPU windows machines. We can only reproduce on a
    dual CPU windows machine, and if we take one CPU out, it does not happen.
    ...
    Which showed me that several transactions where waiting for a particular
    row which was locked by another transaction. This transaction had no
    pending locks (so no deadlock), but just does not complete and hence
    never relinquishes the lock.
    Is the stuck transaction still consuming CPU time, or just stopped?
    CPU drops off. In fact, that's my main clue something's wrong ;-)
    Is it possible to get a stack trace from the stuck process? I dunno
    if you've got anything gdb-equivalent under Windows, but that's the
    first thing I'd be interested in ...
    I wouldn't know. I'm hardly a windows expert. Prefer not to touch the stuff,
    myself. Can do some research though...
    regards, tom lane
    jan

    --
    --------------------------------------------------------------
    Jan de Visser                     jdevisser@digitalfairway.com

    Baruk Khazad! Khazad ai-menu!
    --------------------------------------------------------------
  • Jan de Visser at Mar 10, 2006 at 2:00 am

    On Thursday 09 March 2006 15:10, Tom Lane wrote:
    Is it possible to get a stack trace from the stuck process?  I dunno
    if you've got anything gdb-equivalent under Windows, but that's the
    first thing I'd be interested in ...
    Here ya go:

    http://www.devisser-siderius.com/stack1.jpg
    http://www.devisser-siderius.com/stack2.jpg
    http://www.devisser-siderius.com/stack3.jpg

    There are three threads in the process. I guess thread 1 (stack1.jpg) is the
    most interesting.

    I also noted that cranking up concurrency in my app reproduces the problem in
    about 4 minutes ;-)

    With thanks to Magnus Hagander for the Process Explorer hint.

    jan

    --
    --------------------------------------------------------------
    Jan de Visser                     jdevisser@digitalfairway.com

    Baruk Khazad! Khazad ai-menu!
    --------------------------------------------------------------
  • Magnus Hagander at Mar 9, 2006 at 9:22 pm

    Is it possible to get a stack trace from the stuck process?
    I dunno if you've got anything gdb-equivalent under Windows,
    but that's the first thing I'd be interested in ...
    Try Process Explorer from www.sysinternals.com.

    //Magnus
  • Hakan Kocaman at Mar 10, 2006 at 2:32 pm
    Hi,
    -----Original Message-----
    From: pgsql-performance-owner@postgresql.org
    On Behalf Of Tom Lane
    Sent: Thursday, March 09, 2006 9:11 PM
    To: Jan de Visser
    Cc: pgsql-performance@postgresql.org
    Subject: Re: [PERFORM] Hanging queries on dual CPU windows


    Jan de Visser <jdevisser@digitalfairway.com> writes:
    Furtermore, it does not happen on Linux machines, both
    single CPU and dual
    CPU, nor on single CPU windows machines. We can only
    reproduce on a dual CPU
    windows machine, and if we take one CPU out, it does not happen.
    ...
    Which showed me that several transactions where waiting for
    a particular row
    which was locked by another transaction. This transaction
    had no pending
    locks (so no deadlock), but just does not complete and hence never
    relinquishes the lock.
    Is the stuck transaction still consuming CPU time, or just stopped?

    Is it possible to get a stack trace from the stuck process? I dunno
    if you've got anything gdb-equivalent under Windows, but that's the
    first thing I'd be interested in ...
    Debugging Tools for Windows from Microsoft
    http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Additinonally you need a symbol-file or you use
    "SRV*c:\debug\symbols*http://msdl.microsoft.com/download/symbols"
    to load the symbol-file dynamically from the net.

    Best regards
    regards, tom lane

    ---------------------------(end of
    broadcast)---------------------------
    TIP 5: don't forget to increase your free space map settings



    Hakan Kocaman
    Software-Development

    digame.de GmbH
    Richard-Byrd-Str. 4-8
    50829 Köln

    Tel.: +49 (0) 221 59 68 88 31
    Fax: +49 (0) 221 59 68 88 98
    Email: hakan.kocaman@digame.de
  • Magnus Hagander at Mar 10, 2006 at 6:31 pm

    Could it be they broke it when they did that????
    In theory, yes, but it still seems a bit far fetched :-(
    Well, I rolled back SP1 and am running my test again. Looking
    much better, hasn't locked up in 45mins now, whereas before
    it would lock up within 5mins.

    So I think they broke something.
    Wow. I guess I was lucky that I didn't say it was impossible :-)


    But what really is happening. What other thread is actually holding the
    critical section at this point, causing us to block? The only places it
    gets held is while looping the signal queue, but it is released while
    calling the signal function itself...

    But they obviously *have* been messing with critical sections, so maybe
    they accidentally changed something else as well...

    What bothers me is that nobody else has reported this. It could be that
    this was exposed by the changes to the signal handling done for 8.1, and
    the ppl with this level of concurrency are either still on 8.0 or just
    not on SP1 for their windows boxes yet... Do you have any other software
    installed on the machine? That might possibly interfere in some way?

    But let's have it run for a bit longer to confirm this does help. If so,
    we could perhaps recode that part using a Mutex instead of a critical
    section - since it's not a performance critical path, the difference
    shouldn't be large. If I code up a patch for that, can you re-apply SP1
    and test it? Or is this a production system you can't really touch?

    //Magnus
  • Jan de Visser at Mar 10, 2006 at 7:27 pm

    On Friday 10 March 2006 13:25, Magnus Hagander wrote:
    Could it be they broke it when they did that????
    In theory, yes, but it still seems a bit far fetched :-(
    Well, I rolled back SP1 and am running my test again. Looking
    much better, hasn't locked up in 45mins now, whereas before
    it would lock up within 5mins.

    So I think they broke something.
    Wow. I guess I was lucky that I didn't say it was impossible :-)


    But what really is happening. What other thread is actually holding the
    critical section at this point, causing us to block? The only places it
    gets held is while looping the signal queue, but it is released while
    calling the signal function itself...

    But they obviously *have* been messing with critical sections, so maybe
    they accidentally changed something else as well...

    What bothers me is that nobody else has reported this. It could be that
    this was exposed by the changes to the signal handling done for 8.1, and
    the ppl with this level of concurrency are either still on 8.0 or just
    not on SP1 for their windows boxes yet... Do you have any other software
    installed on the machine? That might possibly interfere in some way?
    Just a JDK, JBoss, cygwin (running sshd), and a VNC Server. I don't think that
    interferes.
    But let's have it run for a bit longer to confirm this does help.
    I turned it off after 2.5hr. The longest I had to wait before, with less load,
    was 1.45hr.
    If so,
    we could perhaps recode that part using a Mutex instead of a critical
    section - since it's not a performance critical path, the difference
    shouldn't be large. If I code up a patch for that, can you re-apply SP1
    and test it? Or is this a production system you can't really touch?
    I can do whatever the hell I want with it, so if you could cook up a patch
    that would be great.

    As a BTW: I reinstalled SP1 and turned stats collection off. That also seems
    to work, but is not really a solution since we want to use autovacuuming.
    //Magnus
    jan

    --
    --------------------------------------------------------------
    Jan de Visser                     jdevisser@digitalfairway.com

    Baruk Khazad! Khazad ai-menu!
    --------------------------------------------------------------
  • Jan de Visser at Mar 10, 2006 at 7:37 pm

    On Friday 10 March 2006 14:27, Jan de Visser wrote:
    As a BTW: I reinstalled SP1 and turned stats collection off. That also
    seems to work, but is not really a solution since we want to use
    autovacuuming.
    I lied. I hangs now. Just takes a lot longer...

    jan

    --
    --------------------------------------------------------------
    Jan de Visser                     jdevisser@digitalfairway.com

    Baruk Khazad! Khazad ai-menu!
    --------------------------------------------------------------

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedMar 6, '06 at 2:38p
activeMar 10, '06 at 7:37p
posts10
users4
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase