FAQ
Hi all,

I have a go program which hangs in a strange way sometimes and I don't know
how to build a repeatable test case.
Any help on building a repeatable test case or even identify the problem is
very welcome.

The program is called pn, for "periodic noise", and it's purpose is
described here: https://github.com/Jimdo/periodicnoise
System is Debian 6.0.4 / Linux 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28
UTC 2012 x86_64
Go version is 1.1.2

I call it like this from a crontab entry:
pn --monitor-event=blobdeploy_releasecleanup --timeout=45m --use-syslog --
/root/server_scripts/updateserver/deployment_release_cleanup.sh

Here is a gdb debug session listing all goroutines and
gdb -p 31878 /usr/bin/pn
GNU gdb (GDB) 7.3-debian
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/pn...done.
Attaching to program: /usr/bin/pn, process 31878
Reading symbols from /lib/libpthread.so.0...(no debugging symbols
found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x7f7897280700 (LWP 31898)]
[New Thread 0x7f7897a81700 (LWP 31897)]
[New Thread 0x7f7898282700 (LWP 31894)]
[New Thread 0x7f7898a83700 (LWP 31890)]
[New Thread 0x7f7899284700 (LWP 31887)]
[New Thread 0x7f7899a85700 (LWP 31885)]
[New Thread 0x7f789a388700 (LWP 31881)]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
267 RET
Loading Go Runtime support.
(gdb) info goroutines
   1 waiting runtime.park
   2 syscall runtime.goexit
* 3 syscall runtime.entersyscallblock
   4 waiting runtime.park
(gdb) goroutine 1 bt full
#0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
No locals.
#1 0x000000000042448f in netpollblock (pd=void, mode=void) at
/usr/lib/go/src/pkg/runtime/netpoll.goc:247
         gpp = -1717014688
#2 0x0000000000423f52 in net.runtime_pollWait (pd=void, mode=void,
err=void) at /usr/lib/go/src/pkg/runtime/netpoll.goc:117
No locals.
#3 0x00000000004bdf81 in net.(*pollDesc).WaitWrite (pd=0xc200060110,
~anon0=...) at /usr/lib/go/src/pkg/net/fd_poll_runtime.go:80
No locals.
#4 0x00000000004c01c6 in net.(*netFD).Write (fd=0xc200060090, p= []uint8 =
{...}, nn=0, err=...) at /usr/lib/go/src/pkg/net/fd_unix.go:294
         n = -1
         err = {tab = 0x0, data = 0x0}
#5 0x00000000004cc913 in net.(*conn).Write (c=0xc200000168, b= []uint8 =
{...}, ~anon1=6, ~anon2=...) at /usr/lib/go/src/pkg/net/net.go:131
No locals.
#6 0x000000000043af0b in fmt.Fprintf (w=..., format="<%d>%s %s[%d]: %s%s",
a= []interface {} = {...}, n=4254901, err=...) at
/usr/lib/go/src/pkg/fmt/print.go:215
         p = 0xc200082820
         n64 = 31878
#7 0x000000000044b247 in
github.com/Jimdo/periodicnoise/syslog.(*netConn).writeString
(n=0xc200091200, p=29, hostname="web8", tag="blobdeploy_releasecleanup",
msg="OK\n", nl="", ~anon5=...)
     at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:294
         timestamp = "Oct 31 01:00:05"
#8 0x000000000044af3a in
github.com/Jimdo/periodicnoise/syslog.(*Writer).write (w=0xc20005f120,
p=29, msg="OK\n", ~anon2=3, ~anon3=...) at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:276
         nl = ""
         s·2 = Traceback (most recent call last):
   File "/usr/lib/go/src/pkg/runtime/runtime-gdb.py", line 43, in to_string
     return self.val['str'].string("utf-8", "ignore", l)
OverflowError: signed integer is greater than maximum


#9 0x000000000044ad19 in
github.com/Jimdo/periodicnoise/syslog.(*Writer).writeAndRetry
(w=0xc20005f120, p=29, s="OK\n", ~anon2=0, ~anon3=...) at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:257
         pr = 29
#10 0x000000000044a6f8 in
github.com/Jimdo/periodicnoise/syslog.(*Writer).Write (w=0xc20005f120, b=
[]uint8 = {...}, ~anon1=833223656192, ~anon2=...) at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:178
No locals.
#11 0x000000000042bdf0 in log.(*Logger).Output (l=0xc20007f140,
calldepth=2, s="OK\n", ~anon2=...) at /usr/lib/go/src/pkg/log/log.go:153
         file = ""
         now = {sec = 63518774405, nsec = 178383383, loc = 0x671a20}
         line = 0
#12 0x000000000042c657 in log.Println (v= []interface {} = {...}) at
/usr/lib/go/src/pkg/log/log.go:282
No locals.
#13 0x0000000000401f4d in main.Ok () at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:31
No locals.
#14 0x000000000040322e in main.main () at
/home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:221
         r·2 = {tab = 0xc20008f360, data = 0x30}
         logger = {tab = 0xc20008f4e0, data = 0xc20008e790}
         err = {tab = 0x0, data = 0x0}
         args = []string =
{"/root/server_scripts/updateserver/deployment_release_cleanup.sh"}
         &cmd = 0xc200000088
         stdout = {tab = 0xc20008f7e0, data = 0xc200000268}
         w·3 = {tab = 0x4006710e8, data = 0x40ecb5}
         &wg = 0xc2000910e0
         stderr = {tab = 0xc20008f7e0, data = 0xc200000288}
         limit = 4723630
         command =
"/root/server_scripts/updateserver/deployment_release_cleanup.sh"
         &cw = 0x2aaae4000
         timer = 0xc20007bb80
#15 0x0000000000418fe2 in runtime.main () at
/usr/lib/go/src/pkg/runtime/proc.c:182
No locals.
#16 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
No locals.
#17 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) goroutine 2 bt full
#0 runtime.goexit () at /usr/lib/go/src/pkg/runtime/proc.c:1223
No locals.
#1 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) goroutine 3 bt full
#0 0x000000000041b52e in runtime.entersyscallblock (dummy=void) at
/usr/lib/go/src/pkg/runtime/proc.c:1334
No locals.
#1 0x0000000000415bbe in runtime.MHeap_Scavenger () at
/usr/lib/go/src/pkg/runtime/mheap.c:454
         notep = -1431621792
         note = 0
         trace = 0
         sumreleased = 0
         k = 584
         limit = -647710720
         forcegc = -259084288
         tick = -129542144
         h = -1707569152
#2 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
No locals.
#3 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) goroutine 4 bt full
#0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
No locals.
#1 0x0000000000425ea9 in timerproc () at
/usr/lib/go/src/pkg/runtime/time.goc:210
         safe = 6723302
         arg = 1
         f = 3
         t = 0
         now = 1681968385
         delta = -1
#2 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
No locals.
#3 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) thread apply all bt full

Thread 8 (Thread 0x7f789a388700 (LWP 31881)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x000000000041c9e6 in sysmon () at
/usr/lib/go/src/pkg/runtime/proc.c:2007
         ticks = 1983
         lastpoll = 0
         delay = 20
         idle = 1
#4 0x0000000000419972 in runtime.mstart () at
/usr/lib/go/src/pkg/runtime/proc.c:485
         seh = 4298912
#5 0x00000000004f03a4 in crosscall_amd64 ()
No symbol table info available.
#6 0x0000000000000003 in ?? ()
No symbol table info available.
#7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 7 (Thread 0x7f7899a85700 (LWP 31885)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
No locals.
#4 0x000000000041ab1d in findrunnable () at
/usr/lib/go/src/pkg/runtime/proc.c:1100
         safe = 512952
         i = 2
         p = 856576
         gp = 856576
#5 0x000000000041acbd in schedule () at
/usr/lib/go/src/pkg/runtime/proc.c:1145
         gp = 0
#6 0x000000000041b0df in goexit0 (gp=void) at
/usr/lib/go/src/pkg/runtime/proc.c:1247
No locals.
#7 0x00000000004263a9 in runtime.mcall (fn=void) at
/usr/lib/go/src/pkg/runtime/asm_amd64.s:195
No locals.
#8 0x000000c200064000 in ?? ()
No symbol table info available.
#9 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 6 (Thread 0x7f7899284700 (LWP 31887)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 60
#2 0x000000000040ee9f in runtime.notetsleep (n=void, ns=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:147
         deadline = -1467643998
#3 0x0000000000415bbe in runtime.MHeap_Scavenger () at
/usr/lib/go/src/pkg/runtime/mheap.c:454
         notep = -1431621792
         note = 0
         trace = 0
         sumreleased = 0
         k = 941
         limit = -647710720
         forcegc = -259084288
         tick = -129542144
         h = -1707569152
#4 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
No locals.
#5 0x0000000000000000 in ?? ()
No symbol table info available.

         seh = 0
#5 0x00000000004f03a4 in crosscall_amd64 ()
No symbol table info available.
#6 0x0000000000000003 in ?? ()
No symbol table info available.
#7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 4 (Thread 0x7f7898282700 (LWP 31894)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
No locals.
#4 0x000000000041ab1d in findrunnable () at
/usr/lib/go/src/pkg/runtime/proc.c:1100
         safe = 856320
         i = 2
         p = 409600
         gp = 0
#5 0x000000000041acbd in schedule () at
/usr/lib/go/src/pkg/runtime/proc.c:1145
         gp = 0
#6 0x000000000041b0df in goexit0 (gp=void) at
/usr/lib/go/src/pkg/runtime/proc.c:1247
No locals.
#7 0x00000000004263a9 in runtime.mcall (fn=void) at
/usr/lib/go/src/pkg/runtime/asm_amd64.s:195
No locals.
#8 0x000000c200064000 in ?? ()
No symbol table info available.
#9 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 3 (Thread 0x7f7897a81700 (LWP 31897)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
No locals.
#4 0x000000000041ab1d in findrunnable () at
/usr/lib/go/src/pkg/runtime/proc.c:1100
         safe = 512952
         i = 2
         p = 856576
         gp = 856576
#5 0x000000000041acbd in schedule () at
/usr/lib/go/src/pkg/runtime/proc.c:1145
         gp = 0
#6 0x000000000041b0df in goexit0 (gp=void) at
/usr/lib/go/src/pkg/runtime/proc.c:1247
No locals.
#7 0x00000000004263a9 in runtime.mcall (fn=void) at
/usr/lib/go/src/pkg/runtime/asm_amd64.s:195
No locals.
#8 0x000000c200064000 in ?? ()
No symbol table info available.
#9 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (Thread 0x7f7897280700 (LWP 31898)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
No locals.
#4 0x000000000041ab1d in findrunnable () at
/usr/lib/go/src/pkg/runtime/proc.c:1100
         safe = 983040
         i = 2
         p = 32768
         gp = 4286936
#5 0x000000000041acbd in schedule () at
/usr/lib/go/src/pkg/runtime/proc.c:1145
         gp = 0
#6 0x00000000004199a3 in runtime.mstart () at
/usr/lib/go/src/pkg/runtime/proc.c:494
         seh = 0
#7 0x00000000004f03a4 in crosscall_amd64 ()
No symbol table info available.
#8 0x0000000000000003 in ?? ()
No symbol table info available.
#9 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (Thread 0x7f78aab25700 (LWP 31878)):
#0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
No locals.
#1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
at /usr/lib/go/src/pkg/runtime/os_linux.c:58
         safe = 0
         ts = 4254502
#2 0x000000000040edd9 in runtime.notesleep (n=void) at
/usr/lib/go/src/pkg/runtime/lock_futex.c:125
No locals.
#3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
No locals.
#4 0x000000000041ab1d in findrunnable () at
/usr/lib/go/src/pkg/runtime/proc.c:1100
         safe = 4351120
         i = 2
         p = 352512
         gp = 4351839
#5 0x000000000041acbd in schedule () at
/usr/lib/go/src/pkg/runtime/proc.c:1145
         gp = 0
#6 0x000000000041aed0 in park0 (gp=void) at
/usr/lib/go/src/pkg/runtime/proc.c:1194
No locals.
#7 0x00000000004263a9 in runtime.mcall (fn=void) at
/usr/lib/go/src/pkg/runtime/asm_amd64.s:195
No locals.
#8 0x0000000000671c60 in runtime.g0 ()
No symbol table info available.
#9 0x0000000000426276 in _rt0_amd64 () at
/usr/lib/go/src/pkg/runtime/asm_amd64.s:86
No locals.
#10 0x0000000000000006 in ?? ()
No symbol table info available.
#11 0x00007fff75f0a7d8 in ?? ()
No symbol table info available.
#12 0x0000000000000006 in ?? ()
No symbol table info available.
#13 0x00007fff75f0a7d8 in ?? ()
No symbol table info available.
#14 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) quit
A debugging session is active.

Inferior 1 [process 31878] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/pn, process 31878

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Dave Cheney at Nov 1, 2013 at 11:28 pm
    When your program hangs, can you hit it with a SIGQUIT and post the
    complete stacktrace message.

    Cheers

    Dave
    On Sat, Nov 2, 2013 at 9:56 AM, Ingo Oeser wrote:
    Hi all,

    I have a go program which hangs in a strange way sometimes and I don't know
    how to build a repeatable test case.
    Any help on building a repeatable test case or even identify the problem is
    very welcome.

    The program is called pn, for "periodic noise", and it's purpose is
    described here: https://github.com/Jimdo/periodicnoise
    System is Debian 6.0.4 / Linux 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC
    2012 x86_64
    Go version is 1.1.2

    I call it like this from a crontab entry:
    pn --monitor-event=blobdeploy_releasecleanup --timeout=45m --use-syslog --
    /root/server_scripts/updateserver/deployment_release_cleanup.sh

    Here is a gdb debug session listing all goroutines and
    gdb -p 31878 /usr/bin/pn
    GNU gdb (GDB) 7.3-debian
    Copyright (C) 2011 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later
    <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law. Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /usr/bin/pn...done.
    Attaching to program: /usr/bin/pn, process 31878
    Reading symbols from /lib/libpthread.so.0...(no debugging symbols
    found)...done.
    [Thread debugging using libthread_db enabled]
    [New Thread 0x7f7897280700 (LWP 31898)]
    [New Thread 0x7f7897a81700 (LWP 31897)]
    [New Thread 0x7f7898282700 (LWP 31894)]
    [New Thread 0x7f7898a83700 (LWP 31890)]
    [New Thread 0x7f7899284700 (LWP 31887)]
    [New Thread 0x7f7899a85700 (LWP 31885)]
    [New Thread 0x7f789a388700 (LWP 31881)]
    Loaded symbols for /lib/libpthread.so.0
    Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
    Loaded symbols for /lib/libc.so.6
    Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
    found)...done.
    Loaded symbols for /lib64/ld-linux-x86-64.so.2
    runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    267 RET
    Loading Go Runtime support.
    (gdb) info goroutines
    1 waiting runtime.park
    2 syscall runtime.goexit
    * 3 syscall runtime.entersyscallblock
    4 waiting runtime.park
    (gdb) goroutine 1 bt full
    #0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
    reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
    No locals.
    #1 0x000000000042448f in netpollblock (pd=void, mode=void) at
    /usr/lib/go/src/pkg/runtime/netpoll.goc:247
    gpp = -1717014688
    #2 0x0000000000423f52 in net.runtime_pollWait (pd=void, mode=void,
    err=void) at /usr/lib/go/src/pkg/runtime/netpoll.goc:117
    No locals.
    #3 0x00000000004bdf81 in net.(*pollDesc).WaitWrite (pd=0xc200060110,
    ~anon0=...) at /usr/lib/go/src/pkg/net/fd_poll_runtime.go:80
    No locals.
    #4 0x00000000004c01c6 in net.(*netFD).Write (fd=0xc200060090, p= []uint8 =
    {...}, nn=0, err=...) at /usr/lib/go/src/pkg/net/fd_unix.go:294
    n = -1
    err = {tab = 0x0, data = 0x0}
    #5 0x00000000004cc913 in net.(*conn).Write (c=0xc200000168, b= []uint8 =
    {...}, ~anon1=6, ~anon2=...) at /usr/lib/go/src/pkg/net/net.go:131
    No locals.
    #6 0x000000000043af0b in fmt.Fprintf (w=..., format="<%d>%s %s[%d]: %s%s",
    a= []interface {} = {...}, n=4254901, err=...) at
    /usr/lib/go/src/pkg/fmt/print.go:215
    p = 0xc200082820
    n64 = 31878
    #7 0x000000000044b247 in
    github.com/Jimdo/periodicnoise/syslog.(*netConn).writeString
    (n=0xc200091200, p=29, hostname="web8", tag="blobdeploy_releasecleanup",
    msg="OK\n", nl="", ~anon5=...)
    at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:294
    timestamp = "Oct 31 01:00:05"
    #8 0x000000000044af3a in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).write (w=0xc20005f120, p=29,
    msg="OK\n", ~anon2=3, ~anon3=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:276
    nl = ""
    s·2 = Traceback (most recent call last):
    File "/usr/lib/go/src/pkg/runtime/runtime-gdb.py", line 43, in to_string
    return self.val['str'].string("utf-8", "ignore", l)
    OverflowError: signed integer is greater than maximum


    #9 0x000000000044ad19 in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).writeAndRetry
    (w=0xc20005f120, p=29, s="OK\n", ~anon2=0, ~anon3=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:257
    pr = 29
    #10 0x000000000044a6f8 in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).Write (w=0xc20005f120, b=
    []uint8 = {...}, ~anon1=833223656192, ~anon2=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:178
    No locals.
    #11 0x000000000042bdf0 in log.(*Logger).Output (l=0xc20007f140, calldepth=2,
    s="OK\n", ~anon2=...) at /usr/lib/go/src/pkg/log/log.go:153
    file = ""
    now = {sec = 63518774405, nsec = 178383383, loc = 0x671a20}
    line = 0
    #12 0x000000000042c657 in log.Println (v= []interface {} = {...}) at
    /usr/lib/go/src/pkg/log/log.go:282
    No locals.
    #13 0x0000000000401f4d in main.Ok () at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:31
    No locals.
    #14 0x000000000040322e in main.main () at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:221
    r·2 = {tab = 0xc20008f360, data = 0x30}
    logger = {tab = 0xc20008f4e0, data = 0xc20008e790}
    err = {tab = 0x0, data = 0x0}
    args = []string =
    {"/root/server_scripts/updateserver/deployment_release_cleanup.sh"}
    &cmd = 0xc200000088
    stdout = {tab = 0xc20008f7e0, data = 0xc200000268}
    w·3 = {tab = 0x4006710e8, data = 0x40ecb5}
    &wg = 0xc2000910e0
    stderr = {tab = 0xc20008f7e0, data = 0xc200000288}
    limit = 4723630
    command =
    "/root/server_scripts/updateserver/deployment_release_cleanup.sh"
    &cw = 0x2aaae4000
    timer = 0xc20007bb80
    #15 0x0000000000418fe2 in runtime.main () at
    /usr/lib/go/src/pkg/runtime/proc.c:182
    No locals.
    #16 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #17 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 2 bt full
    #0 runtime.goexit () at /usr/lib/go/src/pkg/runtime/proc.c:1223
    No locals.
    #1 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 3 bt full
    #0 0x000000000041b52e in runtime.entersyscallblock (dummy=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1334
    No locals.
    #1 0x0000000000415bbe in runtime.MHeap_Scavenger () at
    /usr/lib/go/src/pkg/runtime/mheap.c:454
    notep = -1431621792
    note = 0
    trace = 0
    sumreleased = 0
    k = 584
    limit = -647710720
    forcegc = -259084288
    tick = -129542144
    h = -1707569152
    #2 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #3 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 4 bt full
    #0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
    reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
    No locals.
    #1 0x0000000000425ea9 in timerproc () at
    /usr/lib/go/src/pkg/runtime/time.goc:210
    safe = 6723302
    arg = 1
    f = 3
    t = 0
    now = 1681968385
    delta = -1
    #2 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #3 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) thread apply all bt full

    Thread 8 (Thread 0x7f789a388700 (LWP 31881)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x000000000041c9e6 in sysmon () at
    /usr/lib/go/src/pkg/runtime/proc.c:2007
    ticks = 1983
    lastpoll = 0
    delay = 20
    idle = 1
    #4 0x0000000000419972 in runtime.mstart () at
    /usr/lib/go/src/pkg/runtime/proc.c:485
    seh = 4298912
    #5 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #6 0x0000000000000003 in ?? ()
    No symbol table info available.
    #7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #8 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 7 (Thread 0x7f7899a85700 (LWP 31885)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 512952
    i = 2
    p = 856576
    gp = 856576
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 6 (Thread 0x7f7899284700 (LWP 31887)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 60
    #2 0x000000000040ee9f in runtime.notetsleep (n=void, ns=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:147
    deadline = -1467643998
    #3 0x0000000000415bbe in runtime.MHeap_Scavenger () at
    /usr/lib/go/src/pkg/runtime/mheap.c:454
    notep = -1431621792
    note = 0
    trace = 0
    sumreleased = 0
    k = 941
    limit = -647710720
    forcegc = -259084288
    tick = -129542144
    h = -1707569152
    #4 0x000000000041afb0 in ?? () at /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #5 0x0000000000000000 in ?? ()
    No symbol table info available.

    seh = 0
    #5 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #6 0x0000000000000003 in ?? ()
    No symbol table info available.
    #7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #8 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 4 (Thread 0x7f7898282700 (LWP 31894)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 856320
    i = 2
    p = 409600
    gp = 0
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 3 (Thread 0x7f7897a81700 (LWP 31897)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 512952
    i = 2
    p = 856576
    gp = 856576
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 2 (Thread 0x7f7897280700 (LWP 31898)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 983040
    i = 2
    p = 32768
    gp = 4286936
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x00000000004199a3 in runtime.mstart () at
    /usr/lib/go/src/pkg/runtime/proc.c:494
    seh = 0
    #7 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #8 0x0000000000000003 in ?? ()
    No symbol table info available.
    #9 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #10 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 1 (Thread 0x7f78aab25700 (LWP 31878)):
    #0 runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 4351120
    i = 2
    p = 352512
    gp = 4351839
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041aed0 in park0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1194
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x0000000000671c60 in runtime.g0 ()
    No symbol table info available.
    #9 0x0000000000426276 in _rt0_amd64 () at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:86
    No locals.
    #10 0x0000000000000006 in ?? ()
    No symbol table info available.
    #11 0x00007fff75f0a7d8 in ?? ()
    No symbol table info available.
    #12 0x0000000000000006 in ?? ()
    No symbol table info available.
    #13 0x00007fff75f0a7d8 in ?? ()
    No symbol table info available.
    #14 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) quit
    A debugging session is active.

    Inferior 1 [process 31878] will be detached.

    Quit anyway? (y or n) y
    Detaching from program: /usr/bin/pn, process 31878

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Ingo Oeser at Nov 4, 2013 at 1:22 pm
    Thanks Dave,

    I gathered the requested information.

    So the question is: Why doesn't IO make any progress here?
    More Information: The syslogd is up and running, other instances of this
    program even run at the same time successfully reporting to syslogd every
    minute,
    io.Copy is called in a loop here
    https://github.com/Jimdo/periodicnoise/blob/master/cmd/pn/log.go#L32-L42 and
    retries on any error.
    I use the syslog implementation from Go 1.2rc2 and vendored it into
    https://github.com/Jimdo/periodicnoise/syslog to get error handling
    (the Go 1.1.X version of log/syslog is useless, because it never caught the
    errors it should retry on).

    I am not sure about the
    https://github.com/Jimdo/periodicnoise/blob/master/cmd/pn/main.go#L214waiting on the logging to complete vs. process exit.

    I don't know how to reproduce the issue reliably, but it happens often
    enough to bother us. Out of minutely runs on more than 300 hosts,
    we see it at least 10 times a week on random hosts.

    SIGQUIT: quit
    PC=0x4271d1


    goroutine 1 [semacquire]:
    sync.runtime_Semacquire(0xc200000330)
    /usr/lib/go/src/pkg/runtime/zsema_linux_amd64.c:165 +0x2e
    sync.(*WaitGroup).Wait(0xc2000910e0)
    /usr/lib/go/src/pkg/sync/waitgroup.go:109 +0xf2
    main.main()
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:214
    +0x822

    goroutine 2 [syscall]:

    goroutine 5 [IO wait]:
    net.runtime_pollWait(0x7f61bb6cbf00, 0x77, 0x0)
    /usr/lib/go/src/pkg/runtime/znetpoll_linux_amd64.c:118 +0x82
    net.(*pollDesc).WaitWrite(0xc200060110, 0xb, 0xc20008f5d0)
    /usr/lib/go/src/pkg/net/fd_poll_runtime.go:80 +0x31
    net.(*netFD).Write(0xc200060090, 0xc2000e9000, 0xa8, 0xa8, 0x0, ...)
    /usr/lib/go/src/pkg/net/fd_unix.go:294 +0x3e6
    net.(*conn).Write(0xc200000168, 0xc2000e9000, 0xa8, 0xa8, 0x6, ...)
    /usr/lib/go/src/pkg/net/net.go:131 +0xc3
    fmt.Fprintf(0xc20008fa80, 0xc200000168, 0x598c50, 0x13, 0x7f61cc6f4bb8, ...)
    /usr/lib/go/src/pkg/fmt/print.go:215 +0xab
    github.com/Jimdo/periodicnoise/syslog.(*netConn).writeString(0xc200091200,
    0x1d, 0xc200000138, 0x5, 0x7fff33b1de6f, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:294
    +0x297
    github.com/Jimdo/periodicnoise/syslog.(*Writer).write(0xc20009c0c0, 0x1d,
    0xc2000a4300, 0x79, 0x79, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:276
    +0x16a
    github.com/Jimdo/periodicnoise/syslog.(*Writer).writeAndRetry(0xc20009c0c0,
    0x1d, 0xc2000a4300, 0x79, 0x0, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:257
    +0xb9
    github.com/Jimdo/periodicnoise/syslog.(*Writer).Write(0xc20009c0c0,
    0xc2000c0000, 0x79, 0x8000, 0x1, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:178
    +0x68
    main.(*LineWriter).Write(0xc20008e790, 0xc2000c0000, 0x79, 0x8000, 0x0, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/linewriter.go:17
    +0xf5
    io.Copy(0xc20008f4e0, 0xc20008e790, 0xc20008f8a0, 0xc200091280, 0x0, ...)
    /usr/lib/go/src/pkg/io/io.go:350 +0x237
    main.func·001()
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/log.go:36
    +0x60
    created by main.logStream
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/log.go:43
    +0xd1

    goroutine 7 [semacquire]:
    sync.runtime_Semacquire(0xc20009c10c)
    /usr/lib/go/src/pkg/runtime/zsema_linux_amd64.c:165 +0x2e
    sync.(*Mutex).Lock(0xc20009c108)
    /usr/lib/go/src/pkg/sync/mutex.go:66 +0xbb
    github.com/Jimdo/periodicnoise/syslog.(*Writer).writeAndRetry(0xc20009c0c0,
    0x1d, 0xc2000914e0, 0x1f, 0x0, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:253
    +0x65
    github.com/Jimdo/periodicnoise/syslog.(*Writer).Write(0xc20009c0c0,
    0xc2000914c0, 0x1f, 0x1f, 0xc2000914a0, ...)
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/syslog/syslog.go:178
    +0x68
    log.(*Logger).Output(0xc20007f140, 0x2, 0xc2000914a0, 0x1f, 0x0, ...)
    /usr/lib/go/src/pkg/log/log.go:153 +0x330
    log.Println(0x7f61cc6f3f68, 0x2, 0x2)
    /usr/lib/go/src/pkg/log/log.go:282 +0x67
    main.TimedOut()
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:52
    +0xf2
    main.func·002()
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/github.com/Jimdo/periodicnoise/cmd/pn/main.go:159
    +0x25
    created by time.goFunc
    /usr/lib/go/src/pkg/time/sleep.go:122 +0x48
    rax 0xca
    rbx 0xffffffffffffffff
    rcx 0xffffffffffffffff
    rdx 0x0
    rdi 0x677218
    rsi 0x0
    rbp 0x0
    rsp 0x7fff33b1be18
    r8 0x0
    r9 0x0
    r10 0x0
    r11 0x246
    r12 0x0
    r13 0x23bd
    r14 0x10
    r15 0x5989d0
    rip 0x4271d1
    rflags 0x246
    cs 0x33
    fs 0x0
    gs 0x0

    On Saturday, November 2, 2013 12:27:55 AM UTC+1, Dave Cheney wrote:

    When your program hangs, can you hit it with a SIGQUIT and post the
    complete stacktrace message.

    Cheers

    Dave
    On Sat, Nov 2, 2013 at 9:56 AM, Ingo Oeser wrote:
    Hi all,

    I have a go program which hangs in a strange way sometimes and I don't know
    how to build a repeatable test case.
    Any help on building a repeatable test case or even identify the problem is
    very welcome.

    The program is called pn, for "periodic noise", and it's purpose is
    described here: https://github.com/Jimdo/periodicnoise
    System is Debian 6.0.4 / Linux 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC
    2012 x86_64
    Go version is 1.1.2

    I call it like this from a crontab entry:
    pn --monitor-event=blobdeploy_releasecleanup --timeout=45m --use-syslog --
    /root/server_scripts/updateserver/deployment_release_cleanup.sh

    Here is a gdb debug session listing all goroutines and
    gdb -p 31878 /usr/bin/pn
    GNU gdb (GDB) 7.3-debian
    Copyright (C) 2011 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later
    <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law. Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /usr/bin/pn...done.
    Attaching to program: /usr/bin/pn, process 31878
    Reading symbols from /lib/libpthread.so.0...(no debugging symbols
    found)...done.
    [Thread debugging using libthread_db enabled]
    [New Thread 0x7f7897280700 (LWP 31898)]
    [New Thread 0x7f7897a81700 (LWP 31897)]
    [New Thread 0x7f7898282700 (LWP 31894)]
    [New Thread 0x7f7898a83700 (LWP 31890)]
    [New Thread 0x7f7899284700 (LWP 31887)]
    [New Thread 0x7f7899a85700 (LWP 31885)]
    [New Thread 0x7f789a388700 (LWP 31881)]
    Loaded symbols for /lib/libpthread.so.0
    Reading symbols from /lib/libc.so.6...(no debugging symbols
    found)...done.
    Loaded symbols for /lib/libc.so.6
    Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
    found)...done.
    Loaded symbols for /lib64/ld-linux-x86-64.so.2
    runtime.futex () at /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    267 RET
    Loading Go Runtime support.
    (gdb) info goroutines
    1 waiting runtime.park
    2 syscall runtime.goexit
    * 3 syscall runtime.entersyscallblock
    4 waiting runtime.park
    (gdb) goroutine 1 bt full
    #0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
    reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
    No locals.
    #1 0x000000000042448f in netpollblock (pd=void, mode=void) at
    /usr/lib/go/src/pkg/runtime/netpoll.goc:247
    gpp = -1717014688
    #2 0x0000000000423f52 in net.runtime_pollWait (pd=void, mode=void,
    err=void) at /usr/lib/go/src/pkg/runtime/netpoll.goc:117
    No locals.
    #3 0x00000000004bdf81 in net.(*pollDesc).WaitWrite (pd=0xc200060110,
    ~anon0=...) at /usr/lib/go/src/pkg/net/fd_poll_runtime.go:80
    No locals.
    #4 0x00000000004c01c6 in net.(*netFD).Write (fd=0xc200060090, p= []uint8 =
    {...}, nn=0, err=...) at /usr/lib/go/src/pkg/net/fd_unix.go:294
    n = -1
    err = {tab = 0x0, data = 0x0}
    #5 0x00000000004cc913 in net.(*conn).Write (c=0xc200000168, b= []uint8 =
    {...}, ~anon1=6, ~anon2=...) at /usr/lib/go/src/pkg/net/net.go:131
    No locals.
    #6 0x000000000043af0b in fmt.Fprintf (w=..., format="<%d>%s %s[%d]: %s%s",
    a= []interface {} = {...}, n=4254901, err=...) at
    /usr/lib/go/src/pkg/fmt/print.go:215
    p = 0xc200082820
    n64 = 31878
    #7 0x000000000044b247 in
    github.com/Jimdo/periodicnoise/syslog.(*netConn).writeString
    (n=0xc200091200, p=29, hostname="web8", tag="blobdeploy_releasecleanup",
    msg="OK\n", nl="", ~anon5=...)
    at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/syslog/syslog.go:294
    timestamp = "Oct 31 01:00:05"
    #8 0x000000000044af3a in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).write (w=0xc20005f120, p=29,
    msg="OK\n", ~anon2=3, ~anon3=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/syslog/syslog.go:276
    nl = ""
    s·2 = Traceback (most recent call last):
    File "/usr/lib/go/src/pkg/runtime/runtime-gdb.py", line 43, in to_string
    return self.val['str'].string("utf-8", "ignore", l)
    OverflowError: signed integer is greater than maximum


    #9 0x000000000044ad19 in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).writeAndRetry
    (w=0xc20005f120, p=29, s="OK\n", ~anon2=0, ~anon3=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/syslog/syslog.go:257
    pr = 29
    #10 0x000000000044a6f8 in
    github.com/Jimdo/periodicnoise/syslog.(*Writer).Write (w=0xc20005f120, b=
    []uint8 = {...}, ~anon1=833223656192, ~anon2=...) at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/syslog/syslog.go:178
    No locals.
    #11 0x000000000042bdf0 in log.(*Logger).Output (l=0xc20007f140,
    calldepth=2,
    s="OK\n", ~anon2=...) at /usr/lib/go/src/pkg/log/log.go:153
    file = ""
    now = {sec = 63518774405, nsec = 178383383, loc = 0x671a20}
    line = 0
    #12 0x000000000042c657 in log.Println (v= []interface {} = {...}) at
    /usr/lib/go/src/pkg/log/log.go:282
    No locals.
    #13 0x0000000000401f4d in main.Ok () at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/cmd/pn/main.go:31
    No locals.
    #14 0x000000000040322e in main.main () at
    /home/buildmaster/builds/go/periodicnoise/debian/go_build/src/
    github.com/Jimdo/periodicnoise/cmd/pn/main.go:221
    r·2 = {tab = 0xc20008f360, data = 0x30}
    logger = {tab = 0xc20008f4e0, data = 0xc20008e790}
    err = {tab = 0x0, data = 0x0}
    args = []string =
    {"/root/server_scripts/updateserver/deployment_release_cleanup.sh"}
    &cmd = 0xc200000088
    stdout = {tab = 0xc20008f7e0, data = 0xc200000268}
    w·3 = {tab = 0x4006710e8, data = 0x40ecb5}
    &wg = 0xc2000910e0
    stderr = {tab = 0xc20008f7e0, data = 0xc200000288}
    limit = 4723630
    command =
    "/root/server_scripts/updateserver/deployment_release_cleanup.sh"
    &cw = 0x2aaae4000
    timer = 0xc20007bb80
    #15 0x0000000000418fe2 in runtime.main () at
    /usr/lib/go/src/pkg/runtime/proc.c:182
    No locals.
    #16 0x000000000041afb0 in ?? () at
    /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #17 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 2 bt full
    #0 runtime.goexit () at /usr/lib/go/src/pkg/runtime/proc.c:1223
    No locals.
    #1 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 3 bt full
    #0 0x000000000041b52e in runtime.entersyscallblock (dummy=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1334
    No locals.
    #1 0x0000000000415bbe in runtime.MHeap_Scavenger () at
    /usr/lib/go/src/pkg/runtime/mheap.c:454
    notep = -1431621792
    note = 0
    trace = 0
    sumreleased = 0
    k = 584
    limit = -647710720
    forcegc = -259084288
    tick = -129542144
    h = -1707569152
    #2 0x000000000041afb0 in ?? () at
    /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #3 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) goroutine 4 bt full
    #0 0x000000000041adf4 in runtime.park (unlockf=void, lock=void,
    reason=void) at /usr/lib/go/src/pkg/runtime/proc.c:1175
    No locals.
    #1 0x0000000000425ea9 in timerproc () at
    /usr/lib/go/src/pkg/runtime/time.goc:210
    safe = 6723302
    arg = 1
    f = 3
    t = 0
    now = 1681968385
    delta = -1
    #2 0x000000000041afb0 in ?? () at
    /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #3 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) thread apply all bt full

    Thread 8 (Thread 0x7f789a388700 (LWP 31881)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x000000000041c9e6 in sysmon () at
    /usr/lib/go/src/pkg/runtime/proc.c:2007
    ticks = 1983
    lastpoll = 0
    delay = 20
    idle = 1
    #4 0x0000000000419972 in runtime.mstart () at
    /usr/lib/go/src/pkg/runtime/proc.c:485
    seh = 4298912
    #5 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #6 0x0000000000000003 in ?? ()
    No symbol table info available.
    #7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #8 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 7 (Thread 0x7f7899a85700 (LWP 31885)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at
    /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 512952
    i = 2
    p = 856576
    gp = 856576
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 6 (Thread 0x7f7899284700 (LWP 31887)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 60
    #2 0x000000000040ee9f in runtime.notetsleep (n=void, ns=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:147
    deadline = -1467643998
    #3 0x0000000000415bbe in runtime.MHeap_Scavenger () at
    /usr/lib/go/src/pkg/runtime/mheap.c:454
    notep = -1431621792
    note = 0
    trace = 0
    sumreleased = 0
    k = 941
    limit = -647710720
    forcegc = -259084288
    tick = -129542144
    h = -1707569152
    #4 0x000000000041afb0 in ?? () at
    /usr/lib/go/src/pkg/runtime/proc.c:1218
    No locals.
    #5 0x0000000000000000 in ?? ()
    No symbol table info available.

    seh = 0
    #5 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #6 0x0000000000000003 in ?? ()
    No symbol table info available.
    #7 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #8 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 4 (Thread 0x7f7898282700 (LWP 31894)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at
    /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 856320
    i = 2
    p = 409600
    gp = 0
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 3 (Thread 0x7f7897a81700 (LWP 31897)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at
    /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 512952
    i = 2
    p = 856576
    gp = 856576
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041b0df in goexit0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1247
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x000000c200064000 in ?? ()
    No symbol table info available.
    #9 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 2 (Thread 0x7f7897280700 (LWP 31898)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at
    /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 983040
    i = 2
    p = 32768
    gp = 4286936
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x00000000004199a3 in runtime.mstart () at
    /usr/lib/go/src/pkg/runtime/proc.c:494
    seh = 0
    #7 0x00000000004f03a4 in crosscall_amd64 ()
    No symbol table info available.
    #8 0x0000000000000003 in ?? ()
    No symbol table info available.
    #9 0x00007f78aab30040 in ?? () from /lib64/ld-linux-x86-64.so.2
    No symbol table info available.
    #10 0x0000000000000000 in ?? ()
    No symbol table info available.

    Thread 1 (Thread 0x7f78aab25700 (LWP 31878)):
    #0 runtime.futex () at
    /usr/lib/go/src/pkg/runtime/sys_linux_amd64.s:267
    No locals.
    #1 0x0000000000416539 in runtime.futexsleep (addr=void, val=void, ns=void)
    at /usr/lib/go/src/pkg/runtime/os_linux.c:58
    safe = 0
    ts = 4254502
    #2 0x000000000040edd9 in runtime.notesleep (n=void) at
    /usr/lib/go/src/pkg/runtime/lock_futex.c:125
    No locals.
    #3 0x0000000000419f8f in stopm () at
    /usr/lib/go/src/pkg/runtime/proc.c:795
    No locals.
    #4 0x000000000041ab1d in findrunnable () at
    /usr/lib/go/src/pkg/runtime/proc.c:1100
    safe = 4351120
    i = 2
    p = 352512
    gp = 4351839
    #5 0x000000000041acbd in schedule () at
    /usr/lib/go/src/pkg/runtime/proc.c:1145
    gp = 0
    #6 0x000000000041aed0 in park0 (gp=void) at
    /usr/lib/go/src/pkg/runtime/proc.c:1194
    No locals.
    #7 0x00000000004263a9 in runtime.mcall (fn=void) at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:195
    No locals.
    #8 0x0000000000671c60 in runtime.g0 ()
    No symbol table info available.
    #9 0x0000000000426276 in _rt0_amd64 () at
    /usr/lib/go/src/pkg/runtime/asm_amd64.s:86
    No locals.
    #10 0x0000000000000006 in ?? ()
    No symbol table info available.
    #11 0x00007fff75f0a7d8 in ?? ()
    No symbol table info available.
    #12 0x0000000000000006 in ?? ()
    No symbol table info available.
    #13 0x00007fff75f0a7d8 in ?? ()
    No symbol table info available.
    #14 0x0000000000000000 in ?? ()
    No symbol table info available.
    (gdb) quit
    A debugging session is active.

    Inferior 1 [process 31878] will be detached.

    Quit anyway? (y or n) y
    Detaching from program: /usr/bin/pn, process 31878

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts...@googlegroups.com <javascript:>.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andrey mirtchovski at Nov 4, 2013 at 5:57 pm
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Nov 4, 2013 at 6:04 pm
    I agree. I think you've isolated the problem to syslog (oh how I hate
    that package).

    Additionally to what Andrey suggested this logic looks suspect

    https://github.com/Jimdo/periodicnoise/blob/master/cmd/pn/log.go#L32

    If io.Copy returns an error then you don't know the state of r, it
    almost certainly won;t be in the same place as it was before the call
    so retrying won't give you the results you want.

    On Tue, Nov 5, 2013 at 4:57 AM, andrey mirtchovski
    wrote:
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Ingo Oeser at Nov 4, 2013 at 6:46 pm
    Oops, didn't reply to all:

    Ok, I can see that goroutine 7 waits on goroutine 5 to return from the
    syslog. What I don't understand is,
    why goroutine 5 doesn't return with an error or making progress writing? I
    use only one global syslog object.
    Am I supposed to use more than one?

    The io.Copy loop works as intended. We checked it with stopping, starting
    syslog and even killing it.
    It will always recover beautifully. It is no problem, if it loses a message
    due to short-writes, but it should always recover
    from syslog problems. syslog itself is not the problem, as other calls of
    the same program make progress.


    On Mon, Nov 4, 2013 at 7:04 PM, Dave Cheney wrote:

    I agree. I think you've isolated the problem to syslog (oh how I hate
    that package).

    Additionally to what Andrey suggested this logic looks suspect

    https://github.com/Jimdo/periodicnoise/blob/master/cmd/pn/log.go#L32

    If io.Copy returns an error then you don't know the state of r, it
    almost certainly won;t be in the same place as it was before the call
    so retrying won't give you the results you want.

    On Tue, Nov 5, 2013 at 4:57 AM, andrey mirtchovski
    wrote:
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.

    --
    You received this message because you are subscribed to the Google
    Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Ingo Oeser at Nov 4, 2013 at 6:56 pm
    I understand the sequence, but I don't understand, why the poller never
    wakes up.
    Even after 14 hours (or even more) the process doesn't recover and just
    keeps hanging forever.

    The syslog is using unix domain sockets (syslog/syslog_unix.go, since code
    running under Linux)

    I will try using net.DialTimeout() in my fork of the package and sprinkle
    in a SetWriteDeadline before each write,
    to see, if this helps.

    @Dave: Could you confirm this is a go library issue (in log/syslog) and I
    should open an issue? Can you recommend a different package for syslogging?

    On Mon, Nov 4, 2013 at 7:39 PM, andrey mirtchovski wrote:

    I think this is the sequence of events for goroutine 5:

    call writeAndRetry
    obtain mutex
    calls Fprintf with the network file descriptor
    Fprintf calls netFD.Write()
    attempt to write on the network is interrupted (EAGAIN) for whatever reason
    the code for netFD.Write() makes a call to poller's WriteWait
    (/usr/lib/go/src/pkg/net/fd_unix.go:294)
    WriteWait calls Wait() inside the poller runtime, which blocks.

    this would happen if, for example, you had a dropped packet and had to
    wait for retry/tcp timeout. no idea if this is the case.
    On Mon, Nov 4, 2013 at 11:34 AM, Ingo Oeser wrote:
    Ok, I can see that goroutine 7 waits on goroutine 5 to return from the
    syslog. What I don't understand is,
    why goroutine 5 doesn't return with an error or making progress writing? I
    use only one global syslog object.
    Am I supposed to use more than one?


    On Mon, Nov 4, 2013 at 6:57 PM, andrey mirtchovski <
    mirtchovski@gmail.com>
    wrote:
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Nov 4, 2013 at 7:01 pm

    On Tue, Nov 5, 2013 at 5:56 AM, Ingo Oeser wrote:
    I understand the sequence, but I don't understand, why the poller never
    wakes up.
    Even after 14 hours (or even more) the process doesn't recover and just
    keeps hanging forever.
    your syslog package does not import time, so it looks like it doesn't
    set any timeouts on writes, this would explain the problem.
    The syslog is using unix domain sockets (syslog/syslog_unix.go, since code
    running under Linux)

    I will try using net.DialTimeout() in my fork of the package and sprinkle in
    a SetWriteDeadline before each write,
    to see, if this helps.

    @Dave: Could you confirm this is a go library issue (in log/syslog) and I
    should open an issue?
    I don't use the log/syslog package so cannot confirm, but I'd suspect
    if yours is a direct copy of the 1.2 library, then it probably has the
    same timeout issues.
    Can you recommend a different package for syslogging?
    I don't know of any syslog packages, someone should definitely write one.
    On Mon, Nov 4, 2013 at 7:39 PM, andrey mirtchovski wrote:

    I think this is the sequence of events for goroutine 5:

    call writeAndRetry
    obtain mutex
    calls Fprintf with the network file descriptor
    Fprintf calls netFD.Write()
    attempt to write on the network is interrupted (EAGAIN) for whatever
    reason
    the code for netFD.Write() makes a call to poller's WriteWait
    (/usr/lib/go/src/pkg/net/fd_unix.go:294)
    WriteWait calls Wait() inside the poller runtime, which blocks.

    this would happen if, for example, you had a dropped packet and had to
    wait for retry/tcp timeout. no idea if this is the case.

    On Mon, Nov 4, 2013 at 11:34 AM, Ingo Oeser <nightlyone@googlemail.com>
    wrote:
    Ok, I can see that goroutine 7 waits on goroutine 5 to return from the
    syslog. What I don't understand is,
    why goroutine 5 doesn't return with an error or making progress writing?
    I
    use only one global syslog object.
    Am I supposed to use more than one?


    On Mon, Nov 4, 2013 at 6:57 PM, andrey mirtchovski
    <mirtchovski@gmail.com>
    wrote:
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Nov 5, 2013 at 10:54 pm

    On Tue, Nov 5, 2013 at 5:56 AM, Ingo Oeser wrote:
    I understand the sequence, but I don't understand, why the poller never
    wakes up.
    Even after 14 hours (or even more) the process doesn't recover and just
    keeps hanging forever.

    The syslog is using unix domain sockets (syslog/syslog_unix.go, since code
    running under Linux)

    I will try using net.DialTimeout() in my fork of the package and sprinkle in
    a SetWriteDeadline before each write,
    to see, if this helps.

    @Dave: Could you confirm this is a go library issue (in log/syslog) and I
    should open an issue? Can you recommend a different package for syslogging?
    https://github.com/cloudflare/golog

    untested, but I think these guys know what they are doing.
    On Mon, Nov 4, 2013 at 7:39 PM, andrey mirtchovski wrote:

    I think this is the sequence of events for goroutine 5:

    call writeAndRetry
    obtain mutex
    calls Fprintf with the network file descriptor
    Fprintf calls netFD.Write()
    attempt to write on the network is interrupted (EAGAIN) for whatever
    reason
    the code for netFD.Write() makes a call to poller's WriteWait
    (/usr/lib/go/src/pkg/net/fd_unix.go:294)
    WriteWait calls Wait() inside the poller runtime, which blocks.

    this would happen if, for example, you had a dropped packet and had to
    wait for retry/tcp timeout. no idea if this is the case.

    On Mon, Nov 4, 2013 at 11:34 AM, Ingo Oeser <nightlyone@googlemail.com>
    wrote:
    Ok, I can see that goroutine 7 waits on goroutine 5 to return from the
    syslog. What I don't understand is,
    why goroutine 5 doesn't return with an error or making progress writing?
    I
    use only one global syslog object.
    Am I supposed to use more than one?


    On Mon, Nov 4, 2013 at 6:57 PM, andrey mirtchovski
    <mirtchovski@gmail.com>
    wrote:
    if I were to analyse the code just from the source and the trace dump
    I would say that you're being held up waiting on a network
    error/timeout.

    - if you look at the trace you posted it shows that you're reusing the
    same object in goroutines 7 and 5 (0xc200091200, the first argument to
    all methods).

    - goroutine 7 is blocked because it's waiting on a mutex inside
    writeAndRetry (syslog.go:253).

    - goroutine 5, using the same object, seems to have passed through the
    mutex (i.e., is holding it) and has made the call to write
    (syslog.go:257), which subsequently has blocked inside the network
    poller, holding the mutex hostage.

    - the network writer got EAGAIN when writing on the first attempt,
    forcing a call to WaitWrite(), which blocks inside the poller.

    at this point i can't speculate any further. is there packet loss
    involved? is the connection tcp or udp?

    hope this helps at least a bit.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rob Rodgers at Jul 25, 2014 at 3:51 am
    We are running into it as well. It is trivial to reproduce.

    SETUP:

    Take a stock Ubuntu 13.10 image.
    Boot it in vmware or whatever, we're using VmWare fusion on Macs and
    allocating 4 cores and 6GB of DRAM.
    Download and install Go 1.3.
    export GOMAXPROCS="8"
    Compile the following test program: http://play.golang.org/p/Hc-Mn7bWs1
    run it:
          GODEBUG="schedtrace=2000,scheddetail=1" strace -ff -s 128 -v logtest
          GODEBUG="schedtrace=2000,scheddetail=1" logtest
          etc.

    It takes maybe two runs to experience the problem.

    Without having a chance to really dig too deeply, so I apologize if this is
    off, I think the behavior of runtime_pollWait is actually wrong when
    dealing with SOCK_DGRAM file descriptors at least on Linux.

    Here's a (pruned) example run.

    goroutine tester1 and goroutine tester2 are happily running along until
    tester2 attempts a write which returns EAGAIN. Immediately after that
    tester1 does the same and (for the same reason, likely as not) also hits
    EAGAIN. tester1 never recovers, it is dead:


    *[pid 12999] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)*

    *[pid 12999] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12999] write(4, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester1[12992]: This is test message 1356.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)*

    *[pid 12999] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12999] futex(0xc20806e570, FUTEX_WAIT, 0, NULL <unfinished ...>*

    *[pid 12997] futex(0xc20801bbf0, FUTEX_WAKE, 1 <unfinished ...>*

    *[pid 12998] <... futex resumed> ) = 0*

    *[pid 12998] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12998] futex(0xc20801bbf0, FUTEX_WAIT, 0, NULL <unfinished ...>*

    *[pid 12997] <... futex resumed> ) = 1*

    *[pid 12997] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12997] futex(0xc20806e0f0, FUTEX_WAIT, 0, NULL <unfinished ...>*

    *[pid 12996] <... futex resumed> ) = 0*

    *[pid 12996] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12996] futex(0xc20801b2f0, FUTEX_WAIT, 0, NULL <unfinished ...>*

    *[pid 12993] <... futex resumed> ) = 0*

    *[pid 12993] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>*

    *[pid 12992] <... epoll_wait resumed> {{EPOLLOUT, {u32=2319856568,
    u64=139880814752696}}}, 128, -1) = 1*

    *[pid 12992] futex(0xc20801b2f0, FUTEX_WAKE, 1 <unfinished ...>*

    *[pid 12996] <... futex resumed> ) = 0*

    *[pid 12996] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12996] epoll_wait(5, <unfinished ...>*

    *[pid 12992] <... futex resumed> ) = 1*

    *[pid 12992] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)*

    *[pid 12992] epoll_wait(5, {}, 128, 0) = 0*

    *[pid 12992] futex(0x5ff5d0, FUTEX_WAIT, 0, NULL <unfinished ...>*

    *[pid 12996] <... epoll_wait resumed> {{EPOLLOUT, {u32=2319856568,
    u64=139880814752696}}}, 128, -1) = 1*

    *[pid 12996] futex(0x5ff5d0, FUTEX_WAKE, 1) = 1*

    *[pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = 95*


    *[pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3853.\n", 95) = 95*


    *[pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3854.\n", 95) = 95*

    *[pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3855.\n", 95) = 95*

    *...*


    This is the pattern you will see over and over: A hits EAGAIN, B hits
    EAGAIN, A recovers, B never recovers.

    The scheduler debug (run with GODEBUG="schedtrace=2000,scheddetail=1"
    logtest ) after the problem has occurred looks like this:

    *SCHED 2006ms: gomaxprocs=8 idleprocs=8 threads=8 idlethreads=5 runqueue=0
    gcwaiting=0 nmidlelocked=1 nmspinning=0 stopwait=0 sysmonwait=0*

    * P0: status=0 schedtick=9 syscalltick=20683 m=-1 runqsize=0 gfreecnt=0*

    * P1: status=0 schedtick=631 syscalltick=10 m=-1 runqsize=0 gfreecnt=0*

    * P2: status=0 schedtick=4 syscalltick=1185 m=-1 runqsize=0 gfreecnt=0*

    * P3: status=0 schedtick=7 syscalltick=2020 m=-1 runqsize=0 gfreecnt=0*

    * P4: status=0 schedtick=1 syscalltick=75 m=-1 runqsize=0 gfreecnt=0*

    * P5: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0*

    * P6: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0*

    * P7: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0*

    * M7: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M6: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M5: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M4: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M3: p=-1 curg=17 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M2: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M1: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=1 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * M0: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1*

    * G16: status=4(chan receive) m=-1 lockedm=-1*

    * G17: status=3() m=3 lockedm=-1*

    * G18: status=4(GC sweep wait) m=-1 lockedm=-1*

    * G19: status=4(finalizer wait) m=-1 lockedm=-1*

    * G20: status=4(IO wait) m=-1 lockedm=-1*

    * G21: status=4(chan send) m=-1 lockedm=-1*

    .. which will persist forever. Using a timer to work around this problem
    is actually wrong, it's hiding an underlying error.



    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Rob Rodgers at Jul 25, 2014 at 4:59 am
    This exaggerated example -- http://play.golang.org/p/OBdx8ZBK49 -- fails
    even faster and exaggerates the issue.

    SCHED 10029ms: gomaxprocs=8 idleprocs=8 threads=18 idlethreads=14
    runqueue=0 gcwaiting=0 nmidlelocked=1 nmspinning=0 stopwait=0 sysmonwait=0
       P0: status=0 schedtick=13 syscalltick=14543 m=-1 runqsize=0 gfreecnt=1
       P1: status=0 schedtick=688 syscalltick=94 m=-1 runqsize=0 gfreecnt=0
       P2: status=0 schedtick=16 syscalltick=442 m=-1 runqsize=0 gfreecnt=0
       P3: status=0 schedtick=5 syscalltick=76 m=-1 runqsize=0 gfreecnt=0
       P4: status=0 schedtick=13 syscalltick=232 m=-1 runqsize=0 gfreecnt=0
       P5: status=0 schedtick=13 syscalltick=5268 m=-1 runqsize=0 gfreecnt=0
       P6: status=0 schedtick=8 syscalltick=145 m=-1 runqsize=0 gfreecnt=0
       P7: status=0 schedtick=41 syscalltick=36 m=-1 runqsize=0 gfreecnt=0
       M17: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M16: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M15: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M14: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M13: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M12: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M11: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M10: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M9: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M8: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M7: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M6: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M5: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M4: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M3: p=-1 curg=17 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M2: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M1: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=1 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       M0: p=-1 curg=40 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
       G16: status=4(select) m=-1 lockedm=-1
       G17: status=3() m=3 lockedm=-1
       G18: status=4(GC sweep wait) m=-1 lockedm=-1
       G19: status=4(finalizer wait) m=-1 lockedm=-1
       G20: status=4(IO wait) m=-1 lockedm=-1
       G21: status=4(IO wait) m=-1 lockedm=-1
       G22: status=4(IO wait) m=-1 lockedm=-1
       G23: status=4(IO wait) m=-1 lockedm=-1
       G24: status=4(IO wait) m=-1 lockedm=-1
       G25: status=4(IO wait) m=-1 lockedm=-1
       G26: status=4(IO wait) m=-1 lockedm=-1
       G27: status=6() m=-1 lockedm=-1
       G28: status=4(IO wait) m=-1 lockedm=-1
       G29: status=4(IO wait) m=-1 lockedm=-1
       G30: status=4(IO wait) m=-1 lockedm=-1
       G31: status=4(IO wait) m=-1 lockedm=-1
       G32: status=4(IO wait) m=-1 lockedm=-1
       G33: status=4(IO wait) m=-1 lockedm=-1
       G34: status=4(IO wait) m=-1 lockedm=-1
       G35: status=4(IO wait) m=-1 lockedm=-1
       G36: status=4(IO wait) m=-1 lockedm=-1
       G37: status=4(IO wait) m=-1 lockedm=-1
       G38: status=4(IO wait) m=-1 lockedm=-1
       G39: status=4(IO wait) m=-1 lockedm=-1
       G40: status=3(timer goroutine (idle)) m=0 lockedm=-1
    still waiting : 0 1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18 19 <--- all
    of these are dead forever
    collected/done: 7
    The IO event they are waiting on will never come.

    >

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Jul 25, 2014 at 11:51 am
    Can it be also related to
    https://code.google.com/p/go/issues/detail?id=6336 ? Try to rebuild go
    w/o cgo net dns resolver.
    See this thread for details and for how to build w/o cgo dns resolver:
    https://groups.google.com/forum/#!topic/Golang-nuts/dor_p1mc3WM

    On Fri, Jul 25, 2014 at 8:59 AM, Rob Rodgers wrote:
    This exaggerated example -- http://play.golang.org/p/OBdx8ZBK49 -- fails
    even faster and exaggerates the issue.

    SCHED 10029ms: gomaxprocs=8 idleprocs=8 threads=18 idlethreads=14 runqueue=0
    gcwaiting=0 nmidlelocked=1 nmspinning=0 stopwait=0 sysmonwait=0
    P0: status=0 schedtick=13 syscalltick=14543 m=-1 runqsize=0 gfreecnt=1
    P1: status=0 schedtick=688 syscalltick=94 m=-1 runqsize=0 gfreecnt=0
    P2: status=0 schedtick=16 syscalltick=442 m=-1 runqsize=0 gfreecnt=0
    P3: status=0 schedtick=5 syscalltick=76 m=-1 runqsize=0 gfreecnt=0
    P4: status=0 schedtick=13 syscalltick=232 m=-1 runqsize=0 gfreecnt=0
    P5: status=0 schedtick=13 syscalltick=5268 m=-1 runqsize=0 gfreecnt=0
    P6: status=0 schedtick=8 syscalltick=145 m=-1 runqsize=0 gfreecnt=0
    P7: status=0 schedtick=41 syscalltick=36 m=-1 runqsize=0 gfreecnt=0
    M17: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M16: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M15: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M14: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M13: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M12: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M11: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M10: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M9: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M8: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M7: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M6: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M5: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M4: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M3: p=-1 curg=17 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M2: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M1: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=1 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    M0: p=-1 curg=40 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1
    G16: status=4(select) m=-1 lockedm=-1

    G17: status=3() m=3 lockedm=-1
    G18: status=4(GC sweep wait) m=-1 lockedm=-1
    G19: status=4(finalizer wait) m=-1 lockedm=-1
    G20: status=4(IO wait) m=-1 lockedm=-1
    G21: status=4(IO wait) m=-1 lockedm=-1
    G22: status=4(IO wait) m=-1 lockedm=-1
    G23: status=4(IO wait) m=-1 lockedm=-1
    G24: status=4(IO wait) m=-1 lockedm=-1
    G25: status=4(IO wait) m=-1 lockedm=-1
    G26: status=4(IO wait) m=-1 lockedm=-1
    G27: status=6() m=-1 lockedm=-1
    G28: status=4(IO wait) m=-1 lockedm=-1
    G29: status=4(IO wait) m=-1 lockedm=-1
    G30: status=4(IO wait) m=-1 lockedm=-1
    G31: status=4(IO wait) m=-1 lockedm=-1
    G32: status=4(IO wait) m=-1 lockedm=-1
    G33: status=4(IO wait) m=-1 lockedm=-1
    G34: status=4(IO wait) m=-1 lockedm=-1
    G35: status=4(IO wait) m=-1 lockedm=-1
    G36: status=4(IO wait) m=-1 lockedm=-1
    G37: status=4(IO wait) m=-1 lockedm=-1
    G38: status=4(IO wait) m=-1 lockedm=-1
    G39: status=4(IO wait) m=-1 lockedm=-1
    G40: status=3(timer goroutine (idle)) m=0 lockedm=-1
    still waiting : 0 1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18 19 <--- all
    of these are dead forever

    collected/done: 7

    The IO event they are waiting on will never come.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Jul 25, 2014 at 11:52 am

    On Fri, Jul 25, 2014 at 7:51 AM, Rob Rodgers wrote:
    We are running into it as well. It is trivial to reproduce.

    SETUP:

    Take a stock Ubuntu 13.10 image.
    Boot it in vmware or whatever, we're using VmWare fusion on Macs and
    allocating 4 cores and 6GB of DRAM.
    Download and install Go 1.3.
    export GOMAXPROCS="8"
    Compile the following test program: http://play.golang.org/p/Hc-Mn7bWs1
    run it:
    GODEBUG="schedtrace=2000,scheddetail=1" strace -ff -s 128 -v logtest
    GODEBUG="schedtrace=2000,scheddetail=1" logtest
    etc.

    It takes maybe two runs to experience the problem.

    Without having a chance to really dig too deeply, so I apologize if this is
    off, I think the behavior of runtime_pollWait is actually wrong when dealing
    with SOCK_DGRAM file descriptors at least on Linux.

    What exactly is wrong about it?
    Do you know what exactly we need to fix in runtime_pollWait?



    Here's a (pruned) example run.

    goroutine tester1 and goroutine tester2 are happily running along until
    tester2 attempts a write which returns EAGAIN. Immediately after that
    tester1 does the same and (for the same reason, likely as not) also hits
    EAGAIN. tester1 never recovers, it is dead:

    [pid 12999] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)

    [pid 12999] epoll_wait(5, {}, 128, 0) = 0

    [pid 12999] write(4, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester1[12992]: This is test message 1356.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)

    [pid 12999] epoll_wait(5, {}, 128, 0) = 0

    [pid 12999] futex(0xc20806e570, FUTEX_WAIT, 0, NULL <unfinished ...>

    [pid 12997] futex(0xc20801bbf0, FUTEX_WAKE, 1 <unfinished ...>

    [pid 12998] <... futex resumed> ) = 0

    [pid 12998] epoll_wait(5, {}, 128, 0) = 0

    [pid 12998] futex(0xc20801bbf0, FUTEX_WAIT, 0, NULL <unfinished ...>

    [pid 12997] <... futex resumed> ) = 1

    [pid 12997] epoll_wait(5, {}, 128, 0) = 0

    [pid 12997] futex(0xc20806e0f0, FUTEX_WAIT, 0, NULL <unfinished ...>

    [pid 12996] <... futex resumed> ) = 0

    [pid 12996] epoll_wait(5, {}, 128, 0) = 0

    [pid 12996] futex(0xc20801b2f0, FUTEX_WAIT, 0, NULL <unfinished ...>

    [pid 12993] <... futex resumed> ) = 0

    [pid 12993] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

    [pid 12992] <... epoll_wait resumed> {{EPOLLOUT, {u32=2319856568,
    u64=139880814752696}}}, 128, -1) = 1

    [pid 12992] futex(0xc20801b2f0, FUTEX_WAKE, 1 <unfinished ...>

    [pid 12996] <... futex resumed> ) = 0

    [pid 12996] epoll_wait(5, {}, 128, 0) = 0

    [pid 12996] epoll_wait(5, <unfinished ...>

    [pid 12992] <... futex resumed> ) = 1

    [pid 12992] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = -1 EAGAIN (Resource
    temporarily unavailable)

    [pid 12992] epoll_wait(5, {}, 128, 0) = 0

    [pid 12992] futex(0x5ff5d0, FUTEX_WAIT, 0, NULL <unfinished ...>

    [pid 12996] <... epoll_wait resumed> {{EPOLLOUT, {u32=2319856568,
    u64=139880814752696}}}, 128, -1) = 1

    [pid 12996] futex(0x5ff5d0, FUTEX_WAKE, 1) = 1

    [pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3852.\n", 95) = 95

    [pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3853.\n", 95) = 95

    [pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3854.\n", 95) = 95

    [pid 12996] write(6, "<61>2014-07-24T20:34:44-07:00 r2.skyportsystems.com
    tester2[12992]: This is test message 3855.\n", 95) = 95

    ...


    This is the pattern you will see over and over: A hits EAGAIN, B hits
    EAGAIN, A recovers, B never recovers.

    The scheduler debug (run with GODEBUG="schedtrace=2000,scheddetail=1"
    logtest ) after the problem has occurred looks like this:

    SCHED 2006ms: gomaxprocs=8 idleprocs=8 threads=8 idlethreads=5 runqueue=0
    gcwaiting=0 nmidlelocked=1 nmspinning=0 stopwait=0 sysmonwait=0

    P0: status=0 schedtick=9 syscalltick=20683 m=-1 runqsize=0 gfreecnt=0

    P1: status=0 schedtick=631 syscalltick=10 m=-1 runqsize=0 gfreecnt=0

    P2: status=0 schedtick=4 syscalltick=1185 m=-1 runqsize=0 gfreecnt=0

    P3: status=0 schedtick=7 syscalltick=2020 m=-1 runqsize=0 gfreecnt=0

    P4: status=0 schedtick=1 syscalltick=75 m=-1 runqsize=0 gfreecnt=0

    P5: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0

    P6: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0

    P7: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0

    M7: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M6: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M5: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M4: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M3: p=-1 curg=17 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M2: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M1: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=1 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    M0: p=-1 curg=-1 mallocing=0 throwing=0 gcing=0 locks=0 dying=0 helpgc=0
    spinning=0 blocked=0 lockedg=-1

    G16: status=4(chan receive) m=-1 lockedm=-1

    G17: status=3() m=3 lockedm=-1

    G18: status=4(GC sweep wait) m=-1 lockedm=-1

    G19: status=4(finalizer wait) m=-1 lockedm=-1

    G20: status=4(IO wait) m=-1 lockedm=-1

    G21: status=4(chan send) m=-1 lockedm=-1

    .. which will persist forever. Using a timer to work around this problem is
    actually wrong, it's hiding an underlying error.



    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Robert Rodgers at Jul 25, 2014 at 11:29 pm

    On Jul 25, 2014, at 4:51 AM, Dmitry Vyukov wrote:
    What exactly is wrong about it?
    Do you know what exactly we need to fix in runtime_pollWait?
    On the first, you have sockets blocked on an IO event that never occurs. I don't mean that the other side has stopped reading and you're being backpressure, I mean that the issue has long cleared and the go runtime never recovers.

    I accidentally left a run of the previous tester going overnight, so around 12 hours later it's still stuck:

      still waiting : 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19
      collected/done: 8

    ... all in IO Wait. Clearly whatever we are waiting on isn't going to happen if it hasn't happened by now. Maybe we aren't actually waiting or maybe we shouldn't be waiting. It's not the caller's fault.

    Merging the two messages together. I don't think this case is DNS related; we are talking about a unix domain socket here, there is no DNS activity. It could be related in as much as there may be a general datagram sockets handling issue; like somewhere in the code either an epoll event registration didn't take place or failed (races, logic errors) or an EPOLLET event has fired but the application-side state machine has not carried out the activities necessary for it to fire again despite thinking it has (and thus waiting) - these are very common issues with epoll code.

      http://play.golang.org/p/O4Xt_fkt9q

    This does not involve any syslog code and displays the same symptoms: competitive writing to a unix domain datagram socket eventually hangs the writer. What is the expected behavior? The server is always draining; if you start another instance of the client it works as well.. for awhile.

    With regard to why a timer is wrong, the simple answer is if we are waiting on an IO event, we should be able to identify what IO event is the one that and what will satisfy the wait. In this case, nothing satisfies the event because it is either failing to register the event or the state machine is out of sync or the expectation is wrong. So someone has bad code.

    Using a timeout to solve this is a workaround, not a fix, and not a very good one:

    1. every time we hit this case (which seems to be around EAGAIN and wait) we are going to block on IOWait for some duration
    2. while blocked _there is never the case where the issue clears_, nothing useful happens
    3. return failure

    Might as well use a timeout of 1ns because at least then you're not sitting idle for no reason; either way you're failing. It's sort of like someone using libev or whatever being told to schedule a timer in their application state machine for writes because the underlying event loop driven around select() sometimes forgets to do an FD_SET for the occasional descriptor in question. It will work in a way but it is purely sweeping the underlying problem under the rug.

    There SHOULD be a timeout in the API configured by the user because unix domain sockets on Linux are blocking and syslog almost always uses a unix domain datagram socket; but that timeout should be to recover from a slow consumer blocking the application progress, *not* to hide a descriptor event management issue.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Dmitry Vyukov at Jul 26, 2014 at 8:59 am
    Looks like https://code.google.com/p/go/issues/detail?id=5932
    One of the reproducer there (http://play.golang.org/p/vu12qIyPil) is
    very similar to this one.
    Let's move the discussion there.

    On Sat, Jul 26, 2014 at 3:29 AM, Robert Rodgers wrote:
    On Jul 25, 2014, at 4:51 AM, Dmitry Vyukov wrote:
    What exactly is wrong about it?
    Do you know what exactly we need to fix in runtime_pollWait?
    On the first, you have sockets blocked on an IO event that never occurs. I don't mean that the other side has stopped reading and you're being backpressure, I mean that the issue has long cleared and the go runtime never recovers.

    I accidentally left a run of the previous tester going overnight, so around 12 hours later it's still stuck:

    still waiting : 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19
    collected/done: 8

    ... all in IO Wait. Clearly whatever we are waiting on isn't going to happen if it hasn't happened by now. Maybe we aren't actually waiting or maybe we shouldn't be waiting. It's not the caller's fault.

    Merging the two messages together. I don't think this case is DNS related; we are talking about a unix domain socket here, there is no DNS activity. It could be related in as much as there may be a general datagram sockets handling issue; like somewhere in the code either an epoll event registration didn't take place or failed (races, logic errors) or an EPOLLET event has fired but the application-side state machine has not carried out the activities necessary for it to fire again despite thinking it has (and thus waiting) - these are very common issues with epoll code.

    http://play.golang.org/p/O4Xt_fkt9q

    This does not involve any syslog code and displays the same symptoms: competitive writing to a unix domain datagram socket eventually hangs the writer. What is the expected behavior? The server is always draining; if you start another instance of the client it works as well.. for awhile.

    With regard to why a timer is wrong, the simple answer is if we are waiting on an IO event, we should be able to identify what IO event is the one that and what will satisfy the wait. In this case, nothing satisfies the event because it is either failing to register the event or the state machine is out of sync or the expectation is wrong. So someone has bad code.

    Using a timeout to solve this is a workaround, not a fix, and not a very good one:

    1. every time we hit this case (which seems to be around EAGAIN and wait) we are going to block on IOWait for some duration
    2. while blocked _there is never the case where the issue clears_, nothing useful happens
    3. return failure

    Might as well use a timeout of 1ns because at least then you're not sitting idle for no reason; either way you're failing. It's sort of like someone using libev or whatever being told to schedule a timer in their application state machine for writes because the underlying event loop driven around select() sometimes forgets to do an FD_SET for the occasional descriptor in question. It will work in a way but it is purely sweeping the underlying problem under the rug.

    There SHOULD be a timeout in the API configured by the user because unix domain sockets on Linux are blocking and syslog almost always uses a unix domain datagram socket; but that timeout should be to recover from a slow consumer blocking the application progress, *not* to hide a descriptor event management issue.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedNov 1, '13 at 10:56p
activeJul 26, '14 at 8:59a
posts15
users5
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase