I am suspecting a race condition.
We have a TCP server that loops over a map of large number of clients (that
frequently connect and disconnect), broadcasting a message. The code was
compiled with Go 1.3, GOARCH amd64.
This is a very rare error that happened once under a spiking load:
1405195804.885505 [debug] Go server: unexpected fault address 0x6d50
1405195804.890221 [debug] Go server: fatal error: fault
1405195804.890240 [debug] Go server: [signal 0xb code=0x1 addr=0x6d50
pc=0x4230ea]
1405195804.890254 [debug] Go server: goroutine 3927233 [running]:
1405195804.890268 [debug] Go server: runtime.throw(0x9356e2)
1405195804.890281 [debug] Go server:
/usr/local/go/src/pkg/runtime/panic.c:520 +0x69 fp=0x2aaaad38fd00
1405195804.890349 [debug] Go server: sp=0x2aaaad38fce8
1405195804.890364 [debug] Go server: runtime.sigpanic()
1405195804.890378 [debug] Go server:
/usr/local/go/src/pkg/runtime/os_linux.c:240 +0x13f fp=0x2aaaad38fd18
sp=0x2aaaad38fd00
1405195804.890393 [debug] Go server: hash_next(0x2aaaad38ff20)
1405195804.890406 [debug] Go server:
/usr/local/go/src/pkg/runtime/hashmap.goc:707 +0x50a fp=0x2aaaad38fdb0
sp=0x2aaaad38fd18
1405195804.890422 [debug] Go server: runtime.mapiternext(0x2aaaad38ff20)
1405195804.890436 [debug] Go server:
/usr/local/go/src/pkg/runtime/hashmap.goc:1048 +0x12 fp=0x2aaaad38fdd8
sp=0x2aaaad38fdb0
1405195804.890451 [debug] Go server:
_/nail/build/imbuild/work/tagservgo/server.(*Server).broadcastWorker(0xc208028380,
0xc208055183
, 0x9, 0x0, 0xc20805518d, 0x247)
*1405195804.890465 [debug] Go server:
/nail/build/imbuild/work/tagservgo/server/Server.go:161 +0x374
fp=0x2aaaad38ff78 sp=0x2aaaad*
*38fdd8*
1405195804.890479 [debug] Go server: runtime.goexit()
1405195804.890492 [debug] Go server:
/usr/local/go/src/pkg/runtime/proc.c:1445 fp=0x2aaaad38ff80
sp=0x2aaaad38ff78
1405195804.890505 [debug] Go server: created by
_/nail/build/imbuild/work/tagservgo/server.(*Server).WriteToGroup
1405195804.890519 [debug] Go server:
/nail/build/imbuild/work/tagservgo/server/Server.go:184 +0xc2
1405195804.890532 [debug] Go server: goroutine 16 [runnable]:
The relevant code is:
158 func (s *Server) broadcastWorker(groupName string, shard int, msg
string) {
159 toCtr := 0
160 numClients := len(s.Groups[groupName][shard])
*161* for client := range s.Groups[groupName][shard] {
162 err := s.writeToClient(client, msg)
163 if err != nil {
164 s.Debugln("Error in broadcast: ", err)
165 if strings.Contains(err.Error(), "i/o timeout") {
166 toCtr++
167 s.removeClient(client, false)
168 }
169 }
170 }
181 func (s *Server) WriteToGroup(groupName string, msg string) {
182 numShards := len(s.Groups[groupName])
183 for shard := 0; shard < numShards; shard++ {
184 go s.broadcastWorker(groupName, shard, msg)
185 }
186 }
This map s.Groups[name][shard] is manipulated in different Go routines,
when the new clients join or leave. I wonder if this error is a consequence
of a race condition of looping over this map while it is being added to, or
removed from. If this is the case, putting s.lock() around the for loop in
broadcastWorker leads to significant performance penalties (from previous
stress testing). Perhaps making a deep copy of s.Groups map would be a
better idea?
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.