FAQ
I have been reading several discussions about external (non-Go) API penalty
related to switching to regular stack on each external API call. It seems
to me that the growable stacks of go routines, which are great for
asynchronous (parallel) servers, are no good for other purposes such as
programming games or writing regular desktop applications, that require
calling some external API a lot. Effectively can make such Go app to spend
most of the time on converting back and forth to growable stack.

So I am wondering if it would make a sense to provide kind of "standard
stack" build (linker?) mode, disabling growable stack feature altogether
for built binary, so external, eg. OpenGL, API calls are basically
toll-free, but use of many coroutines is limited, because the standard
stack is simply to big to handle many (thousands) of coroutines.

If I understand correctly, adding stack check preamble to every function is
a Go linker, not compiler, feature, so such "-standard-stack" option would
not require having separate packages in pkg/ for growable and standard
stack.

I am sorry if it was already discussed previously, but couldn't find such
discussion here on golang-dev.

-- Adam

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Minux at Mar 23, 2016 at 9:45 pm
    The overhead of cgo calls are definitely not because of switching stacks,
    because it's just switching a few registers. If switching stack is the main
    bottleneck, then switching goroutinea will see similar slowness.

    The overhead comes from scheduler coordination (goroutines in cgo calls are
    not counted towards GOMAXPROCS)

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Keith Randall at Mar 23, 2016 at 9:59 pm
    Adam, if you have a particular benchmark in mind, we'd love to see it.
    It's much easier to reason about these things in the concrete instead of
    the abstract.

    On Wed, Mar 23, 2016 at 2:45 PM, minux wrote:

    The overhead of cgo calls are definitely not because of switching stacks,
    because it's just switching a few registers. If switching stack is the main
    bottleneck, then switching goroutinea will see similar slowness.

    The overhead comes from scheduler coordination (goroutines in cgo calls
    are not counted towards GOMAXPROCS)

    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Minux at Mar 23, 2016 at 10:09 pm

    On Mar 23, 2016 5:59 PM, "Keith Randall" wrote:
    Adam, if you have a particular benchmark in mind, we'd love to see it.
    It's much easier to reason about these things in the concrete instead of
    the abstract.

    Over one year ago, i filed #9704 for the cgocall performance issue.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Adam Strzelecki at Mar 24, 2016 at 12:15 am

    Keith Randall wrote:

    Adam, if you have a particular benchmark in mind, we'd love to see it.
    It's much easier to reason about these things in the concrete instead of
    the abstract.
    I am thinking particularly about numbers given in the discussion from 2013
    here
    in https://groups.google.com/d/msg/golang-nuts/RTtMsgZi88Q/ebPUKSFsF8UJ The
    consensus was that the overhead is not sooo big, as long as you don't make
    millions of cgo calls, which is not something that is very obvious, eg.
    when using some physics engine, or bridging Go to some desktop API like Qt
    or Cocoa.

    Extending example from #9704 with benchmark for normal Go call I get 2.2ns
    go call vs 190ns cgo call.

    // run me with: go run bench.go -test.bench=.
    package main

    // int rand() { return 42; }
    import "C"

    import "testing"

    func BenchmarkCgoCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
    C.rand()
    }
    }

    //go:noinline
    func rand() int {
    switch {
    } // don't inline (go<=1.6)
    return 42
    }

    func BenchmarkGoCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
    rand()
    }
    }

    func main() {
    testing.Main(func(string, string) (bool, error) {
    return true, nil
    }, nil, []testing.InternalBenchmark{
    {"BenchmarkCgoCall", BenchmarkCgoCall},
    {"BenchmarkGoCall", BenchmarkGoCall},
    }, nil)
    }

    Now what if the rand() is some tiny but vital function, such as vector
    multiplication in some physics engine? The call overhead can be higher than
    the time to execute that tiny function itself and only choice is to rewrite
    parts of the engine to Go itself.

    minux wrote:
    The overhead of cgo calls are definitely not because of switching stacks,
    because it's just switching a few registers. If switching stack is the main
    bottleneck, then switching goroutinea will see similar slowness.
    I think such switch isn't a really a problem when it happens once per few
    milliseconds to switch coroutine, but can be a problem when occurs one
    every cgo call in some tight loop of Go code calling external API.

    Of course all of that is just my random ramblings on Go applications other
    than high performance (web) servers.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Minux at Mar 24, 2016 at 12:29 am

    On Wed, Mar 23, 2016 at 8:14 PM, Adam Strzelecki wrote:

    minux wrote:
    The overhead of cgo calls are definitely not because of switching stacks,
    because it's just switching a few registers. If switching stack is the main
    bottleneck, then switching goroutinea will see similar slowness.
    I think such switch isn't a really a problem when it happens once per few
    milliseconds to switch coroutine, but can be a problem when occurs one
    every cgo call in some tight loop of Go code calling external API.
    I did an experiment with Go tip:
    $ go run issue9704.go -test.bench=1
    testing: warning: no tests to run
    BenchmarkCgo-4 10000000 144 ns/op
    PASS

    Then I modified cgocall to remove all scheduler interactions,

    $ go run issue9704.go -test.bench=1
    testing: warning: no tests to run
    BenchmarkCgo-4 100000000 15.8 ns/op
    PASS

    It's definitely more heavyweight than a direct call, but ~90% of the cgo
    call overhead
    comes from scheduler interaction, rather than from stack switch overhead.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 24, 2016 at 12:35 am
    Here's the perf output I get for BenchmarkCgoCall on my laptop:

         13.04% cgo cgo [.] runtime/internal/atomic.Cas
         10.26% cgo cgo [.] runtime.deferreturn
          7.79% cgo cgo [.] runtime.newdefer
          6.23% cgo cgo [.] runtime.cgocall
          5.86% cgo cgo [.] runtime.systemstack
          5.65% cgo cgo [.] runtime.casgstatus
          5.55% cgo cgo [.] main.BenchmarkCgoCall
          4.87% cgo cgo [.] runtime.reentersyscall
          4.71% cgo cgo [.] runtime.freedefer
          4.38% cgo cgo [.] runtime/internal/atomic.Store
          3.92% cgo cgo [.] main._Cfunc_rand
          3.83% cgo cgo [.] runtime.deferproc
          3.24% cgo cgo [.] runtime.exitsyscallfast
          3.01% cgo cgo [.] runtime.exitsyscall
          2.88% cgo cgo [.] runtime.deferproc.func1
          1.99% cgo cgo [.] runtime.getcallerpc
          1.76% cgo cgo [.] runtime.memmove
          1.66% cgo cgo [.] runtime.asmcgocall
          1.62% cgo cgo [.] runtime.entersyscall
          1.46% cgo cgo [.] runtime.getcallersp
          1.41% cgo cgo [.] runtime.unlockOSThread

    Less than 2% of BenchmarkCgoCall's time is spent in the actual stack switch
    (asmcgocall). Most of the time appears to be spent dealing with the defer
    in cgocall (deferreturn, newdefer, systemstack, freedefer, deferproc). Most
    of the remaining time is doing atomic CAS and store operations, which is
    probably from manipulating scheduler state.

    So, optimizing the stack switch is unlikely to help. However, it's probably
    possible to improve the overhead from the defer. It may also be possible to
    improve the overhead from scheduler interaction, but that's probably harder
    to do without changing semantics.
    On Wed, Mar 23, 2016 at 8:14 PM, Adam Strzelecki wrote:

    Keith Randall wrote:
    Adam, if you have a particular benchmark in mind, we'd love to see it.
    It's much easier to reason about these things in the concrete instead of
    the abstract.
    I am thinking particularly about numbers given in the discussion from 2013
    here in
    https://groups.google.com/d/msg/golang-nuts/RTtMsgZi88Q/ebPUKSFsF8UJ The
    consensus was that the overhead is not sooo big, as long as you don't make
    millions of cgo calls, which is not something that is very obvious, eg.
    when using some physics engine, or bridging Go to some desktop API like Qt
    or Cocoa.

    Extending example from #9704 with benchmark for normal Go call I get 2.2ns
    go call vs 190ns cgo call.

    // run me with: go run bench.go -test.bench=.
    package main

    // int rand() { return 42; }
    import "C"

    import "testing"

    func BenchmarkCgoCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
    C.rand()
    }
    }

    //go:noinline
    func rand() int {
    switch {
    } // don't inline (go<=1.6)
    return 42
    }

    func BenchmarkGoCall(b *testing.B) {
    for i := 0; i < b.N; i++ {
    rand()
    }
    }

    func main() {
    testing.Main(func(string, string) (bool, error) {
    return true, nil
    }, nil, []testing.InternalBenchmark{
    {"BenchmarkCgoCall", BenchmarkCgoCall},
    {"BenchmarkGoCall", BenchmarkGoCall},
    }, nil)
    }

    Now what if the rand() is some tiny but vital function, such as vector
    multiplication in some physics engine? The call overhead can be higher than
    the time to execute that tiny function itself and only choice is to rewrite
    parts of the engine to Go itself.

    minux wrote:
    The overhead of cgo calls are definitely not because of switching stacks,
    because it's just switching a few registers. If switching stack is the main
    bottleneck, then switching goroutinea will see similar slowness.
    I think such switch isn't a really a problem when it happens once per few
    milliseconds to switch coroutine, but can be a problem when occurs one
    every cgo call in some tight loop of Go code calling external API.

    Of course all of that is just my random ramblings on Go applications other
    than high performance (web) servers.

    --
    You received this message because you are subscribed to the Google Groups
    "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Minux at Mar 24, 2016 at 12:42 am

    On Wed, Mar 23, 2016 at 8:35 PM, Austin Clements wrote:

    Here's the perf output I get for BenchmarkCgoCall on my laptop:

    13.04% cgo cgo [.] runtime/internal/atomic.Cas
    10.26% cgo cgo [.] runtime.deferreturn
    7.79% cgo cgo [.] runtime.newdefer
    6.23% cgo cgo [.] runtime.cgocall
    5.86% cgo cgo [.] runtime.systemstack
    5.65% cgo cgo [.] runtime.casgstatus
    5.55% cgo cgo [.] main.BenchmarkCgoCall
    4.87% cgo cgo [.] runtime.reentersyscall
    4.71% cgo cgo [.] runtime.freedefer
    4.38% cgo cgo [.] runtime/internal/atomic.Store
    3.92% cgo cgo [.] main._Cfunc_rand
    3.83% cgo cgo [.] runtime.deferproc
    3.24% cgo cgo [.] runtime.exitsyscallfast
    3.01% cgo cgo [.] runtime.exitsyscall
    2.88% cgo cgo [.] runtime.deferproc.func1
    1.99% cgo cgo [.] runtime.getcallerpc
    1.76% cgo cgo [.] runtime.memmove
    1.66% cgo cgo [.] runtime.asmcgocall
    1.62% cgo cgo [.] runtime.entersyscall
    1.46% cgo cgo [.] runtime.getcallersp
    1.41% cgo cgo [.] runtime.unlockOSThread

    Less than 2% of BenchmarkCgoCall's time is spent in the actual stack
    switch (asmcgocall). Most of the time appears to be spent dealing with the
    defer in cgocall (deferreturn, newdefer, systemstack, freedefer,
    deferproc). Most of the remaining time is doing atomic CAS and store
    operations, which is probably from manipulating scheduler state.

    So, optimizing the stack switch is unlikely to help. However, it's
    probably possible to improve the overhead from the defer. It may also be
    possible to improve the overhead from scheduler interaction, but that's
    probably harder to do without changing semantics.
    Indeed, if I remove the "defer endcgo(mp)" and call endcgo(mp) to the
    end of the function, the cgocall time is reduced from 144ns/op to 63.7ns/op.
    (We can't just remove the defer this way though, it will break panic/recover
    with Go->C->Go call sequence.)

    Why is defer this slow? This benchmark:
    //go:noinline
    func defers() (r int) {
             defer func() {
                     r = 42
             }()
             return 0
    }
    func BenchmarkDefer(b *testing.B) {
             for i := 0; i < b.N; i++ {
                     defers()
             }
    }

    Showed that calling defers() uses 77.7ns/op on my system. Should I file an
    issue?

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Minux at Mar 24, 2016 at 12:56 am

    On Wed, Mar 23, 2016 at 8:41 PM, minux wrote:

    Why is defer this slow? This benchmark:
    We're already caching _defers in per-P caches. But I'm wondering
    why can't change the compiler so that it allocates the _defer on
    stack (and also fill in the defer argument directly.)

    At least we could do this for the usual case where there are
    bounded number of defers in a function.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Austin Clements at Mar 24, 2016 at 1:11 am
    This does seem a lot more expensive than it should be, at least for such a
    simple case. I'm not sure how high priority this is, but go ahead and file
    an issue so we can at least keep track of it.
    On Wed, Mar 23, 2016 at 8:41 PM, minux wrote:

    On Wed, Mar 23, 2016 at 8:35 PM, Austin Clements wrote:

    Here's the perf output I get for BenchmarkCgoCall on my laptop:

    13.04% cgo cgo [.] runtime/internal/atomic.Cas
    10.26% cgo cgo [.] runtime.deferreturn
    7.79% cgo cgo [.] runtime.newdefer
    6.23% cgo cgo [.] runtime.cgocall
    5.86% cgo cgo [.] runtime.systemstack
    5.65% cgo cgo [.] runtime.casgstatus
    5.55% cgo cgo [.] main.BenchmarkCgoCall
    4.87% cgo cgo [.] runtime.reentersyscall
    4.71% cgo cgo [.] runtime.freedefer
    4.38% cgo cgo [.] runtime/internal/atomic.Store
    3.92% cgo cgo [.] main._Cfunc_rand
    3.83% cgo cgo [.] runtime.deferproc
    3.24% cgo cgo [.] runtime.exitsyscallfast
    3.01% cgo cgo [.] runtime.exitsyscall
    2.88% cgo cgo [.] runtime.deferproc.func1
    1.99% cgo cgo [.] runtime.getcallerpc
    1.76% cgo cgo [.] runtime.memmove
    1.66% cgo cgo [.] runtime.asmcgocall
    1.62% cgo cgo [.] runtime.entersyscall
    1.46% cgo cgo [.] runtime.getcallersp
    1.41% cgo cgo [.] runtime.unlockOSThread

    Less than 2% of BenchmarkCgoCall's time is spent in the actual stack
    switch (asmcgocall). Most of the time appears to be spent dealing with the
    defer in cgocall (deferreturn, newdefer, systemstack, freedefer,
    deferproc). Most of the remaining time is doing atomic CAS and store
    operations, which is probably from manipulating scheduler state.

    So, optimizing the stack switch is unlikely to help. However, it's
    probably possible to improve the overhead from the defer. It may also be
    possible to improve the overhead from scheduler interaction, but that's
    probably harder to do without changing semantics.
    Indeed, if I remove the "defer endcgo(mp)" and call endcgo(mp) to the
    end of the function, the cgocall time is reduced from 144ns/op to
    63.7ns/op.
    (We can't just remove the defer this way though, it will break
    panic/recover
    with Go->C->Go call sequence.)

    Why is defer this slow? This benchmark:
    //go:noinline
    func defers() (r int) {
    defer func() {
    r = 42
    }()
    return 0
    }
    func BenchmarkDefer(b *testing.B) {
    for i := 0; i < b.N; i++ {
    defers()
    }
    }

    Showed that calling defers() uses 77.7ns/op on my system. Should I file an
    issue?
    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Ian Lance Taylor at Mar 23, 2016 at 10:10 pm

    On Wed, Mar 23, 2016 at 2:32 PM, Adam Strzelecki wrote:
    I have been reading several discussions about external (non-Go) API penalty
    related to switching to regular stack on each external API call. It seems to
    me that the growable stacks of go routines, which are great for asynchronous
    (parallel) servers, are no good for other purposes such as programming games
    or writing regular desktop applications, that require calling some external
    API a lot. Effectively can make such Go app to spend most of the time on
    converting back and forth to growable stack.

    So I am wondering if it would make a sense to provide kind of "standard
    stack" build (linker?) mode, disabling growable stack feature altogether for
    built binary, so external, eg. OpenGL, API calls are basically toll-free,
    but use of many coroutines is limited, because the standard stack is simply
    to big to handle many (thousands) of coroutines.

    If I understand correctly, adding stack check preamble to every function is
    a Go linker, not compiler, feature, so such "-standard-stack" option would
    not require having separate packages in pkg/ for growable and standard
    stack.
    Gccgo can work this way (just compile your non-Go code with
    -fsplit-stacks). It doesn't help as much as one would like because
    you still have to coordinate with the Go scheduler. What you need is
    a combination of your proposal with a way to mark a cgo call as
    non-blocking. Then a call would, in principle, only require going
    through the code that loads the parameters into the registers that C
    expects. Of course, if you make a mistake in your cgo annotations
    your program will break.

    Ian

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Mura at Mar 24, 2016 at 2:07 am
    I'd also be happy to have a simplified C "ABI" calling support without the
    need of C compiler and additional boilerplate code.
    Something like Julia's *ccall* would be great, but I am afraid that it's
    impossible for Go since it requires macros and compile-time code generation
    AFAICT.

    In Go, the API I would expect looks like:

    cqsort := cabi.New("libc", "qsort")
    or
    cqsort := cabi.New("libc", "qsort", a_slice_supplied_as_call_stack)

    The type signature of the Call method is like
    func (c *ctx) Call(args ...int) int
    or
    func (c *ctx) Call(args ...interface{}) int

    And the usage is like
    cqsort.Call(base, count, size, mycomp)

    The "Call" method will have interactions with the scheduler and therefore
    be slower

    Then we can have a fast version of Call without interactions with the
    scheduler:

    func (c *ctx) Fastcall(args ...int) int
    or
    func (c *ctx) NonBlockingCall(args ...int) int



    Please bear in mind I do not have broad/deep knowledge about programming
    language design and Go's internals so these are just thoughts from a user's
    perspective.

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Seb Binet at Mar 24, 2016 at 8:34 am

    On Thu, Mar 24, 2016 at 3:07 AM, mura wrote:
    I'd also be happy to have a simplified C "ABI" calling support without the
    need of C compiler and additional boilerplate code.
    Something like Julia's ccall would be great, but I am afraid that it's
    impossible for Go since it requires macros and compile-time code generation
    AFAICT.

    In Go, the API I would expect looks like:

    cqsort := cabi.New("libc", "qsort")
    or
    cqsort := cabi.New("libc", "qsort", a_slice_supplied_as_call_stack)

    The type signature of the Call method is like
    func (c *ctx) Call(args ...int) int
    or
    func (c *ctx) Call(args ...interface{}) int

    And the usage is like
    cqsort.Call(base, count, size, mycomp)

    The "Call" method will have interactions with the scheduler and therefore be
    slower

    Then we can have a fast version of Call without interactions with the
    scheduler:

    func (c *ctx) Fastcall(args ...int) int
    or
    func (c *ctx) NonBlockingCall(args ...int) int



    Please bear in mind I do not have broad/deep knowledge about programming
    language design and Go's internals so these are just thoughts from a user's
    perspective.
    this looks a lot like dlopen+ffi:
      https://github.com/gonuts/ffi

    (in my "copious" free time, I am slowly working on implementing the
    "plugin" package on top of that: https://github.com/sbinet/go-plugin)

    -s

    --
    You received this message because you are subscribed to the Google Groups "golang-dev" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedMar 23, '16 at 9:32p
activeMar 24, '16 at 8:34a
posts13
users7
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase