I spent a while on the Stackless Python email lists around 2004, so I like
to think I understand that project's goals and implementation decently.
I've also been on this list long enough that I like to think I understand
the same for Go.
Stackless Go does not make much sense.
The goal of Stackless Python was to make it possible to block a Python
coroutine at any point in a computation and start running another. This was
non-trivial because the Python interpreter implemented language constructs
like function calls by invoking the interpreter loop recursively, not by
maintaining an explicit separate stack. That is, each Python call frame
ended up being a C call frame in the interpreter. This is a perfectly fine
way to write an interpreter, and it no doubt simplifies some things, but it
means you can't just pause one execution context (task, goroutine,
coroutine, lightweight thread, fiber, whatever name you like) and start
another, because some of the execution state is on the C stack, which is
hard to save and restore. To address this, the Stackless Python project
moved all the Python execution state into separate allocation chains: an
explicit Python stack per execution context. Once that's been done, the C
interpreter can run one execution context for a while, and then stop and
run another. Because all the execution state is in data structures, it's
now trivial to flip between them.
Even at the time, there were other possible implementation strategies. For
example, swtch.com/libtask is a simple coroutine library for C. It would
probably have been less work to flip between multiple Python stacks managed
by that library, but not portably: it would have required
architecture-specific assembly and also guessing at the size of the stack
required, since arbitrary C library calls are being made. The approach
taken by Stackless Python does make a lot of sense. If it had been done
from the start of the Python interpreter it would have been pretty easy,
and it's completely portable. Because there was a large existing
interpreter not written that way, it turned out to be a lot of work.
Go obviously shares with Stackless Python the goal of having many
lightweight execution contexts (goroutines). However, Go also accepts that
the underlying machine hardware is optimized for use of call and return
instructions on a hardware-managed stack. It uses architecture-specific
assembly to do stack switches, and it avoids guessing at the size of the
stack by using a sequence of stack segments, allocated on demand and freed
when no longer in use. In the 'Stackless Python' sense, Go is already
stackless: it does not put any per-goroutine state on the main C stack, and
as such can flip between goroutines with ease.
The cons you list for Go's current approach are not fixed by explicit
allocations for each function frame. The first is "checked use of stack
space (split on demand) costs some cycles on every call invocation." It
costs one or two machine instructions. Explicit allocation and deallocation
on every function frame is like splitting at every call. It is guaranteed
to be more expensive. Also, since you don't get to use the hardware's jump
predictor for call and return, those transitions will be slower even
ignoring the frame management. The second is "not compatible with C stacks
(cgo has to start a new thread at the worlds borders because of this)."
This is not true. Cgo does not create new threads. It switches from the
per-goroutine stack to the significantly larger per-thread stack, so that
the C program will not run out of space. The stack switch is super cheap,
just saving and loading a pair of registers on the x86. The calling of the
C routine does require some scheduler bookkeeping to avoid deadlocks. That
is inherent to the scheduling, regardless of stack management.
If you want to do something about the 4 kB per goroutine stack overhead
(and remember that most C thread implementations measure stack size in
megabytes), you don't need to throw the whole thing away and start over.
That's almost certainly counterproductive. Put some logic into the linker
or compiler to analyze the call graph and decide on smaller frames for
computations that cannot possibly use 4 kB. Or ratchet the default stack
size down to 2 kB or 1 kB.
Rewriting everything to do explicit per-call frame allocation may be fun
and instructive, but it is unlikely to be be faster.