> They require either segmented stacks, a precise GC
If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.
I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.
For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.
Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.
Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.
Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.
When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.
Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.
The two best sources I've found that help untangle the mess in both theoretical and practical terms are
* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)
* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)
The explanation of using a separate data structure to track activation frames is really the key thing for me here, otherwise I can't see how coroutines would work well. I guess there's cases, as mentioned in the original article, where the C++ compiler can determine that a coroutine could be hosted inside the current stack frame but that's really a special case.
1/ The typical C stack is fixed size and code is written accordingly.
2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.