Hacker News new | comments | show | ask | jobs | submit login

  > They require either segmented stacks, a precise GC
Which is _exactly_ what stackless coroutines and promises do in a very roundabout manner. Some languages are moving toward annotations to automate chaining, but the problem with having to explicitly annotate coroutines is that you no longer have (or can have) first-class functions; at best you now have multiple classes of functions that only interoperate seamlessly with their own kind, which is the opposite of first-class functions. Plus it's much slower than just using a traditional call stack.

Implementing transparently growable or moveable stacks can be difficult, yes. Solving C ABI FFI issues is a headache. And languages that stack-allocate variables but cannot easily move them are in a real pickle. Though, they can do as Go and only stack-allocate variables that don't have their address taken, and in any event that only applies to C++ and Rust. There's no excuse for all the other modern languages. Languages like Python and JavaScript don't have stackful coroutines because of short-sighted implementation decisions that are now too costly to revisit. Similarly, Perl 6 doesn't officially have them because they prematurely optimized their semantics for targets like the JVM where efficient implementations were thought to be difficult. (Moar VM implements stackful coroutines to support gather/take, which shows that it was simply easier to implement the more powerful construct in order to support the less powerful gather/take construct.)

If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.

I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.

For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.

Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.

I think you're misunderstanding the benefit of stackless coroutines. Here's the core of the issue: Stacks, by definition, are dynamically growable objects. But sometimes (in most cases?) the state that needs to be stored across yield points can be statically bounded up front. In that case, stackful coroutines (really, threads) add unnecessary overhead, because you pay for that ability to grow. Golang, for instance, has to start stacks small and grow them by copying, because it can't reliably statically analyze the goroutine to determine the needed stack space. A properly-implemented stackless coroutine system (or even the classic C approach of writing state machines by hand), by contrast, can determine the amount of space necessary for each coroutine and allocate it precisely. There are even slab allocation tricks for many coroutines of the same size. Fundamentally, this is why manually-written C code using epoll can outperform any stackful coroutine system: optimal allocation of stackful coroutines in a language with recursion and first-class functions requires higher-order control flow analysis, which is generally considered infeasible.

Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.

I'm not at all convinced. Stackless coroutines have a statically-known fixed stack size, which can be treated as an anonymous type a la C++ or Rust closures. This is far cheaper and easier to optimize than any possible implementation of stackful coroutines, regardless of what your platform ABI looks like.

Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.

This seems like a super-interesting discussion but I can't quite follow due to the mixed-up naming. Is there a good paper or website that clarifies what stackful/less means in the various contexts?

The distinction are difficult to grasp, and it doesn't help that meanings evolve. For example, many people today understand "thread" to mean a preemptively scheduled kernel resource with a contiguous region of memory used for storing objects and return addresses, which shares an address space with other such threads. That sort of "thread" is what 10 years ago many people called a LWP (lightweight process), when thread didn't imply preemptive scheduling. And what was a "thread" 10 years ago is often called a "fiber" or "microthread" today. Though I'm coming from the world of Unix. I think connotations had a slightly different evolution in the world of Windows, largely because Unix emphasized the "process" (concurrent execution, non-shared address space) and modern concurrent execution with shared address space were only formalized in the late 1990s, whereas Windows didn't get isolated virtual address spaces until Windows 95, so the primary construct was always a "thread" (with or without preemptive execution, depending on era).

When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.

Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.

The two best sources I've found that help untangle the mess in both theoretical and practical terms are

* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)

* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)

Excellent, thanks. I was first introduced to threads in OS/2 and then Windows so for me Thread still means preemptive + stack + shared address space (vs. process which has it's own address space.) I've run into single-stack multitasking in the context of RTOSes (like this: https://github.com/djboni/librertos) but then it becomes cooperative so really a different beast.

The explanation of using a separate data structure to track activation frames is really the key thing for me here, otherwise I can't see how coroutines would work well. I guess there's cases, as mentioned in the original article, where the C++ compiler can determine that a coroutine could be hosted inside the current stack frame but that's really a special case.

A good start could be "Coroutines in Lua" [1]. It explains well the terminology and makes comparisons between designs in different languages.

[1] http://www.inf.puc-rio.br/~roberto/docs/corosblp.pdf

How do the terms in that paper map to the stackful/stackless distinction being discussed above? For example neither term appears in the paper at all, except for one mention of "Stackless Python" as a name.

I should have linked the other paper by Ierusalimschy, "Revisiting Coroutines". wahern linked it in his reply. They are both good reads to better understand coroutines.

Great, thanks. That is a good start.

I appreciate all your points. However, in my experience implementing coroutines and fibers, where I allocated stacks using `mmap`, with guard pages, I'm not completely convinced that fixed-size stacks are an issue:

1/ The typical C stack is fixed size and code is written accordingly.

2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.


How about fixed size stacks- with a canary of a expansionfunction at the top? That way you can have it both- the speed and fixed size- and expandability on demand- by simply checking for the canary in a pattern and executing it ..

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact