C++ Coroutine Theory (2017)

pokoleo · on Feb 6, 2018

Waterloo's CS 343 (concurrency) course uses coroutines as a stepping block towards understanding concurrent programming.

Notes from the course are fantastic, coroutines start here: https://www.student.cs.uwaterloo.ca/~cs343/documents/notes.p...

3uclid · on Feb 6, 2018

Would you recommend CS 343?

I'm disappointed the course teaches μC++ rather than referencing C++11 threading support.

legojoey17 · on Feb 7, 2018

It has been one of my favourite courses because it really dives into the fundamentals and implementation of how concurrency models work, is very learnable but still applicable.

The primitives uC++ make it very easy to form the concurrency models that are present in other languages and these days the prof does exactly that and shows exactly how to implement common models, such as channels, actors, and a few others.

Everything done is very mappable to how other languages provide concurrency.

elvinyung · on Feb 7, 2018

Actually, IIRC 343 does explore C++11 concurrency constructs (see section 11.4.5 of [1]) at some point, as well as concurrency models in other runtimes.

As others have mentioned, uC++ is necessary because other runtimes lack the complete set of concurrency constructs that the course explores. It's essentially the path of least resistance to introducing CS students to all the different flavors of concurrency.

[1] https://www.student.cs.uwaterloo.ca/~cs343/documents/notes.p...

AnthonyCalandra · on Feb 7, 2018

If it helps I had the same reservation about the course but ended up enjoying uC++. Some things like coroutines, monitors, etc. do not have C++ equivalents yet so there's part of the reason. I found uC++ to be a really useful learning tool.

z0r · on Feb 6, 2018

i vaguely remember the course being fine when i took it about a decade ago. teaching concurrency using such a unique language is a little strange but the reasoning is transferable. in practice you'll probably end up having to learn the specifics of whatever platform you might use if you end up professionally writing concurrent code.

setheron · on Feb 7, 2018

I liked that course a lot. Many of the ideas were easily transferable.

I would love to have applied test driven development to my courses.

psyc · on Feb 6, 2018

Do C++ coroutines give you a sane call stack when chained? I’ve only worked with proprietary implementations that do not, and it is an utter catastrophe for maintenance once they have infested a large code base.

wahern · on Feb 6, 2018

There's something truly perverse about the way languages are recapitulating the evolution of call stacks in the form of coroutines[1] and promises.

Erlang, Go, Lua, and Scheme get this right. Stackless Python never caught on :( But Java may get proper stackful coroutines in the near future.

[1] Stackless coroutines, which means you can't yield across nested function invocations. The unfortunately named Stackless Python actually implements stackful coroutines. "Stackless" in Stackless Python refers to not implementing the Python call stack on the C ABI stack. Stackful coroutines are basically threads (sometimes called microthreads or fibers to distinguish from C ABI/kernel thread construct) with a language-level construct for directly passing control to another thread.

Rusky · on Feb 6, 2018

There's nothing about stackless coroutines that means you can't have good stack traces. For example, C# already does it, at least to some degree.

Stackful coroutines are clearly a viable tool, but they don't work in all use cases. They require either segmented stacks, a precise GC, or memory usage comparable to kernel threads. They are tricky to implement correctly on Windows. Etc.

In my ideal world, we'd have stackless coroutines with great debugger support everywhere, with languages free to experiment with the syntax- explicit suspension points, implicit suspension points, effect polymorphism to make it look like you're yielding across nested function calls, etc...

wahern · on Feb 6, 2018

  > They require either segmented stacks, a precise GC

Which is _exactly_ what stackless coroutines and promises do in a very roundabout manner. Some languages are moving toward annotations to automate chaining, but the problem with having to explicitly annotate coroutines is that you no longer have (or can have) first-class functions; at best you now have multiple classes of functions that only interoperate seamlessly with their own kind, which is the opposite of first-class functions. Plus it's much slower than just using a traditional call stack.

Implementing transparently growable or moveable stacks can be difficult, yes. Solving C ABI FFI issues is a headache. And languages that stack-allocate variables but cannot easily move them are in a real pickle. Though, they can do as Go and only stack-allocate variables that don't have their address taken, and in any event that only applies to C++ and Rust. There's no excuse for all the other modern languages. Languages like Python and JavaScript don't have stackful coroutines because of short-sighted implementation decisions that are now too costly to revisit. Similarly, Perl 6 doesn't officially have them because they prematurely optimized their semantics for targets like the JVM where efficient implementations were thought to be difficult. (Moar VM implements stackful coroutines to support gather/take, which shows that it was simply easier to implement the more powerful construct in order to support the less powerful gather/take construct.)

If it were easy we wouldn't have stackless coroutines at all because they're objectively inferior in every respect, and absent external constraints (beholden to the C stack) can result in less memory usage and fewer wasted CPU cycles in both the common and edge cases. But both PUC Lua and LuaJIT do it properly and are among the fastest interpreted and JIT'd implementations, respectively, so I think the difficulty is exaggerated.

I understand why these other constructs exist, but I still think it's perverse. At some point we should just revisit and revise the underlying platform ABI so we can get to a place where implementing stackful coroutines is easier for all languages.

For example, the very same debugging information you might add to improve stack traces can be used by implementations to help, say, move objects. Make that mandatory as part of the ABI and a lot of cool things become possible, including easy reflection in compiled languages. DWARF is heavyweight and complex, but Solaris (and now FreeBSD and OpenBSD) support something called Compact C Type Format (CTF) for light-weight type descriptions, which shows how system ABIs could usefully evolve.

Newer languages shouldn't be tying themselves to incidental semantics from 40 years ago. Rather, they should be doing what hardware and software engineers were doing 40 years ago when they defined the primary semantics--properly abstract call stacks/call state into a first-class construct (i.e. thread of control), while simultaneously pushing the implementation details down so they can be performant.

pcwalton · on Feb 7, 2018

I think you're misunderstanding the benefit of stackless coroutines. Here's the core of the issue: Stacks, by definition, are dynamically growable objects. But sometimes (in most cases?) the state that needs to be stored across yield points can be statically bounded up front. In that case, stackful coroutines (really, threads) add unnecessary overhead, because you pay for that ability to grow. Golang, for instance, has to start stacks small and grow them by copying, because it can't reliably statically analyze the goroutine to determine the needed stack space. A properly-implemented stackless coroutine system (or even the classic C approach of writing state machines by hand), by contrast, can determine the amount of space necessary for each coroutine and allocate it precisely. There are even slab allocation tricks for many coroutines of the same size. Fundamentally, this is why manually-written C code using epoll can outperform any stackful coroutine system: optimal allocation of stackful coroutines in a language with recursion and first-class functions requires higher-order control flow analysis, which is generally considered infeasible.

Of course, I've only focused on the performance issues here, and obviously there are also important issues of ergonomics at play. But my point is that there are benefits to stackless coroutines that can't be simply dismissed.

Rusky · on Feb 6, 2018

I'm not at all convinced. Stackless coroutines have a statically-known fixed stack size, which can be treated as an anonymous type a la C++ or Rust closures. This is far cheaper and easier to optimize than any possible implementation of stackful coroutines, regardless of what your platform ABI looks like.

Like I said, stackful coroutines are a useful tool. They just can't be the only implementation strategy, especially in C++/Rust-like languages.

zwieback · on Feb 6, 2018

This seems like a super-interesting discussion but I can't quite follow due to the mixed-up naming. Is there a good paper or website that clarifies what stackful/less means in the various contexts?

wahern · on Feb 6, 2018

The distinction are difficult to grasp, and it doesn't help that meanings evolve. For example, many people today understand "thread" to mean a preemptively scheduled kernel resource with a contiguous region of memory used for storing objects and return addresses, which shares an address space with other such threads. That sort of "thread" is what 10 years ago many people called a LWP (lightweight process), when thread didn't imply preemptive scheduling. And what was a "thread" 10 years ago is often called a "fiber" or "microthread" today. Though I'm coming from the world of Unix. I think connotations had a slightly different evolution in the world of Windows, largely because Unix emphasized the "process" (concurrent execution, non-shared address space) and modern concurrent execution with shared address space were only formalized in the late 1990s, whereas Windows didn't get isolated virtual address spaces until Windows 95, so the primary construct was always a "thread" (with or without preemptive execution, depending on era).

When I say thread in my previous post I'm referring to the abstract construct that represents the flow of control of sequentially executed instructions, regardless of the underlying implementation, and regardless of how you initiate, yield, or resume that control. For languages that support function recursion that necessarily implies some sort of stack (if not multiple stacks) for storing the state of intermediate callers that have invoked another function. Often such a stack is also used to store variables, both as a performance optimization and because many (perhaps most objects) in a program have a lifetime that naturally coincides with the execution of the enclosing function.

Such a "call stack" has historically been implemented as a contiguous region of memory, but some hardware (i.e. IBM mainframes) implement call stacks as linked-lists of activation frames, which effectively is the data structure you're creating when you chain stackless coroutines.

The two best sources I've found that help untangle the mess in both theoretical and practical terms are

* Revisiting Coroutines, a 2004 paper that helped to renew interest in coroutines. (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf)

* The proposal to add fibers to the JVM. (http://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.htm...)

zwieback · on Feb 6, 2018

Excellent, thanks. I was first introduced to threads in OS/2 and then Windows so for me Thread still means preemptive + stack + shared address space (vs. process which has it's own address space.) I've run into single-stack multitasking in the context of RTOSes (like this: https://github.com/djboni/librertos) but then it becomes cooperative so really a different beast.

The explanation of using a separate data structure to track activation frames is really the key thing for me here, otherwise I can't see how coroutines would work well. I guess there's cases, as mentioned in the original article, where the C++ compiler can determine that a coroutine could be hosted inside the current stack frame but that's really a special case.

lboasso · on Feb 6, 2018

A good start could be "Coroutines in Lua" [1]. It explains well the terminology and makes comparisons between designs in different languages.

[1] http://www.inf.puc-rio.br/~roberto/docs/corosblp.pdf

BeeOnRope · on Feb 6, 2018

How do the terms in that paper map to the stackful/stackless distinction being discussed above? For example neither term appears in the paper at all, except for one mention of "Stackless Python" as a name.

lboasso · on Feb 6, 2018

I should have linked the other paper by Ierusalimschy, "Revisiting Coroutines". wahern linked it in his reply. They are both good reads to better understand coroutines.

zwieback · on Feb 6, 2018

Great, thanks. That is a good start.

ioquatix · on Feb 7, 2018

I appreciate all your points. However, in my experience implementing coroutines and fibers, where I allocated stacks using `mmap`, with guard pages, I'm not completely convinced that fixed-size stacks are an issue:

1/ The typical C stack is fixed size and code is written accordingly.

2/ `mmap` can allocate virtual memory, so you can allocate a large stack but that's an upper limit, it might be paged out if unused.

Thoughts?

WillReplyfFood · on Feb 7, 2018

How about fixed size stacks- with a canary of a expansionfunction at the top? That way you can have it both- the speed and fixed size- and expandability on demand- by simply checking for the canary in a pattern and executing it ..

bonzini · on Feb 7, 2018

Stackful coroutines are trivial to implement on Windows---it supports "fibers" natively.

Rusky · on Feb 8, 2018

Trivial to use the included fiber API; impossible to implement yourself without the risk of a future OS update breaking your implementation.

avodonosov · on Feb 7, 2018

References about Java? I was looking for coroutines / CPS for Java recently and found no JSRs and no plans to include it into the language anytime soon.

Matthias247 · on Feb 7, 2018

I guess this might be about Project Loom? Pron (here on HN, author of Quasar) is leading that effort.

ioquatix · on Feb 6, 2018

In my experience it depends on the tooling - both the compiler needs to allow for insertion of the necessary debugging symbols, and the debugger needs to interpret them correctly.

I'd like to improve this situation, right now the following markers are commented out because I've had issues with them.

https://github.com/kurocha/concurrent/blob/8334ecf758b43eac3...

gumby · on Feb 6, 2018

Given the semantics of coroutines, what would you consider a sane call stack? Should each routine have its own stack?

This is not a glib question, I genuinely don't know what the proper answer might be. Perhaps what's needed is a different debugger mechanism?

spc476 · on Feb 6, 2018

My own implementation of coroutines for C [1] requires each coroutine to have its own stack. It simplifies things quite a bit.

[1] https://github.com/spc476/C-Coroutines [2]

[2] More of a "proof-of-concept" than something really meant for production.

jayd16 · on Feb 6, 2018

>I’ve only worked with proprietary implementations that do not

Hmm, what does this mean? I've used coroutines in Unity and to that respect, yielding functions in C# as well as generators in Python. The stacks are sane in the sense that they include the current running iteration.

Is that tricky to implement? Are you expecting something different? Seems pretty straight forward to me.

paulddraper · on Feb 6, 2018

IDK, but stack space is a big reason people say "threads are heavy". There is a trade-off.

vvanders · on Feb 6, 2018

Threads are heavy because of the context switches and mutexes involved. Coroutines can be very lightweight(see Lua for instance).

BeeOnRope · on Feb 6, 2018

It depends what you mean by "heavy", but the common example is something like thread-per-request or implementing async IO using threads that make blocking calls.

In those context threads are primary "heavy" because each thread needs a dedicated stack, so there is a pretty low limit on the total number of threads you can have based on your available memory, so you run into scaling limitations whenever you'd naturally want to have 100 of thousands or millions of such threads.

The CPU use of context switches is a secondary factor there and probably pales compared to IO costs - and some competing designs also have context switch costs.

Threads were also heavy in the sense that if they are OS scheduled, the behavior often degrades with large numbers of threads, depending on the implementation. Recent scheduler designs are mostly able to mitigate this, however.

KeepFlying · on Feb 6, 2018

Dumb question, but is this something that a dev would expect the compiler to do for us automatically depending on what is deemed most efficient or is this something that a developer can write directly into the code to make it work this way?

I want to be sure I understand what is going on here to be sure.

Can someone offer an example of where this ability would be particularly useful?

Twisol · on Feb 6, 2018

Coroutines are quite nice for managing sessions with asynchronous events, like stateful GUIs or client-server interactions. If you're familiar with the tendency of callbacks in e.g. JavaScript to nest deeply, you can think of coroutines as a way to recover a flat, procedural style.

I'm a particular fan of how coroutines work in Lua. Here's an article that helps explain them a bit in that context: http://leafo.net/posts/itchio-and-coroutines.html

vvanders · on Feb 6, 2018

Yeah, they're awesome for doing sequences of events that can have temporal gaps in the middle.

We used to use them(Lua) in games to do scripted sequences and AI. Was simple enough that even our designers could edit/extend them.

gumby · on Feb 6, 2018

Read the wikipedia page on coroutines first. It's a fairly basic concept in computer science.

It's a different model of control structure, primarily when you have a thing that produces data and a thing that consumes it. It could be in multiple threads, multiple machines, or just a single thread, so is often included in multiprocessing portions of designs, standards and instruction.

ioquatix · on Feb 6, 2018

Actually, a coroutine is just a more generalised function - it's a function which retains it's state over multiple calls.

I recommend watching this video, which has a great overview. https://www.youtube.com/watch?v=_fu0gx-xseY

Splines · on Feb 7, 2018

I've seen useful usage of coroutines where you have a task you want to run synchronously on the main thread, but you want it to yield time without unwinding the whole stack.

So you create another thread, have the main thread yield time to it, and at fixed points in this other thread yield back to the main thread. When the main thread yields to this other thread, all it's context is maintained and you don't need to "walk up" the stack to restore the work this thread was doing.

At a high level, this solves the class of problems where you need to do background tasks in applications that aren't thread safe. There are probably other places where these work well, but this is just my experience.

RajuVarghese · on Feb 7, 2018

Modula-2, one of the early languages with coroutines, had a pretty simple implementation. With NEWCOROUTINE a new coroutine was created (including the heap memory that would function as a workspace for that coroutine), TRANSFER to transfer control from one coroutine to another and IOTRANSFER to do the same but for interrupts. With these one could design a scheduler and off you went!

I had built a coroutine system for a Pascal environment by implementing NEWCOROUTINE and TRANSFER. Both turned out to be pretty simple in assembly language. The workspace contained an area for the CPU registers and the stack. So TRANSFER involved saving the registers of one coroutine in the workspace and restoring the registers from the second.

steveklabnik · on Feb 6, 2018

The visualization here is excellent; exactly the kind of thing I've been meaning to look up.

eptcyka · on Feb 6, 2018

Seems like this would be more like tokio for Rust rather than Go's goroutines.

steveklabnik · on Feb 6, 2018

Yes, and we're exploring a similar thing in Rust right now, though we call them "generators."

A key question right now is, can we have an implementation where the heap allocations that occur here don't have to? It's not 100% clear.

tmandry · on Feb 7, 2018

The post says that if a compiler can prove "that the lifetime of the coroutine is indeed strictly nested within the lifetime of the caller", it would be able to avoid heap allocations. It makes it sound like an optional optimization, though.

I think Rust clearly has an advantage here: since lifetimes are well-defined concept in the language itself, it should also be well-defined whether a generator can be allocated on the stack. Of course, I'm probably missing some hairy edge cases here.

Another aspect I found interesting was the ability to define custom code which runs every time a certain coroutine is suspended, or returns. In my reading, this allows for the coroutine to e.g. manage its own membership in a resume queue.

Tokio tackles this problem differently, but is this more powerful than Rust coroutines? Is there a mechanism by which a generic type could be used to wrap a generator and provide this functionality in a similar way? Asking partly as a point of conversation, and partly as someone who isn't fully steeped in Rust's type system yet.

steveklabnik · on Feb 7, 2018

Yes, the issue here is what happens on moves?

https://boats.gitlab.io/blog/post/2018-01-25-async-i-self-re... and the two follow up posts are the latest thinking on this topic, and by latest, I mean "posted in the last few days". This is cutting-edge stuff!

The intention is for this stuff to support Tokio; that is, generators make asyc/await work, which lets you write futures code more easily, to be run with Tokio.

tmandry · on Feb 8, 2018

Interesting. The fourth post from today seems very promising. It gives me increasing confidence that Rust has chosen its code abstractions very well, to know that we can provably solve this problem without heap allocations.

The lifetime binding tricks feel reminiscent of template metaprogramming-fu in C++, though the parts that have to be reasoned about past the standard library seem smaller and cleaner.

In Rust, it feels like we are leveraging our cleverness to accomplish a higher order of expression than what’s possible in other imperative languages today. In 20 years, maybe we’ll be talking about a promising new language concept that reminds of how exciting Rust lifetimes were back in 2018 :)

GolDDranks · on Feb 7, 2018

It's interesting that people refer the same thing with "generators", "semicoroutines" or "stackless coroutines". (A resumable computation that doesn't have a growable stack)

On the other hand, "stackful coroutines" and "coroutines" without a qualifier often refers things with dynamically growing stacks.

It seems to me that the C++ Coroutines refer to the former, which likely complicates the terminology further.

tpush · on Feb 7, 2018

They are all specific restrictions on and/or implementations details of delimited continuations anyway :).

pjmlp · on Feb 7, 2018

Because coroutines are the building blocks of generators.

Being stackless is an implementation detail, even if a relevant one.

mark_edward · on Feb 7, 2018

As I understand, generators are a subset of coroutines. Generators can only yield to their caller, but coroutines can yield to other coroutines. coroutines are more general. they can be implemented on top of generators with some fancy stuff i don't remember involving returning tokens to some kind of scheduler/dispatcher in the caller

jnordwick · on Feb 6, 2018

Does anybody know if there has been some research and experimentation with inlining? It is probably the most important part of an optimizing compiler, and coroutines would seem to make that very difficult or impossible.

lucozade · on Feb 7, 2018

I'd recommend looking at the work the LLVM team have done on this. [0] for example and there is a Youtube video from one of the LLVM confs if I recall correctly.

In a nutshell they are still researching this but their general approach is to split the coroutine up and devirtualise/inline the parts where they can.

[0] https://llvm.org/devmtg/2016-11/Slides/Nishanov-LLVMCoroutin...