Re: coroutines in games more generally, you can look to older console games (NES/SNES-era) for some good examples. Earlier games on the C64 et al were—as is commonly explained—purely event-loop based; but by the mid-1980s, the paradigm had switched to using a hardcoded coroutine-scheduled actor model.
If you've never read it, the Super Mario Bros 1 disassembly[1] is extremely enlightening. It feels a lot like Erlang, with individual parallel "process" actors being responsible for updating the physics of each sprite on screen—but instead of a "process" being an abstraction handled by a scheduler, it's just an entry-point function that calls a finite-state-machine-state function, then either calls a hardwired next "process" function, or a trampoline function that calls whatever is left to run in that game state. Effectively, it's a "direct-threaded next-jmp-prefetched" implementation of scheduling.
The coroutines are regular threads. The library provides an interface for syncing which makes them behave as if they were "real" coroutines. This avoids common problems you run into when making coroutines in C++ which by itself does not support coroutines natively - but it supports threads very well.
Pros: Very easy to implement and use. Will likely work very well anywhere without much thought.
Cons: Resource hungry. Not really designed for thousands of simultaneous coroutines.
Yeah, we'd traditionally use coroutines per-actor so this would be way too heavyweight.
It's really nice for scripting things as you can tests a bunch of conditions, yield() and then re-test without needing to build an explicit state machine. Much easier for designers to grok.
Usually done in Lua, which is a fantastic language for gamedev(low overhead, fast, coroutine support and integrates very nicely with C/C++).
Question: couldn't you use OS fibers to do "coroutines" without being resource intensive? I know all major OS kernels have supported fibers for some time...
If I remember correctly, that's an additional interface for usermode initiated, cooperative scheduling of kernel threads (basically a variant of pthread_yield that allows specifying which thread is going to get the timeslice we are relinquishing).
It still requites a syscall and the entities being scheduled are proper kernel level threads.
Directed graphs are also a great way to create scripted dialogues and quests for games. Visually, a DG for your game dev is quite powerful and makes it very easy to reason about what's going on in the game story. I don't think the same can be said for coroutines hard coded in c++. In addition, it's easy to find a cycle in a graph, I'm not sure the same can be said for a coroutine.
Here's a site to give some interest: http://twinery.org/
One big problem with using coroutines for "scripting" long running events in games I don't often see mentioned is what to do - if anything - about saving their state. For cutscenes this can be ignored for sure, but what if you tie something like a long conversation tree into a coroutine?
For this example library if your making some sort of traditional adventure game and you wanted the player to be able to save the game you'd have to restrict saving to checkpoints, use a loop with a switch (practically moving back to the state machine approach), or just split the conversation up into multiple coroutines as it progresses.
And of course, there's the whole "what to do when you need to patch the game" problem on top of all that.
Yeah, I've had the exact same problem while exploring coroutines. Stackless Python is the only implementation I'm aware of which allows you to trivially serialize its microthreads.
In the end, I prefer to just use mostly "stateless" Lua scripts, with all the saveable data managed in C++.
As a lot of people have mentioned Lua ITT, there's a library called Pluto[1] which can serialize coroutine states. It was used in the game Fable to store quest statuses[2].
I recommend having a look at the source code for ScummVM. The original SCUMM engine, first used in Maniac Mansion, implemented its own VM with cooperative multithreading, way back on the C64.
Basically, the game ran "scripts" (containing custom bytecode), each given its own "slot" with space for locals. Each script is run in turn, and each script must voluntarily relinquish control when appropriate.
Of course, this is a bit more advanced than simple coroutines. If you're going to the effort of creating your own embedded scripting language, you may as well an existing language like Lua.
Quick question for you C++ gurus: Why doesn't C++ have native generators? Is there some technical reason that makes it hard to compile such a feature, or is there a political reason?
Plenty of lightweight thread systems exist for C++. You're thinking about the problem the wrong way. You don't really want generators: you want fibers. Once you have fibers, generators are trivial to implement. One implementation of fibers for unixlike systems is GNU nPth.
Not exactly. I'm not familiar with nPth specifically, but it appears it uses swapcontext, which is a lot slower than doing it in an event loop the proper way. Like, 1000 times slower (on Linux)[1]. Having proper generators and extendable event loops is a far cleaner solution to the problem, in my humble opinion. Python's asyncio is perhaps the best implementation of such an event loop I've seen.
nPth can use swapcontext if everything else fails, but its preferred portable switching strategy is the so called sigaltstack trick witch is usually significantly faster.
BTW, I'm the author of the above page. Coroutine libraries have been my personal hobby since forever.
I'm a bit out of my wheelhouse here, I usually do C and rarely touch C++, and in addition the coding standard I have to adhere to prevents setjmp/longjmp, but doesn't setjmp need to be used in conjunction with longjmp, which makes it completely unsafe to use malloc, fopen, etc? Also, it would prevent RAII from running.
getcontext and setcontext are fundamentally about saving and restoring CPU registers. setjmp and longjmp are also about saving and restoring CPU registers, and they (and related functions) give you much more complex of whether you want to save the signal mask. Since saving and restoring the signal mask is the objectionable part of setcontext, setjmp is a decent alternative.
If you want to unwind, just unwind. setjmp and setcontext neither help nor hurt you here.
Fair enough. I still have doubts about whether that would scale up to millions or billions of coroutines, but I will defer to your knowledge in that it's probably Good Enough™ for most purposes. Thanks for the explanation!
It is actually pretty easy, stackless generators desugar to a switch statement. MS implementation of their C++ coroutine proposal already demonstrates that. The problem is that, as currently specified, it requires memory allocation which can't always be optimised out and that has proven conttentious with many.
Say function A calls into generator B, which has entry points X,Y,Z. Maybe you could generate code for functions A<X>, A<Y> and A<Z>, specialised on the entry point into B. When B yields, it actually "calls" the appropriate A, but does something funny with the stack pointer. Maybe?
The generated code would explode if you had too many coroutines going at once, and you need a lot of things to be known at compile time, and maybe the specific things you'd need to do to the stack would slow you down. I don't think you'd have to mess with the return address, though, so maybe it wouldn't be so bad.
Yes of course. At that point any standard optimisation that can be applied to normal code is applicable.
In this case you wouldn't expect any call at all: the switch statement is online in the caller and standard constant propagation can remove the switch itself.
If you've never read it, the Super Mario Bros 1 disassembly[1] is extremely enlightening. It feels a lot like Erlang, with individual parallel "process" actors being responsible for updating the physics of each sprite on screen—but instead of a "process" being an abstraction handled by a scheduler, it's just an entry-point function that calls a finite-state-machine-state function, then either calls a hardwired next "process" function, or a trampoline function that calls whatever is left to run in that game state. Effectively, it's a "direct-threaded next-jmp-prefetched" implementation of scheduling.
[1] https://gist.github.com/1wErt3r/4048722