
Coroutines in C, revisited - codr4life
http://vicsydev.blogspot.com/2016/11/coroutines-in-c-revisited.html
======
markpapadakis
This is a C++ coroutines framework( [https://github.com/markpapadakis/coros-
fibers](https://github.com/markpapadakis/coros-fibers) ), but its trivial to
“port” it to C.

I ‘ve also written a few things about coroutines and fibers(e.g
[https://medium.com/software-development-2/coroutines-and-
fib...](https://medium.com/software-development-2/coroutines-and-fibers-why-
and-when-5798f08464fd#.qz59082w0) and [https://medium.com/software-
development-2/high-performance-s...](https://medium.com/software-
development-2/high-performance-services-using-coroutines-
ac8e9f54d727#.1s8lfoswb) ).

Coroutines based on stack manipulation(as opposed to simple FSMs) are
extremely powerful and are particularly useful for implementing fair
scheduling across runnable ‘processes’ and concurrent execution contexts. Fair
scheduling in particular can be really important in the domain of distributed
systems that need to process potentially long-running and/or expensive request
which would otherwise stall processing other cheaper requests indefinitely.
Combined with async I/O, this gives near optimal concurrency and resources
utilisation. Coroutines are often associated (literally and figuratively) with
futures/promises. See SeaStar framework for a great implementation of those
ideas.

~~~
marktangotango
>> Coroutines based on stack manipulation(as opposed to simple FSMs) are
extremely powerful and are particularly useful for implementing fair
scheduling across runnable ‘processes’ and concurrent execution contexts.

So, isn't this what threads are for? This is not a snarky comment, genuine
interest: but isn't this what the OS scheduler and thread implementations are
meant to do, and implementing essentially ones own threads and scheduling is
the epitomy of 'reinventing the wheel'? Specifically, when going down the
route of using a coroutine to essentially duplicate functionality provided by
native threads, is there any performance advantage? My understanding is that,
nowadays, OS threads and schedulers are very highly optimize for performance.

Other than being able to write concurrent code with yield() and resume(), is
there really any reason to NOT use OS level primitives?

~~~
markpapadakis
That's a good question, and the answer has to do with, obviously, the cost of
spawning OS threads/processes, context-switching among them, and how all that
affects caches (cache hit/miss ratio dominates most workloads nowadays, more
so than actual CPU instructions execution). Context switching can be cost on
average 30micros of CPU overhead, but can reach as high as 1500micros. (You
can reduce that by pinning threads to core, but the savings will likely get
may be as much as 20% - maybe). Thankfully now, at least on Linux, AFAIK, a
context switch doesn't result in a TLB flush - back in the day that was quite
expensive. Whenever the OS scheduler needs to context switch, it needs to save
old state(including regs), and restore another state and perform some other
kind of book-keeping. This is more expensive and convoluted than doing it
yourself -- you also usually don't need to care for all possible registers, so
you can reducing switching costs by e.g not caring about SSE registers. Apps
that create too many threads are constantly fighting for CPU time - you will
end up wasting a lot of cpu cycles practically cycling between threads.

Another hidden cost that can severely impact server workloads is that, after
your thread has been switched out, even if another process becomes runnable,
it will need to wait in the kernel's run queue until a CPU core is available
for it. Linux kernels are often compiled with HZ=100, which means that
processes are given time slices of 10ms. If your thread has been switched out,
but becomes runnable almost immediately, and there are 2 thread before your
threads before it in the run queue waiting for CPU time, your thread may have
to wait upto 20ms in a worse case scenario to get CPU time. Depending on the
average length of the run queue (reflected in sys.load average), and how
length threads typically run before getting switched out again, this can
considerable affect performance.

Also, kernels don't generally do a good job at respecting CPU affinity -- even
on an idle system. You can control that yourself.

If you are going to represent ‘jobs’ or tasks as threads, and you need
potentially 100s or 1000s of those, it just won’t scale - the overhead is too
high.

If you control context-switching on your application (in practice that’s about
saving and restoring a few registers) -- cooperative multitasking -- you can
have a lot more control on how that works, when its appropriate to switch etc,
and the overhead is far lower.

Oh, and on average its 2.5-3x more expensive to context switch when using
virtualisation.

~~~
marktangotango
>> If you are going to represent ‘jobs’ or tasks as threads, and you need
potentially 100s or 1000s of those, it just won’t scale - the overhead is too
high.

This seems to be an indictment against containerization (ie docker,
kubernetes, etc). How can a single machine, even with containers, scale out if
this is true? Do people advocating docker and similar technologies not
understand or recognize that their containers should only utilize y threads
(where y = (total threads per machine)/(number of containers hosted on that
machine))? I personally have never heard this mentioned as a consideration
when deploying docker, for example.

~~~
ecnahc515
The advantages of containers are multidimensional. The benefit of running
multiple processes on a single node is not solely to increase utilization of a
given machine but to also gain advantages in deployment, etc.

Schedulers can take what's being mentioned here into account. You can say "I
need at least 2 cores but would like up to 8" or "I need guaranteed 8 cores,
evict lower priority containers if necessary" to ensure your containers aren't
fighting over resources . You can also classify machines for certain types of
workloads and then specify certain containers should run on a particular class
of node. Not all container platforms handle this, but Kubernetes does at
least.

~~~
smarterclayton
And as we add improved resource management and tuning to Kubernetes you gain
those improvements without having to change your deployment pipeline - just
like when a compiler gets better you get more performance for "free".

------
btrask
I just created a submission for libco, which I think is the best (fastest and
most portable) library for coroutines in C. It's great, I wish it got more
love.
[https://news.ycombinator.com/item?id=13199581](https://news.ycombinator.com/item?id=13199581)

~~~
codr4life
I'm curious, how is libco faster or more portable than the approach described
in this post?

~~~
byuu
It's not, they're two separate things. Unfortunately, nobody can agree on
terminology for this stuff. I try not to ever use the word 'coroutine' anymore
because of this.

libco is what I call a cooperative threading library. It allocates a stack for
each thread. This means that thread A can call six functions deep into helper
functions, and suspend in the middle of a helper function, and then resume
right back at that same point.

Thus, you program libco very similarly to how you do preemptive threading,
only you don't ever need any locks, and you have to do your own scheduling
(which can be as simple as "co_call and co_return", or "co_jump" in assembly
terms.)

The CORO #define wrappers in the article is what I would call a very simple
state machine. There is no thread, there is no state machine, there is no
user-space context switching going on. You only get the call/ret style
behavior, and you can't ret from a subroutine called inside of it. It's far,
far less useful.

However, when you can get away with it, CORO will be about ten times faster on
average. libco is optimized as much as possible with pure assembler. On ARM,
it's literally three instructions. On x86, it's ten. The entirety of the
burden is because modern CPUs do not enjoy having their stack pointer changed.
Go to older CPUs and the overhead drops substantially, but few people are
coding on a 68000 anymore. This also means that libco is _not_ ISO C. It
requires an assembly implementation or a platform that lets you poke at jmpbuf
from setjmp/longjmp to work. In practice, it'll work anywhere you'd reasonably
want. x86, amd64, ppc, ppc64, arm, mips, sparc, alpha; Windows, macOS, Linux,
*BSD, Haiku; etc have all been tested. And even if you find something crazy
obscure, you only have to write about twenty lines of code to port it.

The magic of libco is that it's about 100 times faster than preemptive
threading would be for the context switching overhead. So when you actually
need subroutines, and when you don't want to be swamped in
locks/semaphores/critical sections, it can eliminate giant swaths of nested
state machines (switch tables) by keeping all that state inside each local
stack frame, transparently and out of your codebase. Deeply nested state
machines are a serious pain in the ass to reason about and maintain, trust me.

Also ... both of these methods have a serious flaw (as does preemptive
threading): serialization. If you want to save the state of your program in
the middle of working, and restore to that position later (think save states
in an emulator) ... it'll be a bit tricky with libco. But since the state is
hidden by the #define, it'll be absolutely impossible with CORO, unless you
redesigned it to use external state, which would make it even clunkier to use.

In conclusion: both complement each other. I feel C++20 or whatever should add
CORO-style stackless state machine wrappers to the language for convenience.
But it should definitely leave stack-based cooperative threading out, because
that can easily be done as a library. (It can include library functions for
that purpose, but it doesn't need language-level primitives.)

There's a hundred cooperative threading libraries, by the way. My aim with
libco was to make a library that has four functions (co_create, co_delete,
co_active, co_switch) and zero types/structs. As simple and fast as possible,
supplying as many optimized backends as possible. Then let people build on top
of that. The downside is most people still choose to write their own libraries
from scratch, so there's a lot of wheel reinventing going on.

~~~
naasking
> It's not, they're two separate things. Unfortunately, nobody can agree on
> terminology for this stuff.

There is standard terminology though. Coroutines allow arbitrarily nested
calls back and forth. So-called "stackful" coroutines are the only possibility
in C. Other languages perform a program translation into continuation-passing
style to support stackless coroutines with the same semantics.

CORO falls under a class of abstractions known as "generators". You can think
of it like a degenerate coroutine limited to a single call level, but I don't
think conflating the terminology like this is useful.

~~~
codr4life
You got it backwards though; all the name means is the subroutine has multiple
entry points for suspending and resuming execution, which can be used to
implement nonpreemptive threads, generators and other abstractions.

------
clarkmoody
How does this work compare with libdill[1]?

[1] [http://libdill.org/](http://libdill.org/)

~~~
drvdevd
At a glance this just looks like a short, simple coroutine implementation/hack
using C macros, whereas libdill does a lot more to extend C syntax in similar
manner, but in a much more comprehensive way (I believe with the intention of
matching some of Go's concurrency mechanisms). The author compares this to
Duff's device so the comparison seems apt.

I would pick libdill for any serious program and keep this on hand as a C
snippet in Vim or $EDITOR.

[edit] I should say this could be used in a "serious" program too, but it's
nice and short in a way that looks like a great snippet to me

------
zserge
I've recently made a similar protothread/coroutine library for embedded
systems where local continuations can be either setjmp or goto or case labels.
Nothing fancy, but perhaps someone find it useful:
[https://github.com/zserge/pt](https://github.com/zserge/pt)

------
enqk
This is a rediscovery of contiki's protothreads:

[http://contiki.sourceforge.net/docs/2.6/a01802.html](http://contiki.sourceforge.net/docs/2.6/a01802.html)

~~~
drakehutner
Which afaik is based on the publication of Simon Tatham:

[http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html](http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html)

Who directly references good old duffs device. The title seems to acknowledge
the existence of said article, but I didn't find any reference.

~~~
codr4life
I know, I know. Why is it so important that someone else did something
conceptually similar once upon a time? Can't you see the value in standing on
their shoulders and try to reach further in some direction? We're all in the
same boat here, you can't own ideas. In this case I was aiming to simplify API
and usage while keeping the performance, and I feel I managed to take at least
a few steps in that direction.

~~~
scandox
Well it is useful for others to know the history and have other references. I
am sure it is not intended to diminish your work.

~~~
drakehutner
Exactly that!

Sorry if my comment came across like an insult. Wasn't intended as such. Maybe
it was just me hoping to find a direct improvement of Tathams implementation.
Which I would call "quirky" at best.

Your implementation seems to be much more "useable" and understandable in
comparison.

After reading through more of your library I must say, I like it's style.

~~~
codr4life
None of this is personal to me, I have no idea where the inspiration came
from; I'm just trying to share the beauty I see in the concepts. So I get a
bit frustrated with anything that steals focus from my mission, like history
lessons. I don't really care who thought of what first. Sorry if that came
across as hurt feelings, there is no conflict here.

I'm glad you like the humble beginnings of the library I've always been
wishing for :)

~~~
drakehutner
This makes your work even more interesting. I don't know if you read the
original Mail [1] in which Tom Duff shared his Device, but at the end he's
mentioning that he had an idea for something like coroutines:

> It amazes me that after 10 years of writing C there are still little corners
> that I haven't explored fully. (Actually, I have another revolting way to
> use switches to implement interrupt driven state machines but it's too
> horrid to go into.)

He later replied to Simon Tatham, that his coroutines actually where those
"horrid" ideas.

Obviously you must had some thoughts similar to those two. Kudos for that. You
seem to to have a deep understanding of the concepts that make the C language
(and how to twist them) ;).

[1]: [http://www.lysator.liu.se/c/duffs-
device.html](http://www.lysator.liu.se/c/duffs-device.html)

------
c-smile
That appears as a variant of Simon Tatham idea:
[http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html](http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html)

Just in case, I've made C++ variant of it:
[https://www.codeproject.com/Tips/29524/Generators-
in-C](https://www.codeproject.com/Tips/29524/Generators-in-C)

