
Implementing simple cooperative threads in C - brenns10
https://brennan.io/2020/05/24/userspace-cooperative-multitasking/
======
ChuckMcM
FWIW, the original implementation of Java (when it was known as Oak) used this
trick to manage user level threading. There was also a some code to do the
stack swap since you were in a virtual machine anyway, that stuff was all
there to be adjusted.

Random meta note brenns10, the idiom for giving up the processor is 'yield'
(rather than 'relinquish') it is common in cooperative systems.

That said, the code in Java that switched threads was called the 'trampoline'
code since you bounced in, switched threads and bounced out.

And my last comment is that if you set up your scheduler in a way that it has
a 'runnable' and 'sleeping' queue, you can do clever things like have a task
sleep on a mutex or other conditional variable and wake up when that variable
does what you're waiting for. It makes implementing network services that
service multiple clients much cleaner.

~~~
brenns10
> FWIW, the original implementation of Java (when it was known as Oak) used
> this trick to manage user level threading. There was also a some code to do
> the stack swap since you were in a virtual machine anyway, that stuff was
> all there to be adjusted.

I never knew Java did this. I figured most "interpreted" languages (via direct
interpreters or bytecode) would be trivial to implement any kind of threads
(even pre-emptive), because at any point the interpreter could decide to save
the state of one language thread and switch to another, without bothering the
interpreter's own stack.

> the idiom for giving up the processor is 'yield' (rather than 'relinquish')
> it is common in cooperative systems

Makes sense, I had been using relinquish for my other projects but I almost
named it yield this time around.

> And my last comment is that if you set up your scheduler in a way that it
> has a 'runnable' and 'sleeping' queue, you can do clever things like have a
> task sleep on a mutex or other conditional variable and wake up when that
> variable does what you're waiting for. It makes implementing network
> services that service multiple clients much cleaner.

I'm really excited to tweak large parts of this, scheduler included! I'm
really interested in allowing my scheduler to use `poll()` to determine
readiness of blocked tasks, and yeah unblocking tasks for purposes like
coroutines or mutexes or condition variables. This was the absolute minimum
scheduler which would form a foundation to write baout, and I was really
excited to write about the topic and share about it.

~~~
pansa2
> I figured most "interpreted" languages would be trivial to implement any
> kind of threads, because at any point the interpreter could decide to save
> the state of one language thread and switch to another, without bothering
> the interpreter's own stack.

Some language interpreters keep the "virtual" stack and the C stack separate
like this, including current versions of Lua. As you suggest, this allows Lua
to support coroutines which each have their own virtual stack.

However other interpreters, such as CPython, keep both stacks synchronized. In
this case, a call in the interpreted language corresponds to a call in C -
recursively back into the main interpreter loop. This is why Python's
generators do not have independent stacks.

The awfully-named "Stackless Python" [0] breaks the synchronization between
stacks, to allow Python to support stackful coroutines.

At least, the above was my understanding until recently. The upcoming Lua 5.4
changes to using a recursive call [1], like CPython, but still supports
stackful coroutines. I don't know how this is implemented.

[0] [https://github.com/stackless-dev/stackless](https://github.com/stackless-
dev/stackless)

[1]
[https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec...](https://github.com/lua/lua/commit/196c87c9cecfacf978f37de4ec69eba0a5971256)

~~~
anonymoushn
I was curious about these changes to Lua 5.4, so I had a look. When you create
a Lua coroutine that allocates 200 Lua stack frames and then yields across
them, the first longjmp crosses 200ish C stack frames. When you later resume,
all those C stack frames are gone, but there wasn't any important state in
them! Lua just calls luaV_execute in a loop[0]. (Trampolining?) So at that
point, for that coroutine, the lower 200ish Lua stack frames do not have
corresponding C stack frames, but any Lua subroutines you then call will have
corresponding C stack frames.

[0]
[https://github.com/lua/lua/blob/9b7987a9d1471ba94764286b28e0...](https://github.com/lua/lua/blob/9b7987a9d1471ba94764286b28e0998f73deb46a/ldo.c#L569)

------
snarfy
I've done this before [1], many years ago. At the time, I had never heard of
setjmp or longjmp. I understood task switching as 'push all registers onto the
stack, change stacks, pop all registers off the stack', and so that's what I
wrote for the task switching routines using gnu assembler. Making it pre-
emptive was a matter of calling it periodically. I wasn't sure about timers
either so I ended up hooking the timer interrupt handlers directly. You could
use either irq0 timer or irq8 timer. I recall getting my dos boxes to
'multitask' inside windows by hooking irq8.

[1] - [https://github.com/zenakuten/lwp20](https://github.com/zenakuten/lwp20)

------
ColinWright
Interesting to compare and contrast with Simon Tatham's method of implementing
Co-Routines in C:

[https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html)

Submitted here:
[https://news.ycombinator.com/item?id=23293835](https://news.ycombinator.com/item?id=23293835)

That's decades old, and actually used in PuTTY.

~~~
brenns10
That's really cool, and much lighter weight for the purpose of coroutines!

~~~
SV_BubbleTime
Still looking for something appropriately handle blocking and resuming without
a full RTOS, but I do like both these options.

------
ausjke
Other C coroutines

[https://github.com/hnes/libaco](https://github.com/hnes/libaco)

[https://github.com/baruch/libwire](https://github.com/baruch/libwire)

[https://swtch.com/libtask/](https://swtch.com/libtask/)

[https://github.com/halayli/lthread](https://github.com/halayli/lthread)

[https://github.com/Tencent/libco](https://github.com/Tencent/libco)

[https://byuu.org/library/libco/](https://byuu.org/library/libco/)

~~~
deepspace
People just love re-inventing the wheel. I created my own implementation, back
in the early 90s in Borland Turbo C and used it to implement a multi-terminal
POS system for a busy bar. Did not know what the technique was called at the
time though.

Then I joined a company in 96 and discovered that their software all ran on a
custom in-house developed multitasking O/S (!). I immediately recognized the
task switching mechanism and formally learned about coroutines.

Now here we are in 2020 and the youngins are still rediscovering the same
thing.

~~~
winrid
And it will keep happening, because there are many more people starting
programming each year than old people that can teach them.

~~~
retSava
And it's a Good Thing, since it fosters a inventive way of thinking around a
problem rather than just reading/hearing that X exists, download that and be
done. It's part of learning and honing a skill.

This doesn't mean that the implementation should then be used in practice,
there are likely other solutions that are better suited, more mature, etc.

~~~
winrid
I agree!

------
chacham15
The main problems I've seen with cooperative threading (aka fibers) are
threefold:

1\. You need to decide how much stack space to allocate to a fiber beforehand.
If you go over this limit, you will segfault. 2\. Thread local storage doesnt
exist anymore since you're potentially swapping between different kernel
threads (and different fibers are using the same thread and will share
storage). 3\. Certain types of locks can now fail because the lock isnt
transferred to the new thread upon being woken up. Other locks run into
rentrancy problems here.

Does anyone know of good solutions to these problems? Is there a library that
takes all these things into account?

~~~
Blackthorn
I think there's a fourth problem that's even more dire: what do you do about
blocking?

A blocking call stalls the whole system. It doesn't in a non-cooperative
threading environment. Various languages have "solved" this by not allowing
you to use blocking calls and sometimes add syntactic sugar like async/await,
but this sucks. The point of having threads is to use them like threads. At
least Go gets this right.

~~~
kccqzy
> Various languages have "solved" this by not allowing you to use blocking
> calls

That is the right solution. Do you seriously think in Go when you make an
apparently blocking system call, an actual blocking system call is made? No,
you are calling the runtime provided version that doesn't actually block. Of
course in C/C++ that needs more discipline to do.

~~~
Blackthorn
> Do you seriously think in Go when you make an apparently blocking system
> call, an actual blocking system call is made?

Who cares? From the perspective of the programmer, it's a blocking call.

~~~
kccqzy
Then how is a syntactic construct like "await blocking_call()" inferior? I
fail to see your point. The await keyword even signals to you the fact that
this call may be blocking the progress of the current fiber/coroutine.

And back to the original example, it's totally possible for a production
implementation to handhold the user like this by, e.g. supplying a modified
libc so that all the blocking system calls are linked against the runtime's
implementation. That would basically achieve what you liked in Go.

~~~
Blackthorn
> Then how is a syntactic construct like "await blocking_call()" inferior?

It's inferior because you have to actually specify it whenever you make such a
call, rather than having the runtime or library be smart and figure it out.

> And back to the original example, it's totally possible for a production
> implementation to handhold the user like this by, e.g. supplying a modified
> libc so that all the blocking system calls are linked against the runtime's
> implementation.

I know. I didn't say it was intractable. Just like problems 1-3, they all have
potential solutions -- they just involve hard work.

~~~
kccqzy
I see what you mean now. Omitting the keyword "await" is incredibly easy to
do; simply do something different when the returned type is an awaitable. But
the fact is that most languages haven't done this, and have you considered
that perhaps language designers considered this issue and decided that
explicit is better than implicit? And there are also a whole host of other
issues too.

------
felixr
Another interesting approach to lightweight cooperative threading is
Protothreads [1][2]. Those are however stackless and make use of duff's device
for the implementation; rather, than setjmp/longjmp.

[1] [http://dunkels.com/adam/pt/](http://dunkels.com/adam/pt/) [2]
[https://en.wikipedia.org/wiki/Protothread](https://en.wikipedia.org/wiki/Protothread)

~~~
mrpippy
On GCC/Clang, pt uses goto and the “labels as values” GNU extension so it can
avoid Duff’s device. This simplifies things, and allows yielding inside a
switch statement.

------
nickelpro
A more common method of implementing this that avoids setjmp and longjmp is to
code a little assembly to do the stack switch. This is typically faster and
easier to port, see:

[https://probablydance.com/2013/02/20/handmade-coroutines-
for...](https://probablydance.com/2013/02/20/handmade-coroutines-for-windows/)

~~~
simias
Faster I can believe, but how can custom asm be more portable than
setjmp/longjmp which are standard since C89?

Although in this particular instance TFA also uses inline assembly to setup
the stack, so it's moot.

~~~
nickelpro
I'm wrong, I seemed to remember many platforms not supporting them but a quick
Google search reveals its just a foggy memory. I was probably thinking about
ucontext actually.

~~~
Gibbon1
Most if not all RTOS's swap threads by saving and restoring the stack and
registers. Most of the ones I've seen you can turn off preemption and you then
have cooperative threads.

It's much easier to deal with shared memory with cooperative threads at the
expense of higher latency.

------
kccqzy
I implemented something similar last year: [https://github.com/kccqzy/green-
threads/blob/master/green_th...](https://github.com/kccqzy/green-
threads/blob/master/green_threads.cpp)

Actually implementing it was quite fun and I encourage every enterprising
systems engineer to try. My implementation even supports growable stacks (up
to a limit of course).

------
jerrysievert
we used something similar when I was writing games in the early to mid 90's,
but our yield function was simply called yield().

we were running on windows (3.11), macOS, and amiga (for tooling, the games
didn't run on the amiga) with great success.

of course, we just used setjmp and longjmp, no assembly for portability
(though there ended up being some assembly in one of the games, while macOS
had good sound management, windows did not at the time and the audio mixing
code ended up written in hand-rolled x86 assembly in order to perform well
enough for the games to work on the hardware at the time).

~~~
Zardoz84
DIV Games Studio, a language and IDE to develop DOS games on late 90's, used
coroutines on a similar fashion. Even it had priority levels, signals and
hierarchies of coroutines. On DIV parlance, this coroutines where called
"process", and yield() was called "frame()". I managed to remake this from
scratch, using D fibers, for a pet/toy game engine that I'm writing.

~~~
jerrysievert
I've always really liked the idea of cooperative multitasking: fewer gotchas,
possibly safer memory access, and easier to control where/when a thread
switches. of course in those days we had to deal with timing issues, couldn't
let a cpu hog block another cpu hog :)

------
norrius
I recall being taught a (fairly rudimentary?) explanation of the memory model
that the available space is used from one side as the stack, and from the
other side as the heap. Reading the article, I realised that memory management
in a concurrent context is way more complex than that, but you don't go into a
lot of details here (fair enough, that's not the main focus).

So each task gets 16k of stack space, but what if it runs out of it? These are
allocated from the "main" heap, so they won't exactly run into each other, but
might trigger UB or crash the program. This is something that a real scheduler
(like the one in the OS's kernel) would have to deal with anyway, so how does,
say, Linux, solve this? Giving a lot of stack space to each running thread
would lead to fragmented and underused memory, or is it not a concern at the
stack spaces you'd normally deal with?

~~~
ori_b
> Giving a lot of stack space to each running thread would lead to fragmented
> and underused memory, or is it not a concern at the stack spaces you'd
> normally deal with?

That's exactly what happens, and it's not a problem in practice -- each thread
gets 8 megabytes of memory allocated. (Note, due to demand paging, physical
usage grows in 4kb increments.)

------
greggman3
This style of coop multi-tasking was in Crash Team Racing on PS1 for a while.
Unfortunately on a machine with only 2meg of ram it turned out to take too
much memory and be too fragile. The stacks had to be small, add one printf for
debugging and you'd overflow the stack. Eventually it was torn out.

~~~
anonymoushn
I'm surprised you didn't pile on by making a printf which switches to the main
stack, does the actual printf, and switches back.

~~~
greggman3
printf was just an obvious example. Any function had this problem. You run,
things work, and then under some condition the function you called called
something else and crash. Or you try to add some functionality to a function
in the system and suddenly the tasks crash because your edit used more stack.

The point isn't to dis this method. It's just to say it was frustrating with
tiny stacks.

~~~
anonymoushn
I see, that does sound tough.

------
msaltz
This is a really great explanation. I believe that libdill [1] [2] [3] works
this way but this gave me a much more thorough understanding.

[1] [http://libdill.org/](http://libdill.org/) [2] Call to setjmp:
[https://github.com/sustrik/libdill/blob/de7a917bc39756f61237...](https://github.com/sustrik/libdill/blob/de7a917bc39756f61237b8b779dd8be735652f88/cr.c#L442)
[3] Call to longjmp:
[https://github.com/sustrik/libdill/blob/de7a917bc39756f61237...](https://github.com/sustrik/libdill/blob/de7a917bc39756f61237b8b779dd8be735652f88/cr.c#L504)

------
nroets
Many C implementations allow this without any assembly:
[https://en.m.wikipedia.org/wiki/Setjmp.h#Cooperative_multita...](https://en.m.wikipedia.org/wiki/Setjmp.h#Cooperative_multitasking)

(2007)

------
kccqzy
An alternative to implementing your own fibers is to use user-mode scheduled
threads. It has advantages such as these are real threads, so supporting
things like thread-local storage is much easier, and is much more forgiving
when the code uses blocking system calls. Unfortunately the only mainstream
operating system that implements it is Windows:
[https://docs.microsoft.com/en-
us/windows/win32/procthread/us...](https://docs.microsoft.com/en-
us/windows/win32/procthread/user-mode-scheduling)

------
loeg
Here's the best prior art I know of in this space:
[https://swtch.com/libtask/](https://swtch.com/libtask/)

I would not recommend taking this approach in C.

------
thornjm
I wonder how hard it would be to try handle the segfault and allocate stack
space on demand for these “threads”. Almost certainly stupid but a fun
experiment to see if its possible...

EDIT: found a nice example of this here: [https://rethinkdb.com/blog/handling-
stack-overflow-on-custom...](https://rethinkdb.com/blog/handling-stack-
overflow-on-custom-stacks/)

~~~
loeg
The kernel does this for you, generally. You could use a segfault handler,
stack probing, and a guard page to figure out when you're out of stack, but
it's not clear how you grow the C stack if you're out of contiguous virtual
address space.

------
H2NCH2COOH
This kind of context switching can be more easily achieved by using the old
SysV APIs when working under *nix: [http://man7.org/linux/man-
pages/man3/makecontext.3.html](http://man7.org/linux/man-
pages/man3/makecontext.3.html)

------
numlock86
Before diving into something complex as e.g. FreeRTOS, can anyone recommend
production ready code preferably in C that shows off preemptive task
scheduling using an external timer including features like priorities and task
delays?

The post is a good explanation regarding stack handling.

------
riscy
For a recently published comparison of various stack implementations, see here
(pdf):
[https://kavon.farvard.in/papers/pldi20-stacks.pdf](https://kavon.farvard.in/papers/pldi20-stacks.pdf)

------
watatuti
The code and explanation of scheduler_relinquish seemed at odds to me.

------
optymizer
I find the contrast between the text color and background color to be too low
for comfortable reading.

~~~
brenns10
Hey, I hear you and I don't want reading my site to be uncomfortable! My site
uses the solarized light colorscheme [1], which I chose because I use the
solarized schemes for my text editors. I think they are lower contrast to lend
themselves to long editing sessions without being jarring on the eyes, but
maybe that's not ideal for webpages?

In any case, if I added a small JS powered button to switch into a higher
contrast style, would that make the site more usable to you?

[1]:
[https://ethanschoonover.com/solarized/](https://ethanschoonover.com/solarized/)

~~~
optymizer
Hi Stephen, I appreciate your willingness to accept feedback. I'm flattered
that you would add a button to make my reading experience better, but I think
a website should be readable by default, so the button would just be a
workaround from my point of view.

I would also like to respectfully challenge your assumption that lower
contrast text is less jarring for the eyes for long reading sessions. If it
were so, they'd be printing books in light gray color, wouldn't they?

Looking at the Solarized theme, my guess is it's the yellow-ish background
that removes the strain on the eyes, as #fff is typically very bright. Since
most people aren't focusing on the background when reading, it doesn't cause
the eyes to focus more than usual. However, changing the color of the text is
a different matter because that's what the eyes are focusing on, and making
the text lighter causes it to be harder to read.

That's easy to verify: on the opacity scale of 0.0 to 1.0, where 1.0 is fully
opaque, #000 (full black) at 0.1 opacity (almost transparent) is harder to
focus on than 0.9 opacity (almost fully opaque). The only reason why some
(most?) designers wouldn't be caught dead using #000 at 1.0 opacity for text,
despite it being the most readable text color, is because very few things in
the real world are at #000 or #fff, so the general advice is to use some shade
that's close to it to make the webpage look more natural.

They're essentially trading off readability for aesthetics, and my original
comment was that you traded off too much readability for aesthetics, in my
opinion.

That said, if I were you, I wouldn't trust my opinion or anyone else's really.
Seeing as you made it to HN, you could run an A/B test to see if serving a
higher contrast page causes people to spend more time on that page on average,
presumably because they're spending more time reading the content, instead of
just skimming it like me. That way you'd be making informed changes based on
data, not people's subjective experiences (including mine and yours).

Just for fun, I fiddled with the text color on your webpage and I'd say
anything lighter than #555 for the body and #666 for the h2 seems too washed
out (with the same background color).

Finally, I do want to note that the content is excellent. Happy Memorial Day!

------
presiozo
This looks like more work and more opportunities for bugs than protothreads.

------
Ram_Lakshmanan
I highly recommend thread analyzer tool
[https://fastthread.io/](https://fastthread.io/)

------
adamnemecek
Yes, this has been done ad nauseam. Do you know why no one uses it? Reasoning
about cooperative threads in C is impossible. Reasoning about async C code is
impossible.

~~~
loeg
> Yes, this has been done ad nauseam. Do you know why no one uses it?

Yes.

> Reasoning about cooperative threads in C is impossible.

Well, sorta. The problem with these approaches is mostly that C runtimes
expect you to use the native operating system thread construct, not your own
handrolled thing. If you were writing a libc for your own operating system you
_could_ use something vaguely like these libraries.

Another obvious problem is the lack of utilization of multiple CPU cores. And
once you try and mix the two (M:N threading), the system becomes difficult to
understand (and often has worse performance than just 1:1 threading).

I wouldn't say it's because reasoning about them is impossible, but that in
practice, they are extremely difficult to use.

> Reasoning about async C code is impossible.

No, that part isn't true.

------
SlowRobotAhead
> It’s like a goto statement, but it can even be used to jump outside of a
> function. It’s also a lot more difficult to read than a goto, since it looks
> like a regular function call.

Ah! Blood pressure rising!

> Like with goto, the common advice is to avoid

Whew. And then goes on to make a really good case for complicated task perform
with simple error handling.

It’s a good application for not exactly general use cooperative routines but a
specific use case of performing multiple procedural steps.

I think there it falls short and where an RTOS/OS will be more useful is when
you have to wait on one of those steps.

I think that’s the harder example with the example of an HTTP request is
waiting for the response and keeping that context ready and blocking in a
separate thread. That ends up being easy to read and maintainable. But I
suppose there is no reason you can’t do both.

~~~
asveikau
Does your blood pressure also rise from use of exceptions?

Because the pattern that emerges with setjmp/longjmp is pretty similar, and
that's the most common use: to simulate what other languages offer with
exceptions. The most prominent libraries I can think of that _force you to do
this_ are image related, libpng and libjpeg both use longjmp to handle errors
[though IIRC it's optional in the latter].

~~~
SlowRobotAhead
Are there exceptions in C?

If I’m interviewing someone for a programming job and I see goto in there C
code... they better have an amazing reason or they won’t be getting the job.
Harsh, but it’s reality of how few people are suited for embedded programming.

~~~
Gibbon1
I spend a lot of time dicking with embedded C code. I think using goto is
reasonable some cases. The classic case is to short circuit a block of code
when there is an error of some sort.

    
    
       function(){
       
       // early returns here.
    
       // stuff
       // more stuff
    
       if(oops_err){
         goto some_error;
       }
    
       // yet more stuff
    
       // normal exit
       return 0;
       
       // bad exit
       some_error:
    
       // clean up
    
       return oops_err;
       }
    

The above is fine, you avoid a convoluted execution path. Or cleanup and a
return in the middle of a function. Looking at the top of the function you see
early returns. Looking at the end you see the normal and error exits.

~~~
rtomayko
This is specifically allowed in the Linux kernel coding style guidelines as
well FWIW:

[https://www.kernel.org/doc/html/v4.10/process/coding-
style.h...](https://www.kernel.org/doc/html/v4.10/process/coding-
style.html#centralized-exiting-of-functions)

