
Libaco: A fast and lightweight C asymmetric coroutine library - adamnemecek
https://github.com/hnes/libaco
======
dmytroi
Would be nice to compare performance with others libraries! For example
Stacklet [0] and Boost Context [1].

\- [0]
[https://github.com/mozillazg/pypy/tree/master/rpython/transl...](https://github.com/mozillazg/pypy/tree/master/rpython/translator/c/src/stacklet)

\- [1]
[https://github.com/boostorg/context/tree/develop/src/asm](https://github.com/boostorg/context/tree/develop/src/asm)

~~~
ynezz
And to [http://libdill.org](http://libdill.org) as well

~~~
zii
Ye, make a performance diagram.

------
ausjke
the most starred co-routine C is:

[https://github.com/Tencent/libco](https://github.com/Tencent/libco)

It's said golang learned from libtask:

[https://github.com/jamwt/libtask](https://github.com/jamwt/libtask)

And something else:

[https://github.com/baruch/libwire](https://github.com/baruch/libwire)

[https://github.com/halayli/lthread](https://github.com/halayli/lthread)
(multi-core friendly)

~~~
byuu
Disappointing that Tencent took my library's name. I'd been using the name
since 2006 or so. I guess you can't expect a two-letter lib* name to remain
unique. My luck that theirs would become the most well known in this space,
though.

~~~
ausjke
hi byuu, are you going to add arm64 support? or considering C99?

if I have multiple cores, what's the best way to leverage that along with
libco? will cpu affinity help?

~~~
byuu
Aarch64 support will show up as soon as I can get a Linux distro running on it
to test the implementation with. My understanding is the ARM ABI isn't as well
defined, and you tend to need slightly different implementations per device,
so it may not be as useful as simply using the setjmp version.

You can set -DLIBCO_MP to get the co_active handle to be multi-threaded. Each
preemptive thread would then have its own set of cooperative threads.

------
ur-whale
A maybe naive question here, and also maybe slightly off-topic but: is there
ever a reason to use co-routines when you have full-fledged multi-threading
available?

As much as multi-threading can quickly become a headache wrt robustness and
provable behaviour, I've always felt that a complex co-routine based codebase
- even if you can theoretically reason about it using FSMs - is harder to "fit
in head".

~~~
jandrewrogers
I use coroutines in C++ frequently as well as threads; they solve different
problems. There are two major use cases for coroutines in my experience: they
are a _much faster_ concurrency primitive in some cases and they make the
design of some complex concurrency problems _much simpler_. Using coroutines,
which are functional in nature, seem complicated until you grok them and at
which point the design of such code becomes clear and obvious (like functional
programming generally).

The large performance advantage of coroutines comes from two main areas.
First, context-switching in coroutines is _at least_ a thousand times faster
than with threads. On modern processors for software with fine-grained locking
requirements, it is common to see multithreaded applications spending more CPU
time on context-switching than actually running application code. Database
engines were early adopters of the coroutine model because they have this
problem in a major way. Second, you have complete control of the execution
schedule ensuring that your code is always doing the best possible thing at
the best possible time; schedule-based optimization offers a large throughput
improvement by allowing you to dynamically select the task that is most
optimal to run in the current context rather than selecting randomly. For a
simple example from the database world, most atomic units of work are applying
a scan operators to a data page, which are generally very fast -- access
latency is a major fraction of the total cost. If you have a thousand queries
running concurrently, many of them will touch the same pages at different
times. In a coroutine model, when one coroutine pulls a data page into L2
cache, it is inexpensive to immediately schedule the other queries that will
need to eventually access that data page to run their operator on the same
page at the same time, switching between them sequentially. In multithreaded
environments, you tend to end up churning the CPU cache for almost every
single page operation. The overhead of most schedule-based optimization in
threaded environments is prohibitive, so it largely isn't done.

The simplicity advantage for complex concurrency problems is that you don't
need locks, which is the bane of multithreaded programming, with the caveat
that all your operations need to be non-blocking. The full state of the
various tasks is available to you so conflicts can be eliminated by scheduling
the coroutine execution so that they never occur. For all practical purposes,
code correctness is evaluated in a single-threaded context. You can design
multithreaded code primitives that can inspect their own lock graphs to ensure
that deadlocks and other conflicts can always be resolved automatically,
database engines tend to use these structures pervasively, but they are both
complex and computationally expensive. The disadvantage of coroutines here is
that all of your I/O interactions must be non-blocking. Most I/O these days
_should be_ non-blocking as a matter of performant design but if you are using
coroutines then using blocking I/O is not a sensible option for the developer.

Architecturally, coroutines lead to a model where virtually all data is
private to a thread whereas in a multithreaded model most data tends to be
shared between threads. There are many concurrency and complexity advantages
to making data non-shared but it comes with one major downside -- you need to
figure out how to balance compute load in such a model, which is perfectly
solvable but not trivial like the multithreaded case. Distributed systems have
the same load balancing problem (though few attempt to solve it in a robust
way).

~~~
ur-whale
> by allowing you to dynamically select the task that is most optimal to run
> in the current context rather than selecting randomly

thanks for the very complete explanation, and this sentence is the crux afaic.

[edit]: what do you use to do coroutines in C++?

------
cryptonector
The fastest C co-routine library is going to be inside PuTTY [0].

[0]
[https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html)

------
rjeli
Comparison to libmill/libdill?

~~~
FrozenVoid
Libmill is stack-unsafe
[https://github.com/sustrik/libmill/issues/167](https://github.com/sustrik/libmill/issues/167)

------
ComputerGuru
Does anyone know if this plays well with multiple threads with a separate
coroutine machine on each simultaneously?

~~~
rurban
You have to look if the thread context register GS is stored and restored in
the assembler part.

~~~
rurban
Oops, FS stores the TIB, not GS

------
rijoja
Does anybody feel like adding some define black magic to sugar coat the
syntax?

------
lkurusa
This looks nice. It's always good to see formal proofs!

------
stevefan1999
I suggest using setjmp/longjmp instead of copying registers manually.

~~~
ufo
How would you do that? longjmp let's you jump back to the parent coroutine but
after that you cannot jump back to the child to continue where it left off.

~~~
cryptonector
By using different stacks for each coroutine. But it's still not trivial.

