async.h is indeed similar to protothreads, but simpler. I like its mechanism for co-routines calling co-routines.
As for scheduling, there's a lot of history of M:N scheduling. M:N scheduling hasn't worked out for, e.g., Solaris, Linux, Rust, and some others. Thread libraries tend to be 1:1 nowadays. Erlang uses M:N and I see claims that it works well enough, but I am not familiar enough with it to understand if that's true or why.
M being the number of user-land threads and N being the number of threads allocated in the OS kernel to run those user-land threads. M:N means having them be different, with M>N and a user-land scheduler to choose which OS thread runs which user-land thread (which looks a lot like a co-routine). It's easy end up with pathological conditions and leaving performance on the table. But I suppose a lot depends on just what exactly the workload looks like. An I/O bound workload with no long CPU runs (or lots of yielding during them) will probably work well enough with M:N scheduling, but 1:1 threading with as many threads as CPUs should work even better.
I.e., coroutines are just a C10K method, and you must end up with more of them than you have OS threads and HW CPUs.
If, e.g., Bryan Cantrill and others who claim M:N is bad are wrong, they're only wrong -I think- if they extend the claim to stackless / small stack coroutines. But Bryan Cantrill's seminal paper on the badness of M:N threading was not about stackless coroutines, but about very stackful coroutines (pthreads).
M:N is necessarily bad if the M things have large stacks (which was and is the case in, e.g., pthreads).
M:N is necessarily good (C10K) if the M things are extremely light-weight.
Everything we see in this space, from Scheme-style continuations, partial continuations, to hand-coded CPS, to stackless co-routines, async/await primitives that allow compilers to do partial CPS conversion / coroutines -- all these things are about program C10K, which is about a) using async I/O, and b) compressing program state / reducing overhead to serve the most possible clients.
As a program state compression technique, nothing beats hand-coded CPS, but it's utterly not user-friendly. Scheme-style continuations mostly shift program state from the stack to the heap. The sweet spot is async/await.