
High performance services using coroutines - gmosx
https://medium.com/software-development-2/high-performance-services-using-coroutines-ac8e9f54d727
======
themartorana
Writing high performance server software is a whole other world from writing
services. In the first, data copies can cause unacceptable slow-downs of a
fraction of a millisecond - and those copies may be hidden all the way down at
the driver level. When writing services (which are just as often written in
Ruby or Python) trading off a few milliseconds for safety is often a worthy
thing.

I spend most of my time writing services, and when looking at a language like
Go, there's a reason the default is pass-by-value. Passing around pointers to
multiple co-routines is asking for a race condition, and in non-GC languages,
null pointers. Services are rarely written with the precision of high-
performance servers.

I don't envy the server writers (although I can see how it would be fun!).
Giving up a few milliseconds per request to make sure my co-routines aren't
sharing pointers is worthwhile, and I appreciate the safety that gives me. I'm
sure someone will mention that Rust could give me the safety I was looking for
in a non-GC language, but that's the point, isn't it - that by being able to
game the system, you can gain a few precious microseconds here and there that
enforced safety might cost you.

~~~
kibwen

      > I'm sure someone will mention that Rust could give me 
      > the safety I was looking for in a non-GC language, but 
      > that's the point, isn't it
    

I'm not sure what you're trying to say here. There is no need to game the
system, as Rust lets you pass pointers between threads while enforcing that
they are used safely. Rust lets you be as fast as an unsafe language while as
safe as a managed language; the tradeoff is that you have to give more thought
to your design up-front (which, given the pain of trying to reproduce and fix
race conditions, is more than worth the effort IMO).

------
aaronlevin
This sounds very similar to warp, the Haskell web server that is known for its
high-performance on many-core machines. The Web Application Interface (WAI) is
also coroutine based (I believe):
[http://www.aosabook.org/en/posa/warp.html](http://www.aosabook.org/en/posa/warp.html)

~~~
raksoras
Sorry for the shameless plug but I just released HTTP server based on Lua
coroutines
([https://github.com/raksoras/luaw](https://github.com/raksoras/luaw)) that
pretty much follows the exact design from the OP article.

~~~
markpapadakis
This is pretty cool. Thanks for sharing.

------
marktangotango
>> If not, the coroutine will yield, and at the same time, be
scheduled(migrated) to another set of threads responsible for the ‘slow’
requests(ones that need to block). Those will just schedule the coro in their
own scheduler, and perform the read again and continue.

>> Alternatively, we could just yield the coro, waiting for a new special
coroutine that would run in a background thread just executing readahead() and
then re-scheduling(making runnable) the original coro, waiting for readahead.

Seems to me this scheme will ultimately be limited by the slower request.
Performing fast and slow operations is essentially Little's Law[1] where the
average time dominates. However if the slow/blocking reads where also async, I
think you'd eventually be limited by io speed?

[1]
[http://en.wikipedia.org/wiki/Little%27s_law](http://en.wikipedia.org/wiki/Little%27s_law)

~~~
markpapadakis
I am familiar with Little's Law. The key concept is optimistic sequential
execution. The overwheleming majority of requests do not need to page in cold-
data or otherwise stall for too long. Also, in the case where they can stall
to e.g process a lot of data in a loop, they can yield back to the scheduler,
thereby being fair to all other accepted coroutines/requests and giving a
chance to others to complete earlier. In practice, at least based on my
prototype and tests, this works extremely well. If about 10-15% of the
requests may end up blocking the thread, while the read can read kernel-cache
resident data and proceed, that's really huge in terms of performance gains.
The background threads that either become owners and responsible for running
slow/blocking coros can rely on a work-strealing scheme to keep them all busy,
and if you don't rely on them, thereby removing contention/stall from the
'fast' threads, you are going to eventually suffer stalls and latency spikes.

------
SixSigma
Syscalls are slow. See my previous HN post

[https://news.ycombinator.com/item?id=8961582](https://news.ycombinator.com/item?id=8961582)

------
rubiquity
The end implementation sounds a lot like... Erlang. With the exception that
this solution avoids allocations that Erlang probably makes due to
immutability.

~~~
markpapadakis
This is a C++ implementation. Erlang and other high level languages use
Processes(in the case of Erlang) or other such constructs to encapsulate logic
and those are indeed scheduled based on either cooperative/preemptive fashion.

The main goals of this system is to minimize potential for blocking or
otherwise slow processing steps which can stall processing of other requests,
especially if this happens for many such requests/second.

Specifically, the 'optimistic sequential execution' idea is that you run the
request on the request it was accepted in, use mincore() to determine if will
_likely_ stall if you read from it, and if so, either migrate the coro to a
'slow' coros threads list, or just keep on going (which should be the fast-
path here).

This is a solution to a specific class of problems. The similarities with
Erlang's Process scheduling is that a process is a lot like a coroutine (it
encapsulates logic and data and it's scheduled by a user-space scheduler, and
that there can be thousands of them in runnable state, and each has its own
stack). But that's really the definition of stackless coroutines.

------
amelius
Coroutines are essentially a form of cooperative multitasking. I'm not sure we
should be using that in this day and age, especially considering that requests
are not purely I/O bound in all cases, and could depend on actual computation.

~~~
markpapadakis
The fewer the trips to the kernel, the more work you can do in user-space, the
fewer the sys-calls, the fewer the context switches, the better the throughput
and performance.

Even if a request is completely CPU-bound, it’s a good idea to be fair to
other requests accepted in the thread; that is, if a request should take 2
seconds of processing time, while others would take a few microseconds, they
shouldn’t all wait until that one request is processed.

Instead, yielding back to the scheduler where appropriate, will give them a
chance to complete in time proportional to the effort required to process
them, not to the time required to handle long-running requests that stall
processing of other coros.

~~~
amelius
Yielding will take CPU time, even when no yielding is necessary. And where to
put these yields? In an inner-loop? Should all loops be rewritten to ensure a
yield takes place at least every N instructions? Considering all this, I think
a better approach would be to optimize the thread scheduler in the kernel,
assuming it is as inefficient as you are suggesting.

~~~
markpapadakis
In practice, you indeed yield often enough — there is no perfect formula I
believe — so in loops or elsewhere where you are spending too much time doing
one thing. You don’t need to do it perfectly, you just need to yield often
enough to give other coros a fair chance at completion. Relying on the kernel
is a bad idea; that’s where you lose on performance. Also, to do what you are
suggestions would mean that you use many, many OS threads and the kernel
somewhat manages them all in a pre-emptive fashion, and you don’t need to do
it yourself; but that means you need to pay the context switching, memory
footprint etc costs. It won’t scale and will be slow.

~~~
amelius
The memory footprint is there only to provide for stack-space. This makes me
wonder, how do coros implement stacks? And don't they suffer effectively from
the same problem?

Another point: what if your code calls a library function for which you don't
have control over when it yields?

