
RethinkDB internals: making coroutines fast - coffeemug
http://www.rethinkdb.com/blog/2010/12/making-coroutines-fast/
======
pmjordan
Great stuff, good catch on the syscall! I've been tinkering with C coroutines
as well recently but since our critical code runs in kernel space they ended
up being more hassle than they're worth. I'm still using them in our (user
space) testing sandbox.

As you mention malloc()ing stacks: you might want to allocate them using
mmap() and MAP_ANONYMOUS instead. You can map the adjacent memory pages with
appropriate protection to prevent stack overflows and as a result, memory
corruption. I'm not aware of any drawbacks (malloc itself typically uses mmap
above a certain size) but it certainly beats hoping your stacks are big
enough. Less of an issue on 64-bit environments of course.

~~~
littledanehren
Thanks for the suggestion! I'm actually looking into doing something like that
right now.

------
jhrobert
So, coroutine are as fast as callbacks and easier to program with... so, why
aren't available is all language?

I understand that it is difficult to change JavaScript, yet, when you create a
whole new framework, say, nodejs, why not providing coroutines?

Does it need a language construct to be efficient? Maybe, then have a look the
Icon programming language, where coroutines rule,

~~~
strlen
Personally, I view node.JS as a step backwards: it's great to provide the idea
of a single thread holding multiple connections, with high-performance non-
blocking I/O within each thread; problem is, in their case, there's also a
single thread per UNIX process with no primitives for communications between
processes (unlike e.g., Erlang/OTP): if you'd want good performance, you want
an event loop per logical core.; completely independent processes may be
acceptable for simple, stateless web apps, but with most anything else you
risk losing a great deal of efficiency without correct primitives. For
example, if you have on-disk state, unless the threads are able to share
memory (either by running within the same process or using UNIX shm) you're
risking a situation where threads are competing for resources like OS page
cache (provided you're not doing your own in-process caching and direct I/O,
but even in this case, you've now lost the ability share a resource between
any two connections without copying).

Some e.g., Asana have added fibers to Javascript (in their case V8):
<http://asana.com/blog/?p=49>

The previous post from RethinkDB is quite interesting in terms of motivation
for coroutines: to me, this superficially seems like SEDA. Here, instead of
stages in the pipeline (thread pool per stage, first stage being processing
events from epoll/kqueue, thread per core in each pool, each thread holding
state machines, communication between threads via a queue) you are using
coroutines for clearer code.

~~~
sjs
I don't think you want an event loop per core, that's back into the realm of
concurrent madness, and I don't think that's how Erlang works (Erlang folk:
correct me if I'm wrong on that).

It's not a bad thing that data has to be copied to be shared between Erlang-
style processes. That is part of the reason why things "just work" when you
move that process to another machine, or data centre. If the implementation
can be smart and "cheat" by sharing data in the same process that's fine, but
it is an implementation detail and not a property of the system.

The way Erlang works is that you have a single event loop and threads are
spawned to handle i/o and such (the n:m threading model, N green threads are
mapped onto M OS threads). When a process blocks its execution is suspended.

Node employs an n:m threading model as well except "processes" in node are
just functions. One big difference is that if you make a blocking call in node
the whole process is blocked. There's only an event loop, no scheduler or
anything that OS-like involved. The Erlang model is clearly superior imo, but
Node is _far_ more accessible (for better and worse).

~~~
strlen
> I don't think you want an event loop per core, that's back into the realm of
> concurrent madness, and I don't think that's how Erlang works (Erlang folk:
> correct me if I'm wrong on that).

If you don't care about performance, sure (there's nothing wrong with not
caring about performance e.g.,: low volume web applications, where server side
Javascript can shine, IMO). However, I can give you a command line you can run
which will plot a very nice graph of throughput vs. number of selector threads
in a specific system that I've been working on.

Yes, it will bring you back to "concurrent madness". What you _want_ is having
primitives that make it possible to deal with this madness, not handwave it
away. In Erlang they're actors themselves, optimized for efficient delivery of
message within a single node and remotely, supervision of processes that fail,
ETS, mnesia; in Java they're Doug Lea's beautiful concurrent collections
(java.util.concurrent). You're assuming multi threading pthreads or
synchronized/notify/wait (Java before java.util.concurrent).

Note that there are two models for this: one is Erlang's as well as
traditional UNIX IPC -- message passing; the other is shared memory multi-
threading with concurrent and lock free collections (Java), which also goes
nicely with the idea of minimizing mutable state (Haskell, Clojure, Scala).
One is good for one kinds of applications which optimize for worst case
latency (Erlang shines there), the other is good for another, which optimize
for average throughput (JVM shines there).

------
modeless
I prefer Grand Central Dispatch, which fixes the callback model by providing
closures and anonymous functions as an extension to the C language.

<http://en.wikipedia.org/wiki/Grand_Central_Dispatch#Examples>

------
finiteloop
This looks really cool. I have been messing around with coroutines a lot
lately as well, testing libtask (<http://swtch.com/libtask/>) as well as
libcoroutine. Is the source for the changes to libcoroutine available? I
didn't see a link from the article.

~~~
coffeemug
We'd like to. The difficulty is that in order to get the performance
improvements we had to integrate three different components: a few patches to
libcoroutine, a patch to glibc, and pooling code that's tightly integrated
with our asynchronous IO layer. A compelling patch would span three different
licenses, involve patches to two libraries and re-engineering of the
integrated piece to make it usable by the outside world. We'd like to do it,
but for now this would be too much of a distraction.

------
warrenwilkinson
I've blogged about co-routines before, and this post inspired me to write a
little bit that contrasts RethinkDB with my own special-purpose DB that my
startup uses.

<http://formlis.wordpress.com/2010/12/24/co-routines/>

------
mscarborough
Sounds great, and I love the idea that RethinkDB is following.

I don't know a lot about them though, and having an extra benchmark against a
similar SSD installation of MySQL or Postgres with the same select query that
achieved 1.5 million QPS would be pretty interesting.

