
Co-routines as an alternative to state machines (2009) - adamnemecek
http://eli.thegreenplace.net/2009/08/29/co-routines-as-an-alternative-to-state-machines/
======
userbinator
A few observations:

* I've found that state machines are _much_ easier to understand when they consist of actual gotos, with labels for each state; the current state being where execution is.

* Coroutines are one of those things that I think are easier to understand and write in Asm - a yield looks much like a call but you pop the return address off the stack and save it somewhere, so the next call basically becomes a jump to the resumption point.

* Yielding between coroutines is very similar to switching between threads.

* It helps a lot when understanding coroutines if you forget any existing notions of what functions are and have to do, and just focus on how the execution flows.

~~~
groovy2shoes
I agree with all of these points.

As much as Dijkstra liked to cry foul on goto, there are some instances where
they really can make imperative code _more_ readable [1]. State machines are a
fine example of this. In languages that lack goto but _do_ have tail call
optimization (such as many functional languages), you can safely substitute
mutual recursion. Then functions become your states, and transitions are tail
calls. Lambda is, after all, the ultimate goto [2].

Edit: forgot the links -_-

[1]:
[http://web.archive.org/web/20090320002214/http://www.ecn.pur...](http://web.archive.org/web/20090320002214/http://www.ecn.purdue.edu/ParaMount/papers/rubin87goto.pdf)

[2]:
[http://dspace.mit.edu/bitstream/handle/1721.1/5753/AIM-443.p...](http://dspace.mit.edu/bitstream/handle/1721.1/5753/AIM-443.pdf)

~~~
ufo
Since this is the internet, I would like to nitpick and point out that
Dijkstra never said that gotos are always evil. From his considered harmful
letter:

> In [2] Guiseppe Jacopini seems to have proved the (logical) superfluousness
> of the go to statement. The exercise to translate an arbitrary flow diagram
> more or less mechanically into a jump-less one, however, is not to be
> recommended. Then the resulting flow diagram cannot be expected to be more
> transparent than the original one.

The important thing is to make sure that the state of the program can be
modeled by something as static as possible and as close to the source code as
possible. Ideally you should be able to tell what is going on just by looking
at what line of code you are at and you want to avoid those times where the
only way to understand a program is to go back to "main" and keep track of the
whole code path that was taken to reach the point you are currently at.

So a state machine with gotos is fine since the state machine states are
strongly correlated to the source lines of code. On the other hand, an
unstructured program translated to a goto-less one via the introduction of
lots of "flag" variables is just as hard to understand as the version using
gotos.

\---

By the way, the "GOTO considered harmful" letter is smaller than your average
blogpost amd is very easy to understand. I highly recommend that everyone
should go ahead and read it if you haven't done so already :)

[http://www.u.arizona.edu/~rubinson/copyright_violations/Go_T...](http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html)

~~~
groovy2shoes
> Since this is the internet, I would like to nitpick and point out that
> Dijkstra never said that gotos are always evil.

To further nitpick your nitpick: I never claimed Dijkstra _did_ say that gotos
are always evil. I only said that he cried foul on goto, which he did:

> More recently I discovered why the use of the go to statement has such
> disastrous effects, and I became convinced that the go to statement should
> be abolished from all "higher level" programming languages (i.e. everything
> except, perhaps, plain machine code).

As an aside, the Jacopini paper [1] doesn't really seem to have anything to do
with eliminating jumps. In fact, all of their example normalizations given in
figures 20-22 clearly contain jumps.

[1]:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.9119&rep=rep1&type=pdf)

Edit: After further reading the Jacopini paper, I think I see what Dijkstra
was hinting at: abstracting the jumps into 2 basic operations which are less
powerful than a generic jump -- a conditional jump and an iterative jump.
Jumps remain in the normalizations but are more restricted, and in some sense
"abstracted away". However, the paper does explicitly say that not all flow
diagrams can be decomposed this way.

~~~
mtdewcmu
I suspect that the programming landscape has changed so much that the goto
debate of the 70s is no longer relevant. People used to use gotos because they
were necessary; naturally, bad code got written; a theory emerged that goto
caused bad code. Fast forward to the present. Gotos are so feared that most
programmers have never even seen one. Naturally, bad code still gets written.
What we now know for absolute certain is that it's possible to write any kind
of code without gotos (including awful code). If you wanted to revive goto
today, you'd have to force people to use it, because no one would know what to
do with it. There's a goto-free solution for everything. Bad code would still
get written, but I doubt it would look like the bad code of the 70s.

------
pdq
Rob Pike used coroutines for the Go Template implementation, instead of a
standard lexer/parser:

Video:
[https://www.youtube.com/watch?v=HxaD_trXwRE](https://www.youtube.com/watch?v=HxaD_trXwRE)

Slides:
[http://cuddle.googlecode.com/hg/talk/lex.html](http://cuddle.googlecode.com/hg/talk/lex.html)

~~~
ysleepy
OMG that code looks horrible.

Kind of disappointing how he not uses proper parser generators and grammars.

I avoid imerative almost always if I can go declarative.

~~~
vidarh
Very few production compilers uses "proper parser generators and grammars" for
good reasons. He's even listed some of the reasons on one of the earliest
slides.

Basically, the tools are generally not good enough.

------
happy4crazy
Neat, I was just wondering about this. I'm using the same idea with core.async
channels in ClojureScript to parse the BitTorrent peer protocol[0], but I
wasn't sure how to describe it.

[0]
[https://github.com/happy4crazy/ittybit/blob/master/src/ittyb...](https://github.com/happy4crazy/ittybit/blob/master/src/ittybit/protocol.cljs#L93)

~~~
juliangamble
You can read more about the State Machines of core.async here:
[http://hueypetersen.com/posts/2013/08/02/the-state-
machines-...](http://hueypetersen.com/posts/2013/08/02/the-state-machines-of-
core-async/)

------
jacquesm
Statemachines are an excellent way to get deterministic behavior of complex
systems. Doing this using co-routines would be a lot harder, if it can be done
at all once you reach a certain level of complexity.

I've tried to do something very similar to this in a library I wrote two years
ago, in the end we reverted back to state machines because they were so much
easier to tame. They're boring, but sometimes boring is good.

------
kerneis
Coroutines and state machines are indeed equivalent. Here is an example of how
to (and good reasons why) translate mechanically one into the other:

[http://gabriel.kerneis.info/research/files/kerneis-
boutier-2...](http://gabriel.kerneis.info/research/files/kerneis-
boutier-2013.pdf)

More context in my work about Continuation-Passing C:
[http://gabriel.kerneis.info/research/](http://gabriel.kerneis.info/research/)

~~~
drudru11
Reading your papers now. Great stuff.

------
munro
The blame is being placed on the wrong thing. It's not the state graph model
that is the issue, but modeling it with object oriented constructs. The
article actually also agrees that the the concept is succinctly modeled with
the state graph diagram, and yet the object oriented version is hard to
understand [1].

It's like saying Hashmaps are hard to use, when using List operations to
manipulate a map stored as a list of tuples.

What the author needs is a library that can model state graphs in a much more
readable way. I developed a state graph library for JavaScript when creating a
turn-based game [2], which helped clean up my code immensely. There also
appears to be a fairly popular Python library, with a nice looking API [3].

[1] "it’s generally difficult to understand what’s going on in the code
without having some sort of a state diagram in front of your eyes." [2]
[https://github.com/munro/stategraph](https://github.com/munro/stategraph) [3]
[https://github.com/jtushman/state_machine](https://github.com/jtushman/state_machine)

~~~
userbinator
There is nothing "object oriented" about the example, except for the fact that
the code is in a class.

The "proper object oriented way" would be the State design pattern, using
separate classes for each state and replacing the if/else with a virtual call,
which I agree would be even more obfuscation of the control flow:

[http://en.wikipedia.org/wiki/State_pattern](http://en.wikipedia.org/wiki/State_pattern)

No, I don't think the answer is to make yet another library, if it isn't one
that encodes the state graph as a table.

~~~
munro
Yep! I completely agree, there is nothing object oriented about the example.
So why use constructs meant for describing objected oriented code? Though, the
fact that someone defined a pattern for designing state machines on top of
object oriented constructs is interesting, thanks for bringing that up.

The other option is build a model for talking about state machines on top of
Python's features (like decorators, meta classes, even coroutines!) than
strict OOP to feel more like idiomatic Python, than say Java. And when someone
has written the code already, I say why reinvent the wheel! :)

~~~
Too
Classes are a way to encapsulate state, you don't have to use them together
with oo. How else would you describe the example in python _without_ using
object oriented constructs? Global variables or passing around a dictionary
with all the things set up in init?

~~~
munro
A distraction from the main point, being: state graphs rock! And deserve a
high level way of talking about them in any language of choice, which most
languages don't include in the stdlib. So one may have to roll their own/find
a library. And that the article confused code that's "difficult to understand"
with state graphs being bad.

> Classes are a way to encapsulate state, you don't have to use them together
> with oo.

Lexical closures are another way to encapsulate state, modules are another,
then there are monads! People have thought of lots of cool ways to encapsulate
state.

> How else would you describe the example in python without using object
> oriented constructs? Global variables or passing around a dictionary with
> all the things set up in init?

I would build the example on state graph terminology, instead of on top of
Python's class directly. Though, as userbinator pointed out, someone has
described an OOP pattern for talking about state graphs [1].

Unfortunately there is no built in way for talking about state graphs in
Python. To build that, Python has lots of fun features I can leverage! I
wouldn't artificially limit myself. But first I would definitely research if
someone has written a good library.

[1]
[http://en.wikipedia.org/wiki/State_pattern](http://en.wikipedia.org/wiki/State_pattern)

------
dimatura
This is very much the same reasoning behind protothreads[1]. As a bonus, the
implementation of protothreads uses a neat trick reminiscent of Duff's device
[2].

[1] [http://dunkels.com/adam/pt/](http://dunkels.com/adam/pt/)

[2]
[http://en.wikipedia.org/wiki/Duff's_device](http://en.wikipedia.org/wiki/Duff's_device)

------
hadoukenio
The great thing about state machines is that they are easy to debug. Once you
have co-routines firing off co-routines, you're in co-routine spaghetti. In
fact, I'll coin the term now:

    
    
      co-routines = "asynchronous gotos".

~~~
eliben
Can you clarify why state machines are easier to debug?

~~~
hadoukenio
I've worked on large state machines in the past, which were rock solid and
easy to work with. As the code was built using state transition tables, state
machines were automatic. I would hate to think of how to juggle these systems
using co-routines.

~~~
eliben
I'm not sure you're answering my question, though. You made a claim that state
machines are easy to debug and coroutines are not. What makes you think they
are not? I'm really just curious.

FWIW I've done significanly more work with state machines in the past, and I
find them reasonable to debug, though not as easy as straight line code
because in the latter the full state history (execution stack) is always
known, while in state machines you usually need to do extra work to trace how
the machine _got to_ some state.

~~~
hadoukenio
Sorry I forgot my why. Probably because by using state transition tables the
system's whereabouts is afforded salience. Compare this to a co-routines being
sprinkled throughout a large code base, essentially creating asynchronous
goto-hell.

------
shoover
Good laughs chasing links:

\- Simon Chatham: "PuTTY is a Win32 Telnet and SSH client. The SSH protocol
code contains real-life use of this coroutine trick. As far as I know, this is
the worst piece of C hackery ever seen in serious production code."
[http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html](http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html)

\- Tom Duff: "Yeah. I knew this. In my piece about The Device, I mention
something about a similar trick for interrupt-driven state machines that is
too horrible to go into. This is what I was referring to."
[http://brainwagon.org/2005/03/05/coroutines-
in-c/#comment-18...](http://brainwagon.org/2005/03/05/coroutines-
in-c/#comment-1878)

------
joefkelley
This is cool for this example, but it seems like for a non-trivial state
machine (I'm thinking TCP open/close or similar) the co-routine code could get
very messy. It's really a small subset of state machines for which this is a
good idea, or maybe even possible.

------
olegp
Co-routines are great! Have been using them as part of Common Node
([https://github.com/olegp/common-node](https://github.com/olegp/common-node))
via node-fibers for a number of projects. Both stability and performance have
been admirable.

For example at StartHQ ([https://starthq.com](https://starthq.com)) we have a
framework that automatically fetches data about web apps from a number of
sources, some as frequently as once every few minutes (blog RSS feeds) for all
the 2K+ apps. At the same time we've had as many as a few hundred concurrent
visitors on site at peak, all on an EC2 micro instance, with no noticeable
impact on responsiveness.

------
mtdewcmu
Theoretically speaking, any machine that keeps a fixed (finite) amount of
state is a finite state machine. Once you recognize that, you will realize
that you have written lots of them without even trying. If a problem
inherently requires a fixed amount of state, writing a finite state machine is
virtually inevitable. So it's not really a choice of whether to write a finite
state machine; it's a choice of how you want it to look; it's a question of
style. The state machine style assigns a name to each state, and makes the
transitions very obvious. Co-routines can be finite state machines, quite
clearly. You could argue that they do not follow the state machine style,
though.

------
byuu
At one point I was attempting to simulate multiple processors running in
parallel. It turns out this behavior can get really finely grained.

So a processor executes instructions, and an instruction/opcode is made up of
several cycles. Say for an "INCrement" instruction, there would be a cycle to
fetch the opcode prefix, another to fetch the location to increment, another
to read the value at that address, another to increment the value, another to
write the value back to that address.

There can arise a case where one CPU is writing to memory right as another is
reading that same address. And depending on whether you synchronize the two
CPUs between opcodes or between cycles, the one reading can end up seeing a
different value. This can, in certain cases, break emulated software if not
done correctly.

It gets even hairier when you factor in that each cycle consists of several
clock ticks. There is time required to hold a value on a bus before a read or
write is acknowledged.

So as I continued to try and increase the precision, I found myself nesting
state machines within state machines. Each CPU core had one for the
instruction table to select the active instruction. And then each instruction
had one for the active cycle. And then each cycle had one for the active
clock. And then there was also an interrupt-driven DMA unit that could trigger
within cycles that had to be accounted for. It was just too complex to
collapse into one giant, 100,000 case state machine inside of a single
function. Imagine trying to implement an x86 CPU inside one switch table
inside one function.

So you would have to traverse through 3-4 state machines to do something as
trivial as "increment an integer", and then return all the way back up the
stack frame and switch to the other CPU. The code ended up being around 90%
state machine maintenance to 10% actual simulation code. And it was painfully
slow.

This led me to the idea of having a separate cooperative thread
(coroutine+stack frame) for each CPU. Whenever one would read from or write to
something the other CPU could see, it would make sure it was 'behind in time'
to the other CPU. Otherwise, it would switch the stack frame and resume where
the other thread left off. When that CPU stopped, it would resume right at our
first CPU's pause. Very highly reciprocal.

The end result was a code reduction from around 350KB for a CPU core to around
35KB for the exact same code. And thanks to what I'll call "just in time
switching", it was possible to run one CPU through hundreds of instructions
when it wasn't communicating with the other, greatly reducing host CPU cache
thrashing. I ended up with a huge speed bonus to boot.

The one problematic area ended up being serialization. Basically, this model
moves the state machine into the host CPU's stack frames. But you can't write
that out portably, so it becomes a real problem to capture an exact point in
your program, write it out to disk, and then resume at that exact point in a
future run. That took a long time to solve, and required some serious trickery
involving "checkpoints" for serialization. So it's something to keep in mind
if you want to use coroutines/cooperative threads and also want to
serialize/unserialize things.

What was refreshing was that the core logic for switching between two threads
is surprisingly simple. For x86, it is simply:

    
    
        thread_switch(to = ecx; from = edx):
        mov [edx],esp; mov esp,[ecx]; pop eax
        mov [edx+ 4],ebp; mov [edx+ 8],esi; mov [edx+12],edi; mov [edx+16],ebx
        mov ebp,[ecx+ 4]; mov esi,[ecx+ 8]; mov edi,[ecx+12]; mov ebx,[ecx+16]
        jmp eax
    

Basically: save the old stack pointer and volatile registers, swap to the new
stack pointer, restore the volatile registers and return. The function design
is the program-equivalent of a palindrome.

(push/pop tends to be slower on Athlon chips than indirect memory accesses;
and getting the return address into eax instead of relying on ret allows x86
CPUs to start caching the new instructions quicker.)

So, like everyone else who has discovered this concept, I of course wrote my
own library for it in C89, which can be downloaded here:
[http://byuu.org/programming/libco/](http://byuu.org/programming/libco/)

Instead of trying like other libraries to build in complicated schedulers and
hundreds of functions, mine is just four functions taking zero to two
arguments each. No data structures, no #defines. The idea was for the user to
write their own scheduling system that works best for their intended use case,
instead of trying to be one size fits all.

~~~
userbinator
Very interesting. If you're going for size instead of speed, then you can do
that thread switch in 5 instructions:

    
    
        pusha
        mov [edx], esp
        mov esp, [ecx]
        popa
        ret
    

(This might be actually be faster too on recent CPUs.)

~~~
byuu
That will definitely work, and be more portable to esoteric ABIs and calling
conventions. But it was more than 3x as slow on the Pentium 4, Athlon 64 and
Core 2 Duo E6600. I haven't benchmarked since then. But you're pushing and
restoring a whole bunch of volatile registers in vain.

Another fun detail, I tried using xchg r32,m32 to swap the stack pointer out
in one instruction. Turns out that on the Pentium 4 (and probably others), the
instruction is implemented in microcode now. Plus it's a guaranteed atomic
operation. The benchmark I wrote ran at least 30x slower than with two mov
instructions. I was absolutely blown away by that. People used that all the
time in the 8086/80286 days to save a bit on code size (a much bigger deal
back then.) Yet that same code, run today, can end up being substantially
slower. Not knowing what opcodes will become slower in the future becomes a
fairly compelling argument against writing inline assembly for speed.

ARM has nice register lists that you can use to mask out the volatile regs. So
an optimal implementation is something like:

    
    
        push {r4-r11}
        stmia r1, {sp,lr}
        ldmia r0, {sp,lr}
        pop {r4-r11}
        bx lr
    

Moving on to amd64 ... Microsoft ignored the SystemV ABI (rbp,rbx,r12-r15 are
non-volatile) and instead made xmm6-xmm15 non-volatile as well. This makes it
more than twice as slow to perform a safe thread switch there. Even their own
fibers implementation ignores xmm6-xmm15, unless a special flag is used.

Probably the strangest was the SPARC CPU. It has register windows for fast
register saves/restores on leaf functions. Pretend your 16 regs were a block
of memory. It gave you 16 blocks of that memory, and you could change one
value to move to a new block of memory. When attempting threading, you
couldn't know if you would recurse enough to exhaust this window. So you had
no choice but to save and restore every single register window. Context
switching became _immensely_ demanding. So much so that GCC offered a
compilation flag to not use register windows in binaries it produced.

The choice of volatile vs non-volatile is really fascinating. The less non-
volatile you have, the faster both cooperative and preemptive task switching
is. But it also means you have less registers that remain safe to use after
function calls.

There's also caller vs callee non-volatility: either the caller has to back up
the regs it thinks the callee will trample (or all of them); or the callee has
to back up the regs it knows it will trample (but may end up backing up regs
the caller doesn't actually care about.)

------
dkarapetyan
Yup, the connection between state machines and co-routines is pretty nice.
Instead of state and a bunch of goto's you get a linear flow interrupted with
'yield' statements or whatever the equivalent is in your co-routine library.

------
tree_of_item
Coroutines as an encoding of state machines...?

~~~
Fasebook
Pretty much, which basically gives state to stateless system..

