Co-routines as an alternative to state machines

silentbicycle · on Aug 29, 2009

An excellent paper on coroutines is "Revisiting Coroutines" by Ana Lucia de Moura and Roberto Ierusalimschy (http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf).

Lua has had excellent support for co-routines for a couple years, and, since its implementation is "stackless" (in the same sense as Stackless Python), its coroutines don't have the limitations as standard Python's. (It doesn't have a GIL either...) There are a couple papers about coroutines in Lua, but "Revisiting Coroutines" is the best overview. They're surprisingly versatile.

The paper shows how to efficiently implement one-shot continuations on asymmetric coroutines. (There is also a small C patch by Mike Pall that adds a coroutine.copy function to the VM, making multi-shot continuations possible: http://lua-users.org/lists/lua-l/2006-01/msg00652.html, and note the fix in his follow-up post.) Implementing continuations in terms of coroutines rather than the reverse (as is commonly done in Scheme) often leads to clearer code, IMHO.

chadaustin · on Aug 29, 2009

IMVU makes heavy use of this technique:

http://thespeedbump.livejournal.com/63798.html http://thespeedbump.livejournal.com/64053.html

It's the only way we've found to write understandable, maintainable flows that touch network requests and user input. Explicit continuations are way too error-prone, in our experience.

tnovelli · on Aug 29, 2009

Oh yeah, software state machines are silly, considering that microprocessors are very efficient state machines. It's funny how algorithms that make perfect sense in Assembler can become so slow and convoluted in an HLL. Coroutines, continuations, and GOTOs are some remedies for this high/low-level disconnect.

I wonder, are software FSMs ever really necessary?

Hexstream · on Aug 29, 2009

I recently used a FSM approach to simultaneously match multiple patterns against a tree in one walk, to know what rules match at each node and with what pattern variable values (dictionary). Specifically, to associate the ordered list of suitable error renderers for each specific validation error that can occur.

The way I normally do pattern-matching is simply to walk the pattern and the tree to match against simultaneously, but of course this approach breaks down if you want to match multiple patterns against one tree in one pass. So I compiled each pattern into a list of closures, each closure representing one step of pattern-matching, a state. Then I walk the input tree normally and at each node, I call "push" on each of the patterns. If the current state of the pattern matches, I push the next state and the new dictionary variable values that were matched at this step. If the match is invalid, I push :invalid on the stack. After visiting the node I call pop on all the patterns.

Sorry if this is confusing. That's just to illustrate that learning the concept of a FSM let me implement something I wouldn't have known how to do otherwise.

Anyway, understanding the concept of a FSM is valuable in itself even if you never were to implement one because thinking about all possible states, inputs (symbols) and transitions helps you think more formally about some problems.

And even if the CPU itself is a state-machine, what if you want to make an emulator for that CPU?

stcredzero · on Aug 30, 2009

I've seen good patterns for state machines in Smalltalk with states as objects and states as methods (on an FSM object).

With states as methods, you have an instance variable for "state" and you just run a loop:

    [ self state == #final ]
        whileFalse:
            [self getNextInput.
            self state: (self perform: self state)]

Each method then becomes a concise description of the state transitions:

    code
        self atEndOfInput 
            ifTrue: 
                [^#final].           "Go to #final State"
        self input == $"
            ifTrue: 
                [ self writeToOutput.
                ^#insideComment].    "Go to #insideComment"
        ^#code                       "Stay in #code"
    
    insideComment
        self input == $" 
            ifTrue: 
                [^#code].            "Go to to #code"
        ^#insideComment              "Stay in #insideComment"

Then debugging just becomes a matter of putting breakpoints in methods. #perform: is almost as fast a a regular message send, and the methods themselves are mostly #ifTrue: statements, which are bytecode optimized, so this is mostly JIT-ed to machine code.

bkovitz · on Aug 30, 2009

In my C programming days, I used FSMs heavily. They enabled me to write code to handle extremely tricky, bug-prone problems of various kinds. I wrote them as static data with a tiny interpreter. They took almost no memory, they were super-fast, and they almost always worked the first time. I can't say that FSMs are particularly readable, but they were definitely more readable than tangles of while loops and if statements.

It's always been a disappointment that FSMs work so badly in most high-level languages. They are indeed a good fit to microprocessors.

jcl · on Aug 30, 2009

Zed Shaw has a lot to say about the benefits of explicit software state machines:

http://www.zedshaw.com/essays/ragel_state_charts.html

enqk · on Aug 29, 2009

My favorite example of that are "protothreads", stackless "threads" in C introduced by Adam Dunkel, and related with the contiki project.

An implementation of the paper can be done in a 130 lines big header, and they conveniently replace state machines with straightforward sequential code.

http://www.sics.se/~adam/pt/

functional-tree · on Aug 29, 2009

For cool Python coroutines with networking code, have a look at Eventlet. I'm using it for a lot of my server code; it makes server-side socket programming dead-simple to understand.

btw0 · on Aug 29, 2009

Also don't forget cogen, a library for network oriented, coroutine based programming using the enhanced generators from python 2.5. I recently used it as a front-end concurrent WSGI server in my Google Summer of Code project.

Besides state machine simulation, corotine has many other usages as well (eg. implement concurrency), it is a very powerful tool that not many programmers know. For those who are interested, I suggest a tutorial:

http://www.dabeaz.com/coroutines/

abecedarius · on Aug 29, 2009

Nice explanation, but the Python example could be expressed just as naturally, and with a bit less code, using a generator instead of a coroutine. I'd have added a note on when to use which style: prefer generators since Python supports them better, but use coroutines when the producer of the data stream has to be 'on top' in the control flow.

silentbicycle · on Aug 30, 2009

A "generator" is a subtype of coroutine. Python calls them "generators" because 1) that's what Icon called them, and 2) Python's implementation of coroutines is* restricted by being unable to yield from within called functions, among other things, so they're not as generally useful.

* Was? Maybe Python 3 fixed this. I use Lua instead now -- it's cleaner, consistently gets these things (TCO, lambdas, coroutines, the GIL, etc.) right, and I'm not going to wait forever for Guido to change his mind.

abecedarius · on Sept 1, 2009

I followed the language of Beazley's slides on Python coroutines: a 'generator' is a thing like this:

  def g():
      for i in range(10):
          yield i

and a 'coroutine' is a thing like this:

  @coroutine
  def c():
      while True:
          x = yield
          print x

The code in the original post is like the latter. g() produces successive outputs when you call .next() on it; c() consumes successive inputs when you call .send(input) on it. The g() style has been in Python longer and is better integrated into the language -- you can say "for x in g():", for example. That's the point I was making, maybe too tersely.

Your point 2 is still correct, though there's a PEP to make trampolining to work around it work nicer.

cturner · on Aug 30, 2009

Yielding from within called functions is the basic generator syntax, but I think I know what you're talking about.

> Maybe Python 3 fixed this

Can you verify that this is what is being talked about at page 26 of this link? http://www.dabeaz.com/coroutines/Coroutines.pdf

If so - Python 2.5.

silentbicycle · on Aug 30, 2009

Not slide or page 26, no.

From PEP 342:

    Also, generators cannot yield control while other functions are
    executing, unless those functions are themselves expressed as
    generators, and the outer generator is written to yield in response
    to values yielded by the inner generator.  This complicates the
    implementation of even relatively simple use cases like asynchronous
    communications, because calling any functions either requires the
    generator to "block" (i.e. be unable to yield control), or else a
    lot of boilerplate looping code must be added around every needed
    function call.

It looks like 2.5 fixed that particular issue. (I'm not holding my breath on tail-call optimization, though...)

KateLawson · on Aug 29, 2009

Good point. I had to re-read the article a couple times to figure out why this wasn't implemented as a generator. The Dunkels' TCP stack approach in C made more sense.

silentOpen · on Aug 29, 2009

Why not use mutual recursion and TCO? The function with the program counter is the current state. Problem solved?

Mutable state is your enemy...

silentbicycle · on Aug 29, 2009

It's not always just two coroutines (could be thousands), and not every language has TCO.

Also: While mutable state is problematic, sometimes the alternatives are worse.

silentOpen · on Aug 30, 2009

That's a good point... mutual recursion lacks dynamism.

The lack of TCO or ability to do mutual recursion+TCO is really the fault of the language. If you have a state machine with a small number (<20) states, mutual recursion seems like a more natural fit (unless you have to have states determined at run-time).

nimrody · on Aug 30, 2009

While coroutines may help in writing more maintainable/readable code, they make for a very inefficient structure since they have to keep a separate stack. Having many concurrent coroutines is probably a bad idea.

This is the classic trade-off between event based processing (state machines) and process based models.

jrockway · on Aug 30, 2009

There is not that much overhead. If you are writing an event-based network server, you have to allocate one structure of some type per connection, this is going to use memory regardless of whether it's a coroutine, or just a closure or instance of a class. If your coroutine implementation is good, these representations will be exactly equivalent. (In the real world, it is not as good of course, but it is theoretically possible to make coroutines as space efficient as hand-crafted state machines.)

I did an experiment; I created 100,000 coroutines in Perl (with the Coro module) each implementing a simple network protocol (echo). This used about 14k of RAM per "connection".

So sure, you could do it with less, but I know a lot of people who are happy with using 1.4G of RAM handling three concurrent requests -- this handles 100 thousand. (Right now, my web browser is using half that amount of memory so I can type this post.)

And BTW, this is also not too expensive CPU-wise; under the hood it uses epoll on my machine, which scales linearly over the active connections. With most of the connections idle, this results in very good performance.

Edit: I did the same thing with GHC and Control.Concurrent's forkIO; the overhead is about 3k per thread. Very reasonable.

silentbicycle · on Aug 30, 2009

As a data point, creating 100,000 coroutines in Lua uses about 183 MB* , which works out to about 1.9K apiece. This seems independent of the function included in the coroutine -- "return 3" and a function with several pages of code use the same space. It's probably a small stack, a closure for "upvaues", and a function pointer.

Note that this doesn't include memory for the IO connection, but the original question was about the space overhead of coroutines themselves. How much do Perl and Haskell use without the IO handle?

* On an AMD64 running OpenBSD 4.6. On an i386 (32-bit) system, it took 142 MB.

jrockway · on Aug 30, 2009

With empty suspended coros, we use about 169M for 300,000 of them. So that's about .5K per coroutine.

In Haskell, 500,000 threads waiting for an empty mvar used about 1.4G of RAM. But if they just spin forever (foo = foo), not waiting for anything, then it uses 129M for 500,000 threads; .2K each.

So really, coroutines are not costly at all. Most of the price you pay you would have to pay anyway. The overhead of the coroutine itself is tiny.

silentbicycle · on Aug 30, 2009

Good to know, thanks for the quick response.

jrockway · on Aug 31, 2009

And just for the record, my examples were on a 64 bit machine.

aaronblohowiak · on Aug 30, 2009

Memory efficient or time efficient?

nimrody · on Aug 30, 2009

Memory of course. I don't think there's much of a difference in runtime (ignoring cache misses, etc.)

amichail · on Aug 29, 2009

Full coroutine example in uC++:

http://www.student.cs.uwaterloo.ca/~cs343/examples/FullProdC...

noss · on Aug 29, 2009

Erlang programming is like this all the time.

skwiddor · on Aug 29, 2009

C.A.R. Hoare wrote a whole book on CSP

http://www.usingcsp.com/

skwiddor · on Aug 29, 2009

The Limbo programming on the Inferno OS makes great use of co-routines. http://www.vitanuova.com/inferno/