Lua has had excellent support for co-routines for a couple years, and, since its implementation is "stackless" (in the same sense as Stackless Python), its coroutines don't have the limitations as standard Python's. (It doesn't have a GIL either...) There are a couple papers about coroutines in Lua, but "Revisiting Coroutines" is the best overview. They're surprisingly versatile.
The paper shows how to efficiently implement one-shot continuations on asymmetric coroutines. (There is also a small C patch by Mike Pall that adds a coroutine.copy function to the VM, making multi-shot continuations possible: http://lua-users.org/lists/lua-l/2006-01/msg00652.html, and note the fix in his follow-up post.) Implementing continuations in terms of coroutines rather than the reverse (as is commonly done in Scheme) often leads to clearer code, IMHO.
It's the only way we've found to write understandable, maintainable flows that touch network requests and user input. Explicit continuations are way too error-prone, in our experience.
Oh yeah, software state machines are silly, considering that microprocessors are very efficient state machines. It's funny how algorithms that make perfect sense in Assembler can become so slow and convoluted in an HLL. Coroutines, continuations, and GOTOs are some remedies for this high/low-level disconnect.
I wonder, are software FSMs ever really necessary?
I recently used a FSM approach to simultaneously match multiple patterns against a tree in one walk, to know what rules match at each node and with what pattern variable values (dictionary). Specifically, to associate the ordered list of suitable error renderers for each specific validation error that can occur.
The way I normally do pattern-matching is simply to walk the pattern and the tree to match against simultaneously, but of course this approach breaks down if you want to match multiple patterns against one tree in one pass. So I compiled each pattern into a list of closures, each closure representing one step of pattern-matching, a state. Then I walk the input tree normally and at each node, I call "push" on each of the patterns. If the current state of the pattern matches, I push the next state and the new dictionary variable values that were matched at this step. If the match is invalid, I push :invalid on the stack. After visiting the node I call pop on all the patterns.
Sorry if this is confusing. That's just to illustrate that learning the concept of a FSM let me implement something I wouldn't have known how to do otherwise.
Anyway, understanding the concept of a FSM is valuable in itself even if you never were to implement one because thinking about all possible states, inputs (symbols) and transitions helps you think more formally about some problems.
And even if the CPU itself is a state-machine, what if you want to make an emulator for that CPU?
Each method then becomes a concise description of the state transitions:
code
self atEndOfInput
ifTrue:
[^#final]. "Go to #final State"
self input == $"
ifTrue:
[ self writeToOutput.
^#insideComment]. "Go to #insideComment"
^#code "Stay in #code"
insideComment
self input == $"
ifTrue:
[^#code]. "Go to to #code"
^#insideComment "Stay in #insideComment"
Then debugging just becomes a matter of putting breakpoints in methods. #perform: is almost as fast a a regular message send, and the methods themselves are mostly #ifTrue: statements, which are bytecode optimized, so this is mostly JIT-ed to machine code.
In my C programming days, I used FSMs heavily. They enabled me to write code to handle extremely tricky, bug-prone problems of various kinds. I wrote them as static data with a tiny interpreter. They took almost no memory, they were super-fast, and they almost always worked the first time. I can't say that FSMs are particularly readable, but they were definitely more readable than tangles of while loops and if statements.
It's always been a disappointment that FSMs work so badly in most high-level languages. They are indeed a good fit to microprocessors.
My favorite example of that are "protothreads", stackless "threads" in C introduced by Adam Dunkel, and related with the contiki project.
An implementation of the paper can be done in a 130 lines big header, and they conveniently replace state machines with straightforward sequential code.
For cool Python coroutines with networking code, have a look at Eventlet. I'm using it for a lot of my server code; it makes server-side socket programming dead-simple to understand.
Also don't forget cogen, a library for network oriented, coroutine based programming using the enhanced generators from python 2.5. I recently used it as a front-end concurrent WSGI server in my Google Summer of Code project.
Besides state machine simulation, corotine has many other usages as well (eg. implement concurrency), it is a very powerful tool that not many programmers know. For those who are interested, I suggest a tutorial:
Nice explanation, but the Python example could be expressed just as naturally, and with a bit less code, using a generator instead of a coroutine. I'd have added a note on when to use which style: prefer generators since Python supports them better, but use coroutines when the producer of the data stream has to be 'on top' in the control flow.
A "generator" is a subtype of coroutine. Python calls them "generators" because 1) that's what Icon called them, and 2) Python's implementation of coroutines is* restricted by being unable to yield from within called functions, among other things, so they're not as generally useful.
* Was? Maybe Python 3 fixed this. I use Lua instead now -- it's cleaner, consistently gets these things (TCO, lambdas, coroutines, the GIL, etc.) right, and I'm not going to wait forever for Guido to change his mind.
I followed the language of Beazley's slides on Python coroutines: a 'generator' is a thing like this:
def g():
for i in range(10):
yield i
and a 'coroutine' is a thing like this:
@coroutine
def c():
while True:
x = yield
print x
The code in the original post is like the latter. g() produces successive outputs when you call .next() on it; c() consumes successive inputs when you call .send(input) on it. The g() style has been in Python longer and is better integrated into the language -- you can say "for x in g():", for example. That's the point I was making, maybe too tersely.
Your point 2 is still correct, though there's a PEP to make trampolining to work around it work nicer.
Also, generators cannot yield control while other functions are
executing, unless those functions are themselves expressed as
generators, and the outer generator is written to yield in response
to values yielded by the inner generator. This complicates the
implementation of even relatively simple use cases like asynchronous
communications, because calling any functions either requires the
generator to "block" (i.e. be unable to yield control), or else a
lot of boilerplate looping code must be added around every needed
function call.
It looks like 2.5 fixed that particular issue. (I'm not holding my breath on tail-call optimization, though...)
Good point. I had to re-read the article a couple times to figure out why this wasn't implemented as a generator. The Dunkels' TCP stack approach in C made more sense.
That's a good point... mutual recursion lacks dynamism.
The lack of TCO or ability to do mutual recursion+TCO is really the fault of the language. If you have a state machine with a small number (<20) states, mutual recursion seems like a more natural fit (unless you have to have states determined at run-time).
While coroutines may help in writing more maintainable/readable code, they make for a very inefficient structure since they have to keep a separate stack. Having many concurrent coroutines is probably a bad idea.
This is the classic trade-off between event based processing (state machines) and process based models.
There is not that much overhead. If you are writing an event-based network server, you have to allocate one structure of some type per connection, this is going to use memory regardless of whether it's a coroutine, or just a closure or instance of a class. If your coroutine implementation is good, these representations will be exactly equivalent. (In the real world, it is not as good of course, but it is theoretically possible to make coroutines as space efficient as hand-crafted state machines.)
I did an experiment; I created 100,000 coroutines in Perl (with the Coro module) each implementing a simple network protocol (echo). This used about 14k of RAM per "connection".
So sure, you could do it with less, but I know a lot of people who are happy with using 1.4G of RAM handling three concurrent requests -- this handles 100 thousand. (Right now, my web browser is using half that amount of memory so I can type this post.)
And BTW, this is also not too expensive CPU-wise; under the hood it uses epoll on my machine, which scales linearly over the active connections. With most of the connections idle, this results in very good performance.
Edit: I did the same thing with GHC and Control.Concurrent's forkIO; the overhead is about 3k per thread. Very reasonable.
As a data point, creating 100,000 coroutines in Lua uses about 183 MB* , which works out to about 1.9K apiece. This seems independent of the function included in the coroutine -- "return 3" and a function with several pages of code use the same space. It's probably a small stack, a closure for "upvaues", and a function pointer.
Note that this doesn't include memory for the IO connection, but the original question was about the space overhead of coroutines themselves. How much do Perl and Haskell use without the IO handle?
* On an AMD64 running OpenBSD 4.6. On an i386 (32-bit) system, it took 142 MB.
With empty suspended coros, we use about 169M for 300,000 of them. So that's about .5K per coroutine.
In Haskell, 500,000 threads waiting for an empty mvar used about 1.4G of RAM. But if they just spin forever (foo = foo), not waiting for anything, then it uses 129M for 500,000 threads; .2K each.
So really, coroutines are not costly at all. Most of the price you pay you would have to pay anyway. The overhead of the coroutine itself is tiny.
Lua has had excellent support for co-routines for a couple years, and, since its implementation is "stackless" (in the same sense as Stackless Python), its coroutines don't have the limitations as standard Python's. (It doesn't have a GIL either...) There are a couple papers about coroutines in Lua, but "Revisiting Coroutines" is the best overview. They're surprisingly versatile.
The paper shows how to efficiently implement one-shot continuations on asymmetric coroutines. (There is also a small C patch by Mike Pall that adds a coroutine.copy function to the VM, making multi-shot continuations possible: http://lua-users.org/lists/lua-l/2006-01/msg00652.html, and note the fix in his follow-up post.) Implementing continuations in terms of coroutines rather than the reverse (as is commonly done in Scheme) often leads to clearer code, IMHO.