
Asynchronous IO in Rust - nercury
https://medium.com/@paulcolomiets/asynchronous-io-in-rust-36b623e7b965
======
jerf
If you use threads, green or otherwise, you don't have to "implement" special
code for composing things together, you get the full set of tools for
composing code together, which includes, in passing, state machines, among all
the other things it includes. This basically implements an Inner Platform
Effect of an internal data-based language for concurrency that the language
interprets, which will A: forever be weaker than the exterior language (such
is the nature of the Inner Platform Effect) and B: require a lot of work that
is essentially duplicating program control flow and all sorts of other things
the exterior language already has.

There are some programming languages that have sufficiently impoverished
semantics that this is their best effort that they can make towards
concurrency.

But this is _Rust_. It's the language that fixes all the reasons to be afraid
of threading in the first place. What's actually _wrong_ with threads here?
This isn't Java. And having written network servers with pretty much every
abstraction so much as mentioned in the article, green threads are a _dream_
for writing network servers. You can hardly believe how much accidental
complexity you're fighting with every day in non-threaded solutions until you
try something like Erlang or Go. Rust could be something that I mention in the
same breath for network servers. But not with this approach.

There's plenty to debate in this post and I don't expect to go unchallenged.
But I would remind repliers that we are _explicitly_ in a Rust context. We
must talk about _Rust_ here, not 1998-C++. What's so wrong with _Rust_
threads, excepting perhaps them being heavyweight? (Far better to solve that
problem directly.)

~~~
pcwalton
> What's so wrong with Rust threads, excepting perhaps them being heavyweight?
> (Far better to solve that problem directly.)

Nothing. If threads work fine, use them! That's what most Rust network apps
do, and they work fine and run fast. A modern Linux kernel is very good at
making 1:1 threading fast these days.

> And having written network servers with pretty much every abstraction so
> much as mentioned in the article, green threads are a dream for writing
> network servers.

Green threads didn't provide performance benefits over native threads in Rust.
That's why they were removed.

Based on my experience, goroutine-style green threads don't really provide an
performance benefit over native threads. The overhead that really matters for
threads is in stack management, not the syscalls, and M:N threading has a ton
of practical disadvantages (which is why Java removed it, for example). It's
worth going back to the debate around the time of NPTL and look at what the
conclusions of the Linux community were around this time (which were,
essentially, that 1:1 is the way to go and M:N is not worth it).

There _are_ benefits to be had with goroutine-style threads in stack
management. For example, if you have a GC that can relocate pointers, then you
can start with small stacks. But note that this is a property of the stack,
not the scheduler, and you could theoretically do the same with 1:1 threading.
It also doesn't get you to the same level of performance as something like
nginx, which forgoes the stack entirely.

If you really want to eliminate the overhead and go as fast as possible, you
need to get rid of not only the syscall overhead but also the stack overhead.
Eliminating the stack means that there is no "simple" solution at the runtime
level: your compiler has to be heavily involved. (Look at async/await in C#
for an example of one way to do this.) This is the approach that I'm most keen
on for Rust, given Rust's focus on achieving maximum performance. To that end,
I really like the "bottom-up" approach that the Rust community is taking:
let's get the infrastructure working first as libraries, and then we'll figure
out how to make it as ergonomic as possible, possibly (probably?) with
language extensions.

My overarching point is this: it's very tempting to just say "I/O is a solved
problem, just use green threads". But it's not that simple. M:N is obviously a
viable approach, but it leaves a lot of performance on the table by keeping
around stacks and comes with a lot of disadvantages (slow FFI, for example,
and fairness).

~~~
theseoafs
> Green threads didn't provide performance benefits over native threads in
> Rust. That's why they were removed.

This seems wrong. This shouldn't be the point of green threads -- green
threads aren't a "faster" alternative to native threads. How would that work?
How could it be the case that virtual threads implemented in user space are
faster than native threads provided by the OS? I don't think anyone expects
that to be true of green threads in general.

We don't have green threads as a useful abstraction because they're fast, we
have them because they're _flexible_ , and because they allow us to easily
capture a lot of patterns. They let you easily and efficiently distribute work
amongst multiple processors without having to think about the subtleties of
the underlying threading model.

Of course, given Rust's other goals, shying away from green threads makes
sense.

~~~
pcwalton
I'm confused. To me, green threads are just an implementation detail. The fact
that green threading is being used shouldn't leak into the interface. To take
Go for an example, you could perfectly well write a conforming implementation
of Go that used 1:1 native threading: it would just have different performance
characteristics and things like LockOSThread() would become a no-op.

Could you elaborate on what you consider the interface difference to be?

~~~
theseoafs
This is backwards. Green threading can be an implementation detail, but in
languages where it's a key feature, green threads let you write code that
would otherwise be incorrect. For example, in a native threading model, it is
an awful idea to spawn a thread for every incoming connection on a server.
It's the easy way to write it, but it's wrong. With a green threading model,
though, that's easy and efficient.

~~~
comex
I was curious what the actual state was of the "modern Linux kernel" pcwalton
mentioned, so I tried running a test program to create a million threads on a
VM - x86-64 with 8GB of RAM, Linux 4.0. For comparison, Go apparently uses
about 4KB per goroutine, so it should be possible to create somewhat under 2
million goroutines. To be fair, I allocated the stacks manually in one large
allocation; otherwise it dies quite quickly running out of VM mappings. I set
the stack size to the pthread minimum of 16KB (actually, I tried cheating and
making it smaller, but it crashed, so I gave up - not a good idea anyway). The
threads waited for an indication from the main thread, sent after thread
creation was done, to exit; in an attempt to avoid the overhead associated
with pthread conditions, I just used futex directly:

    
    
        while (!ready)
            assert(!syscall(SYS_futex, &ready, FUTEX_WAIT, 0, NULL, NULL, 0));
    
    

The program caused the kernel to hit OOM (rather ungracefully!) somewhere
around 270,000 threads. To see how long it took while ensuring all the threads
actually ran, I reduced the thread count to 200,000, had it join all the
threads at the end, and timed this whole process: after the first run it took
about 4 seconds. (The first run was considerably slower, but that isn't a big
deal for a server, which is the most important use case for having such a
large number of goroutines/threads.) Therefore, the C version uses about 20
microseconds and 32 KB of memory per thread.

For completeness, I also tested a similar Go program on Go 1.4 (the version
available from Debian on the VM); it actually got up to 3,150,000 before OOM,
and took 9 seconds to do 2 million - 4.5 microseconds and 2.7KB per thread.

In other words, Linux is about an order of magnitude slower at managing a lrge
number of threads. That looks pretty bad, but on the other hand, it's not that
much in absolute terms! I'm pretty sure most server programs don't need more
than 250,000 simultaneous connections (or can afford to spend more than 8GB of
RAM on them) and don't mind spending an extra 20 microseconds to initiate a
connection, so if operating systems other than Linux aren't a concern, they
could be written to create a thread per connection without too much trouble.
It's not going to give you the absolute maximum performance (meaning it's not
appropriate for a decent class of program - then again, I suspect Go isn't
either), but it's not terrible either.

I'd like to see it improve. I wouldn't be surprised if there is (still) some
low hanging fruit; do kernel developers actually care about this use case?

(And yes, I know this doesn't really test performance of the scheduler during
sustained operation. That's its own can of worms.)

~~~
theseoafs
> To be fair, I allocated the stacks manually in one large allocation;
> otherwise it dies quite quickly running out of VM mappings.

Okay, so the test you did doesn't actually reflect the use case in practice.
Can I expect to reach 200,000 threads if the threads are not all created at
exactly the same moment? What if (God forbid) they're doing memory allocation?
And if it does work out, will everything be handled efficiently?

~~~
Scramblejams
Hope comex replies to your question. Typical green thread usage is spawn-em-
as-you-need-em, so if in order to spawn lots of 1:1 threads I need to do it
all up front, that could be very limiting or complicating.

~~~
comex
Yeah, I just made a mistake - you can increase the maximum number of mappings
using /proc/sys/vm/max_map_count; I tried doing that and switching back to
normal stack allocation (but still specifying the minimum size of 16KB using
pthread_attr_setstacksize) and it doesn't change the number of threads I was
able to create.

...in fact, neither did removing the setstacksize call and having default 8MB
stacks. I guess this makes sense: of course the extra VM space reserved for
the stacks doesn't require actual RAM to back it; there is some page table
overhead, but I guess it's not enough to make a significant difference at this
number of allocations. Of course, on 32-bit architectures this would quickly
exhaust the address space.

If increasing max_map_count hadn't worked, it would still be possible to
allocate stacks on the fly - but you would get a bunch of them in one mmap()
call, and therefore in one VM mapping, and dole them out in userland. However,
in this case guard pages wouldn't separate different threads' stacks, you
would have to generate code that manually checks the stack pointer to avoid
security issues from stack overflows, rather than relying on crashing by
hitting the guard page. Rust actually already does this, mostly unnecessarily;
I'm not sure what Go is doing these days but I think it does too. Anyway,
given that the above result I suspect this won't be an actual issue, at least
until the number of threads goes up by an order of magnitude or something.

~~~
Scramblejams
Thanks for your reply. Wonder how other platforms fare.

------
amelius
It would be much nicer to use coroutines instead of state machines. This means
that you could write code as if it were doing i/o in a "blocking" fashion, but
behind the scenes it is actually asynchronous.

------
mtanski
I think a lot of folks think of Network IO when they say Asynchronous IO.
That's only half the story, unless you're just building proxies and caches you
have to deal with Disk IO at some point in time. And, async disk IO is
horrible in every OS / language.

~~~
justincormack
Well, apparently not on Windows. But this is a kernel interface issue not a
language issue, and clearly needs fixing now we have SSDs where having
thousands of outstanding requests is useful, if not required for performance,
unless the hardware APIs are going to change (maybe if they get memory
interfaces this does change).

~~~
mtanski
The async story on windows is better, but not great. In many cases the async
calls on windows silently block due to a number of special cases (that are not
so special).

~~~
trentnelson
(Presuming you're referring to this: [https://support.microsoft.com/en-
us/kb/156932](https://support.microsoft.com/en-us/kb/156932))

Those are all entirely reasonable things to block for if you understand the
mechanics behind _why_ those calls may block. And none of it should be fast-
path stuff; it's easy to architect a solution where you avoid those things
except in corner cases.

The other thing you're missing is that when you architect around Windows
completion ports and threadpool I/O facilities, it doesn't matter if one
thread blocks every so often. Windows will detect this and schedule another
one to run, such that there is one active thread scheduled on every core.

I/O completion ports facilitate thread-agnostic I/O completion; the thread
that blocked because it was extending an encrypted file won't impede the
latency of other clients because there is no thread-specific association.

I exploit all of this with PyParallel.
[http://pyparallel.org](http://pyparallel.org)

For more details on asynchronous I/O on Windows and why it's fundamentally
better than the UNIX I/O model in every possible way:
[https://speakerdeck.com/trent/pyparallel-how-we-removed-
the-...](https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-
exploited-all-cores?slide=47)

------
vvanders
Really awesome stuff, great example of how well composability works with
traits and the nice handling that tagged unions bring to state machines.

------
geertj
This is identical to the Protocol abstraction in Python's asyncio, right?

