Hacker News new | past | comments | ask | show | jobs | submit login
How Rust optimizes async/await (tmandry.gitlab.io)
351 points by tmandry on Aug 19, 2019 | hide | past | favorite | 123 comments

As a newcomer to Rust, wishing that this post was one of the first ones I've read about this topic. It took scouring through many many posts, some of them here on HN, to be able to grasp some of the same idea. (I may not be alone, judging from the very long discussion the other day: https://news.ycombinator.com/item?id=20719095)

Development of high quality async support in Rust is happening right now, so remember to wear a hard hat. ;-) I like to watch https://areweasyncyet.rs/ and https://this-week-in-rust.org/ to see where things are.

Part of this is just that the design has been in the works for four years and has changed significantly during that time; it’s only now that things are almost stable that it’s worthwhile to write these kinds of things.

Well, it's got still three months until it lands stable, maybe it's just that the time hasn't been ripe for great, understandable posts about the feature until recently.

This is one of the most concise tutorials on how generators, coroutines and futures/promises are related (from first principles) that I've seen.

I'm hopeful that eventually promises and async/await fade into history as a fad that turned out to be too unwieldy. I think that lightweight processes with no shared memory, connected by streams (the Erlang/Elixer, Go and Actor model) are the way to go. The advantage to using async/await seems to be to avoid the dynamic stack allocation of coroutines, which can probably be optimized away anyway. So I don't see a strong enough advantage in moving from blocking to nonblocking code. Or to rephrase, I don't see the advantage in moving from deterministic to nondeterministic code. I know that all of the edge cases in a promise chain can be handled, but I have yet to see it done well in deployed code. Which makes me think that it's probably untenable for the mainstream.

So I'd vote to add generators to the Rust spec in order to make coroutines possible, before I'd add futures/promises and async/await. But maybe they are all equivalent, so if we have one, we can make all of them, not sure.

It's the same underlying mechanism for generators as for futures: they are stackless coroutines. All the space they need for local variables is allocated ahead of time.

In my experience, the fact that they are stackless is not at all obvious when you're coding with them. Rust makes working with them really simple and intuitive.

Debugging can be pain though, as you may not know the right stack, and makes it harder to follow how the code executed in that context. But yes, rather than writing an async state machine with callbacks, I would prefer this.

https://crates.io/crates/tracing is attempting to solve this issue!

Regarding determinism, async is way more deterministic than multiple threads, because you don’t have arbitrary point where execution contexts can change.

That's true in a way, but only for multithreaded code. Multi-process code with full isolation uses different metaphors like joining threads within higher order functions to achieve parallelism in code that looks single-threaded.

For example, lisp-based languages like Clojure can be statically analyzed and then parallelized so that all noninteracting code runs in its own process. This can also be done for code that operates on vectors like MATLAB and TensorFlow.

For me, isolated processes under the Actor model in languages like Elixer/Erlang and Go is much simpler conceptually than async/await, which is only one step above promises/futures, which is only one step above callback hell. I know that the web world uses async/await for now, but someday I think that will be replaced with something that works more like Go.

> I think that lightweight processes with no shared memory, connected by streams [...] are the way to go.

No, at least not in general. There are a lot of problems in the real world for which "no shared memory" is incompatible with "efficient parallelism".

That's actually not true - the efficiencies due to copying can be overcome with the runtime or abstractions.

For example, the copy on write (COW) mechanism of unix where memory pages are mapped to the same location until a forked process writes to one, in which case the virtual memory manager makes a mutable copy.

There's also Redux and Closure's immutable state tree that makes copies under a similar mechanism to COW but through code, since they run at a level of abstraction above C++ or Rust.

My feeling is that these techniques run within a few percent of the speed of hand-optimization. But in the real world, I've seen very little human code remain optimized over the long term. Someone invariably comes along who doesn't understand the principles behind the code and inadvertently does a manual copy somewhere or breaks the O() speed of the algorithm by using the wrong abstractions. Meanwhile immutable code using mechanisms like COW avoids these pitfalls because the code is small and obvious.

I feel that the things that Rust is trying to do were solved long ago under FP, so I don't think it's the language for me. That's also why I moved away from C#, Java, C++, etc. Better languages might be Elixer or Clojure/ClojureScript, although they still have ugly syntax from a mainstream perspective compared to say Javascript or Python. I love that Rust exists as a formal spec of a mature imperative programming language. I think it's still useful in kernels and the embedded space. But I'm concerned that it's borrowing ideas like async/await that trade determinism for performance.

> I feel that the things that Rust is trying to do were solved long ago under FP

I don't think this is true at all. Statically analyzed Sync/Send are an amazing tool that I don't see in any other language. In fact, your point of view w.r.t. the actor model is extremely well-supported in Rust, due to Sync/Send traits. Immutable objects can be shared between threads using simple pointers, and mutable objects can have ownership moved between threads with just a pointer copy.

Those threads may have 'shared memory' in the strict sense of the term. But compared to other languages, Rust makes working with shared memory 'feel' like working with independent processes, except with extremely efficient message passing. The language statically guards against threads interacting directly as long as you don't use primitives like Mutex<T>.

> For example, the copy on write (COW) mechanism of unix where memory pages are mapped to the same location until a forked process writes to one

That is shared memory – memory which is shared between two or more threads or processes.

> There's also Redux and Closure's immutable state tree that makes copies under a similar mechanism to COW but through code

That's also shared memory.

"Shared memory" does not necessarily mean "shared mutable memory". Rust, for example, statically prevents memory from being simultaneously shared and mutable (except through synchronization primitives like mutexes etc.). A pure actor-based language, in contrast, would prevent even immutable memory from being shared.

>> dynamic stack allocation of coroutines, which can probably be optimized away anyway

This seems interesting. Do you have any pointers to places/papers I can look more into this? I'm also curious, since the stacks have to be rather small when you are running several thousands of coroutines (like Go), how often people get into issues of running out of stack because of some big stack allocation somewhere and stuff like that.

I haven't studied it deeply, but a breadcrumb would be that cooperative threads (green threads) are equivalent to coroutines.

Ok it looks like current techniques are stackless runtimes and compiling coroutines to stackless continuations:







This looks like a rare gem, although I just started reading it:


I grew up with the cooperative multithreading of classic Mac OS and was really shocked when I first saw Javascript back in the 90s and it had no notion of it (because it didn't have generators). That sent us down the callback hell evolutionary dead end, through promises/futures and finally to async/await where we are now. That could have been largely avoided if we had listened to programming language experts!

Oh I think I misundetstood you (or you me), and I didn't really articulate my question well. I'm well aware of how stackless coroutines are implemented.

My main concern was whenever you do that, you lose the ability to look at backtrace when something goes wrong. So some implementations keep a separate stack for each coroutine (like Go), and switch SP whenever context is switched. That way you don't have bunch of random function calls in your stack trace. The problem though is that these individual stack for each couroutine has to be pretty small in general (since you would spawn hundreds of thousands of these). Go solved this temporarily using fragmented stack, and later ditched that in favor of copying the whole stack over. And I thought, you were talking about optimizations around this whole "stackfull" coroutines thing so that I can have my backtrace.

As a reminder, you don't need to use async/await to implement socket servers in Rust. You can use threads, and they scale quite well. M:N scheduling was found to be slower than 1:1 scheduling on Linux, which is why Rust doesn't use that solution.

Async/await is a great feature for those who require performance beyond what threads (backed by either 1:1 or M:N implementations) can provide. One of the major reasons behind the superior performance of async/await futures relative to threads/goroutines/etc. is that async/await compiles to a state machine in the manner described in this post, so a stack is not needed.

I find async/await easier to reason about than threads for anything more involved than the 1 request per thread web server use case. This is because you avoid bringing in the abstraction of threads (or green threads) and their communication with one another. You trade syntactical complexity (what color is your function, etc), for semantic complexity (threads, channels, thread safety, lock races).

I would agree to the statement for languages where it's an either/or. E.g. in Javascript there are only callbacks and async/await, so we can avoid all the complexity of threads - which is great!

However in multithreaded languages it's always an AND. Once you add async/await people need to know how traditional threading as well as how async/await works. Rust will also make that very explicit. Even if you use async/await on a single thread the compiler will prevent accessing data from more than one task - even it those are running on the same thread. So you need synchronization again. With maybe the upside that this could be non-thread-safe (or !Sync in Rust terms) in order to synchronize between tasks that run on the same thread. But however also with the downside that utilizing the wrong synchronization primitives (e.g. thread-blocking ones) in async code can stall also other tasks.

Overall I think using async/await in Rust is strictly more complex. But it has its advantages in terms of performance, and being able to cancel arbitrary async operations.

They have the same semantic complexity. You have tasks in async/await and you still need to deal with inter-task communication, locking, etc.

The main difference is that in async/await you control where context switches occur and the syntax (.await) explicitly points them out. This means you can often avoid locks and do things in a more straightforward manner.

Note that this applies only to tasks that are !Sync. If they aren't, two tasks accessing the same state could be moved to different threads and access that state racily. In that case .await tells you nothing about accessing non-local state. However, for purposefully single-threaded task pools avoiding locks this way certainly seems possible.

Async/await are effectively threads, the switches are just scheduled statically.

Async/await is also good for integration with other systems where starting new threads is not practical or you are calling non-threadsafe FFI functions. Tokio offers the CurrentThread runtime, which allows concurrency without creating any new threads.

To implement, no. To make a performant server- yes. Most systems have limited number of threads (low limit constant compiled with the kernel), and each thread is triggered by the scheduler, not by network events, which is very uneconomical.

You with application level concurrency you get 100x performance boost for network servers.

On extreme workloads, perhaps. But we have people happily running Rust code with thousands of threads per second in production.

M:N was experimentally found to be slower than 1:1 in Rust.

epoll io loop performs better for most network io though and is simplier to manage when you have to start dealing with out of band issues (like efficient hearbeats - every time I've had a conversation without how to move some of the heartbeat code over to async it comes down to just accepting it isn't going to be as efficient as my c++ implementation and either strain heavily of accept over publication).

Last time I saw there was still a couple extra allocations going on too in the compiler (I was told they were being worked on) and basically the default executor, tokio, wasn't very efficient at changing events in the queue (requiring a an extra self signal to get the job done).

I'd be interesting to see how little cost these are, because there is defintely a cost to the generator implementation. Yes, if I wrong a generator to do this, I couldn't write it better, but I wouldn't write a generator (and that would be a very odd definition of zero-cost there anything can be called zero cost even GC as long as it is implemented well - well, that depends on if rustc saves unnecessary state).

> Additionally, it should allow miri to validate the unsafe code inside our generators, checking for uses of unsafe that result in undefined behavior.

This is really useful as a lot of the buffer handing code needs to use unsafe for efficienty issues. And the enums sharing values is nice too - hopefully the extra pointer derferences can be optimized out.

I do worry though about all this state sitting on the head and ruining cache locality on function calls though.

If you want to use epoll directly in Rust, go right ahead. Nothing is stopping you. You don't have to switch to C++ to use epoll.

It's pretty clear that most people don't want to write and maintain that kind of code, though—that's what async/await is for. Personally, I hate writing state machines.

They're not very well done. Rust kind of skipped the whole non-blocking thing and went straight to tokio, then had to backtrack a little to mio+tokio, then when mio is only half way jupmed to async/await. (mio was originall tokio rolled into it - it was the whole thing and you had to use mios sockets and mios timers on the mio event loop - tokio was then split out - metal io, it was not - this was because there wasnt a non-blocking interface in rust yet).

Async is deinitely a distraction. Rust should have stabilized other pieces more before jumping to yet another paradigm. It is gaining this collection of half-implemented things. I'm guessing generators are going to the the next move before async/await is even in the print book.

Rust has moved from zero-cost abstraction to "an fairly efficient implemention of everything you can think of".

Rust didn't implement async/await for fun. The team implemented it because it was perhaps the #1 requested feature from users, including me.

You can see it here on HN. What's the #1 comment that always appears on Rust async threads? Invariably it's "why not just use goroutines? That kind of code is so much easier to write!" Now for various reasons M:N threading like in Go doesn't work in Rust. What does work, and offers most of the ergonomics of threads/goroutines, is async/await. That's why the team has prioritized it.

There are people who care less about ergonomics than optimizing every CPU cycle for socket servers. For those people, the focus on getting the ergonomics right before microoptimization of mio and tokio might be frustrating to see. But the overwhelming majority of users just want easy-to-use async I/O. Async/await is an enormous improvement in readability and maintainability over hand-written state machines. Rust would be doing a disservice to its users if it focused on microoptimization of tokio instead.


If you think that Rust doesn't care about addressing the needs of C++ users, I'm not sure what to tell you.

It's not like C++ only focuses on features with zero overhead. The zero-overhead principle means that you don't pay for what you don't use. It doesn't mean that compiler developers are forbidden from working on any language or library feature that consumes more than zero cycles at runtime.


I think the issue is that the zero-cost abstraction terminology is used in two ways often. First, to mean a feature you can use and costs nothing because the compiler will change it into a more common form for you (such as functional programming approaches where it will take multiple filters and map passes and turn it into a single loop), and second, to mean that features added that do add overhead, only do so when you use them (as in, there's no runtime to manage M:N threading unless you include a crate that does so, because developers not using that feature shouldn't have to pay for it).

I think the terminology here is looser than it should be, and these are not really the same thing, even though I've seen zero-cost abstractions used to describe them both. It causes confusion sometimes, as I believe it is doing here.

Sure, and there are a few hooks for one in the C++ standard library e.g. std::declare_reachable [1], although it doesn't include an actual implementation. Only mark and sweep is feasible, not mark and compact, for obvious reasons [2]. Here's a mention of garbage collection Stroustrop's old C++11 FAQ [3]

[1] https://en.cppreference.com/w/cpp/memory/gc/declare_reachabl...

[2] https://herbsutter.com/2011/10/25/garbage-collection-synopsi...

[3] https://web.archive.org/web/20120228143039/http://www2.resea...

Sure. It's called Rc and Arc.

Don't be intentionally obtuse. You know I mean something more like mark and sweep

Reference counting is a form of garbage collection, despite the fact that it does not detect cycles without extra work. This has been the correct terminology in the memory management literature for decades.

Is shared_ptr a zero cost abstraction?

Mostly, with the caveat that in single threaded usage, you're paying for the unnecessary overhead of atomic ops on the reference count.

Rust has `Rc`, which is not atomic but `!Send` so the type system ensures that it stays within a thread, as well as the atomic `Arc`.

> Rust kind of skipped the whole non-blocking thing and went straight to tokio, then had to backtrack a little to mio+tokio, then when mio is only half way jupmed to async/await.

Not sure exactly what you mean here; mio predates tokio by a couple years, and has been possible to use standalone since before tokio ever became a thing.

Early versions of mio had a bunch of non-essential things like a timer wheel built-in, that were later removed.

What does tokio have to do with that?

This strikes me as needlessly hostile without contributing anything.

Honestly I don’t get it why simply using mio (epoll, kqueue, iocp wrapper) is so unpopular.

See these examples of how async/await simplifies looping and error handling: https://docs.rs/dtolnay/0.0.3/dtolnay/macro._01__await_a_min.... Await syntax makes it much easier to compose asynchronous code from different libraries, in the same way that ordinary function calls compose synchronous code. Talking to epoll/mio directly is certainly possible, but it's hard to make two different libraries work well together if they both need to talk to a single epoll handle.

Yeah but you’re talking about things not even stable. Old tokio with closures is terrible imo.

It’s extremely close to being stable; it just missed a release train, because there’s some last minute polish that needs to be done. It will be on its way soon!

everybody always points to the 5 line basics, never to how complex async/await gets in a real system with timers, tuneouts, multiple errors types that need to be handled in different ways, heartbeats, side channel info, etc... It isn't better at that point.

Cool, you can write a 5 line echo server easier, but good luck with a high performance server handling important transactions.

(If there is documentation on how to do some of these or good example code, please point me to it, because my last attempt at getting send/rcv heartbeats efficiently didn't work.)

The networking implementations I've had to do were all made a lot easier moving to async/await from hand-written state machines. It's a reduction in LOC terms of 6x or so, and the logic is much, much easier to follow.

State machines are viable if your transition function is such that effectively every combination of (state, input) leads to a different state. If the transition function is mostly describing "go to the (single) next state or error," then you're essentially asking the user to do a lot of bookkeeping that compilers are good at and users are bad at.

Having ported a full peer-to-peer engine in JavaScript from old-school callbacks to async/await, I strongly disagree with your comment.

Async/await really shines when the complexity arises.

People are doing a lot more with async/await than five-line echo servers. They're converting entire codebases and seeing dramatic reductions in noise.

Probably because you're just looking at examples; that's selection bias. The combinators/manual implementing Future approach to async IO gets even worse at scale, async/await is infinitely easier. If you don't like it, nobody is forcing you to use it.

Because most people don't like writing state machines by hand.

Pre-await Tokio code looks way worse to me.

A lot of that, I suspect, is that Rust's borrow checker makes specifying ownership correctly for callback chains extremely painful and unergonomic.

In my experience, closure-based async programming is only sufficiently painless in managed languages to be worth using. When you're dealing with manual memory management (or, rather, smart pointer memory management), you spend a lot more time trying to make the captures hold onto stuff.

I'd agree with this, and emphasize the point that this stuff is really tricky to get right without GC. Fighting the borrow checker is somewhat expected when you're dealing with this level of inherent complexity in your memory management.

One of the key reasons for shipping async/await is that it erases almost all of this difficulty and lets you write straight-line code again.

> tokio, wasn't very efficient at changing events in the queue (requiring a an extra self signal to get the job done).

Could you expand on that?

Curious what is the fundamental difference that makes Go do M:N thread efficiently? Considering that the compiler has far less information than Rust about the program.

Go is more efficient at M:N then Rust can be mostly for two reasons:

1. Go can start stacks small and relocate them, because the runtime needed to implement garbage collection allows relocation of pointers into the stack by rewriting them. Rust has no such runtime infrastructure: it cannot tell what stack values correspond to pointers and which correspond to other values. Additionally, Rust allows for pointers from the heap into the stack in certain circumstances, which Go does not (I don't think, anyway). So what Rust must do is to reserve a relatively large amount of address space for each thread's stack, because those stacks cannot move in memory. (Note that in the case of Rust the kernel does not have to, and typically does not, actually allocate that much physical memory for each thread until actually needed; the reservation is virtual address space only.) In contrast, Go can start thread stacks small, which makes them faster to allocate, and copy them around as they grow bigger. Note that async/await in Rust has the potential to be more efficient than even Go's stack growth, as the runtime can allocate all the needed space up front in some cases and avoid the copies; this is the consequence of async/await compiling to a static state machine instead of the dynamic control flow that threads have.

2. Rust cares more about fast interoperability with C code that may not be compiled with knowledge of async I/O and stack growth. Go chooses fast M:N threading over this, sacrificing fast FFI in the process as every cgo call needs to switch to a big stack. This is just a tradeoff. Given that 1:1 threading is quite fast and scalable on Linux, it's the right tradeoff for Rust's domain, as losing M:N threading isn't losing that much anyway.

Go needs to allocate a growing stack on the heap, needs to move it around, etc. It's not as efficient as Rust's async.

But is it as efficient or more than the linux threads?

I don't have the full answer (and I would love if someone more knowledgeable could jump in this thread) but I'd say it depends since there are a few antagonistic effects :

- goroutines are (unless it changed since last time I used it) cooperatively scheduled. It's cheaper than preemptive scheduling, but it can lead to big inefficiencies on some workload of you're not careful enough (tight loops can hold a (Linux) thread for a long time and prevent any other goroutines from running on this thread).

- goroutines start with a really small (a few kB) stack which needs to be copied to be grown. If you end up with a stack as big as a native stack, you'd have done a lot of copies in the process, that wouldn't have been necessary if the stack was allocated upfront.

I don't think the implication was that Go does it efficiently.

> One of the major reasons behind the superior performance of async/await futures relative to threads/goroutines/etc. is that async/await compiles to a state machine in the manner described in this post, so a stack is not needed.

That's a great optimization but doesn't that mean it also breaks stack traces?

Does anyone know how Rust's implementation compares to C++2a's? The C++ people seem to have spent a lot of time creating an extremely generic framework for async/await wherein it is easy to change out how the scheduler works (I currently have a trivial work stack, but am going to be moving to something more akin to a deadline scheduler in the near future for a large codebase I am working with, which needs to be able to associate extra prioritization data into the task object, something that is reasonably simple to do with await_transform). I am also under the impression that existing implementation in LLVM already does some of these optimizations that Rust says they will get around to doing (as the C++ people also care a lot about zero-cost).

Disclaimer: I'm not an expert on the proposal, but have looked at it some, and can offer my impressions here. (Sorry, this got a bit long!)

The C++ proposal definitely attacks the problem from a different angle than Rust. One somewhat surface-level difference is that it implements co_yield in terms of co_await, which is the opposite of Rust implementing await in terms of yield.

Another difference is that in Rust, all heap allocations of your generators/futures are explicit. In C++, technically every initialization of a sub-coroutine starts defaults to being a new heap allocation. I don't want to spread FUD: my understanding is that the vast majority of these are optimized out by the compiler. But one downside of this approach is that you could change your code and accidentally disable one of these optimizations.

In Rust, all the "state inlining" is explicitly done as part of the language. This means that in cases where you can't inline state, you must introduce an explicit indirection. (Imagine, say, a recursive generator - it's impossible to inline inside of itself! When you recurse, you must allocate the new generator on the heap, inside a Box.)

To be clear, the optimizations I'm talking about in the blog post are all implemented today. I'll be covering what they do and don't do, as well as future work needed, in future blog posts.

One benefit of C++ that you allude to is that there are a lot of extension points. I admit to not fully understanding what each one of them is for, but my feeling is that some of it comes from approaching the problem differently. Some of it absolutely represents missing features in Rust's initial implementation. But as I say in the post, we can and will add more features on a rolling basis.

The way I would approach the specific problem you mention is with a custom executor. When you write the executor, you control how new tasks are scheduled, and can add an API that allows specifying a task priority. You can also allow modifying this priority within the task: when you poll a task, set a thread-local variable to point to that task. Then inside the task, you can gain a reference to yourself and modify your priority.

> In C++, technically every initialization of a sub-coroutine starts defaults to being a new heap allocation. I don't want to spread FUD: my understanding is that the vast majority of these are optimized out by the compiler. But one downside of this approach is that you could change your code and accidentally disable one of these optimizations.

I don't think this is correct. C++ 20 allows a lot of choices to implement it without forcing a heap allocation.

see https://lewissbaker.github.io/2017/11/17/understanding-opera... also see this video that goes in depth how to have billions of coroutines with C++: https://www.youtube.com/watch?v=j9tlJAqMV7U

Thanks for the information!!

On your last paragraph, the thing I'm concerned by is where this extra priority information is stored and propogated, as the term "task" is interesting: isn't every single separate thing being awaited its own task? There isn't (in my mental model) a concept that maps into something like a "pseudo-thread" (but maybe Rust does something like this, requiring a very structured form of concurrency?), which would let me set a "pseudo-thread" property, right?

As an example: if I am already in an asynchronous coroutine and I spawn of two asynchronous web requests as sub-tasks, the results of which will be processed potentially in parallel on various work queues, and then join those two tasks into a high-level join task that I wait on (so I want both of these things to be done before I continue), I'd want the background processing done on the results to be handled at the priority of this parent spawning task; do I have to manually propagate this information?

In C++2a, I would model this by having a promise type that is used for my prioritize-able tasks and, to interface with existing APIs (such as the web request API) that are doing I/O scheduling; I'd use await_transform to adapt their promise type into one of mine that lets me maintain my deadline across the I/O operation and then recover it in both of the subtasks that come back into my multi-threaded work queue. Everything I've seen about Rust seems to assume that there is a single task/promise type that comes from the standard library, meaning that it isn't clear to me how I could possibly do this kind of advanced scheduling work.

(Essentially, whether or not it was named for this reason--and I'm kind of assuming it wasn't, which is sad, because not enough people understand monads and I feel like it hurts a lot of mainstream programming languages... I might even say particularly Rust, which could use more monadic concepts in its error handling--await_transform is acting as a limited form of monad transformer, allowing me to take different concepts of scheduled execution and merge them together in a way that is almost entirely transparent to the code spawning sub-tasks. The co_await syntax is then acting as a somewhat-lame-but-workable-I-guess substitute for do notation from Haskell. In a perfect world, of course, this would be almost as transparent as exceptions are, which are themselves another interesting form of monad.)

The concept of a pseudo-thread you're referring to is a task. A task contains a whole tree of futures awaiting other futures. So no manual propagation is necessary.

Of course, it's possible for tasks to spawn other tasks that execute independently. (To be clear, if you are awaiting something from within your task, it is not a separate task.) For spawning new tasks, there's a standard API[1], which doesn't include any executor-specific stuff like priority. You'll have to decide what you want the default behavior to be when someone calls this; for example, a newly spawned task can inherit the priority of its parent.

To get more sophisticated, you could even have a "spawn policy" field for every task that your first-party code knows how to set. Any new task spawned from within that task inherits priority according to that task's policy. The executor implementation decides what tasks look like and how to spawn new ones, so you can go crazy. (Not that you necessarily should, that is.)

To summarize the Rust approach, I'd say you have 3 main extension points:

1. The executor, which controls the spawning, prioritization, and execution of tasks

2. Custom combinators (like join_all[2]), which allow you to customize the implementation of poll[3] and, say, customize how sub-futures are prioritized (This is at the same level as await, so per-Future, not per-Task.)

3. Leaf futures (like the ones that read or write to a socket). These are responsible for working with the executor to schedule their future wake-ups (with, say, epoll or some other mechanism). For more on this, see [4].

[1]: https://doc.rust-lang.org/1.28.0/std/task/trait.Executor.htm...

[2]: https://rust-lang-nursery.github.io/futures-api-docs/0.3.0-a...

[3]: https://doc.rust-lang.org/1.28.0/std/future/trait.Future.htm...

[4]: https://boats.gitlab.io/blog/post/wakers-i/

Thank you so much for the context here!! And yeah: a big terminology clash is that many of the libraries for C++ come from a common lineage (almost entirely from Lewis Baker, who has been involved in the STL abstractions, wrote cppcoro, and then got allocated by Facebook to work on Folly) and use the term "task" to essentially mean "a future that you can efficiently co_await". What I'm seeing so far seems reasonably sane and arguably similar to the abstraction I have been building up using lower-level components in C++; which all makes me very happy, as I'm anticipating eventually wanting to rewrite what I have done so far in Rust.

In a Rust implementation of your example, you might not necessarily spawn the two sub-operations as separate tasks. Awaiting them directly in a parent async function (probably using FuturesUnordered, like Promise.all in JS) will cause all of their work to be scheduled, prioritized, cancelled, etc. together because they’ll be a part of the same task. There’s a 1-many relation from tasks to Futures in Rust.

FWIW, what I meant by "join those two tasks into a high-level join task" was "call something akin to Promise.all to compose the two futures into a single one upon which I could await". It sounds like I need to learn more about the concept Rust has for "tasks", as maybe they are providing this "pseudo-thread" abstraction I was discussing in passing. I am seeing terms like "leaf Futures" and "top-level Futures" in some of the documentation I am quickly turning up.

I don’t fully understand your use case, but in case this helps:

- Future is a trait, not a type, so you can write your own future types.

- Future’s poll method takes a context argument, and async/await-based futures pass that argument unchanged to any sub-futures that they await. The context argument is a pointer to a standard library type (std::task::Context), which itself contains a pointer to a “waker” object that’s meant to be supplied by the executor, with some predefined methods. There’s some room for customization here, but it’s not as flexible as it probably should be – e.g. for now, as far as I can tell, you can’t get the pointer back out of the waker, only call the standard methods on it. Thread-local storage is an option, of course.

A “task” is the thing you submit to an executor, that is, the sum of all async/awaits in a single computation. Each await is not its own task.

It's interesting that support for recursion is no longer the default here. A partial reversal of what happened going from Fortran to Algol?

Aside from the high-level similarity of the "function -> state machine" transformation, Rust's is quite a bit different (and IMO both simpler and more flexible).

A C++ coroutine chooses a particular promise type as part of its definition. Its frame defaults to a separate heap-allocation per coroutine, with some allowance for elision. At a suspension point, it passes a type-erased handle to the current coroutine to an `await_suspend` method, which can either `return false` or call `handle.resume()` to resume the coroutine. A stack of `co_await`ing coroutines (or "psuedo-thread" as you call it) is thus a linked list of `coroutine_handle`s stored in the coroutine frames of their await-ees, rooted with whoever is responsible for next resuming the coroutine.

A Rust async function does things inside out, in a sense. It has no promise type; calling one directly returns its frame into the caller's frame, as a value with an anonymous type that implements the `Future` trait. This trait has a single method called `poll`, which resumes the function and runs it until its next suspension point. `poll` takes a single argument, a handle which is used to signal when it is ready to continue. This handle is threaded down through a stack of `poll`s (a "task" or pseudo-thread), and stored with whoever is responsible for notifying the task it should continue.

One implication of the Rust approach is that the "executor" and the "reactor" are decoupled. An executor maintains a collection of running tasks and schedules them. A reactor holds onto those handles and notifies executors of relevant events. This lets you control scheduling without language hooks like await_transform- you can associate your prioritization data with a task when you spawn it on a particular executor, and it can await any reactor without losing that information.

Another implication is that you have a choice of whether to a) `await` a future, making it part of the current task, or b) spawn it as its own task, to be scheduled on its own, much like OS thread APIs. Option (a) can get really interesting with multiple concurrent sub-futures (with things like Promise.all or select); it can be as simple as having the caller poll all its children every time it wakes up, or as complex as wrapping `poll`'s handle argument and implementing your own scheduling within a task.

My understanding is that C++ opted more for a coroutine-first design, where a very generic coroutine abstraction is in the center of the design, and other things (like generators and async functions) are built around it. That makes it very universal - but probably also harder to understand if one only has a specific use-case.

Rusts design compared to that was not only focused on async functions as the main design goal, but also on maintaining compatibility with a "struct Future" type which also can be implemented by hand and represents a state-machine.

The latter will allow Rust to reuse lots of async infrastructure that had been built in the (combinators & state-machine) Futures world in the last years (e.g. the tokio library and everything on top of it).

One downside of Rusts approach might be that some parts feel a bit awkward and hard, e.g. the 2 different states of Futures (one where it hasn't been executed and can be moved around and one where it has been started executing and can't be moved anymore) and the pinning system. As far as I understand C++ exposes less of those details to end-users - this might be something where the implicit allocations might have helped it.

As far as I understand the async function flavor of C++ coroutines also have run-to-completion semantics and can't be cancelled at any yield point like Rusts Futures can be. This has the advantage of being able to wrap IO completion based operations in a more natural fashion than Rust. But it then again has the downside that users need to pass CancellationTokens around for cancellation, and that some code might not be cancellable.

I don't quite follow. What exactly is the overhead that other languages have for futures that is eliminated here?

Going down the same rabbit hole earlier this week, found this to be a good explanation:

All of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications. ... Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state—working just like hand-rolled code. [1]

[1] https://aturon.github.io/blog/2016/09/07/futures-design/

> Perhaps surprisingly, the future within a task compiles down to a state machine, so that every time the task wakes up to continue polling, it continues execution from the current state

How are those tasks implemented, and what's scheduling them?

You need a third-party library to provide an executor. Rust does not come with one to keep the runtime size small. The community seems to have centralized on https://tokio.rs/ (under the hood it uses epoll/whatever OS-specific functionality to schedule M:N)

See: https://news.ycombinator.com/item?id=20722297

How's that play out in terms of, say, performing some long-running calculations, rather than something that performs IO?

(Of course, this is Tokio-specific)

Cooperative scheduling is used to schedule tasks on executors. A single executor is expected to manage many tasks across a small set of threads. There will be a far greater number of tasks than threads. There also is no pre-emption. This means that when a task is scheduled to execute, it blocks the current thread until the poll function returns.

Because of this, it is important for implementations of poll to only execute for very short periods of time. For I/O bound applications, this usually happens automatically. However, if a task must run a longer computation, it should defer work to a blocking pool [1] or break up the computation into smaller chunks and yield back to the executor after each chunk. [2]

"poll" here refers to the callback function in the future that actually does the work.

[1] https://docs.rs/tokio-threadpool/0.1.15/tokio_threadpool/fn....

[2] https://tokio.rs/docs/internals/runtime-model/

When you want to do long-running blocking work (either computation, or synchronous IO from some library that doesn't use futures), you usually want to farm that work out to a thread pool. I think Tokio provides convenience functions for doing that.

It's also possible to write a long-running computation as a future, which can yield after some smaller amount of work to let some IO get done. But I'm not sure if that's the recommended approach.

You're essentially implementing statically scheduled cooperative userspace threads on top of a single thread with async/await.

So, for computation, this is a bad solution.

This is a great quote, and one that I missed while first writing the post! I've added it now.

Most languages allocate every future (and sub-future, and sub-sub-future) separately on the heap. This leads to some overhead, allocating and deallocating space to store our task state.

In Rust, you can "inline" an entire chain of futures into a single heap allocation.

In .NET something similar is possible via ValueTask.

ValueTask is more of an optimization for scenarios where the Future (or Task<T>) is often intended to be ready immediately (synchronously). For those cases its wasteful to first allocate a continuation on the heap, and then not to use it - because the code after the await can directly be executed.

The introduction of ValueTask allowed C# code to only perform dynamic allocations whenever the task definitely needed to be yielded - opposed to one allocation for every Task<T>. However it doesn't allow for guaranteed avoidance of allocations - like Rusts model can do.

However on the positive side removing the allocations for those cases is probably good enough since the other yields are of low-frequency (when waiting for IO). And Rust code currently requires allocations like Box<dyn Future> in order to make certain things work (like type-erasure and dynamically dispatched interfaces) that are not necessarily required in .NET in the non-yielded case.

From an ergonomic perspective I definitely prefer .NETs easy-by-default model which is still highly optimizable up to the point of avoiding all allocations. But I understand why this wouldn't have worked for Rust and it's constraints (no GC, no classes, no runtime).

Thanks for the thoughtful reply.

You are right, however there is a scenario that you have forgotten.

The possibility that the JIT might eventually optimise the code path given enough PGO data, and apply RVO or similar approaches.

Naturally I don't know if RyuJIT can already do this kind of optimization, given that only recently they have strengthened their focus on this area.

However this kind of optimizations are already doable in Graal, so it is possible that Microsoft also adds them to RyuJIT.

Which in any case would rule out some of the Rust deployment scenarios I guess.

Has anyone ever done a comparison to see how much overhead this actually adds? I'd be really curious to see this represented in concrete terms.

So for example, in Kotlin each piece of synchronous code within an async function is compiled into what is essentially a Java Runnable object, which must be allocated on the heap.

As far as I understand in Kotlin continuations are only allocated on the heap when necessary (the suspending function can not be executed synchronously). Therefore most allocations should be avoided until blocking for IO.

Ah. What kind of asynchronous task executes so fast that a heap allocation is measurable?

I like to think of it from the other direction. I'm super excited to use futures and async/await in embedded hardware, where "the heap" might not even exist. Hardware interrupts can be treated as events that wake the executor. That lets me write some code like this:

    async fn bracketed_echo() -> ! {
        loop {
            let mut buf = 0;
            Serial.rx(&mut buf).await;
This reads from the serial port and echos it back inside of `><`. The code reads very cleanly, does not require the heap, and should be very efficient.

The caveat is that I am also trying to target a platform that Lust / LLVM doesn't support completely yet, so I haven't gotten this to actually work well. I bet the people using the Cortex processor are far ahead of me though.

That's exactly what I am looking forward to, the state machine generation can make a lot of code targeting embedded platforms, especially the IO part, much more comfortable. Might I ask which controller you are using? With Rust I'm still on a Cortex, but looking to apply it on other architectures in the future

A network read or write can be in the order of hundreds of nanoseconds when using an accelerated userspace networking stack. Allocation can be a good chunk of that.

Related questions: does anybody know:

1) which was the first language that introduced the async/await keywords? I want to say C#, but I’m not sure.

2) are there any papers that describe the code transformation to state machine that is commonly performed to implement these keywords?

For #1, there's a well-sourced Stack Overflow post https://softwareengineering.stackexchange.com/a/377514.

Cool, I'll take a look!

I was quickly searching around and found this paper [1]:

"Pause 'n' play: formalizing asynchronous C#".

It looks promising, although it is behind a paywall ;-(. Also, the keyword "formalizing" tells me that maybe this goes a bit deeper than the kind of description I'm looking for...

1: https://dl.acm.org/citation.cfm?id=2367181

sci-hub.tw usually helps with getting papers that are behind paywalls!

I'm a big user of Rust, but I'm kinda dismayed that async IO wasn't part of the original design.

It's nice they're making Futures a zero-cost abstraction, but it feels like it is at the expense of ergonomics.

You have to prioritize something. Development of other features was prioritized over async first. Futures support came later, and async later still. It's an iterative approach. In the end, it let people get a LOT of value out of rust well before async was fully baked.

For myself, I'd been waiting for the syntax to finalize as I'm still learning and didn't want to really delve into old and new ways, though I'm sure I will see them in practice. For others, depending on your needs, if you didn't need it, or willing to jump through the hoops, you could still get value along the way.

Futures were developed outside Rust core, in a third-party library, before being brought into the language. Working with them in combinator form definitely was less ergonomic, but async/await fixes that.

Fun fact: there was a future type in the rust standard library, long, long ago.


That version of Rust also did async I/O in the runtime. Async I/O has always been part of Rust. The model changed because there was too much overhead doing it the more ergonomic way and it got booted out of the runtime.

Yep, this is a great point.

Someday, we should get a book about the history of Rust together...

I didn’t know about this.. I’d love to read that book :)

I think the year is off by 1 in the blog post, or the title needs a 2018 tag.

That's what I get for publishing late at night. Fixed, thanks!

Former, or that's some impressive time travel knowledge.


How does `yield` work under the hood? Does it add a reference to some array, with code to loop over all the references with some small timeout until the reference status changes from "pending" to "completed"?

Generators do nothing unless you call their resume() method. resume moves the generator from the last state it was in to the next yield (or return).

Internally, when the code hits a yield, it's happening inside the resume method. yield works by saving the current state of the generator in the object (see e.g. resume_from in the post), and returning from resume().

It gets converted into a state machine.

  || {
    yield 2;
    yield 3;
    yield 5;
will get converted to a struct that implements the Generator trait[1] with a resume method something like

  fn resume(self: Pin<&mut Self>) -> GeneratorState<i32, ()> {
    match self.next_state {
      0 => {
        self.next_state = 1;
      1 => {
        self.next_state = 2;
      2 => {
        self.next_state = 3;
      _ => Complete(())
Local variables become fields in the struct and if you have more complex control flow, the state could end up jumping around instead of just increasing by one each time. It's nothing that you couldn't write by hand, but it would be very tedious to do so.

[1] https://doc.rust-lang.org/beta/unstable-book/language-featur...

Just like for async, borrow across yield points here are a special feature that you couldn't implement in Rust as an open coded state machine, you'd have to find a workaround.

We're getting generators as well? Awesome.

Not necessarily. They're an implementation detail of the compiler, and aren't fully baked yet to boot.

But there's plenty of reason to want generators, including the fact that they let you build streams. And the fact that async/await relies heavily on them has pushed the implementation much closer to being ready. I hope we get them at some point!

They are available in nightly - https://doc.rust-lang.org/beta/unstable-book/language-featur... - but are very unstable.

I think "at some point" is the best answer :)

To be clear on how unstable: they have not received an actual RFC yet, so they’re barely designed at all. It will be some time before they’re stable, as in some sense, they haven’t even started the process.

I’d imagine that they would go through the stages faster than many other features, though, given that they implementation will have received a lot of testing via async/await.

A future is a one-shot generator, give or take.

Good reading list at the bottom of https://areweasyncyet.rs, starting with this post that uses generators as an example: https://boats.gitlab.io/blog/post/2018-01-25-async-i-self-re...

Thank you for the great writeup <3 Eagerly waiting for the upcoming parts - please keep it up!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact