Hacker News new | past | comments | ask | show | jobs | submit login
Zero-cost futures in Rust (aturon.github.io)
930 points by steveklabnik on Aug 11, 2016 | hide | past | favorite | 337 comments



This is huge.

This allows one to express concurrency in a natural way not prone to the typical errors of problems of this nature, with no runtime overhead, while competing with C in terms of the constraints of the runtime.

Big kudos to Aaron Turon and Alex Crichton. You guys knocked it out of the park.


Agreed, and very cool... I do think that async/await syntax is very nice for reducing a certain amount of nesting bloat. Even with chaining, it doesn't look as nice imho, though better than deeply nesting callbacks.


Checkout https://github.com/tokio-rs/tokio. It builds on top of future-rs


As mentioned in the post, given Rust wants to operate in the same space as C, this approach makes sense. However from a higher level, building more complex concurrent systems, dealing with futures/deferred-s/promises and/or a central select/epoll/kqueue reactor loop gets daunting and doesn't mix with complex business rules.

Deferred based approach has been a round for many years. I experienced it by using Twisted (Python framework) for 5 or so years. And early on it was great However when we switched to using green threads, the logic and amount of code was greatly simplified.

So wondering if Rust provides any ability to add that kind of an N:M threading approach. Perhaps via an extension, macro or some other mechanism.

Note that in C, it being C such things can be done with some low level trickery. Here is a library that attempt that:

http://libmill.org/

And there were a few others before, but none have taken off enough to become mainstream.


> So wondering if Rust provides any ability to add that kind of an N:M threading approach. Perhaps via an extension, macro or some other mechanism.

I don't want M:N threading as Go implements it. It's a big loss of performance for marginal benefit over futures. In particular the libmill approach was tried in Rust and the results were far worse than 1:1.

However, assuming this takes off I would like to see async/await style syntactic sugar over futures down the road to make it easier to write code that looks blocking but actually isn't. Crucially, this sugar would maintain the zero-cost nature of futures. With that approach, we'd have the ergonomics of languages like Erlang and Go without the significant performance tax associated with M:N.


> I don't want M:N threading as Go implements it.

As someone who writes code for enterpricey businesses doing a lot of I/O bound stuff, golang style M:N threading is a godsend over Java's standard library, and other common platforms in that space. Being able to express your code in a sequential manor and still gain the performance offered by implicit fiber,goroutine,w/e scheduling is pretty awesome. With futures, there's still some language semantic overhead from the point of view of the developer.

In my problem space, M:N threading is simply worth it. And if I need anything better performing, I can still switch to C for specific use cases.


> Being able to express your code in a sequential manor and still gain the performance offered by implicit fiber,goroutine,w/e scheduling is pretty awesome.

You don't gain as much performance. On Linux, you don't actually gain that much if anything over 1:1 threading. Most of the benefits of goroutines actually comes from the small stacks, which don't have anything to do with M:N and 1:1 to begin with—they're a feature of GC.

As the blog post states, our end goal is to achieve the ergonomics of Go-style M:N without sacrificing nginx levels of performance. With this futures library, we've established the foundations. It would make no sense to give up before even trying.

> And if I need anything better performing, I can still switch to C for specific use cases.

Why would you write networking code in C in 2016, when there are better alternatives available (like this one)?

Rust's philosophy is to have both performance and ergonomics. It rejects the idea that optimal performance requires a cumbersome programming model. I'm not about to give up on that.


> Why would you write networking code in C in 2016, when there are better alternatives available (like this one)?

As someone who's done a bit of game development in the past I can easily say that you'll fall flat on your face in some fields if you don't write your own protocols. There are simply no protocols that operate well in every use case and there are also use-cases without standardized protocols.

For extremely high performance operations you NEED to use a custom protocol in some case to squeeze every bit of power out of your applications.

From a game development perspective you need to handle two types of content: content that isn't really ordered by time occurrence and content that needs to happen sequentially.

Because of this you can correctly optimize your networking stack for those needs and get your networking system really fast.

Some people have taken to using TCP with a JSON message passing system and that just doesn't work large scale. You'll have huge latency issues and with a big enough game you have to do some serious bit-packing to make it cost effective to host servers for.

Now I'm speaking generally and this is just from my experience but there is definitely still a reason to write your own network code.


I believe pcwalton agrees that the ability to write one's own networking code is important. The quoted line is simply questioning why you'd write such code in C (which, if you're a game developer, means that "platform support" is a perfectly valid answer).


Sure, there are valid reasons to write your own network code. The question is whether you should write it in C. :)


> On Linux, you don't actually gain that much if anything over 1:1 threading.

You use green threads instead of native threads because native threads have space overhead, not because they have time overhead. Attempting to spawn 100k OS threads will do strange things to most kernel scheduling algorithms; they're not optimized for that use-case.


> You use green threads instead of native threads because native threads have space overhead, not because they have time overhead.

The main overhead of a thread, green or native, is the stack. The size of the stack is independent of whether you use native or green threads. Go's small stacks are actually made possible by its GC, not its choice of 1:1 or M:N. In musl, for example, you can have 2KB stacks [1] with 1:1.

I haven't seen a benchmark of huge numbers of native threads vs. a userland scheduler, but I have a hard time imagining that a userland scheduler will beat the kernel's scheduler. The kernel scheduler has a much more global picture of the system compared to userland.

[1]: https://github.com/rofl0r/musl/blob/d05aaedaabd4f5472c233dbb...


> The kernel scheduler has a much more global picture of the system compared to userland.

In most of the comparisons I've seen (usually for Erlang), worst-case latency was the important factor, so interaction with the kernel scheduler was avoided as much as possible. In the Erlang runtime, you can pass a switch to cause the userland scheduler-threads to each get bound to a particular processor core, and to cause the kernel scheduler to avoid scheduling anything on those cores. Effectively, this partitions the processor into a set of cores the OS scheduler entirely manages, and a set of cores that the userland scheduler entirely manages.


If the kernel scheduler completely hands control of the core over to the program, doesn't that mean you can only run one or two programs on the entire machine at once without running out of cores? Surely the kernel still schedules other threads on that core too.


Yes, that's rather the point: we're talking about highly-multicore server machines (e.g. 16/32 cores, or perhaps far more) entirely dedicated to running your extremely-concurrent application. You want all but one or two of those cores just running the app and nothing else. You leave one or two cores for the "control plane" or "supervisor"—the OS—to schedule all the rest of its tasks on.

It's a lot like a machine running a hypervisor with a single VM on it set to consume 100% of available resources—but actually slightly more efficient than that, since a guest can intelligently pin allocate cores to itself and pin its scheduler-threads to them and then just stop thinking about the pinning, while a hypervisor-host is stuck constantly thinking about whether its vCPUs-to-pCPU mapping is currently optimal, with what's basically a black box consuming those vCPUs.


Is this something Erlang supports, or is it something that you get merely because you've pinned threads to a specific core?

I know with cgroups in Linux you can pin processes to cores pretty easily. Just curious how Erlang makes this easier.


If you mean that in terms of "can you ask Erlang itself to do this for you", then yes: http://erlang.org/doc/man/erl.html#+sbt

If you mean that in terms of "does the Erlang runtime intelligently take advantage of the fact that its schedulers are pinned to cores to do things you don't get from plain OS-level pinning", I'm not sure.

I think it might, though. This is my impression from reading, a year or so back, the same docs I just linked; you can read them for yourself and form your own opinion if you think this sounds crazy:

It seems like ERTS (the Erlang runtime: BEAM VM + associated processes like epmd and heart) has a pool of "async IO" threads, separate from the regular scheduler threads, that just get blocking syscalls scheduled onto them. Erlang will, if-and-only-if it knows it has pinned schedulers, attempt to "pair" async IO threads with scheduler threads, so that Erlang processes that cause syscalls schedule those syscalls onto "their" async IO threads, and the completion events can go directly back to the scheduler-thread that should contain the Erlang process that wants to unblock in response to them.†

In the default case, if you don't tell ERTS any different, it'll assume you've got one (UMA) CPU with N cores, and will try to pin async IO threads to the same cores as their paired scheduler-threads. This has context-switching overhead, but not much, since 1. the async IO thread is mostly doing kernel select() polling and racing to sleep, and 2. the two threads are in a producer-consumer relationship, like a Unix pipeline, where both can progress independently without needing to synchronize.

If you want, though, you can further optimize by feeding ERTS a CPU map, describing how the cores in your machine are grouped into CPU packages, and how the CPU packages are further grouped into NUMA memory-access groups. ERTS will then attempt to schedule its async IO threads onto a separate core of the same CPU package, or if not possible, the same NUMA group* as the scheduler-thread, to decrease IPC memory-barrier flushing overhead. (The IPC message is still forced to dump from a given core's cache-lines into the CPU or to NUMA local memory, but it doesn't have to go all the way to main memory.)

ERTS will also, when fed a CPU map, penalize the choice in its scheduling algorithm to move an Erlang process to a different CPU package or NUMA group. (It will still do it, but only if it has no other choice.)

---

† This is in contrast to a runtime without "native" green-threads, like the JVM, where even if you've got an async IO pool, it just sees an opaque pool of runtime threads and sends its completion events to one at random, and then it's the job of a framework like Quasar to take time out of the job of each of its JVM runtime threads to catch those messages and route them to a scheduler running on one of said runtime threads.

The same is true of an HVM hypervisor: without both OS support (paravirtualization) plus core pinning inside each VM, a hardware interrupt will just "arrive at" the same pCPU that asked to be interrupted, even if the vCPU that was scheduled on that pCPU when it made the hypercall is now somewhere else. This is why SR-IOV is so important: it effectively gives VMs their own named channels for hardware to address messages to, so they don't get delayed by misdelivery.


The space overhead is significantly worse for native threading in the presence of many threads since each thread needs both a user and kernel stack.

For M:N, there are N kernel stacks.


Fair enough, but kernel stacks are 8K. 10K user + kernel size is a far cry from the 2MB default pthread stack size people usually talk about when they talk about 1:1.

I don't know of any benchmark comparing Go vs. a 1:1 implementation with 2K pthread stack sizes, but I would be surprised if the performance difference is large at all.


> Fair enough, but kernel stacks are 8K.

And getting even cheaper than that, with the effort to make kernel stacks use virtual memory. As I understand it, once kernel stacks use virtual memory, they'll start out at a single 4k page.


Is there a writeup somewhere of the progress towards this, or a prospective timeline?



I haven't seen a benchmark of huge numbers of native threads vs. a userland scheduler, but I have a hard time imagining that a userland scheduler will beat the kernel's scheduler. The kernel scheduler has a much more global picture of the system compared to userland.

Doesn't using kernel threads imply lots of context switches? Doesn't that tend to be expensive in terms of time on modern architectures?


> Doesn't using kernel threads imply lots of context switches?

No, it's actually fewer context switches. That's because each I/O completion event goes straight from the kernel to the code that was waiting on it (1 context switch), not from the kernel to the userland dispatcher to the code that was waiting on it (2 context switches).


This is not quite right. The userland dispatcher to run next thread is not a context switch: the TLB doesn't need to get flushed for instance. Furthermore many userland threads can be woken at once, and don't incur context switches when they suspend, prompting the next runnable thread to run.

If you're willing to put in a lot of compiler work userland thread switching is a function call, literally. I don't 1:1 threads with small stack are going to compete with that.


The real cost of a true "context switch" is the transition from user level to kernel level, which takes thousands of cycles. This cost isn't incurred on a userland context switch, so those costs aren't comparable.


Thousands of cycles for a SYSCALL/SYSRET on a reasonably modern Intel/AMD CPU? I think you should try to measure that.


"Thousands of cycles" is definitely not true on modern CPUs. SYSENTER/SYSEXIT takes 150 cycles on my Xeon X3460.


Given the kind of "cloud"-backed startups most of HN are working on, I think the more practical measurement would be the overhead of a ring 0/3 separation in a VM, plus hypercalls, plus a ring 0/3 separation in the hypervisor (for paravirtualized syscalls); averaged against a ring 0/3 separation in a VM, plus SR-IOV virtualization (for HVM-backed syscalls.)

Yeah, this doesn't apply for context-switches that happen because of plain old pre-emption, but it does happen if the context-switch is because e.g. a network packet wants to arrive at another process in your VM than the one that's currently running.

I would imagine that there's a reason bare-metal IaaS providers have a business model. :)


My knowledge of current timings are admittedly a little out of date since my microkernel days are behind me. Even the 150 cycles another poster suggested is still a significant cost over a userland context switch.


What if you have async system calls (e.g. FlexSC)?


Well, one advantage the Go runtime has is the green threads are cooperatively scheduled, so switches are a lot more lightweight.


> Well, one advantage the Go runtime has is the green threads are cooperatively scheduled, so switches are a lot more lightweight.

Switches occur on timeouts (which involve a round trip through the kernel), on I/O events (which also involve a round trip through the kernel), or on goroutine message passing. So in every case except goroutine message sends, the switches don't actually save you a trip through the kernel.


System call context switching is cheaper than and different from thread/scheduler context switching.


OK, but that's irrelevant in this context because Go has to context switch either way.


It can also switch on memory allocations and function calls as well. I'd expect a large fraction of the context switches come from those events and don't go through the kernel.


Except when the IO doesn't go though the kernel in the first place of course.


What I/O doesn't go through the kernel?


For I/O you can buffer inside the application whether the respective file is readable/writeable or not. If it's not readable you can directly switch to another goroutine without invoking a syscall. The scheduler will later get notified by select/epoll that the file is readable/writeable, adjust the buffered status and reschedule the goroutine.

But ok, for reading you might often have a buffered state of readable while in reality it's not and you only get this information through EAGAIN on read, so you won't avoid the syscall there. For writing you most likely always would.


For the example if you are using kernel bypass of the network stack, which is increasingly common in high throughput/low latency applications.


If you're going to bypass the kernel network stack for performance, you're definitely not going to write the rest of your code in go. The garbage collector, goroutine scheduler, and conservative compile time optimizations will totally undermine the end goal.

If you're in that world, you're using C, C++, or That if you're feeling dangerous.


The JVM is used a lot in Fintech.

Of course it isn't the OpenJDK, rather specialized JVMs like Azul or they code using C style coding with the GC out of the out paths, having profiled the applications with tools like Java Mission Control.

These are the customers driving Oracles effort for value types, AOT compilation and JNI replacement.

Why not just use C and C++, one might ask?

In spite of all these tricks and required knowledge, salaries are lower than for C and C++ developers and overall project costs are anyway lower due to shorter development times for the other tools not available in C and C++.

Hence why Fintech nowadays is looking into languages like Pony, because they don't want to wait for Java 10, if possible.

So an area where Rust might eventually earn some friends.


I've done kernel bypass work on the JVM.


FWIW, I wasn't thinking bout go.


Is anybody doing kernel bypass networking in go?


Couldn't you switch on timeout using some kind of CPU counter, e.g. RDTSC?


Then you're polling the system clock all the time, which is a large unnecessary throughput loss.


But that's just one instruction, highly paralelisable (I assume), as opposed to going through the kernel... Anyways, I haven't measured it, it's just an idea I had!


> Why would you write networking code in C in 2016, when there are better alternatives available (like this one)?

Support for kernel bypass networking libraries like ibverbs, DPDK (has an old unmaintained Rust wrapper) and other IO kernel bypass libraries such as SPDK and IOAT.

If Rust supported these libraries, I'd much prefer the future based Rust code to a massive event loop in C.


I'm one of the authors of SPDK (which includes an NVMe driver and an IOAT driver). If the community wants to add rust bindings to those two components I'd be very supportive.


The article mentioned that the authors would love to help make futures-based wrappers for C libraries. You might ask if they could help with wrappers for ibverbs, DPDK, SPDK, and IOAT.


It's all open-source, this is an area that if it matters this much to you, you could help support. The Rust community is hugely welcoming to people coming on and helping.


> In my problem space, M:N threading is simply worth it. And if I need anything better performing, I can still switch to C for specific use cases.

It maybe sounds like you should just use Go, which (as you point out) makes exactly those tradeoffs. I think Rust is aiming for a different market segment, with customers that want to make a different set of tradeoffs. Seems ideal to have both options. In particular, I don't think switching to C will ever be something Rust will want to require of its users.


Well, one of the problems with M:N threading is that the "switch into C for that part" gets a lot more expensive, negating the speed increase.


> I can still switch to C for specific use cases.

So use go and switch to rust for specific use cases.


> sugar over futures down the road to make it easier to write code that looks blocking but actually isn't.

Yes! That is basically what I am after.

Currently using Erlang so having to give up cheap isolated processes and message passing to do concurrency would be hard. But I understand that is a different level of abstraction as well (a VM, multi-machine clusters, etc vs almost raw C speed).

But in general bringing Rust and Erlang together seems like a swell idea. Both are focused on safety, which I like. Been keeping my eye on https://github.com/hansihe/Rustler


> ... I would like to see async/await style syntactic sugar over futures down the road to make it easier to write code that looks blocking but actually isn't....

Would a Scala-style 'for' comprehension achieve the same thing while also being more generally applicable? Scala achieves this by using instance methods 'map', 'flatMap' and 'withFilter'. Personally I think Rust should just lift the sugar wholesale :-)


Simply comparing M:N threading with futures is not correct. In my experience using futures in Javascript, it does not support control flow at all. The goroutine's equivalent is async/await, which transform the cps of asynchronous I/O back to sync form, but it may impose more overheads than M:N threading since it stores the stack structure in its parameter.


Are libmill and Go's approach the same? I.e. Holding concurrent state in multiple stack frames? What causes that to have degraded performance?

Is this new approach different because instead of stack frames you just have dynamically allocated callback state?


> Are libmill and Go's approach the same? I.e. Holding concurrent state in multiple stack frames? What causes that to have degraded performance?

In the M:N approach you have to allocate stack space for each goroutine that you spawn. This requires that you either know the size of the stack up front (generally not possible without being conservative and requesting a large allocation) or that you start small and grow (resulting in a lot of memory traffic and pauses in the growth case, and much harder to do in C).

By contrast, with the zero-cost futures approach we statically know exactly how much per-goroutine size we will ever need, and we can allocate precisely that amount. Furthermore, we only save the data that's absolutely needed across blocking calls. This results in much smaller per-connection state, and as a result it's quicker to allocate.

It's the difference between static and dynamic control flow. Full M:N requires us to give up static knowledge of what a goroutine will do and try to do the best we can at runtime. With futures, we have a lot more static knowledge, and as a result we can optimize more aggressively.


> This requires that you either know the size of the stack up front ... generally not possible without being conservative

Well, you are writing the compiler. Sure you'd be up against the halting problem, but relatively few functions are (non-tail) recursive.

Perhaps the unwieldiness of a large stack is better attributed to the feature of unbounded recursion (and FFI into "uncharted territory") than the feature of green threads.

I appreciate that Rust has already been down the road of lightweight threads. This statement just struck me as an assumption that deserved to be questioned.


It would require higher order control flow analysis like k-CFA, which would certainly fail to produce a bounded stack size on any nontrivial program.

The futures library is the control flow analysis. Because it uses the type system instead of higher order control flow analysis, it actually achieves precision.


"It would require higher order control flow analysis like k-CFA, which would certainly fail to produce a bounded stack size on any nontrivial program."

You can actually do a pretty good job without k-CFA.

First, and i know you know this, when you say "fail to produce a bounded stack size", you really mean "a reasonable bounded stack size".

A bounded stack size of easy, and does not require context sensitive analysis: The stack size is just max sum of stack sizes of of meet over all paths in an acyclic graph :)

The cycles, you either can statically calculate the recurrence count or you can't.

Bounded heap size is pretty much impossible, but stack size is pretty easy.

Recursion is also not hard. You form strongly connected components. You try to prove how often the component is cycled. If you can, victory. If you can't, you can bound it unless it's truly dynamic.

More to the point, here's a paper on doing it with ADA in GCC in 2009 (IE with sticks and fire):

http://www.adacore.com/uploads/technical-papers/Stack_Analys...

Note they are within 2% on most cases, and can detect whether it's statically knowable, bounded, or dynamically changing.


Are you sure about that? First, many higher-order constructs boil down to simple first-order control flow after partial evaluation (see e.g. AnyDSL https://anydsl.github.io/#publications). Inferring stack size bounds in the presence of higher-order programming features is also possible, and was implemented (as the "static region optimization") in MLkit.

But the main point is that code in a high-level programming language looks very different from C or even Rust code. Sparks in Haskell, or processes in Erlang, or even individual goroutines in Go, are typically executing comparatively tiny programs. If the call graph of the program in question is acyclic, then it should be easy to get a bound on the maximum stack size.

To be clear, I agree with you that a 1:1 threading model is the better design for Rust, simply because it's the only design where you can guarantee zero overhead and don't need a complicated runtime. What I don't agree with is that the M:N threading model is inherently inferior. Especially for a high-level programming language, I would assume that you can reduce the overhead significantly through static analysis and a good implementation.


You can't get around page fault blocking with user-space M:N threading. It will be approximate until there is a way to do that.


I was thinking of a much lighter analysis, based on some optional restrictions. If a function:

- is not recursive

- does not call function pointers (ie trait objects)

- allocates only fixed-sized objects on the stack

- only calls functions with known stack requirements

then its stack requirement should be known, no? It feels like with these requirements, one can still write many programs (threads, really). And if one goes beyond the restrictions, then they just pay the cost of having to guess a large stack size.

Stack size analysis would be helpful for other applications as well, like embedded platforms.


No program beyond the most trivial will meet these requirements. The moment you call any function indirectly you lose.


Yes, but indirectly means "&Fn", but not "&F where F: Fn".

And a whole program doesn't need to conform, only individual threads. Presumably the ones you want to make a lot of.

(And if a little dynamicism was required, its expense could be paid for at the use, by creating a fresh necessary-sized stack at that point. But I'm probably opening up old split-stack wounds, sorry)


The problem with an optimization like this is that the current approach always gives you a guarantee on runtime cost. Heuristic based optimizations make performance harder to ensure, not to mention recovering performance when you fall outside the valid subset much more difficult. You can obviously provide another static analysis to warn you about violations, but this seems is more complex, and less flexible then the current approach.


Actually the "current approach" gives you no bound on the stack size. You guess, hope, and test (or do this analysis manually).

While your general objection is good to always keep in mind, it is a tradeoff. There is no perfect solution when you're up against the halting problem (unless you're proposing to change the language to only bounded recursion).

What I'm arguing for here is for more than the single problem the OP is solving, so it's not a case of choosing one or the other and calling it a day.


libmill only uses a single kernel thread so it's a little bit simpler here than Go, but otherwise they should be comparable.

Regarding stack vs. dynamically allocated state (in a future) I'm not convinced what's better. Yes, the allocated stack most likely has an overhead. But at least the data will be stored there in linear fashion and accessing the state will be super cheap once the stack is loaded. In case of futures that hold dynamically allocated data the state might be spread over much more memory locations, so it might be slower to access. I however haven't any more scientific data on this.

In my real-world applications I'm getting about the same performance from a Go based and a boost asio (uses no futures but dynamically allocated callback closures for storing state) networking application. However the programming style is completely different.


With Rust zero cost futures, only one memory allocation is done for the entire chain of futures. It's not doing an allocation for every callback.

This makes it comparable to using a stack. In an M:N solution, the stack is usually (but not always) allocated similarly (e.g. using malloc).


> With that approach, we'd have the ergonomics of languages like Erlang and Go without the significant performance tax associated with M:N.

Can you quantify the "performance tax associated with M:N"?


What is the "significant performance tax" in Go? Do you have any numbers about this implementation?

When I look at the benchmark of your minihttp vs Go fasthttp, Rust is supposed to be 3x time faster without any GC and yet minihttp pulls 2M req/sec VS 1.5 for Go which is a small margin when you know how Go works and is GC.


> What is the "significant performance tax" in Go?

I explain in this sibling comment: https://news.ycombinator.com/item?id=12271194

> Do you have any numbers about this implementation?

See the blog post.


I did edit my first question after seeing the numbers.


Arguing about whether a 30% difference is "significant" isn't really interesting.


You're talking about using a language where you have to manually manage the memory, is significantly harder to use than Go, have less built-in concurrent patterns, inferior standard library, so yes 30% seems a very small benefit.

I didn't had look into the benchmark but my guess is that it's waiting on the I/O so any language with good concurrent model will have the same numbers.

The benchmark would have been more interesting if it was more CPU intensive, that would have definitely favored Rust.

it's not about Rust Vs Go, it's just that 30% seems very low when you know how fast Rust is vs Go in some scenarios.


> You're talking about using a language where you have to manually manage the memory, is significantly harder to use than Go, have less built-in concurrent patterns, inferior standard library so yes 30% seems a very small as a benefit.

Naturally I disagree with all of those items, but I'm also equally uninterested in starting a Rust vs. Go war.


> ergonomics of languages like Erlang and Go

My brain read that as "ergolangmics"


In other words futures are not as composable as one would hope.

John Reppy's CML seems like a much better toolbox which gets composability without pretension. Vesa Karvonen (whom I understand has worked on the MLTon compiler) has offered an excellent delivery in Hopac for C# and F# complete with a slew of combinators: https://github.com/Hopac/Hopac

I'm not aware of anyone offering an alternative superior to an informal CSP yet which seems to be the reason why Go and Clojure have picked it as well for their concurrency model.

See David Nolen's excellent blog posts on the matter:

http://swannodette.github.io/2013/07/12/communicating-sequen... http://swannodette.github.io/2013/08/17/comparative


I think Rust's approach here makes a lot more sense for the language than Go/CML style M:N would. You can recover most of the M:N ergonomics over time via async/await style syntax. But if your foundation is built on a (comparatively) slow "pthreads in userland" threading model, then you hit a performance ceiling that you can never break through. For a language like Rust, it makes sense to begin in the optimum place and then layer zero-cost abstractions on top to achieve the right ergonomics.


I think there's little or no evidence that "you can recover most of the M:N ergonomics over time via async/await style syntax", despite a decade or so of attempts. I think there's an underlying semantic concern that seems unsugarable.

Munificent's http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y... expresses the problem eloquently.

None of that means you're wrong about async making "a lot more sense for the language", though.


> Munificent's http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y.... expresses the problem eloquently.

I have several issues with that blog post.

1. The performance concerns are glossed over. The conclusion is "threads are superior", but that ignores the reason why we want async I/O in the first place: performance.

2. "What if the new place you want to call it is blue? You’ll have to turn it red." is false. There's a way to convert blocking code to async-friendly code: use a thread pool. This is in fact what cgo does internally—see the next point.

3. You always have red functions and blue functions if you have an FFI. But this isn't a big deal because synchronous functions are easily convertible to asynchronous functions (via a thread pool) and asynchronous functions are easily convertible into synchronous functions (via just blocking). So the supposedly big semantic problem just boils down into a question of making sure you remember to switch into the other mode when necessary (which you can always easily do without restructuring your program). This is something that the language can do for you automatically—proof is that Go does it!—and I would like to see type system or other static approaches to doing this in Rust.


This is currently hotly debated in the C++ committee. Some people want shallow C#, python style generators, while other want proper stackful coroutines a-la Lua (full disclosure: I'm on this group). A third group is trying to mediate and trying to come up with an hybrid stackful model that can be optimized as well the async/await model at least in some cases (I.e. full cps transform and fallback to a cactus stack when invoking non transformable functions).


I've been writing async I/O networking software for about 15 years now. Early on most of that was in C, now it's split about 50/50 between C and Lua. Most of my C I/O code is still in C because I prefer my libraries to be reuseable outside of Lua or any particular event loop, and they often are. Lua's coroutines are usually higher up the stack, juggling more abstract state; and I use them for more than asynchronous I/O or asynchronous tasks.

The thing about async/await is that in a language like C, I can already accomplish much of that with tricks like Duff's Device and macros. It has its limitations, but IME they're not much more onerous than the limitations of async/await, especially in the context of a language lacking GC. I have to manually keep state off the stack (or otherwise copy/restore it), but you do that anyhow when you don't have lexical closures and GC, and often even when you do.

The beautiful thing about coroutines in Lua is that it's based on a threading model, but not one bound to the C stack or a kernel thread, which are completely orthogonal concerns left to the application to deal with or not deal with. And it does this while preserving functions as first-class objects. Neither callers nor callees need to know anything about coroutines. That kind of composability makes coroutines useful and convenient for many more things than simulating green threading or managing CPU parallelism. Among other things, it means I can mix-and-match functional and imperative styles according to the problem, and not whether it will be convenient to then make use of coroutines. It means that I have a single, natural call stack--not an implicit stack and an explicit stack. async/await and futures unify your data stack, but you're still manually managing the call stack through syntactic devices or otherwise formalized calling conventions. However heavily sugared, it will hinder the design of your software no less than if you had to manually manage the data stack, too.

Coroutines that aren't stackful aren't nearly as powerful in terms of problem solving. Without them being stackful, it's a horribly leaky abstraction for non-trivial uses. Most people would agree that the C preprocessor is a mess, and that functions as first-class objects are powerful. So modern languages strive to create templating systems that allow you to construct _real_ functions that are indistinguishable from any another function. But then they introduce monstrosities like futures or async/await, that beautiful symmetry is broken. It's like bringing back C's macro preprocessor--now you have regular functions and these weird things with different syntactic and control follow semantics, whether you wanted it or not. The decision is no longer yours, which means you're bending to the language's deficiencies.

Why even bother with such half-baked solutions? In almost every case it's utterly transparent that these solutions exist for the benefit of the compiler and runtime author, usually because of intentional or unintentional technical debt--a direct or indirect dependency on the C or kernel stack. For C++ it's understandably a difficult dilemma, but for every other language it's a total cop-out.

Then these solutions are sold to the public by prettifying the implementations with fancy terminology and beguiling examples showing how they can be used to implement async I/O or parallel CPU jobs. But few, if any, language features are so narrowly tailored to such specific use cases. Why? Because languages are supposed to provide simple building blocks that compose as seamlessly as possible at a much higher level of abstraction than, e.g., a slightly nicer way to implement HTTP long-polling servers. Such contrivances are as removed from the basic simplicity of the function as C's macros are from first-class functions. In both cases you can implement solutions for a certain subset of problems that superficially look convenient and nice; but in the real world the limitations become swiftly apparent, and you realize a lot of effort was spent in design and implementation for little real-world gain.

With Lua's coroutines, I can implement a futures pattern easily when it's appropriate, and it will be more powerful because the futures themselves can make use of coroutines both internally and externally. But in my use of coroutines in Lua futures are rarely the most natural design pattern. Sometime you want a full-blown CPS solution, sometimes you simply want to be able to arbitrarily swap producer/consumer control flow, for example in a lexer. Often you want a mixture of all of these. Coroutines--stackful coroutines--provide all that and more, seamlessly.

Futures only look nice and elegant in contrast to event loop oriented, callback-style programming. But that's a really, really low bar. Please aim higher, people!


> Why even bother with such half-baked solutions? In almost every case it's utterly transparent that these solutions exist for the benefit of the compiler and runtime author, usually because of intentional or unintentional technical debt--a direct or indirect dependency on the C or kernel stack.

There is a significant faction of language designers that disagree, and think that keeping coroutines shallow is important for developers writing and reading the code. This post from Dave Herman<(involved in JavaScript and Rust) sums it up: http://calculist.org/blog/2011/12/14/why-coroutines-wont-wor.... (The comment from Tom van Cutsem is also a good rephrasing: http://disq.us/p/9jcee9.) Note that the argument is not as applicable in languages with racy multithreading (like C/C++ or Java).

I don't think it's necessarily a knockout argument, but it at least helps me sleep with what we've chosen for JavaScript.


I never understood this argument, it feels to me a post hoc rationalisation. Yes, with toplevel only yield you know all suspension points, but doesn't really buy you anything, as calling an unknown function can potentially mutate any object possibly by invoking user callbacks or other generators. If the function behaviour is well documented, then whether it is a suspension point would be as well.


The difference is that when you call a function, you can easily know what will happen: the function will execute. You can use your knowledge about the function you are calling (and the functions it calls, etc.) to ensure it does not violate any invariants you set up.

Whereas, if you have an implicit yield point that goes back to the event loop, the event loop can run arbitrary other tasks---not ones you can locally reason about or predict, but simply those that are ready to be executed.


But if you know what a function does and call you would also know whether it is an implicit yield point or not, right?

Also, I don't know about JS, but many event loops can be reentered recursively, so even with top level only yield all bets are off.


Was that a correct link? I feel like it only re-iterates my points.

My point is that thinking about coroutines in the context of green threading is totally the wrong way to think about it. That you can implement something approximating green threads with coroutines is a testament to the power of coroutines, but it's hardly the defining the feature for them.

And coroutines are not sufficient to implement green threading. You still need a way to have multiple outstanding I/O requests. That could have been done with other mechanisms. User code could wrap such mechanisms with coroutines and fashion an event loop if they desired, and no doubt most would have done that. But by leaving that up to the application people could experiment with patterns for addressing concerns regarding concurrency. And I would also note that concurrency problems related to order of operations hardly go away with futures, the preferred solution in JavaScript, or async/await. Stackful coroutines can theoretically be worse when callees can yield from any expression, but don't forget that the real problem is shared mutable state, which you're passing or otherwise making available in equal measures for each option. For that and other reasons the distinction with futures and async/await is, I think, not very meaningful.

For similar reasons, Rust's failed experiment with green threading is not an argument against the practicality of stackful coroutines. Quite the contrary--it's an example of why it's more important to focus on the abstraction and preserving the power of that abstraction than to tailor the solution for specific scenarios. Rust could easily have had stackful coroutines with zero performance cost and negligible cost to the language design. But instead they focused on coroutines as a roundabout way to ease the burden of async I/O, and, worse, they tried to refactor Rust's entire standard I/O library to work with that green threading model. It was a destined for failure, and for good reason.

When discussing coroutines, my favorite example is something like libexpat. libexpat was early on in the history of XML the most widely used XML parser. But it was a push parser. Push parsers are easier to implement and often more performant, but they're more difficult to use. All of those qualities stem from push parsers literally and figuratively pushing the burden onto the application for solving issues related to state management and buffering.

You couldn't easily refactor libexpat into a pull parser because token production relied on automatic variable stack state and internal stack structures. You'd have to copy and buffer everything it produced. No wonder so many people either forked libexpat or just reinvented the wheel.

If C had stackful coroutines, it would be _trivial_ to make libexpat a pull parser. Heck, trivial undersells how simple and elegant it would be. In that context coroutines could have provided the best of every and all worlds, for both libexapt developer(s) and direct and indirect users.

The narrative around coroutines has been poisoned by this narrow focus on async I/O and similar contemporary problems, and conflation and equivocation with fibers, kernel threads, and the low-level details of platforms' C ABIs. It's created lost opportunities for providing stronger language abstractions.

Perl 6 committed a similar sin, IMO. MoarVM technically can support stackful coroutines, but it doesn't because they're only implemented to support Perl's gather/take control structure. That coroutines could have easily been used to implement gather/take entirely within application code, but not vice-versa, should have been a strong hint that coroutines were the stronger abstraction, and gather/take should have been defined as standard API utilizing proper stackful coroutines.

Note to future language designers: coroutines are not about async I/O. Coroutines are not about green threads. Coroutines are not about map/reduce. Don't conflate the means with the ends. Stackful coroutines can be used to implement all of those and more because they're the better abstraction. Lua's stackful coroutines can be used to easily implement green-threading like async I/O, or non-trivial map/reduce patterns, not because the Lua authors bent over backwards to make that possible, but because they preserved their abstractive power; because they modified both the language language design and implementation so stackful coroutines didn't come with needless caveats.


> You can recover most of the M:N ergonomics over time via async/await style syntax.

While i agree with the tradeoff made by rust (although i think the approach used by C++ coroutine is better), i don't think that having async/await syntax give you "most" of the ergonomics of the Go M:N model .

The main advantage of the go model is that both asynchronous and synchronous operations are identical, with async/await you still need to model the async operation and the sync operation with different types. Not having to decide(and design) upfront which part of your computation is async and which is not is what makes go so attractives


> although i think the approach used by C++ coroutine is better

How?

> The main advantage of the go model is that both asynchronous and synchronous operations are identical, with async/await you still need to model the async operation and the sync operation with different types.

It's more like "everything is synchronous" in the Go model. Semantically, Go doesn't have async I/O at all. It has a userspace M:N implementation of synchronous I/O. You can get the same effect in any other language by just not using async I/O.

> Not having to decide(and design) upfront which part of your computation is async and which is not is what makes go so attractives

That doesn't make sense to me. If async I/O is important in your app, why not just make your entire app use async I/O?


Defaults matter. If some people use async I/O and others don't then you get a mess when they want to share reusable libraries. It's similar to the mess you get when there is more than one string type.

I think the "what color is your function" problem could be mostly solved by making async/await the default function type - that is, most callback functions should be allowed to suspend execution by waiting on a Future.

Then you could have special-purpose "atomic" functions that are occasionally useful for pure functional code.

(Unfortunately, the default has to be the opposite in browser-based languages due to performance concerns.)


> (Unfortunately, the default has to be the opposite in browser-based languages due to performance concerns.)

Also in Rust. Most apps that aren't servers don't want async I/O, and it causes a lot of problems when you need high-performance FFI. For example, in Servo async-everywhere would be a big problem for WebRender, which needs to be able to call OpenGL extremely quickly.

> Defaults matter. If some people use async I/O and others don't then you get a mess when they want to share reusable libraries. It's similar to the mess you get when there is more than one string type.

Given that having both is necessary for Rust (maybe a necessary evil), I think the right approach is to make jumping from one mode to the other painless. For sync-to-async, it needs to be easy to block on the result; for async-to-sync, it needs to be easy and fast to offload to a thread pool. If we can make it really easy to switch from one mode to the other, then most of the really hairy problems go away.


Easy and error prone.

Sometimes libraries pretend to be async and accidentally are sync. Consider some library that in the normal case just does some pure computation, but logs to syslog or something on some error condition. If you use that library in an async context, it could work fine most of the time, until you hit some unexpected situation where it happens to make a network request to some syslog daemon and blocks up your worker thread. The same thing can happen with mutexes, or many other common blocking operations.

It's also the case that often async libraries depend on some sync library and so they have their own worker pool. You can easily have many libraries with their own worker pools all using more resources then they need.

You also have to worry about if your functions do any of these transformations under the hood. For example, if you have some async worker that delegates some sync task to the worker pool, and that sync task happens to use some async function and blocks on it, and that async function ALSO has a sync task and attempts to delegate it to a worker pool, and that worker pool is bounded, then you have just opened yourself up to a deadlock under high load that you probably won't find under normal operation.

On top of all that, debugging is usually much harder in these environments because you have to inspect the runtime memory state that depends on the specific internal details of the libraries being used instead of a bunch of stacks. It's extremely hard to diagnose deadlocks or stalls in these environments. It's non-trivial to provide a good debugging experience that doesn't cause extra load in production environments.

These issues are all real things that I have hit in production with Twisted. A static type system could help all these things, but I think it requires buy in from every library you might use, transitively.


> It's also the case that often async libraries depend on some sync library and so they have their own worker pool. You can easily have many libraries with their own worker pools all using more resources then they need.

The Rust story will not be complete without a canonical single implementation of a thread pool that everybody doing async I/O uses for blocking tasks.

> For example, if you have some async worker that delegates some sync task to the worker pool, and that sync task happens to use some async function and blocks on it, and that async function ALSO has a sync task and attempts to delegate it to a worker pool, and that worker pool is bounded

I think the solution here is "don't have strictly bounded worker pools". This is what cgo does, I believe.

> It's non-trivial to provide a good debugging experience that doesn't cause extra load in production environments.

But this is the exact same problem that any M:N system will have. So I don't see any special problem for Rust's system here.


I hope your optimism that a single kind of thread pool will service all applications is well founded. It seems like people would want to specialize them much like they want to specialize their memory allocators. The Rust team has a really great track record of innovation and technical excellence so I look forward to the design that will accomodate that and hope the ecosystem buys off on it.

Go does limit the number of threads and will crash the process if it goes over. It's also very rare to have CGo call back into Go multiple times versus libraries juggling adapters between async and sync in my experience. It's also easy to have your library have a limiter on the number of CGo calls you make, but less easy to limit the number of tasks you throw into a thread pool because you don't have the option to block. (edit: I think you can just store the closure on the thread pool and schedule it eventually at the cost of writing your own scheduler and perhaps requiring allocations?) I have a feeling that a similar crashing solution won't work in the Rust community, and what to do when the limits are hit will be punted upstream. My main point is that there are many subtle details in solving the "colored function" problem.

I don't think all M:N systems have the debuggability problem becausd the runtime has a single canonical representation of what is running: the stack traces. Since the entire ecosystem bought into the runtime, you don't have any fracturing of representations. If you're optimisitc that the entire ecosystem will buy into whatever mechanism you have to do async work, then this can be solved, but I believe that's already not the case (people hand code state machines) and is unlikely to stay the case as time goes on.


> Given that having both is necessary for Rust (maybe a necessary evil), I think the right approach is to make jumping from one mode to the other painless. For sync-to-async, it needs to be easy to block on the result; for async-to-sync, it needs to be easy and fast to offload to a thread pool. If we can make it really easy to switch from one mode to the other, then most of the really hairy problems go away.

That really sounds like the Task<T> type from C# TPL, which can also be used in sync and async environments and was probably also designed to be a generic solution. While it basically works there's a bigger number of pitfalls associated with that model. E.g. you can synchronously block on .Result from some tasks (that will be fulfilled by other threads), but not from others (which would be fulfilled by the same thread, because that causes a deadlock). In the Task+Multithreading world there's also always the question where continuations (attached with .ContinueWith, .then, ...) run. E.g. synchronously in the thread that fulfills the promise, asynchronously in a new eventloop iteration (but for which EventLoop?), in an explicitly specified scheduler, etc. C# uses TaskScheduler and SynchronizationContext variables for that. But as they are only partially known and even behave somewhat different for await Task and Task.ContinueWith there's quite some room for confusion.


> Also in Rust. Most apps that aren't servers don't want async I/O, and it causes a lot of problems when you need high-performance FFI. For example, in Servo async-everywhere would be a big problem for WebRender, which needs to be able to call OpenGL extremely quickly.

I don't understand why this is the case. Since async/await allows the compiler to transform the code into a state machine, why would it be not be able to optimize this?


Because async-everywhere usually means pushing blocking FFI calls over to a thread pool, which would be unacceptably slow for e.g. OpenGL.


"Usually" is a funny word, but I see what you are saying.

I guess if you are in a single-threaded event loop scenario (ala nodejs) you could get away with it somewhat, but as soon as you introduced multiple threads of execution all bets are off.

That's unfortunate.

We have a fairly large codebase written in .NET. We wouldn't mind porting it to CoreCLR, but we'd have to migrate everything to async/await. The red/blue nature of method signatures makes this look like almost a complete rewrite. Given the difficulty of this migration so far, and the nature of the change, it's certainly caused us to explore other options and we've already rewritten a significant chunk of the code in Go.

Moving a large codebase from a single color model to a dual color model really sucks. I hope Rust can lock this down sooner rather than later otherwise a lot of people are going to feel pain down the road.

The good news is you have a good static type system. I cannot even begin to imagine migrating a Python codebase to async/await...


> How ? In term of allocation : When the future uses some variable present on the current function stack you have two options

1 - Waiting for the future to complete before exiting the current function (which essentially is blocking)

2 - Allocating the closure on th heap (allocation + deallocation)

In a language with coroutine support , we have a third alternative. Instead of block or allocating memory, it's possible to just suspend the current function (no heap allocation necessary, no blocking), and resume when the future completes.

In term of context switching speed : the cost of moving the state machine is essentially the cost of a double dispatch (probably double dispatch plus virtual function call), switching coroutines is closer to the cost of a function call ( i think it's cheaper than a normal function call, but tha becomes too technical)


I just watched (mostly) the CppCon talk you posted elsewhere. The coroutine approach is really interesting, but I'm confused as to how it's different. According to a source I found[1], the way coroutines are implemented is that a new stack is created on the heap and it moves back and forth between that. Isn't that the same case here? Is the compiler level implementation(as opposed to boost, as in the linked reference) different in some way?

1: http://stackoverflow.com/questions/121757/how-do-you-impleme...


> The main advantage of the go model is that both asynchronous and synchronous operations are identical

There is a bit of misunderstanding on your part. There are no asynchronous operations in that go model, everything is synchronous. There is no event loop underneath, despite what some people claim. And this is absolutely not an advantage in a shared memory environment. Instead it forces you to do synchronization to access memory, while different-looking asynchronous operations do not precisely because they are different-looking, otherwise it would be impossible to know what is running when and you would have to think about synchronization too.

It is not a secret that shared memory multithreading, while useful for parallelism, is the worst possible model for concurrency and this includes all flavors of coroutines as well.


> There are no asynchronous operations in that go model, everything is synchronous.

You are the second person saying this, and i have to admit i am really confused by this statement. From my understanding, golang does not expose an asyncio interface, but this doesn't meant that golang runtime doesnt perform io operations asynchronously. So golang expose async operation through a synchronious interface, which is what most language construct (C# async/await, F# async monad etc..) try to emulate.

From https://morsmachine.dk/go-scheduler , when a goroutine performs a blocking operation, the local scheduler the remaining goroutine are migrated to another OS thread, allowing them to continue their execution while the blocking operation completes asynchroniously.


> From my understanding, golang does not expose an asyncio interface, but this doesn't meant that golang runtime doesnt perform io operations asynchronously.

This is also true for an OS kernel.


I don't understand the point your are making


The blocking (i.e. synchronous) syscalls for IO are doing asynchronous things internally, they just wait for the event to finish before returning. The kernel can even do similar scheduling things to a language runtime, e.g. when a thread does a call to a blocking read on a socket, the thread can be switched out until the socket actually has data, allowing other code to run on that core.


I am still unclear on how does that correlate to the orignal discussion. The point we were debating is whether or not Golang does async io. I am not sure why the kernel behavior is important here


The Go programming model is synchronous: the code executes an IO operation and that thread of execution halts until the operation completes. The Go runtime is implemented using asynchronous IO internally, and manages scheduling between threads itself, but that is an implementation detail.

This is exactly the same as using normal blocking operations with OS threads. That programming model is synchronous: the code executes an IO operation and that thread of execution halts until the operation completes. The kernel is implemented using asynchronous IO internally, and manages scheduling between threads itself, but that is an implementation detail.

The original point is Go's programming model is the same as OS-level blocking IO, the fact the runtime is implemented in user-space on top of async IO is an implementation detail that doesn't change the surface behaviour of the code. One could substitute OS threads and normal OS blocking calls for goroutines and the runtime's IO operations and code would behave essentially identically, just possibly with different performance characteristics.


> I'm not aware of anyone offering an alternative superior to an informal CSP yet which seems to be the reason why Go and Clojure have picked it as well for their concurrency model.

What about Quasar [0] on the JVM?

0 - http://docs.paralleluniverse.co/quasar/


But Quasar does seem to offer go-like channels? I'm unclear what kind of combinators are provided but those could be implemented.


It does offer channels, see the linked Docu.

For combinators I don't think those are really needed in the CSP environments because you can do most transformations with normal function calls and normal control flow contructs (loops, if, ...)


Yes, this is the lowest layer of the story. For actually writing an app, writing something higher-level makes sense: consider the relationship between raw futures and Finagle, for example. You have to build out the lower layers before you can add the upper ones, though.

Rust doesn't natively supply anything like green threads, as they have too heavy of requirements. But there are librarys that can implement them, since Rust is so low-level: https://crates.io/crates/mioco as an example.


Dealing with futures is fairly straightforward if the language offers some syntactic sugar for continuations (such as C#'s async/await that has been spreading around for some time now). Why doesn't it mix well with complex business rules?


I'm not sure I would go as far as saying they can't. I'm quite comfortable with F#'s async computation expression and like them a lot.

You get a strange sense however when playing with Hopac (which btw has an analog sugar to F#'s async which works just fine) and going through the content in Reppy's book that the model is strictly more composable. I'd invite you to try it although I don't have a sense for the C# combinators.


TLDR;

I’ve claimed a few times that our futures library provides a zero-cost abstraction, in that it compiles to something very close to the state machine code you’d write by hand. To make that a bit more concrete:

- None of the future combinators impose any allocation. When we do things like chain uses of and_then, not only are we not allocating, we are in fact building up a big enum that represents the state machine. (There is one allocation needed per “task”, which usually works out to one per connection.)

- When an event arrives, only one dynamic dispatch is required.

- There are essentially no imposed synchronization costs; if you want to associate data that lives on your event loop and access it in a single-threaded way from futures, we give you the tools to do so.

This sounds quite badass and awesome. I'm not sure what other language implementations take this approach, but this is clearly an extremely beautiful, powerful, and novel (to me at least!) concept. Before reading this, I thought rust was great. This takes it to the next level, though.


And in practice:

> ...we’ve added a comparison of minihttp against a directly-coded state machine version in Rust. The two are within 0.3% of each other.


Generating state machine out of async fluent APIs was enlightening.

How are are handling errors ? - One Error state with the context ?


This is probably better answered by how Rust handles errors generally:

https://doc.rust-lang.org/book/error-handling.html

Suffice it to say, similar magic tricks have already been at work to make handling errors not soak up a lot of space, and it'd work well with something like this, I think.


that sounds great but I guess that will make the compilation time bigger.


In Rust it's frequently the case that slow compilations are dominated by generating and optimizing LLVM IR. This codegen step (generating LLVM IR) often takes awhile just because we're generating so much IR.

Rust takes an approach with generic functions called monomorphization which means that we generate a new version of each function for each set of generics it's instantiated with. This means that a future of a String will generate entirely different code from a future of an integer. This allows generics to be a zero cost abstraction because code is optimized as if you had substituted all the generics by hand.

Putting all that together, highly generic programs will generally trend towards higher compile times. With all the generics in play, there tends to be a lot of monomorphization which causes quite a lot of LLVM IR to get generated.

As with many aspects of Rust, however, you have a choice! Rust supports what we call "trait objects" which is a way to take a future and put it behind an allocation with a vtable (virtual dispatch). This forces the compiler to generate code immediately when a trait object is created, rather than down the line when something is monomorphized.

Put another way, you've got control over compile times if you're using futures. If you're taking a future generically and that takes too long to compile, you can instead take a trait object (or quickly convert it to a trait object). This will help cut down on the amount of code getting monomorphized.

So in general futures shouldn't make compilation worse. You'll have a choice between performance (no boxes) and compile times (boxing) occasionally, but that's basically already the case of what happens in Rust today.


Do note that with MIR the focus is polymorphic optimizations - reducing the LLVM IR for all monomorphizations of a generic function, at once.

Unlike C++ templates, Rust enforces a single definition with uniform semantics, for a generic type/function (specialization going through the existing trait static dispatch mechanism), so we can take advantage of that to reduce compile times.


Hm. I've heard arguments that C# or Java is slow for multiple reasons, but never because of the minuscule overhead of a virtual method dispatch when using objects behind interfaces (kinds similar to trait objects).

It's interesting that this is seen as significant here. Are we dealing with much shorter timescales, or just being eager to optimise everything?


Virtual dispatch per se is not terribly slow, as long as the branch is predictable by the CPU. The problem is that virtual dispatch prevents the sort of aggressive inlining and interprocedural opimizatios that C++ compilers are known to do. C# and Java JITers get around that via runtime analysis and speculative inlining, but that is done at runtime and eats away some of the precious little time available for optimisations.

Edit: spelling


Put it this way:

Cost of a branch misprediction is 10s of cpu cycles. (1) Measured in gigahertz (10^9 cycles per second).

Time to turn around a web request is, if you're very lucky and have done the work, mainly about getting a value from an in-memory cache at multiple milliseconds (2). That's 1 / (10^3) seconds.

If you're not lucky, 10s or 100s of milliseconds to generate the response.

It seems that the second duration is best case around 10^6 times longer. I would not sweat the first one.

1) http://stackoverflow.com/a/289860/5599 2) http://synsem.com/MCD_Redis_EMS/


Contrary to popular belief, not all C++ programs (or rust FWIW) are web servers serving HTTP requests over the Internet.


Yep, that's why I'm asking about the use-cases in the grandparent comment.


As an example, many real-time systems are often a giant ball of messy asynchronous code and state machines. Futures can help with that, although lately I have found that somtimes the best, cleanest, way to implement a state machine is to make it explicit.


How much do you attribute that to the benefit of creating a high barrier to entry for modifying that code? Could this be summarized as: code that inexperienced devs can't understand, stays performant because they can't figure out how to change it?


None of the teams I've worked with had such a policy and certainly I wouldn't work in a team like that.


I assume at least part of it is that "zero-cost abstractions" is a fairly objective and boolean metric to calculate. "Is this performance impact significant enough to worry about?" would probably result in a lot more bikeshedding.


A tremendous amount of effort has gone into the CLR towards optimizing interface dispatch, because at one time it was slow. Interface dispatches are cached at the call site to avoid real virtual (vtable) dispatch, just like a Smalltalk or JavaScript VM would.


Yeah, using closures in zero-cost abstractions (like iterator or now future adaptors) does make compilation slow. LLVM isn't used to this kind of code, and takes time optimizing it.

However, now that we have MIR we can optimize this on the Rust side first, before the type info has been lost. This was one of the reasons MIR was added :)


I'm... curious about why you think LLVM isn't used to the sort of code that results from these sort of zero-cost abstractions: C++ code is also very aggressive about templates that inline away to nothing, and modern C++ especially has many similar abstractions to Rust (cf. closures, and the range proposal). Rust isn't special in that particular regard.

Maybe you mean LLVM is very low-level and thus it sees everything in the abstractions, which obviously takes time to sort through (and is especially wasteful to do multiple times for different monomorphisations). Thus, using MIR to optimise at a higher level can simplify code more efficiently before it hits LLVM.


I believe he meant, lots of closures. Which is not normal in C or C++, but is the normal in Rust because they are Zero-Cost.

Without knowing anything about LLVM's internals, I would assume it doesn't anticipate so much closure chaining, and therefore doesn't leverage the fact that they are so easy to inline.


Lambdas in C++ are almost identical to Rust ones (each closure is a unique unnameable struct containing the captures, with an overloaded operator()), in fact, the current Rust scheme was explicitly inspired by C++11 closures. Historically (C++98), it's true that not much code used closures, because they didn't exist at a language level, but the modern language uses them much more aggressively, even pushing hard on making them more and more flexible (e.g. auto/generic closures in C++14). For instance, sort[1] can take a closure which is more efficient that way than a function pointer, and easier to use than a manually implemented custom type.

Also, closures are just structs with functions that take them as arguments, so if the compiler can handle those, it can handle closures.

[1]: http://en.cppreference.com/w/cpp/algorithm/sort


Exactly. And hand written function objects were already common in C++98. C++1* lambdas are just (very sweet) syntactic sugar.


Lambdas in modern idiomatic C++ are also essentially zero-cost, or at least they're supposed to be - they're not heap-allocated, and are strong candidates for inlining.


Does clang directly go for LLVM or is there an intermediate form that is optimized similar to MIR?


I believe it goes straight to LLVM IR.


Yeah, I meant the latter, sorry.


Compared to what?


compared to a non free rust future


I dabbled with rust in the past and was really fascinated with it, but haven't played around lately. One thing caught my eye in the post:

    fn get_row(id: i32) -> impl Future<Item = Row>;
That return type looks odd to me. What does it mean to return an "impl", and is that a new feature in rust, or just something advanced that I missed in my exploration before?


This is an exciting upcoming feature in Rust, which you can read more about in a couple places:

- http://aturon.github.io/blog/2015/09/28/impl-trait/ - https://github.com/rust-lang/rfcs/pull/1522

This feature allows you to return any struct that implements the trait, without having to type the name of the struct (which can sometimes be quite big). It also means that clients only know what traits are implemented; the concrete type is invisible to them. The above links have a bunch more detail.

(And this feature is set to land in nightly Rust very soon! Shoutout to eddyb :)


So, existentially quantified types?


Yes. However, there isn't full support of existential types in the horizon – only returning them from a function (and using them inside the caller, as the type inference allows this). This resolves some specific pain points of the current Rust experience.

There may or may not be some extensions (like storing them in fields of structs) in the future.

Note that Rust supported existentials before too, in the form of trait objects. But this was only possible behind a pointer and using virtual dispatch, so the performance story wasn't perfect. The impl Trait syntax is supported through monomorphization.


Yeah, since rust is eager existential types seem like a necessary evil.


Sounds awesome. Love it :)


Rust couldn't return Traits since they vary in return Size.So using couple of chain and maps would result in following syntax.

    fn did_too_many_chains() -> Chain<Map<'a, (int, u8), u16, Enumerate<Filter<'a, u8, vec::MoveItems<u8>>>>, SkipWhile<'a, u16, Map<'a, &u16, u16, slice::Items<u16>>>>
Now, you can write:

   fn did_too_many_chains() -> impl Iterator 
and be done with it.


That's awesome. I've been banging my head against the wall trying to implement a Drain iterator for a tree structure and it tries to recurse. I ended up with things like your Chain<Map...>> and was banging my head against the wall over it.


That is new syntax which has been accepted as a language change, but has not actually landed in the compiler yet. It was shown this way because it's much easier to understand, but is the same thing as it would be in the real code.

https://github.com/rust-lang/rfcs/blob/master/text/1522-cons... is an explanation of the feature in detail, but it boils down to "I will return you some kind of thing that implements the Future trait, but I'm not telling you exactly what it is."


Does the need for this arise, in some sense, because Rust doesn't offer classical inheritance and so no mechanism to indicate co/contra variance?


Rust does let you communicate variance (it's very important for lifetimes). Variance is determined based on a type's composition, and PhantomData can be used to specify variance that doesn't immediately follow by providing an "example" of what it should behave like. The only limitation of this approach is that you can't override the compiler's reasoning -- if it sees evidence for covariance, and you provide more evidence for contravariance it will just give up and pick invariance (which is of course strictly correct if both pieces of evidence are accurate). Also I guess you can't have bivariance but like... bivariance is so bad.

All of Rust's standard collection and smaht pointer types are written in pure Rust and have relatively complex variance behaviour.


For anyone else wondering what the deal is with "smaht pointers": https://www.reddit.com/r/rust/comments/3404ml/prepooping_you... I think I first saw the term on that thread but didn't stay long enough to see the explanation. It was driving me nuts! :)


No, it's more related to the fact that Rust allows you to use traits in both a statically- and dynamically-dispatched way. You can read some more on the topic here: http://aturon.github.io/blog/2015/09/28/impl-trait/


I'm not sure variance is at all related to this particular feature (maybe you could expand on your thinking?), but I also don't think the classical inheritance techniques for solving this problem are appropriate. I assume you're referring to, say, having an Iterator or Future base class and returning that. This unfortunately imposes costs like virtual calls and allocations, and is in fact already possible in Rust via trait objects (a function can return Box<Iterator<...>>, which is an allocation containing the data along with a vtable of virtual function pointers to manipulate that data). The impl trait feature is designed to allow returning closures and other unnameable/complicated types with no unnecessary cost, it's the same as returning a struct in C without compromising on programmer niceties.


I probably misunderstood something, and it's been awhile since I looked closely at the language. Your reply, and the others to my comment, have been enlightening, thanks!


It's a new feature (hopefully to be merged this week). Here are the gory details:

https://github.com/rust-lang/rfcs/blob/master/text/1522-cons...


Worth noting that by "hopefully to be merged this week" we mean that this initial prototype implementation will land in nightly: https://github.com/rust-lang/rust/pull/35091 . It's not being stabilized this week.


On a related note: why `Future<Item = Row>` instead of `Future<Row>`?

I notice it's consistent with the standard `Iterator<Item = ...>`, but why is Iterator like that?


Because it's an associated type (https://doc.rust-lang.org/book/associated-types.html), not a regular generic parameter.


Aren't generic types with one associated type isomorphic to 'normal' generic types? In that sense, it is still odd.


Not exactly. A given type can have infinite implementations of iterator<A> (by implementing for different A) but only one impl for iterator<Item=Foo>


I don't think I get that without an example. Do you know of any simple examples?


Generic traits can be implemented multiple times, as long as the params differ. This is generally used to establish relationships between types, which can naturally be M:N. Comparison and conversion are standard use cases:

    impl PartialOrd<OsString> for OsString
    impl PartialOrd<str> for OsString
    impl PartialOrd<OsStr> for OsString
    impl PartialOrd<Cow<OsStr>> for OsString
Traits without generics can only be implemented once (because there's no generic params to "differentiate" impls). But it's often still desirable/necessary to relate types to the implementation. This is much more common, as it's generally a case that some type is a ThingDoer in only one way (and having it be a ThingDoer in more than one way may be confusing).

Associated types are often more convenient and give the compiler more room to reason about your code.


Gankro gave an example, but here's a better way to think about associated types:

Think about them the same way you think of a method. Methods are "associated functions". If you implement a trait on a type, it can only have one version of a trait method, not two. Similarly, it can have one associated type, not multiple.

With a generic trait the trait itself is generic; there are multiple "versions" of this trait so you can implement it multiple times for different parameters and different methods.

A concrete way to think about this is overloaded methods. Rust doesn't have the regular kind of overloading, but it does have the Fn traits.

https://doc.rust-lang.org/core/ops/trait.FnOnce.html

(ignore the rust-call stuff)

The trait is essentially:

    trait FnOnce<Args> {
        type Output;
        fn call_once(self, args: Args);
    }
Note that it has both a type parameter and an associated type.

Now, we might have a type Foo which is `FnOnce<(u8,),Output=bool>`. This means that it takes a single integer in, and outputs a bool. We can overload it by also implementing `FnOnce<(char,), Output=bool>` and `FnOnce<(u8,u8),Output=char>`. This means that it can also take in a char and return a bool, or take two integers and output a char.

Now, given these impls, I can try to overload it with `FnOnce<(u8,u8),Output=bool>`. However, we can't. Because this means that the function will return either a book or a char when you feed it two ints. This is not how we want functions to behave, and this is why Output is an associated type. For a given trait implemented on a given object (in this case FnOnce<(u8,u8)> implemented on Foo), there is only one output type. But there can be multiple versions of the trait implemented by changing the generic part. So, while an overloaded function may take in multiple different input types, each set of inputs has only one possible output type.

Of course, Rust doesn't forbid having it the other way around -- we could make Output a generic parameter too, and have overloaded functions which can have different output types for the same input (and need type annotations to choose). But we don't want FnOnce to work that way, so we don't have it like that.

Similarly, given an iterator, there is only one type it can produce. We don't want there to be types which are iterators over multiple things.

On the other hand, the trait PartialEq has a single type parameter. This is because we want you to be able to test equality between many types.


> Of course, Rust doesn't forbid having it the other way around -- we could make Output a generic parameter too, and have overloaded functions which can have different output types for the same input (and need type annotations to choose). But we don't want FnOnce to work that way, so we don't have it like that.

Can you please explain why FnOnce shouldn't work this way?


Because overloading the return type (for the same inputs) is pretty unusual :)

Rust has the machinery to deal with this -- you'd need type annotations -- it just gets annoying.

Specific functions in Rust do get "overloaded" by return type by returning a generic parameter (e.g. the collect() method on iterators), but in general functions are expected to work without needing type annotations on return.

I mean, allowing this kind of use of FnOnce isn't a bad idea. I don't see major issues with it, just minor ones. That's not the choice that was made during the design. I'm sure if we looked at the original design rfcs this point will have come up somewhere :)

I wasn't involved in that decision, so I'm not aware of the exact reasoning.


Thanks! I just thought their might be some quite important technical reasons for this decision.

After all overloading the return type is quite usual and extremly helpful for parsing or the conversion of types.

But these kind of operations most likely will be done with a Trait in Rust, so having this option for closures might not be that important.


Exactly -- it only stops you from creating function-likes that have multiple return types (not closures, because closures aren't locally generic anyway). If you really need such behavior, regular generic functions or traits will work, e.g. `.collect()`


It's an upcoming feature, AFAIK the syntax is still being decided upon. What it means is that the `get_row` function returns some type that implements the `Future` trait, but we refrain from saying exactly which type that is. This is useful for various reasons, and in this case it's probably just being used because typing out the concrete type for some of these deeply-nested and combined types quickly starts to resemble C++ template hell.


In my opinion - as someone with some background in CS - the name "future" is a little too overloaded here. It is not only used for the deferred computation of a value so much as it also means the composition of computations. This is not wrong per se, but calling the result a "future" alone oversimplifies what's happening below and hides some properties about the combinations.

The first observation one can make - which is not mentioned anywhere in the article - is that the composition of futures here can be understood as a monadic composition. This by itself gives a big hint why this interface is so powerful. Second is that this library could be understood as an implementation of process and process combination from pi-calculus [1] - sequential combination, joining, selection, etc - so it could be formalized using its process algebra.

From the practical side, one example of a mature library that implements similar concepts is the LWT [2] library for OCaml, which has the same idea of deferred computation, joining and sequencing, but calls the computations "lightweight threads". One could also argue about naming in this case, but it seem to reflect a better the idea of independent "processes" that are combined on the same address space.

Finally, as much as these concepts of futures and processes look similar on the surface, they each have their own properties - so it's always good to consider what better fits the model. By looking at the research and at other similar solutions, one can make more informed choices and have a better idea of what to expect from the implementation.

[1] http://www.cs.cmu.edu/~wing/publications/Wing02a.pdf

[2] http://ocsigen.org/lwt/manual/


Anyone else find the f.select(g)/f.join(g) syntax unintuitive/awkward? I'm confused as to why they wouldn't go with the (IMO) more logical select(f, g) and join(f, g) in this case (since neither Future is really the "subject" in these cases). Not that this is a major concern (it would take only a few lines of code to change within your own program using an alias for the functions), just interested in knowing the rationale behind the choice.


You can write it both ways. t.method() is the same as T::method(t).


Finally a nice async/io interface for rust, always felt that it was a big missing piece, couple of questions for peps familiar with async in other languages :

1 - Isn't the state machine approach the same as C#/.net async/await is using ? But the with the added convenience of the syntactic sugar ?

2 - no allocation : , does'nt the lambda closure need to be allocated somewhere ?

3 - I would have love some comparison (booth performance wise and on theory) with C++ up comming coroutine work, from my understand the C++ approach is even more efficient in term of context switching and have the advantage of even less allocation.


> Isn't the state machine approach the same as C#/.net async/await is using ? But the with the added convenience of the syntactic sugar ?

Similar in principle, but the implementation is different. Tasks in C# are more of an OO style instead of a FP style where they turn into an enum (sum type, if you want to get theoretical).

> 2 - no allocation : , does'nt the lambda closure need to be allocated somewhere ?

No, not in Rust (or in C++).

> from my understand the C++ approach is even more efficient in term of context switching and have the advantage of even less allocation.

No, you're asserting this without understanding Rust. It's not possible to get less than zero allocation.


Yes , i only have a simple understanding of rust memory model . So what happened when/if the future escapes/outlives the current stack/context ?


To be clear, the issue here isn't so much stack vs heap, as much as how many heap allocations are happening.

In practice, you build up a really big combined future on the stack, which has all of the space needed for any state in its state machine, and then when the future is fired off (in a task -- one per connection), that entire thing is moved into the heap in one shot. Thus, generally, you do one big allocation up front per connection -- exactly the same as you'd do when writing state machines by hand.

The follow up posts will go into much more detail on this point :)


Fair enough, but op was quite misleading by saying that in rust closure don't need allocation.

So back to future vs coroutine , it seems that they the advantage in term of allocation simply because the context is not distroid when suspending/resuming coroutines


  let mut counter = 0;
  let mut increment = || { counter += 1; counter };

  increment();
  increment();
There is no heap allocation there.


This is missing his point. "Increment" is called outside of its scope when the epoll triggers. If "counter" is stack allocated that would cause a segfault, so it must be heap allocated which is his point.


Well, they're two different points: closures in rust do not inherently have to heap allocate. But that also doesn't mean that they can _not_ be heap allocated either.

And in this example, it's not even really the closures that allocate: it's still one allocation, regardless of the number of closures.


No. they are not two different points. You are artificially restricting the argument, and saying that some parts are a different question by introducing this new idea of "inherent nature" of a closure in rust : Some humans have legs, some do not. Does it mean that having legs is not an "inherent" part of the human experience?

So to go back to the argument, we are trying to compare using a coroutine base approach vs using a future/state machine + closure approach. My point was that because a coroutine allows one to enter and leave a stack frame without destroying it, it can lead to less allocation and therefore be more efficient in those cases (and let's not even talk about the cost of the activation/deactivation of the stack frame)


To use your analogy, I feel like you're saying that only humans that have legs are human. I'm saying that they're all human.

This analogy is weird.

Furthermore, because closures aren't inherently boxed, there might not even be a stack frame to begin with. Closures are syntax sugar for a struct + a trait implementation on that struct, and said call is then very inline-able.

This is also where I'm getting at with the single invocation thing: with these futures, all of these closures are going to be inlined into one big state machine, which is then itself eventually put on the heap. But that heap allocation has nothing to do with the closure, and more to do with the future wanting to not tie its data down to a specific stack frame.


I think this is becoming a semantic argument.

I think we all understand that Rust/C++ closures don't necessarily allocate memory. His point is simply that by using a rust closure to asynchronously handle an event a heap allocation must be done. Compare to the stack swapping method of classical coroutines, which wouldn't require a heap allocation.

I recognize that with Rust zero-cost futures, one big heap allocation is done for the entire futures combination and is roughly analogous to one big eager stack allocation for a coroutine.


Yes, I think so too. I just want to make sure the semantics are clear. I think your summary here is excellent.


> Some humans have legs, some do not. Does it mean that having legs is not an "inherent" part of the human experience?

I mean - yes, that is what it means.


You don't have to destroy the "stack frame" (by which I assume you mean the state retained between blocking operations) every time you enter and leave. Why do you think you need to?


He's talking about the difference between swapping the stack pointer and executing a single "jump" instruction vs initiating a full function call (pushing argument(s), function prologue/epilogue)


Calling a function isn't "allocation". And you don't just swap the stack pointer: you have to reload the register state. It's the same cost as a function call.


May I ask, where do you assume a coroutine's stack lives?

i.e. In relation to program memory


Not sure what you mean, but from my understating the coroutine e stacks are just allocated pieces of memory in the current address space ( might be stack allocated or heap allocated depending the the situation)


Compile error if it's holding on to any borrowed data on the stack - however, borrowing is only the default, if you move from a capture your closure will contain that value instead of reference to the value, and if you have "move" in front of your closure, all the captures are contained in the closure, so by boxing or by using -> impl Fn(...) you can safely return it.

If you need to hold onto data accessed from multiple locations, you'd need to use Rc<T> instead of keeping that data on the stack, which is closer to GC'd languages (and pretty similar to Swift, IIUC, except Rust is more explicit).


This is handled by the borrow checker, the same as anything else that might hold references to stack data - you get a compiler error and then can decide how to resolve the conflict, either by extending the lifetime of the referent (perhaps by boxing, or creating it earlier in the call stack), or by shrinking the lifetime of the future.


It's an error if that would be unsafe, or else it compiles fine: the future compiles down to an enum/struct that stores everything inline and it can be returned/move up the stack/stored in global memory the same way that aggregate values returned without allocation (etc.) in C or C++.


From the post

  > There are essentially no imposed synchronization costs; if you want
  > to associate data that lives on your event loop and access it in a
  > single-threaded way from futures, we give you the tools to do so.


C++ programmer here; re 3: the C++ coroutine proposals (there are multiple) are completely orthogonal to futures.

The closest things in the standard to these rust futures, is of course std::future, which currently is very limited as it even lacks chaining and composition. The Concurrency TR (which eventually will be added to the standard, but sadly not for C++17) does provide these features (and more); multiple implementations have been are available for some time (boost, HPX, Folly, and some compiler vendors).

The issues with std::future is that is far from zero cost, as it requires allocation (the control block is dynamically allocated), type erasure (the callback type is not part of the future) and synchronisation (as the promise can be fulfilled from another thread). Efficient implementations of std::future are possible but will never be as fast as what is shown in this rust library.

On the other hand the SeaStar C++ library (I'm in no way involved, just a fan) already provides pretty much the same zero cost futures with a very similar implementation as it doesn't try to follow the standard design.


Thanks your answer clarifies a lot. So my question was , as a mechanism to implement async operation how does C++ coroutine compares to this rust library. From my understanding the C++ committee is pushing for the introduction of coroutine specifically because any library base approach would have the limitation you mentioned. Is that a correct ?

You seems familiar with the C++ committee, do you know a good way to track a feature one is interested in ? I kinda lost track of the coroutine proposal after the 4 revision or Gor paper.

By the beside Visual studio, any other compiler has a working implementation of the current draft ?


Coroutines would still run on top of std::future, it would be Just syntactic sugar to avoid callback hell via then chaining. In same case this might save some allocation as the callback would be a plain pointer to the control block, but that's about it.

I'm not part of the committee nor I attend meetings, but I follow the papers and the public online discussions.


From the last Gor paper, coroutine don't need to run on top of the std::future class. The type traits required allows for more computational patters (like generators, agent/actor, full coroutines) etc...


No, but they are still designed to work well with std::futures and that's going to be the standard blessed way to do async IO.


In Rust, closures don't need to be boxed, because they're just bytes like everything else. In this case, they can be stored by-value as a generic type in the future itself, which can be stored on the stack.

The one time that you do tend to see closures get boxed is when returning them, because you can't write out their type. In the future you'll be able to get around that by using the `impl Trait` syntax mentioned in the post (https://github.com/rust-lang/rfcs/pull/1522).


1. I am not sure exactly how async/await is implemented, but I believe it is very similar. Some people are also working on implementing similar sugar in Rust, but it's not done yet.

2. Closures are on the stack, not the heap, by default, in Rust. If you don't see a Box, they're not heap allocated.

3. I agree!


Stack-allocated closures are one of my favorite features of Rust. They make it easy to write functional-style code that performs well.



> 2 - no allocation : , does'nt the lambda closure need to be allocated somewhere ?

I expect it is embedded in the state machine structure. In Rust (and C++) lambdas/closures are "value types".


In C++ lambdas have limited size (about 3 machine words). If it's context is larger than that, the data may be allocated on heap (if compiler can't optimize that allocation out).


What you're talking about is std::function, a wrapper around raw lambda objects that performs dynamic dispatch. Since any lambda can be wrapped in std::function yet std::function, like all C++ types, must have a fixed size, it uses the heap as you say.

If you stick to the raw lambdas, however, the compiler knows the size and doesn't need to use the heap. Only problem is that their type can't be named. In C++, you can work around this with auto.

Without getting too much into it, Rust has a similar distinction between raw lambdas and Box<Fn(...)>. This library uses the former.


I though in c++ the lambda get created on the stack like any other object.


They do, but to actually use it, to pass it anywhere, you have to either wrap it in some other object (std::function, as comex pointed out) or use a template.


Little benchmark rs-futures vs lwan (https://lwan.ws) on my machine Core i5

futures-minihttp(singlethread):

  $ wrk -c 100 -t 2 -d 20 http://127.0.0.1:8080/plaintext
  Running 20s test @ http://127.0.0.1:8080/plaintext
    2 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   823.09us  449.37us  20.98ms   98.69%
      Req/Sec    62.15k    10.51k  105.24k    48.63%
    2479035 requests in 20.10s, 340.44MB read
  Requests/sec: 123335.77
  Transfer/sec:     16.94MB
lwan(singlethread):

  $ wrk -c 100 -t 2 -d 20 http://127.0.0.1:8080/
  Running 20s test @ http://127.0.0.1:8080/
    2 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   596.45us  573.31us  24.46ms   99.33%
      Req/Sec    86.17k    13.15k  119.71k    76.00%
    3429720 requests in 20.01s, 624.73MB read
  Requests/sec: 171404.15
  Transfer/sec:     31.22MB
For lwan i use http server example from lwan.ws main page.

As you can see in this example C http server much faster than simple http Rust server.

* futures-minihttp release build

* lwan -O3


On Reddit [1], Alex mentioned that the single-thread case was not optimized (yet). What do the numbers for multi-threading look like?

[1]: http://reddit.com/r/rust/comments/4x8jqt/zerocost_futures_in...


futures-minihttp(2 threads):

  $ wrk -c 100 -t 2 -d 20 http://127.0.0.1:8080/plaintext
  Running 20s test @ http://127.0.0.1:8080/plaintext
    2 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   633.87us  808.22us  24.66ms   98.41%
      Req/Sec    86.72k     4.84k   94.90k    85.75%
    3452383 requests in 20.01s, 424.73MB read
  Requests/sec: 172546.21
  Transfer/sec:     21.23MB

lwan(2 threads):

  $ wrk -c 100 -t 2 -d 20 http://127.0.0.1:8080/
  Running 20s test @ http://127.0.0.1:8080/
    2 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   375.26us  618.54us  20.60ms   98.21%
      Req/Sec   116.82k     5.56k  124.07k    92.25%
    4647906 requests in 20.00s, 846.62MB read
  Requests/sec: 232339.90
  Transfer/sec:     42.32MB


But for me, this difference is nothing, because on Rust we can write safe and fast close to metal code. I think Rust is the greatest modern system programming language.


This is cute. This is clever. Whether or not it's too clever time will tell. A year ago, I noted that Rust was starting out at roughly the cruft level C++ took 20 years to reach. Rust is now well beyond that.

All this "futures" stuff strongly favors the main path over any other paths. You can't loop, retry, or easily branch on an error, other than bailing out. It's really a weird syntax for describing a limited type of state machine.

I'm not saying it's good or bad, but it seems a bit tortured.


How does futures count as cruft? It's a pure library. This is like saying GObject is cruft in C. It is a library that some folks don't like (like most libraries), but it is not part of the language and nobody is forced to use it in their code.

Why do you think Rust has cruft anyway? It has a lot of typesystem features, yes, but this is no different from languages like Haskell. These features work together nicely and are useful. C++ has lots of features which for better or for worse have been hacked in to the language (can't be made an organic part of the language because backcompat). This is not the case with Rust (or D, which is to me a cruft-less organically-designed C++)


Yes, it's a pure library. But this sort of thing is becoming standard for Rust. If your code isn't full of "foo.and_then(|x| ...)" it's uncool. This isn't "functional"; these functions have major side effects.

The bothersome thing is that the control structures of the language are hidden under object-specific functions. "Things really start getting interesting with futures when you combine them. There are endless ways of doing so." I'd rather have "there's only one way to do it", as in Python, rather than "endless ways of doing so". That usually leads to code that's hard to read and debug. As in "how did control get there?". At least in traditional code, you can see control flow easily. Adding an user-definable level of abstraction hides that.


> This isn't "functional"; these functions have major side effects

uh, no, I've rarely seen adaptors like and_then being used with side effects (folks use regular loops if they want that). Rust doesn't have a strict notion of purity, but that doesn't mean that most rust code isn't pure.

There's nothing wrong with having lots of adaptors scattered around the code, either. It's not less readable, it's just different.

> There are endless ways of doing so." I'd rather have "there's only one way to do it", as in Python

Uh, "there are endless ways of combining future adaptors", not "there are endless ways of solving a problem". Each combination of adaptors solves a different problem (mostly).

> That usually leads to code that's hard to read and debug

This is async code. This has always been hard to read and debug. Futures make the control flow more explicit, if anything (especially if you have async and await), because the flow is now in one place, at least. Grokking manual event loop code is much more annoying.

Sure, the async argument doesn't apply to regular iterators and Option. However, these "object specific functions" are not object specific. All futures have the same adaptors. All iterators have the same adaptors. The only special objects with their own set of such methods are Result and Option. These share the names of the methods, and these are used often enough to justify it. There aren't that many of them either, so this really isn't that big a deal. You just need to know what each of this small number of methods does. It only hides control flow if you're not aware of these methods, which is a state of mind that goes away quickly after the first few rust programs.

Besides, because of the closures it's pretty obvious that some tweaking of control flow is happening, so it isn't hidden. You can check the docs for that function to know exactly what the tweaking is.

I also don't know what you mean by "traditional code", this pattern is exceedingly common in languages which aren't C or C++.


Your definition of cruft is rather strange. Yes, Rust code can be kind of complicated and messy-looking, but it makes sense when you think about it. All the parts justify their existence. IMO, cruft is stuff that doesn't make sense or is extraneous; incidental rather than essential complexity, in general.


> All this "futures" stuff strongly favors the main path over any other paths. You can't loop, retry, or easily branch on an error, other than bailing out.

You can do all of that.


OK, make an HTTP request, and if it fails, wait 2 seconds and retry. After 10 times, give up.


Timeouts are covered here. http://alexcrichton.com/futures-rs/futures/index.html#exampl...

As for repeating something multiple times, you would either repeat the chain 10 times or write your own future combinator. Probably writing a custom future combinator would make the most sense here; actually, such a thing ought to be built-in to futures.rs.


That's the problem - having to roll your own control structures. IF, WHILE, and FOR cover just about everything. This semi-functional style doesn't break down into a small number of simple, well-known primitives.


What's your definition of "cruft," and where does it appear in Rust?

I will be the first to admit that Rust is not perfect, but that adjective does not come to mind. I am of course incredibly biased.


Cruft? Coming from C++ I find it quite the opposite.


I'm confused by

    .map(|row| { json::encode(row) })
    .map(|val| some_new_value(val))
Over

    .map(json::encode)
    .map(some_new_value)
Is the explicit extra layer of lambda generally prefered in Rust over just passing the functions?


It's just a style thing, some people prefer one way, some another. I personally prefer the latter. They compile to the exact same thing.


The makes sense. Do you know if Rust already has an idiomatic correct style?


I don't think that there's ever been an explicit discussion about it. I'm not even sure how many people know the latter is possible, to be honest, it's a bit harder to learn about.


I find point-free style confusing sometimes. This is because I can't tell what is being encoded. Then in the first example I would be like "OK, we encode a row..."


It makes the blog post more approachable for users. I have seen the exlicit style being used over the point-free style many times before since it seems like it's easier for newbies to grasp - I haven't seen any evidence to support it though.


I wonder if it is information used by the compiler to do error checks. Keep in mind that one of the goals of Rust is to catch common errors in C like languages at compile time rather than run time.


No. If there were a type error with either form, the compiler will catch it.


I love the direction this is going, and the performance it achieves.

Debugging promises/deferreds in other languages has given me nightmares, compare with erlang/golang debugging where you get a simple stacktrace.

Does this provide some nice way of debugging complex future chains? Are there plans towards making it super easy to debug?

Cheers!


Native promises in JavaScript suffered from debugging issues - but there are hooks that can help you with those - like unhandledRejecton and the like.

Debugging promises in JavaScript is very easy at the moment.


I'm rather surprised by the benchmark; I would expect the Go benchmark to be faster than Java (and the fact that it isn't may indicate some improvements that can be done to fasthttp by learning from rapidoid or minihttp). Then again, the difference isn't that much, so it just could be implementation details that would require a total refactor to fix.


You may find this makes somewhere more sense to think of it as ~5.3 microseconds per request for fasthttp vs. ~4.8 microseconds for Java vs. ~4.3 for Rust. It's 40 microseconds or so for the standard lib Go. I'm just eyeballing the graph but this should be close enough (dominated by local CPU variances and such). Just as some people point out that "gallons per mile" is a more intuitively useful way of thinking, I think that at this scale "overhead per request" is a better way of thinking about it.

I'm not generally a big fan of measuring the "ping time" response for web servers, but in this case I believe it is justified since we really are trying to establish that this future library is very fast. I fear, based on experience, that some developers well be looking at this and will sit there trying to choose fasthttp vs. rapidoid vs. minihttp based on this one graph, without considering what they really mean. Overhead-per-request I think makes it more clear that for the vast, vast majority of purposes, all of these, including Go and Node, are "way way faster than your code", and unless you know you're building a server where you seriously need to answer an API call at several hundred thousand responses per second, all of these are "fast enough" and the real criteria for choosing should be "everything else".

(One last edit... remember, it's "milli/micro/nano". Micro seems to be forgotten since it seems like many things that we care about fit into nano- or milli-. It's quite a challenge in the web world to get your request out in under a millisecond usually, so .040 milliseconds added as HTTP overhead is rarely the problem.)


Thanks for this comment! I actually totally agree with this perspective, and wish I'd used your suggested scale in the post.


In the case of valaya/fasthttp vs. net/http, a bonus thing is that with fasthttp you (currently+AFAIK) lose HTTP/2 support! Needs vary a ton with the app, environment, etc., of course, but it'd be sad to change your server to save 40µs CPU per req. then wind up making users wait extra RTTs--tens to hundreds of ms--as a result.


>> "I fear, based on experience, that some developers well be looking at this and will sit there trying to choose..."

Sure, but it's not the responsibility of this blog post to protect bad/lazy engineers from themselves.


Why would you expect that? AFAIK the Go compiler has a worse optimizer, a worse GC (at least throughput-wise) and possibly also a worse threading runtime. Since the benchmark measures steady-state performance and not memory or latency, and with the best Java server (probably optimized to almost never allocate) then Java having better code generation and threading could give it that much of an edge.


Actually, optimizer, GC and a threading runtime have almost nothing to do with Go's net/http performing that badly. The reasons are mostly in very poor design choices in the library itself that forced them to do a lot of implicit synchronization, unnecessary memory copying, unnecessary system calls, etc. It's just a huge mess, avoid it whenever possible.


Yah I know, both me and the parent comment were talking about fasthttp vs the fastest Java web server. Both of which are likely optimized like crazy to avoid synchronization and allocation as much as possible.


Go GCs fewer things, though. At least in my experience -- this isn't a hard fact :)

To me, Go feels like C-but-some-things-are-GCd, which Java is not. The explicit sharing annotations in Go means that it's super easy for it to not put most of the things on the GC heap, and it's super easy for me, the programmer, to see what's not being put on the GC heap. Whenever I've debugged Go code this been validated.

But yes, it's not as clear cut. I ignored the threading runtime being a factor, and ignored that Java has a better GC (though it GCs more things).

I also ignored JIT as steve pointed out to me in irc.

So yeah, there could be language reasons behind the benchmark order too. Worth investigating though, I'm sure the folks using fasthttp would love a perf boost :)


Well, Go has a 1.7 release coming up very soon which has a nice performance boost due to the new SSA backend, I assume this test was done using Go 1.6 .


Yep, same here -- at least for fasthttp. This is pretty far into microbenchmarking territory, so the differences among the top libraries likely don't matter much in practice. But I think the main point is pretty clear -- even with an extreme microbenchmark, futures don't seem to be imposing notable cost.


Add around 10-30% (but are cases of even 150%) of speed up in Go 1.7 that will be released next week.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: