I do want to make clear that most of the optimizations discussed are implemented in the Go scheduler, which is where I discovered them. I wrote the article to call them out as they were not easy to discover.
Do you think it would make sense to dynamically adjust the number of threads in the pool? An example of a thread pool that dynamically adjusts the number of threads is .NET. The algorithm is described by Matt Warren and a paper. In addition to the C++ version Matt links to, there is a C# implementation.
There is follow up work planned to allow annotating blocking / CPU intensive bits of code so that the scheduler can do something smarter (things like you linked).
The old scheduler already had such an annotation, but in the interest of shipping, this has punted in the new scheduler.
So, if that is acceptable, then it is fine.
Supposing you were running an http server and most responses were very quick - but occassionally a request came along that required calculating pi to the n billionth digit.. (for example).. Would you need to farm that request to a separate process - or could you keep it in process but in a separate thread outside of Tokio?
Preemption is out of scope for Tokio as we are focusing on a runtime to power Rust async functions. So, for the foreseeable, Tokio will using "cooperative preemption" via `await` points.
How do you preempt syscalls or FFI calls? (Go will almost certainly also have this issue with cgo modules)
I would be very surprised if preemptible Futures are something Tokio can or would implement.
But this requires code pages to be writable. Which, from a security perspective is awful.
The `unsync_load` is used because, at that point, two things are guaranteed:
- The only thread that mutates the field is the `unsync_load` caller
- All other threads will only read from that field.
Because of this, we can just do a "plain old" access of the field w/o the atomic overhead.
As for whether or not it should be in `std`, I don't know. It probably could, but it doesn't have to. It's a pretty "advanced" access strategy.
Can't you get some values "out of thin air" (partial updates of some bits.) Or some intermediate values?
(For example, can the compiler decide to use this memory location to store some unrelated temporary result, as it knows it will erase it soon with a final value and that no other threads is supposed to access this)
Although in this case, I guess this is probably fine since the non-atomic read can't race with a write.
x = load-relaxed(a)
x0 = load-relaxed(a)
x1 = load-relaxed(a)
so under register pressure the compiler would be required to spill something onto the stack instead of just reloading `x` from the original location.
The consumer-side mutex works pretty well, until you take an interrupt while holding it; then, you need to make sure you're using an unfair mutex, or you'll get a lock convoy.
The non-blocking version isn't hard to write, though. IIRC, for the Windows threadpool, I used https://www.cs.rochester.edu/~scott/papers/1996_PODC_queues.... for the implementation.
I'm really impressed by the concurrency analysis tool; again for the threadpool, we used model-based-testing to write tests for every permutation of API calls, but we never got down to the level of testing thread execution permutations; that's amazingly good to see, and really builds confidence in their implementation.
I could have kept working on this for many more months... but my collaborators were starting to get grumpy and telling me to ship already :)
In "A better run queue" the second code block has
self.tail.store(tail + 1, Release);
> The first load can most likely be done safely with Relaxed ordering, but there were no measurable gains in switching.
This code is using an Acquire/Release pair, except it's Acquire/Release on different memory slots, which means there's no actual happens-before edge. This makes the Acquire/Release pair somewhat confusing (which tripped me up too in the first version of this comment).
Furthermore, I don't see anything in here that relies on the head load here being an Acquire. This Acquire would synchronize with the Release store on head in the pop operation, except what's being written in the pop is the fact that we don't care about the memory anymore, meaning we don't need to be able to read anything written by the pop (except for the head pointer itself).
This suggests to me that the head load/store should be Relaxed (especially because in the load literally all we care about for head is the current pointer value, for the length comparison, as already explained by the article). The tail of course should remain Release on the push, and I assume there's an Acquire there on the work-stealing pop (since you only list the local-thread pop code, which is doing an unsynced load here as an optimization). I don't promise that this is correct but this is my impression after a few minutes of consideration.
If I have the time tonight I'll try to look over the PR. No promises though.
FWIW: Hyper is among several that are bandwidth-limited by the 10-gigabit Ethernet in our (TechEmpower) Plaintext test. We hope to increase the available network bandwidth in the future to allow these to higher-performance platforms an opportunity to really show off.
I read a long time ago about how the linux kernel handled incoming packets in a really inefficient way that led to the network stack being the bottleneck in high performance applications. Is that no longer true?
I'd love to learn more about this whole area if anyone has any good links.
It's generally possible to go 3x-10x faster than Linux if you make some tradeoffs like dedicating cores to polling. See https://www.usenix.org/node/186147 and https://wiki.fd.io/view/VPP/What_is_VPP%3F Moore's Law excuses a lot of sins, however, so on the latest servers Linux is probably adequate up to 100 Gbps for most use cases.
So the theoretical optimum is 14.88 Mpps (millions of packets per second). That's with the smallest possible payload. You need to divide that by two because the test involves both a request and a response, which gives you 7.44 million requests per second as the theoretical maximum.
The best tests max out around 7 M rps. This gives an implied overhead of about 6%, which corresponds to the HTTP overhead.
However if you would add a benchmarks to download 10MB static files the network utilization would likely go a lot up - even if the same frameworks are used.
I'll be spending some time on the CDSChecker and dynamic partial-order reduction papers.
The goal for this article was primarily to focus on scheduler design.
I'm hoping to get posts dedicated to loom soon. It's been an indispensable tool.
I guess 2019 is the year of concurrency checking :-)
From a short glance my understanding of Tokio is that at high level it is an execution environment that basically implements data exchange protocols based on asynchronous io.
Then they say the following: Tokio is built around futures. Futures aren’t a new idea, but the way Tokio uses them is unique. Unlike futures from other languages, Tokio’s futures compile down to a state machine
So the question here is why not to expose generic state machine framework (FSM/HSM) which can deal with those protocols but is also very useful for many other tasks as well?
Short answer is: you can.
Tokio is more like "node.js for Rust" or "the Go runtime for Rust".
The bit about Tokio & futures is how you use Tokio today. Tasks are defined as futures, and Rust futures are FSMs. So, a protocol can be implemented as a FSM completely independently of Tokio / futures and then used from Tokio.
In Tokio's next major release (coming soon), it will work with Rust's async fn feature (which is also about to hit stable).
Hope that clarifies things.
But that was the point of my question. Since you've already built that FSM framework inside, why not to expose it? Or I guess it might be too specialized for your particular case.
The futures crate, version 0.3 (released under the name `futures-preview` right now with an alpha version number), contains the helper methods for the standard library's trait.
A few questions:
Did you consider work sharing or work requesting instead of work stealing (the global queue is a form of work sharing of course)? With work requesting you can avoid the required store/load on the pop path. Of course work requesting is not appropriate if the scheduler need to protect against the occasional long running job that doesn't yield to the scheduler in time to service a work request with appropriate frequency, but assuming this is rare enough, an expensive last resort work stealing path could be done via asymmetric synchronization (i.e. sys_membarrier or similar), keeping the pop fast path, well, fast.
Is there any way to make (groups of) tasks be constrained to a specific scheduler? This is particularly useful if they all need to access specific resources as there is no need of mutual exclusion; this works well in applications were executors are eterogeneous and have distinct responsibilities.
Because of the global queue and the work-stealing behaviour, absolutely no ordering is guaranteed for any task, is that correct?
I find surprising that you need to reference count a task explicitly. Having implemented a similar scheduler (albeit in C++), I haven't found the need of refcounting: normally the task is owned by either a scheduler queue or a single wait queue (but obviously never both). In case the task is waiting for any of multiple events, on first wake-up it will unregister itself from all remaining events (tricky but doable); only in the case were is waiting for N events you need an (atomic) count of the events delivered so far (and only if you want to avoid having to wake up to immediately go to sleep waiting for the next event). Also this increment comes for free if you need an expensive synchronization in the wake-up path.
That's all so far, I'm sure I'll come up with more questions later. Again great job and congratulations on the performance improvements.
Not really. It did cross my mind for a moment, but my gut (which is often wrong) is the latency needed to request would be much higher than what is needed to steal. I probably should test it though at some point :)
> assuming this is rare enough, an expensive last resort work stealing
It's not _that_ rare. Stealing is key when a large batch of tasks arrive (usually after a call to `epoll_wait`). Again, I have no numbers to back any of this :)
I also have no number to back it up (we deal wit a relatively small numbers of fds at $JOB and we care of latency more than throughput), but I do not buy epoll_wait generating late batches of work: it seems to me that you should only be using epoll for fds that are unlikely to be ready for io (i.e those for which a speculative read or write has returned EWOULDBLOCK), which means you should not have large batches of ready fds. Even if you do, the only case you would need to steal after that is if another processor run out of work after this processor returned from epoll wait (if it did before, it should have blocked on the same epoll FD and got a chunk of the work directly or could have signaled a request for work), which might be less likely.
Anyway at the end of the day a singe mostly uncontested CAS is relatively cheap on x86 especially if the average job length is large enough, so maybe it is just not worth optimizing it further, especially if it requires more complexity.
> I find surprising that you need to reference count a task explicitly.
This is likely due to how the task and waker system is defined in Rust. A Waker is kind of a handle to the task, whose lifetime is indefinite. Some code can have a handle/Waker to a task which already has finished long ago. One way to make sure the handle is still valid is to reference count it. Since in the Tokio system the Waker/handle and the tasks internal state are actually the same object, this means the combined state gets a reference count.
Or maybe more like Scala Akka Futures?
Tokio is the oldest and most established of these runtimes, and is often the basis for web servers in Rust.
So yeah, “the node event loop as a library” is a decent way to think about it.
Why is this better than having one official, standardized, optionally importable runtime?
1. official - it's not clear how this would be better; that is, the team is not likely to have more expertise in this space than others do
2. standardized - it is standardized, as I said, the interface is in the standard library
3. optionally importable - this is fine, of course, but so are these, so this attribute isn't really anything better or worse.
But I don't think this is the biggest strength of this approach, the big strength is that not everyone wants the same thing out of a runtime. The embedded runtimes work very differently than Tokio, for example. As the post talks about, it's assuming you're running on a many core machine; that assumption will not be true on a small embedded device. Others would develop other runtimes for other environments or for other specific needs, and we'd be back to the same state as today.
libuv is a native C library that V8+node uses for async disk and network accesses
From what I know about Akka, I believe Akka does more than Tokio. For example, Akka manages a supervision tree. I believe Akka is trying to be more of an actor system where as Tokio is just trying to provide an asynchronous runtime for Rust.
They even went into that with the comparison with Seastar.
If this is better than Seastar, someone should start on porting ScyllaDB to Rust. Does Rust have a userspace TCPIP stack like Seastar? I think that is the other major advance Seastar uses for speedup.
Tokio is a runtime for managing Rust's Futures. Aka, the schedule that manages the various jobs (Futures) to run and etc.
well, I guess not :)
Should they update their description? lol
Some chance you might be hitting some pathological behavior that could be fixed or tweaked.
I'd expect something highly specialized such as GRPC server to perform better.
The only optimization I've done was to ditch maps wherever I could. They were dominating flamegraphs.
This is why I expected GRPC to perform better. But I agree it depends heavily on what's being done.
"Because synchronization is almost entirely avoided, this strategy can be very fast. However, it’s not a silver bullet. Unless the workload is entirely uniform, some processors will become idle while other processors are under load, resulting in resource underutilization. This happens because tasks are pinned to a specific processor. When a bunch of tasks on a single processor are scheduled in a batch, that single processor is responsible for working through the spike even if other procesors are idle."
The trade off of this is high performance over uniform utilization, which is a better goal than loading cores evenly. And all you have to do to make use of it is a bit of thought about your load distribution, but you also have plenty of thought budget here by not having to think about synchronization. Stealing work from other queues is not going to be faster, it's going to be wasted on synchronization and non-locality overhead and would require you to do synchronization everywhere, not a good direction to pursue.
With an infinite queue, the same number of tasks per second happen either way, but the delay until the last task you care about finishes can be pretty substantial.
i haven't had a closer look at rust in general, async and tokio, but i suspect you can use the different schedulers in the same codebase. can someone comment on this?
if that was the case one might use a different runtime for more uniform workloads to maximize throughput, while using tokio for general purpose async work.
Cooperative multitasking works great as long as everyone is cooperating. Guess what? Cooperation is a voluntary act. If people are ignorant of their impact or min-maxing between five different 'top priority issues', they aren't going to stay on top of it at all times and in all things.
The cost of freedom is eternal vigilance, and the quarterly roadmap doesn't list it. Which is why we use preemption most of the time. It reduces the blast radius at the cost of some efficiency.
All this is just the preemptive versus cooperative argument wearing new hats.
These queues each execute tasks sequentially. The library provides for multiple queues to run in parallel. All of these scheduling problems and patterns begin to look fundamentally similar if you stop getting hung up on the implementation details. Separate queues have worse 'behavior' than shared queues, which have worse behavior than sequential operation. Except for the extremely real and very huge problem of utilization. Avoiding utilization problems by forcing that mental work onto the callers is exactly the argument of cooperative multitasking; you could make it work, if you were as smart as we are.
Transferring a block of tasks at one go reduces the use of concurrency primitives. But now your worst-case behavior is worse. Solution? Provide way to avoid worst-case behavior. Even if it's more expensive, it only runs infrequently, so the cost is still amortized.
Usually these libraries strongly recommend that your jobs be working with independent data. So I don't buy the shared memory argument. There are in fact tasking libraries in other languages, like Node, that push the tasks off into a separate process, and yet still have the same class of concerns.
You are arguing that if they put more thought into the usage that the library wouldn't need to do the work.
That's not how software works. Blaming others for not thinking hard enough is what fragile smart people do to prop up their egos, instead of getting to the hard work providing real solutions. Essentially, all of these problems were identified in the 70's, and yet there's still plenty of R&D dollars in this space because this shit is hard. Being flip about it is not helpful.
Tokio is a lot about IO. The resources the scheduler is waiting for will almost exclusively be network and file operations.
There is a risk that a task does some long running CPU work on the back of a future or uses sync network/file operations. Maybe a preemptive scheduler could help then, but I'm not convinced by your argument that tokio needs to optimize for this.
(Thank you for your hard work on Tokio!)
The new scheduler will still require using a special API to run CPU intensive futures: `tokio_executor::blocking::run(|| cpu_intensive())`
Why are their many processors? If it’s all one thread, why not just use one processor?
Sorry, the article must not have been clear. If you can point out where in the article you got the impression there were multiple processors per thread, I can try to update the article to fix.
> There are multiple processors, each running on a separate thread.
It's probably easy to miss, do you have thoughts on where else this can be clarified?
I probably should make that clear in the article.
The writeup doesn't mention design choices in Grand Central Dispatch. Is that because we have moved on from GCT, or because it's not known?
Hint: tokio based web servers are the fastest.
It is a fact that I did not notice any reference in the article to GCD or the techniques that have made it a success.
In the meantime, it would not hurt even Rust fans to learn more about how Grand Central Dispatch works.