
Making the Tokio scheduler 10x faster - steveklabnik
https://tokio.rs/blog/2019-10-scheduler/
======
carllerche
Author here. I'll be watching the comments and can answer any questions!

I do want to make clear that most of the optimizations discussed are
implemented in the Go scheduler, which is where I discovered them. I wrote the
article to call them out as they were not easy to discover.

~~~
networkimprov
The Go team is working on an update of its scheduler to make it preemptive.
Will Tokio follow suit?

[https://github.com/golang/go/issues/24543](https://github.com/golang/go/issues/24543)

~~~
carllerche
My understanding of that PR is it relates to how the Go compiler does code
generation. Rust takes a different approach and is about to ship `async /
await` which is a different strategy.

Preemption is out of scope for Tokio as we are focusing on a runtime to power
Rust async functions. So, for the foreseeable, Tokio will using "cooperative
preemption" via `await` points.

------
earhart
I thought this was super interesting; it's great to see a solid writeup.

The consumer-side mutex works pretty well, until you take an interrupt while
holding it; then, you need to make sure you're using an unfair mutex, or
you'll get a lock convoy.

The non-blocking version isn't hard to write, though. IIRC, for the Windows
threadpool, I used
[https://www.cs.rochester.edu/~scott/papers/1996_PODC_queues....](https://www.cs.rochester.edu/~scott/papers/1996_PODC_queues.pdf)
for the implementation.

I'm really impressed by the concurrency analysis tool; again for the
threadpool, we used model-based-testing to write tests for every permutation
of API calls, but we never got down to the level of testing thread execution
permutations; that's amazingly good to see, and really builds confidence in
their implementation.

~~~
carllerche
Thanks! I agree with your assessment of the Mutex. Improving that is included
in the follow up work.

I could have kept working on this for many more months... but my collaborators
were starting to get grumpy and telling me to ship already :)

------
lilyball
Tiny comment:

In "A better run queue" the second code block has

    
    
      self.tail.store(tail + 1, Release);
    

Should this not instead look like the following?

    
    
      self.tail.store(tail.wrapping_add(1), Release);
    

\---

Another comment:

> _The first load can most likely be done safely with Relaxed ordering, but
> there were no measurable gains in switching._

This code is using an Acquire/Release pair, except it's Acquire/Release on
different memory slots, which means there's no actual happens-before edge.
This makes the Acquire/Release pair somewhat confusing (which tripped me up
too in the first version of this comment).

Furthermore, I don't see anything in here that relies on the head load here
being an Acquire. This Acquire would synchronize with the Release store on
head in the pop operation, except what's being written in the pop is the fact
that we don't care about the memory anymore, meaning we don't need to be able
to read anything written by the pop (except for the head pointer itself).

This suggests to me that the head load/store should be Relaxed (especially
because in the load literally all we care about for head is the current
pointer value, for the length comparison, as already explained by the
article). The tail of course should remain Release on the push, and I assume
there's an Acquire there on the work-stealing pop (since you only list the
local-thread pop code, which is doing an unsynced load here as an
optimization). I don't promise that this is correct but this is my impression
after a few minutes of consideration.

~~~
carllerche
Ah thanks! You are correct! Perhaps you should also review the PR? (a tiny bit
longer) :)

~~~
lilyball
I'm not done with the article yet ;) (BTW I just updated my comment, not sure
if you saw the first "another comment" addendum but I got it wrong and rewrote
it)

If I have the time tonight I'll try to look over the PR. No promises though.

~~~
carllerche
I saw your edit. Your summary of why `Relaxed` matches my reasoning. The code
should probably be switched to `Relaxed`.

------
bhauer
Awesome work!

FWIW: Hyper is among several that are bandwidth-limited by the 10-gigabit
Ethernet in our (TechEmpower) Plaintext test. We hope to increase the
available network bandwidth in the future to allow these to higher-performance
platforms an opportunity to really show off.

~~~
GonzaloQuero
This might be slightly off topic, but what are the others in this same space?

~~~
bhauer
In the Round 18 Plaintext results that the OP linked to [1], you can see a
pretty clear convergence at ~7 million requests per second, which I've screen
captured in [2].

[1]
[https://www.techempower.com/benchmarks/#section=data-r18&hw=...](https://www.techempower.com/benchmarks/#section=data-r18&hw=ph&test=plaintext)

[2]
[https://www.techempower.com/images/misc/converge.png](https://www.techempower.com/images/misc/converge.png)

~~~
jxcl
How can you tell how each test is bottlenecked? Is it something as simple as
saying that if the processor isn't at 100% utilization then the bottleneck is
somewhere else?

I read a long time ago about how the linux kernel handled incoming packets in
a really inefficient way that led to the network stack being the bottleneck in
high performance applications. Is that no longer true?

I'd love to learn more about this whole area if anyone has any good links.

~~~
pnako
Well, to start with you can actually compute the optimal number of packets per
second, if you know the bit rate and the packet size.

[https://packet.company/ethernet-statistics](https://packet.company/ethernet-
statistics)

So the theoretical optimum is 14.88 Mpps (millions of packets per second).
That's with the smallest possible payload. You need to divide that by two
because the test involves both a request and a response, which gives you 7.44
million requests per second as the theoretical maximum.

The best tests max out around 7 M rps. This gives an implied overhead of about
6%, which corresponds to the HTTP overhead.

~~~
Hikikomori
Why would you divide by two? Tx/Rx are separate.

~~~
pnako
You're right. But I assume the benchmark basically implements a ping-pong game
instead of DoSing the server to see how many requests actually survive (I
could be wrong).

------
cormacrelf
You should sign your name under the blog post. So many references to "I" and I
didn't know who to be impressed by until I clicked through to the PR.

------
SloopJon
From a QA perspective, I feel like you buried the lede on this Loom tool.
Benchmarks are nice, but confidence is an elusive thing when it comes to non-
trivial concurrency.

I'll be spending some time on the CDSChecker and dynamic partial-order
reduction papers.

~~~
carllerche
True... correctness is very important! I made mention of loom for correctness
in the intro at least.

The goal for this article was primarily to focus on scheduler design.

I'm hoping to get posts dedicated to loom soon. It's been an indispensable
tool.

~~~
snowAbstraction
Yes loom seems very important. I would interested to know how it compares with
similar tools like nidhugg[1][2] for the C and C++.

[1] [https://github.com/nidhugg/nidhugg](https://github.com/nidhugg/nidhugg)
[2] [https://uu.diva-
portal.org/smash/get/diva2:1324003/FULLTEXT0...](https://uu.diva-
portal.org/smash/get/diva2:1324003/FULLTEXT01.pdf)

~~~
carllerche
I skimmed the paper and it looks similar to loom. Specifically they cite CHESS
and Cdschecker as prior art. Both of those are what loom is based on.

I guess 2019 is the year of concurrency checking :-)

------
FpUser
Warning, since I've never written line of Rust code never mind working with
Tokio I might be just full of it. If that's the case please forgive me ;)

From a short glance my understanding of Tokio is that at high level it is an
execution environment that basically implements data exchange protocols based
on asynchronous io.

Then they say the following: _Tokio is built around futures. Futures aren’t a
new idea, but the way Tokio uses them is unique. Unlike futures from other
languages, Tokio’s futures compile down to a state machine_

So the question here is why not to expose generic state machine framework
(FSM/HSM) which can deal with those protocols but is also very useful for many
other tasks as well?

~~~
carllerche
> So the question here is why not to expose generic state machine framework
> (FSM/HSM) which can deal with those protocols but is also very useful for
> many other tasks as well?

Short answer is: you can.

Tokio is more like "node.js for Rust" or "the Go runtime for Rust".

The bit about Tokio & futures is how you use Tokio today. Tasks are defined as
futures, and Rust futures are FSMs. So, a protocol can be implemented as a FSM
completely independently of Tokio / futures and then used from Tokio.

In Tokio's next major release (coming soon), it will work with Rust's async fn
feature (which is also about to hit stable).

Hope that clarifies things.

~~~
FpUser
_So, a protocol can be implemented as a FSM completely independently of Tokio
/ futures and then used from Tokio_

But that was the point of my question. Since you've already built that FSM
framework inside, why not to expose it? Or I guess it might be too specialized
for your particular case.

~~~
carllerche
Ah, all the helpers around the "FSM" aspect of things are already exposed as
part of the `futures` crate: [https://github.com/rust-lang-nursery/futures-
rs/](https://github.com/rust-lang-nursery/futures-rs/)

~~~
MuffinFlavored
Which is different than the standard adopted `std::futures`?

~~~
shepmaster
Not quite. The futures crate, version 0.1, contained a definition of the
`Future` trait and a bunch of the combinator methods. This was iterated on a
lot. Eventually, the bare minimum traits were added to the standard library.

The futures crate, version 0.3 (released under the name `futures-preview`
right now with an alpha version number), contains the helper methods for the
standard library's trait.

~~~
autumnal
Just the bare minimum traits required for built-in async/.await, right?

------
qznc
Even though I don't use Rust myself, I still admire the Rust community for the
awesome technical blog posts they produce all the time.

~~~
carllerche
Thanks! I appreciate it.

------
Sawamara
Thank you for posting articles like these. I feel a bit relieved seeing how
its not demigods with decades of in-domain experience who create sometimes
huge, monumental modules or building blocks of whole languages, but instead:
actual human beings who might have been short on options, time or bought into
some technical debt in the process of creating something useful.

------
gpderetta
First of all, great article! Lots of interesting details and great
explaination.

A few questions:

Did you consider work sharing or work requesting instead of work stealing (the
global queue is a form of work sharing of course)? With work requesting you
can avoid the required store/load on the pop path. Of course work requesting
is not appropriate if the scheduler need to protect against the occasional
long running job that doesn't yield to the scheduler in time to service a work
request with appropriate frequency, but assuming this is rare enough, an
expensive last resort work stealing path could be done via asymmetric
synchronization (i.e. sys_membarrier or similar), keeping the pop fast path,
well, fast.

Is there any way to make (groups of) tasks be constrained to a specific
scheduler? This is particularly useful if they all need to access specific
resources as there is no need of mutual exclusion; this works well in
applications were executors are eterogeneous and have distinct
responsibilities.

Because of the global queue and the work-stealing behaviour, absolutely no
ordering is guaranteed for any task, is that correct?

I find surprising that you need to reference count a task explicitly. Having
implemented a similar scheduler (albeit in C++), I haven't found the need of
refcounting: normally the task is owned by either a scheduler queue or a
single wait queue (but obviously never both). In case the task is waiting for
any of multiple events, on first wake-up it will unregister itself from all
remaining events (tricky but doable); only in the case were is waiting for N
events you need an (atomic) count of the events delivered so far (and only if
you want to avoid having to wake up to immediately go to sleep waiting for the
next event). Also this increment comes for free if you need an expensive
synchronization in the wake-up path.

That's all so far, I'm sure I'll come up with more questions later. Again
great job and congratulations on the performance improvements.

~~~
carllerche
> Did you consider work sharing or work requesting instead of work stealing

Not really. It did cross my mind for a moment, but my gut (which is often
wrong) is the latency needed to request would be much higher than what is
needed to steal. I probably should test it though at some point :)

> assuming this is rare enough, an expensive last resort work stealing

It's not _that_ rare. Stealing is key when a large batch of tasks arrive
(usually after a call to `epoll_wait`). Again, I have no numbers to back any
of this :)

~~~
gpderetta
The idea is to steal only if request fails to return work in a reasonable time
frame i.e. if a job is keeping the scheduler busy for more than a minimum
amount of time.

I also have no number to back it up (we deal wit a relatively small numbers of
fds at $JOB and we care of latency more than throughput), but I do not buy
epoll_wait generating late batches of work: it seems to me that you should
only be using epoll for fds that are unlikely to be ready for io (i.e those
for which a speculative read or write has returned EWOULDBLOCK), which means
you should not have large batches of ready fds. Even if you do, the only case
you would need to steal after that is if another processor run out of work
after this processor returned from epoll wait (if it did before, it should
have blocked on the same epoll FD and got a chunk of the work directly or
could have signaled a request for work), which might be less likely.

Anyway at the end of the day a singe mostly uncontested CAS is relatively
cheap on x86 especially if the average job length is large enough, so maybe it
is just not worth optimizing it further, especially if it requires more
complexity.

------
karmakaze
Howdy, I'm not familiar with Tokio (or Rust much). Is Tokio sort of like Node
for Rust (packaged as a library rather than executable)?

Or maybe more like Scala Akka Futures?

~~~
steveklabnik
Rust, as a systems language, doesn’t include a large runtime. It’s about the
same size as C or C++‘s. But if you want to do asynchronous IO, you’ll need a
runtime to manage all of your tasks. Rust’s approach is to include the
interface to make a task (futures and async/await) into the language and
standard library, but lets you import whatever runtime you’d like. This means
that you can pull in one good for servers, one good for embedded, etc, and not
be partitioned off from a chunk of the ecosystem. And folks who don’t need
this functionality don’t have to pay for it.

Tokio is the oldest and most established of these runtimes, and is often the
basis for web servers in Rust.

So yeah, “the node event loop as a library” is a decent way to think about it.

~~~
MuffinFlavored
> Rust’s approach is to include the interface to make a task (futures and
> async/await) into the language and standard library, but lets you import
> whatever runtime you’d like.

Why is this better than having one official, standardized, optionally
importable runtime?

~~~
steveklabnik
Each point has something different:

1\. official - it's not clear how this would be better; that is, the team is
not likely to have more expertise in this space than others do

2\. standardized - it is standardized, as I said, the interface is in the
standard library

3\. optionally importable - this is fine, of course, but so are these, so this
attribute isn't really anything better or worse.

But I don't think this is the biggest strength of this approach, the big
strength is that not everyone wants the same thing out of a runtime. The
embedded runtimes work very differently than Tokio, for example. As the post
talks about, it's assuming you're running on a many core machine; that
assumption will not be true on a small embedded device. Others would develop
other runtimes for other environments or for other specific needs, and we'd be
back to the same state as today.

------
cogman10
Oh man, I really wish ya'll had chosen a name other than "Loom" for your
permutation tool. Loom is also the name of a project to bring fibers to the
JVM.

------
williamallthing
Really cool. Tokio is the core of Linkerd
([https://linkerd.io](https://linkerd.io)) and I am really excited to see
exactly what kind of impact this will have on Linkerd performance. A super
fast, super light userspace proxy is key to service mesh performance.

------
jmakov
Also it would be interesting to have a benchmark to compare different
approaches to multicore e.g.
[https://parsec.cs.princeton.edu/overview.htm](https://parsec.cs.princeton.edu/overview.htm)

------
jumpingmice
The result is pretty impressive. Go's scheduler is not this good. When my Go
GRPC servers get up near 100,000 RPS their profiles are totally dominated by
`runqsteal` and `findrunnable`.

~~~
hu3
Weird. In one of my Go explorations I wrote a naive trade algo backtracker
which ingested 100,000 datapoints per second from SQLite on commodity
hardware.

I'd expect something highly specialized such as GRPC server to perform better.

~~~
jumpingmice
Reading 100k rows per second from sqlite sounds WAY easier than serving 100k
HTTP/2 queries in the same time.

~~~
hu3
True. But in my case the code also distributes these datapoints to multiple
threads, computes technical indicators and simulates tradings while gathering
statistics to finetunne parameters.

The only optimization I've done was to ditch maps wherever I could. They were
dominating flamegraphs.

This is why I expected GRPC to perform better. But I agree it depends heavily
on what's being done.

------
throwaway13000
Does anyone know how to print the blog post. Storing it as a pdf only saves
first page, not the entire conent.

~~~
liamdiprose
If you enable readability mode in Firefox first, the entire article shows up
in the PDF.

------
zzzcpan
_" Many processors, each with their own run queue"_

 _" Because synchronization is almost entirely avoided, this strategy can be
very fast. However, it’s not a silver bullet. Unless the workload is entirely
uniform, some processors will become idle while other processors are under
load, resulting in resource underutilization. This happens because tasks are
pinned to a specific processor. When a bunch of tasks on a single processor
are scheduled in a batch, that single processor is responsible for working
through the spike even if other procesors are idle."_

The trade off of this is high performance over uniform utilization, which is a
better goal than loading cores evenly. And all you have to do to make use of
it is a bit of thought about your load distribution, but you also have plenty
of thought budget here by not having to think about synchronization. Stealing
work from other queues is not going to be faster, it's going to be wasted on
synchronization and non-locality overhead and would require you to do
synchronization everywhere, not a good direction to pursue.

~~~
lpghatguy
Other parts of the Rust ecosystem have had great luck using work-stealing
queues. There's a talk about how Rayon did this[1] and a good article about
using it in Stylo[2], a new CSS engine written by Mozilla.

[1]
[https://www.youtube.com/watch?v=gof_OEv71Aw](https://www.youtube.com/watch?v=gof_OEv71Aw)

[2] [https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-
en...](https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-engine-
quantum-css-aka-stylo/)

~~~
hinkley
The thesis of Practical Parallel Rendering was that any useful task
distribution strategy requires a work stealing mechanism because you are
leaving serious amounts of responsiveness on the table if you don't.

With an infinite queue, the same number of tasks per second happen either way,
but the delay until the last task you care about finishes can be pretty
substantial.

------
r873436
Does this new scheduler require a dedicated thread pool for CPU intensive
futures, as the current scheduler does?

(Thank you for your hard work on Tokio!)

~~~
carllerche
Thanks, I appreciate it!

The new scheduler will still require using a special API to run CPU intensive
futures: `tokio_executor::blocking::run(|| cpu_intensive())`

------
erikpukinskis
Thank you for the detailed write up!

Why are their many processors? If it’s all one thread, why not just use one
processor?

~~~
carllerche
There is one processor per thread.

Sorry, the article must not have been clear. If you can point out where in the
article you got the impression there were multiple processors per thread, I
can try to update the article to fix.

~~~
erikpukinskis
I just got a bit into the “One queue, many processors” section and that’s
where I got confused. Knowing it’s one processor per CPU is the information I
needed. Thanks!

~~~
carllerche
From that section.

> There are multiple processors, each running on a separate thread.

It's probably easy to miss, do you have thoughts on where else this can be
clarified?

------
ncmncm
Not to be snarky, but the easiest way to get a 10x speedup is to make the
previous one 10x too slow. 10x feels pretty good until you find out it was
really 40x too slow. I say this only because getting 100x improvements on
naive designs has led me to be suspicious of 10x improvements. The principle
is that where 10x was available, more is there.

The writeup doesn't mention design choices in Grand Central Dispatch. Is that
because we have moved on from GCT, or because it's not known?

~~~
xiphias2
You should look at the http server benchmarks before writing this.

Hint: tokio based web servers are the fastest.

~~~
ncmncm
Getting downvoted for asking a serious question tells me all I needed to know.

~~~
xiphias2
I wasn't downviting you, but I'm sure it was the first part of your comment
with unchecked facts, not your question that got you downvoted.

~~~
ncmncm
There were no unchecked facts in the original post. It is a fact that I have
got 100x speedups, and it is my personal experience that 10x speedups rarely
come close to exhausting the opportunities.

It is a fact that I did not notice any reference in the article to GCD or the
techniques that have made it a success.

~~~
ncmncm
Clearly the facts are not the problem. My guess would be that Rust fans are
unusually thin-skinned. Let us see...

In the meantime, it would not hurt even Rust fans to learn more about how
Grand Central Dispatch works.

~~~
housel
You might suggest specific GCD mechanisms that might be usefully adopted by
tokio.

