Thread-per-core

duped · on Oct 6, 2023

Personally I feel like this post misses the forest for the trees.

The debate isn't about thread-per-core work stealing executors, it's whether async/await is a good abstraction for it in Rust. And the more async code I write the more I feel that it's leaky and hard to program against.

The alternative concurrency model people want is structured concurrency via stackful coroutines and channels on top of a work stealing executor.

Until someone does the work to demo that and compare it to async/await with futures I don't think there's any productive discussion to be had. People who don't like async are going to avoid it and people who don't care about making sure everything and its mother is Send + Sync + 'static are going to keep on doing it.

kllrnohj · on Oct 6, 2023

> The alternative concurrency model people want is structured concurrency via stackful coroutines and channels on top of a work stealing executor.

I mean why not just use a thread per connection and not bother with anything fancier at all unless you really truly need to hit those C10M scales? Which I suspect is a very rare need for most things?

So many of these articles just go "kernel threads are expensive" and blow on past it as if that's just inherently true & nothing else needs to be said on it. But they really aren't, and unless your work is doing nothing but spawning no-op tasks then the overhead of a "real thread" is likely minimal and in exchange the simplicity you get is tremendous.

VirusNewbie · on Oct 6, 2023

>I mean why not just use a thread per connection and not bother with anything fancier at all unless you really truly need to hit those C10M scales? Which I suspect is a very rare need for most things?

At every larger company I've worked for, that breaks down right away. It's really not hard to see how processing some larger proto/json/soap/xml msg can slow the entire system down.

kllrnohj · on Oct 6, 2023

The larger the workload per connection the better thread-per-connection performs relative to the alternatives. So it'd do the exact opposite of break down under the workload you've outlined.

fulafel · on Oct 6, 2023

Can you spell out how processing a large message would result in a bigger system slowdown with thread-per-connection than with async?

mcguire · on Oct 6, 2023

Have a number of threads greater than the number of cores results in either synchronization costs or longer latency.

IshKebab · on Oct 6, 2023

No it doesn't. It results in more context switching costs maybe, but I seriously doubt that would make a difference in all but the most extreme cases.

mcguire · on Oct 9, 2023

Context switching, cache coherency, and the way that, in practice, you end up with #threads >> #cores.

bcrl · on Oct 7, 2023

Latency is not the only cost of an excessive number of threads: context switches are expensive! Every context switch is a waste of CPU cycles that could be better spent doing actual work for your application. Furthermore, the cost of context switches keeps going up with every new generation of CPU, and I don't see that trend reversing any time soon.

laurencerowe · on Oct 6, 2023

> At every larger company I've worked for, that breaks down right away. It's really not hard to see how processing some larger proto/json/soap/xml msg can slow the entire system down.

What's the problem with thread-per-connection in this case? It usually works well when application code is moderately heavy since the overhead from threads is very small in comparison.

By contrast I've often seen issues in event driven systems without work stealing where heavy requests slow everything else down.

I think work stealing is a sensible default for an event driven system in a language like Rust. But I do sometimes find myself wishing for the option to write non-async threaded http servers with zero-copy send.

Spivak · on Oct 6, 2023

> without work stealing where heavy requests slow everything else down

Right because threads vs async doesn't make any sense. You use both, async is your default, free yourself up during io, and is basically the developer friendly abstraction on top of epoll, and then use threads for stuff you actually need threads for.

laurencerowe · on Oct 6, 2023

Mixing the two is absolutely the right approach in general. But I still find myself wishing it were easier to build a very simple threaded web server in Rust so I could play around with stuff like MSG_ZEROCOPY [1]. Last time I looked there were no non-async http libraries in cargo other than a few toy servers.

[1] https://lwn.net/Articles/726917/

fiddlerwoaroof · on Oct 6, 2023

I don’t understand this: context switching takes microseconds, I/O latencies are typically in the millisecond range. I’d think thread overhead would be negligible in an I/O-bound application (especially if you take steps to reduce the amount of memory per thread)

jandrewrogers · on Oct 6, 2023

The important distinction is between operation latency and operation rate. Modern I/O devices are highly concurrent and support massive throughput. A device can have millisecond latency while still executing an operation every microsecond. In these cases, the operation latency doesn't matter, your thread has to handle events at the rate the device executes operations. If it is a million operations per second then from the perspective of the thread you have a microsecond to handle each operation. Context switch throughput is much lower by comparison.

In these types of systems, you may issue a thousand concurrent I/O operations before the first I/O operation returns a result. Threads don't wait for the first operation to finish, they keep a deep pipeline of I/O operations in flight concurrently so that they always have something to do.

fiddlerwoaroof · on Oct 6, 2023

> Modern I/O devices are highly concurrent and support massive throughput. A device can have millisecond latency while still executing an operation every microsecond. In these cases, the operation latency doesn't matter, your thread has to handle events at the rate the device executes operations.

This is true for some applications, like an OLAP database or similar. It’s not true for the typical user-facing app where you want to finish requests as soon as possible because a user is waiting and every millisecond costs you money.

galdosdi · on Oct 7, 2023

Well, at every mid sized company I've been at, it's worked great. You can service a fuckton or clients off of one machine even with thread per connection.

And frankly, if well architected so much of the logic and business is well separated from the logistics, it's not the hardest thing (and by then there plenty of money to fund it) to rearchitect a thread per client model into something more effecient.

It's hard to go wrong when you create a system initially and stick with the easy lower scale ways, like a single big MySQL (or oracle or postgres or even SQLite or whatever) database, thread per connection, etc. YAGNI, and if you do need it, get it when you do need it

scottlamb · on Oct 6, 2023

> The debate isn't about thread-per-core work stealing executors, it's whether async/await is a good abstraction for it in Rust.

I think async/await vs stackful coroutines is a more interesting debate [1], but it's definitely not the debate. The quote withoutboats discussed is from a linked article complaining specifically about multi-threaded by default and work-stealing.

[1] https://www.reddit.com/r/rust/comments/16p47f1/the_state_of_...

withoutboats3 · on Oct 6, 2023

In fact there are multiple things in the world that people disagree about. This post addresses a different debate than the one you wanted me to write about, that's all.

duped · on Oct 6, 2023

The post you quoted and whose author you insulted at the end was talking about this issue.

withoutboats3 · on Oct 6, 2023

No it wasn't. Anyone can follow the link and see that it was about preferring non-work-stealing executors, none of the things you complained about.

efficax · on Oct 6, 2023

You can use async/await with channels and restrict yourself to only passing immutable references or copy types through arguments to async functions, communicating w/ channels for mutable shared types. Like you could build Erlang style "servers" that own your mutable types and communicate with them over channels. Or you can Arc<Mutex<T>> your way through things. Rust gives you the power to do both.

zozbot234 · on Oct 6, 2023

AIUI, nothing's stopping you from using channels over Rust's existing async support. Stackful coroutines are kinda pointless, since at that point you might as well be using separate threads.

packetlost · on Oct 6, 2023

Greenthreads have their benefits though, especially for massive concurrency.

littlestymaar · on Oct 6, 2023

Stackless coroutines are a flavor of green threads though. And without a GC, your stackful coroutines would be using segmented stacks, which is almost indistinguishable to stackless coroutines anyway.

vacuity · on Oct 6, 2023

The important distinction of stackless coroutines-as-async vs. green threads-as-async is the function coloring problem. I think stackful coroutines are largely obviated given more manageable function coloring, structured concurrency, and whatnot that Rust is currently exploring and working on. Actual OS thread-like preemption probably requires a heavy runtime (I imagine Java can do this for their virtual threads?) or OS support, though. (See scheduler activations for the latter. They might actually catch on this time, who knows?)

littlestymaar · on Oct 6, 2023

I don't think “function coloring” (which is a concept that I find overhyped and uninformative anyway, especially when talking about Rust[1]) has anything to do with the difference between stackful and stackless coroutines: it's about whether you let the language insert yield points automatically or force the programmer to spell them out explicitly. We could make the exact same thing for blocking code by the way: just create two keywords `blocking`/`block` and mark all stdlib function that do trigger a blocking syscall as `blocking` and force the programmer to use the `block` keyword and transitively annotate the calling function as `blocking` and tada! You have static enforcement of blocking code (aka function coloring) in your language without even having any kind of coroutines.

It also doesn't have much to do with structured concurrency either, and we could have gotten structured concurrency for free in Rust had tokio decided to cancel tasks handles on drop like futures are. And they even weren't that far of doing so[2]

[1]: https://news.ycombinator.com/item?id=36497399

[2]: https://github.com/tokio-rs/tokio/issues/1830

vacuity · on Oct 6, 2023

I get where you're coming from, but it's a real problem that we have const-async-fallible-panicking-allocating combinatorial explosion. An effects system could abstract all of this away except for the base libraries and the application developers. A big goal of Rust is to be accessible, and all these sorts of "function colors" are a hindrance to that goal.

> You have static enforcement of blocking code (aka function coloring) in your language without even having any kind of coroutines.

And now it's annoying to write sync and async Rust.

littlestymaar · on Oct 7, 2023

> it's a real problem that we have const-async-fallible-panicking-allocating combinatorial explosion.

True, but there's no free lunch: you either have the ability to express these things in the type system, or you don't. I'm personally very happy rust has `Result` even though it's legitimately much more tedious than having invisible exceptions.

> A big goal of Rust is to be accessible, and all these sorts of "function colors" are a hindrance to that goal.

It's always a trade-off between usability and expressive power. And not every Rust user have the same priorities. As an application developer, I care less about having allocations expressed in a tractable way than a kernel developer for instance.

> An effects system could abstract all of this away

Maybe, but it's not clear yet if that's something that could be added at this point. Also it would add yet another layer of complexity which is also a factor that needs to be taken into accounts.

> And now it's annoying to write sync and async Rust.

Honestly, this is just a matter of tooling: the annotation could/should be transitively added to every caller function (with a boundary at threads spawn) by something like `cargo fix`.

assbuttbuttass · on Oct 6, 2023

function coloring is exactly the stackful/stackless division. Stackless functions can't be called from a stackful function without special ceremony

littlestymaar · on Oct 7, 2023

This “ceremony” as you call it can totally be handled by the compiler, the same way the compiler deals with the unwinding ceremony without you ever knowing about it. The only reason why we ask the programmer to do it by themselves, is because explicitness is favored.

This is exactly the same reasoning as the difference between Rust's `Result` and C++ exceptions. They serve the same purpose, they spread into the code base the same way (if on of the function you transitively call can raise an exception, you can raise an exception) but in one case we ask the programmer to to the bookkeeping manually so that it's explicit when reading the code.

The only meaningful difference between stackful and stackless coroutines is the ability to write recursive function call. If your coroutines are stackless you need to box to avoid the sizedness issue, but in practice it's only marginally different from segmented stacks. If you have a moving GC and can just relocate the stack then you performance in case of recursive function call will be better, but it's also true about boxing if you have a generational collector with bump allocation. So this is mostly a difference of GC vs no-GC, in a situation where GC is actually improving performance characteristics.

DarkNova6 · on Oct 6, 2023

The JVM has shown that copying the stackframes can be magnitudes more efficient then context switching via separate threads. Likely, Rust's implementation would look different but this route seems too promising not to do basic research on.

I mean, is there anybody who seriously thinks that colored functions is the holy grail?

withoutboats3 · on Oct 6, 2023

Just last month .NET ended a green threading experiment, mainly because the overhead it adds to FFI was too high: https://github.com/dotnet/runtimelab/issues/2398

Rust had green threads until late 2014, and they were removed because of their impact on performance.

Everyone has done the basic research: green threading is a convenient abstraction that comes with certain performance trade offs. It doesn't work for the kind of profile that Rust is trying to target.

pron · on Oct 6, 2023

The performance tradeoffs are very different for different languages. In Java, virtual threads add no FFI costs and virtually no performance impact not because the baseline is lower but because memory management in Java is just so incredibly efficient these days (although that does come at the cost of a higher footprint). So the impact and the tradeoffs are not the same. Allocating heap memory in a very general way is simply faster -- even in absolute terms -- in Java than in a low-level language. Java, unlike .NET or Rust, also doesn't allow pointers into the stack, so we can do very efficient copying "in the shadow" of cache misses, and that takes care of FFI.

vacuity · on Oct 6, 2023

> also doesn't allow pointers into the stack, so we can do very efficient copying "in the shadow" of cache misses

How does this work?

pron · on Oct 7, 2023

First, the use-case where virtual threads offer the most benefit is servers with high concurrency (due to Little's law https://youtu.be/07V08SB1l8c?si=rwTQrnHBnp4NGrj7), which means we're talking about a very large number of threads. That, in turn, means that the state of all those threads cannot fit in the CPU cache, so even the most efficient implementation possible, i.e. one that simply changes the sp register to point to a new stack will incur an expensive cache miss. That means that copying small sections of the stack is almost free (because a cache miss+copy isn't all that more expensive than just a cache miss).

Copying portions of the stack means that you can continue executing code directly in OS threads (rather than change the stack pointer), so the overhead for FFI is zero. However, you cannot copy portions of stack if there are pointers into the stack. In .NET and in low-level languages there are pointers into the stack which makes copying inefficient. Furthermore, managing these stack portions efficiently in the heap requires extremely efficient dynamic memory management, which is something that languages with good GCs do better than low-level languages with more direct memory management.

raggi · on Oct 6, 2023

Yup, the ffi cost is extremely painful. I work on a Go product that is performance sensitive and this exact problem is a constant source of aggravation

geodel · on Oct 6, 2023

Rather than green threads effectiveness in general dotnet experiment proves that sometimes early design decisions are pervasive enough to simply not allow efficient/economical way of moving to new paradigm.

And yes I agree Rust has its own application domain and they are perfectly right to do things as they see fit.

kllrnohj · on Oct 6, 2023

> The JVM has shown that copying the stackframes can be magnitudes more efficient then context switching via separate threads.

...when there's literally no work in those threads whatsoever, that is. Unless you have something more substantial than the many "benchmarks" out there that are "look, spawning 10 bajillion virtual threads that don't do anything at all but sleep is now super efficient"?

jandrewrogers · on Oct 6, 2023

The original problem thread-per-core was invented to solve ~15 years ago was scalability and efficiency of compute on commodity many-core servers. Contrary to what some have suggested, thread-per-core was expressly about optimizing for CPU bound workloads. It turned out to be excellent for high-throughput I/O bound workloads later, albeit requiring more sophisticated I/O handling. When I read articles like this, it looks like speed-running the many software design mistakes that were made when thread-per-core architectures were introduced. To be fair, thread-per-core computer science is poorly documented, having originated primarily in HPC.

This article focuses on a vexing problem of thread-per-core architectures: balancing work across cores. There are four elementary models for this, push/pull of data/load. Work-stealing is essentially the "load pull" model. This only has low overhead if you almost never need to use it e.g. if the work is naturally balanced in a way that few real-world problems actually are. For workloads where dynamic load skew across cores is common, which is the more interesting problem, work-stealing becomes a performance bottleneck due to coordination overhead. Nonetheless, it is easy to understand so people still use work-stealing when the workload is amenable to it, it just doesn’t generalize well. There are a few rare types of workloads (not mentioned in the article) where it is probably the best choice. The model with the most gravity these days seems to be "data push", which is less intuitive but requires much less thread coordination. The "data push" model has its own caveats — there are workloads for which it is poor — but it generalizes well to most common workloads.

Thread-per-core architectures are here to stay -- they cannot be beat for scalability and efficiency. However, I have observed that most software engineers have limited intuition for what a modern and idiomatic thread-per-core design looks like, made worse by the fact that there are relatively few articles or papers that go deep on this subject.

withoutboats3 · on Oct 6, 2023

Thanks for this response, it's really interesting.

> For workloads where dynamic load skew across cores is common, which is the more interesting problem, work-stealing becomes a performance bottleneck due to coordination overhead. Nonetheless, it is easy to understand so people still use work-stealing when the workload is amenable to it, it just doesn’t generalize well

This sounds right to me. The reason Rust async frameworks use work-stealing is mainly that it's easy to enable at the framework level and will improve performance in a lot of applications, especially those that are not ideally architected. Based on your comment and your self-description on your profile, these are not the kinds of applications you work on.

I would be interested in receiving links to more literature.

gukoff · on Oct 7, 2023

What is the idea of "data push"?

asd4 · on Oct 6, 2023

"What they mean by IO bound is actually that their system doesn’t use enough work to saturate a single core when written in Rust: if that’s the case, of course write a single threaded system."

Many of the applications I write are like this, a daemon sitting in the background reacting to events. Making them single threaded means I can get rid of all the Arc and Mutex overhead (which is mostly syntactic at that point, but makes debugging and maintenance easier). Being able to do this is one of the things I love about Rust: only pay for what you need.

The article that this one is responding to calls out tokio and other async libraries for making it harder to get back to a simple single threaded architecture. Sure there is some hyperbole but I generally agree with the criticism.

Making everything more complex by default because its better for high throughput applications seems to be opposite of Rust's ideals.

amluto · on Oct 6, 2023

I’ve written services like this, and I would never have called them IO bound. They’re not throughput-bound at all. They mostly sit idle, then they do work and try to get it done quickly to minimize use of system resources. Unless they sometimes get huge bursts of work and something else cares quite a lot about latency during those bursts, using more than one thread adds complexity and overhead for no gain.

withoutboats3 · on Oct 6, 2023

A lot of people on the internet are confused about what "IO bound" means, and use it in this incorrect way.

__alexs · on Oct 6, 2023

In an era of 10Gb NICs in every server very few things are really IO bound.

tuetuopay · on Oct 6, 2023

The NIC does not really have a lot to do with being IO bound.

IO bound means you spend most of your time waiting on an IO operation to complete. Usually writes are bound by the hardware (how fast your NIC is, how fast your storage is, ...), but reads are bounds by the hardware, but mostly by the "thing" that sends the data. So it's great you have a 10Gbps NIC, but if your database takes 10ms to run your query, you'll still be sitting for 10ms on your arse to read 1KB of data.

withoutboats3 · on Oct 6, 2023

In this context, we're talking about things for which the throughput is IO-bound. You're talking about the latency of an individual request.

Throughput being IO-bound is indeed about the hardware, and the truth is that at the high end it's increasingly uncommon for things to be IO-bound, because our NICs and disks continue to improve while our CPU cycles have stagnated.

raggi · on Oct 6, 2023

In purely practical terms the old system interfaces are sufficiently problematic that for any workload with necessarily smaller buffers than tens of kb, most implementations will get stuck being syscall bound first. Spectre really didn’t help here either.

vacuity · on Oct 6, 2023

I think this is where we have to really move towards the io_uring/FlexSC approach.

elteto · on Oct 6, 2023

The speed of your NIC doesn't matter when you are waiting for an INSERT on a DB with a bad schema. Heck, your DB could be on localhost and you are not even hitting the NIC card. Still the same.

PaulDavisThe1st · on Oct 6, 2023

Although NVMe/SSD drives have changed things a lot, any media creation software is still IO bound in the sense that:

a. you cannot plan to read data from disk on demand, because it will take too long (still!), and it will almost certainly block

b. you cannot plan to write data to disk on demand, because it will take too long (still!) and it will almost certainly block

c. the bandwidth is still a limit on the amount of data that can be handled. It is much higher than it was with spinners, but there is still a limit.

hjl22 · on Oct 6, 2023

There are plenty of applications that do not run on servers. Lots of IO bound stuff in mobile or desktop apps - waiting for network responses, reading data files on startup, etc.

riku_iki · on Oct 6, 2023

> In an era of 10Gb NICs in every server very few things are really IO bound.

for my data crunching project, one core processes about 500MB/s = 4Gb/s, and I have 64 cores..

kosolam · on Oct 6, 2023

10gb nics and their respective connections are quite expensive. Not many servers have these at all.

reacharavindh · on Oct 6, 2023

As a person with a sysadmin + HPc background having built several clusters recently, this is not true(anymore). 10G NICs are almost as common as Gigabit NICs(both in availability and cost). To give you an idea, we commonly use 10G NICs on all compute nodes, and they connect to a 10G top of the rack switch which connects to services like file servers via 100G connections. The 10G connections are all 10GBase-T simple Ethernet connections. The 100G connections are DACs that are more expensive but not prohibitively so.

What cloud providers give you for VMs is not the norm in the datacenters anymore.

kosolam · on Oct 6, 2023

Everything is relative. If you are a cloud provider it’s one thing. I’m speaking from the perspective of the small medium business that rents these physical or virtual servers.

packetslave · on Oct 6, 2023

my $700 Mac Mini has a 10gb NIC. 2.5gb and 5gb NICs are very common on modern PC motherboards. Modern servers from Dell and HP are shipping with 25gb or even 100gb NICs.

andrewprock · on Oct 6, 2023

The cost of 10g is much higher than a single computer. The entire networking stack must be upgraded to 10g. At the very least the Internet device, and possibly the Internet connection as well. It will be cheaper in the cloud than on site.

packetslave · on Oct 6, 2023

Well, it depends on what your use case for "10g" is. If all you care about is fast file transfers between your PC and your NAS, you can get a small 5-8 port 10gb switch for under $300 that will easily handle line-rate traffic (at least for large packet sizes)

If you want 10g line-rate bandwidth between hundreds or thousands of servers? Yeah, I used to help build those fabrics at Google. It's not cheap or easy.

10g to the internet is more about aggregate bandwidth for a bunch of clients than throughput to any single client. Except for very specialized use cases you're going to have a hard time pushing anywhere close to 10g over the internet with a single client.

fulafel · on Oct 6, 2023

10Gb ethernet is 20+ year old tech and and used these days in applications that don't have high bandwidth demands. 100 Gb (and 40 Gb for mid range) NICs came around 2014. People were building affordable home 40 Gb setups in 2019 or so[1]. But I can believe you that the low-end makes up a lot of the volume in the server market.

[1] https://forums.servethehome.com/index.php?threads/cheap-40gb...

packetslave · on Oct 6, 2023

In my experience, 40gb and 100gb are still mostly used for interconnects (switch/switch links, peering connections, etc.). Mostly due to the cost of NICs and optics. 25gb or Nx10gb seems to be the sweet spot for server/ToR uplinks, both for cost, but also because it's non-trivial to push even a 10gb NIC to line rate (which is ultimately what this entire thread is about).

There's some interesting reading in the Maglev paper from Google about the work they did to push 10gb line rate on commodity Linux hardware.

fulafel · on Oct 7, 2023

I guess it'll also depend a lot on what size of server you have. You'd pick a different NIC for a 384-vCPU EPYC box running a zillion VMs in a on-prem server room than a small business $500 1u colo rack web server.

The 2016 Maglev paper was an interesting read, but note that the 10G line rate was with tiny packets and without stuff like TCP send offload (because it's a software router that handles each packet on CPU). Generally if you browe around there isn't issue with saturating a 100G nic when using multiple concurrent TCP connections.

surajrmal · on Oct 6, 2023

Yes exactly. Not everything seeking concurrency is a web server. In an OS, every single system service must concurrently serve IPC requests, but the vast majority of them do so single threaded to reduce overall CPU consumption. Making dozens of services thread per core on a four core device would be a waste of CPU and RAM.

marcosdumay · on Oct 6, 2023

> Not everything seeking concurrency is a web server.

Web servers should be overwhelmingly synchronous.

They are the one easiest kind of application to just launch a lot more. Even on different machines. There are some limits on how many you can achieve but they aren't anything near low. (And when you finally reach them, you are much better rearchitecting your system than squeezing a marginal improvement due with asynchronous code.)

There's a lot to gain from non-blocking IO, so you can serve lots and lots of idle clients. But not much from asynchronous code. Honestly, I feel like the world has gone crazy.

withoutboats3 · on Oct 6, 2023

tokio supports a single threaded executor when you really need it, and its not even hard. It's called a LocalSet in tokio's API:

https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html...

basro · on Oct 6, 2023

This is true but the rest of the ecosystem is not built for it.

If you try to use axum in this way you'd still need to use send and sync all over the place.

Kinrany · on Oct 6, 2023

I was going to comment on the same quote.

The problem is that one may still want concurrency even when a single thread on a single CPU is enough.

basro · on Oct 6, 2023

Instead of Arc and Mutex you'd be using Rc and RefCell. Wouldn't it be just as complex and verbose code-wise?

I understand that it is less efficient but in the case you describe wouldn't paying for a few extra atomics be negligible anyway?

monocasa · on Oct 6, 2023

I've found that practically I'm more likely to simply use Box, Vec, and just regular data on the stack rather than Rc and RefCell when I esque Arc and Mutex by using a single context. The data modeling is different enough that you generally don't have to share multiple references to the same data in the first place. That's where the real efficiencies come to play.

scottlamb · on Oct 6, 2023

Regarding the quote:

> The Original Sin of Rust async programming is making it multi-threaded by default. If premature optimization is the root of all evil, this is the mother of all premature optimizations, and it curses all your code with the unholy Send + 'static, or worse yet Send + Sync + 'static, which just kills all the joy of actually writing Rust.

Agree about the melodramatic tone. I also don't think removing the Send + Sync really makes that big a difference. It's the 'static that bothers me the most, and that's not there because of work stealing. I want scoped concurrency. Something like <https://github.com/tokio-rs/tokio/issues/2596>.

Another thing I really hate about Rust async right now is the poor instrumentation. I'm having a production problem at work right now in which some tasks just get stuck. I wish I could do the equivalent of `gdb; thread apply all bt`. Looking forward to <https://github.com/tokio-rs/tokio/issues/5638> landing at least. It exists right now but is experimental and in my experience sometimes panics. I'm actually writing a PR today to at least use the experimental version on SIGTERM to see what's going on, on the theory that if it crashes oh well, we're shutting down anyway.

Neither of these complaints would be addressed by taking away work stealing. In fact, I could keep doing down my list, and taking away work stealing wouldn't really help with much of anything.

teraflop · on Oct 6, 2023

> I'm having a production problem at work right now in which some tasks just get stuck. I wish I could do the equivalent of `gdb; thread apply all bt`

For all the hate that Java gets, this is something that has Just Worked(tm) for like 25 years, and it's enormously helpful for troubleshooting. You don't even need a debugger; you can just send the JVM a SIGQUIT, and it'll dump a stack trace of every thread to stderr (including which locks each thread is holding and/or waiting on) and keep running.

I miss this feature in every other language I've worked with. You can even use it for ad-hoc profiling in production: just take a bunch of snapshots, and use grep/sed/sort/uniq to look for hotspots.

scottlamb · on Oct 6, 2023

Go does this well, too. iirc the std library has a package for serving <http://blah/debug/pprof/goroutines> so you don't even need to ssh in to the server in question.

There are libraries for doing the same for any language that just uses kernel threads. It's when you throw in async that you really need to reinvent this kind of observability, and Rust isn't there yet unfortunately.

yawboakye · on Oct 6, 2023

i’m slowly coming around to the idea that in most cases (1) big runtimes are a good thing, and that (2) compile-once-run-many was a bad idea. i think our programming languages should create and run software in a highly introspective and interruptible environment.

vacuity · on Oct 6, 2023

I don't know what you have against compile-once-run-many, but as a Rust user I agree that most software doesn't need C or Rust. I think there could probably be a Rust-like and a Java/Go-like, two general-purpose languages that cover 99% of software.

yawboakye · on Oct 8, 2023

compile-once-run-many makes decisions, especially optimization decisions, too early ie way before the software got a chance to see real input. vm-based programs that are able to do just-in-time optimization can learn on the job and make adjustments as necessary. making optimization decisions late, taking input into consideration, makes sense for long-running applications.

vacuity · on Oct 8, 2023

Ah, that's fair. I doubt most applications need that level of optimization, and AOT still seems to be more reliable overall for performance. A language with GC and/or VM does have great properties, though, so for most apps it's not a loss to go with the managed approach. Poorly designed software seems to be the bigger issue in general.

rav · on Oct 6, 2023

> I'm having a production problem at work right now in which some tasks just get stuck.

To mitigate this kind of problem, at my company we use a library [1] that allows regularly logging which tasks are running and what file/line numbers each task is currently at. It requires manually sprinkling our code with `r.set_location(file!(), line!());` before every await point, but it has helped us so many times to explain why our systems seem to be stuck.

[1] https://github.com/antialize/tokio-tasks/blob/main/src/run_t... has set_location(), and task.rs has list_tasks()

scottlamb · on Oct 6, 2023

Yeah, I can see how that'd be helpful. In my case, I suspect this is happening inside a third-party library I'd rather not have to vendor/patch extensively. So that method could confirm my suspicion but probably wouldn't easily allow me to drill down as far as I'd like.

That said, I think the newest version of the third-party library might have some middleware hooks and/or tracing spans. With the right middleware impl / tracing subscriber, maybe I could accomplish something similar.

This code also should be following the general distributed systems practice of setting deadlines/timeouts at the top level of each incoming request, propagating through to all dependent requests, and also setting timeouts on background ops. It's not. Fixing that is also on my list and might be enlightening...

cmrdporcupine · on Oct 6, 2023

There is no right answer on this front, and it's all about different use cases.

It comes down to I/O-bound vs CPU bound workloads, and to how negatively things like cache evictions and lock contention might affect you. If your thing is an HTTP server talking to an external database with some mild business logic inbetween, and hosted on a shared virtual server, then, yeah, work-stealing and re-using threads at least intuitively makes sense (tho you should always benchmark.)

If you're building a database or similar type of system where high concurrency under load with lots of context switches is going to lead to cache evictions and contention all over the place -- you're going to have a bad time. Thread per core makes immense sense. An async framework itself may not make any sense at all.

But there is no right, dogmatic answer on what "is better." Profile your application.

I've said it before, but I feel like the focus of Rust as a whole is being distorted by a massive influx of web-service type development. I remain unconvinced that Rust is the right language for that kind of work, but it seems to do ok for them, so whatever. But the kind of emphasis it puts on popular discussion of the language, and the kind of crates that get pushed to the forefront right now on the whole reflect this bias. (Which is also the bias of most people employed as SWEs on this forum, etc.)

Ar-Curunir · on Oct 6, 2023

I haven’t really seen any issues with async affecting other parts of Rust. People are successfully building systems applications, including game engines, cryptography libraries, kernels, command line tools, compilers, etc, all without having to touch async.

I maintain large cryptography libraries, and have been completely unaffected by the async business.

cmrdporcupine · on Oct 6, 2023

I gripe about it all the time, but it hasn't really been an issue for me, and TBH the biggest codebase that I've written outside of work (where we don't really use async) ... uses tokio.

I think it's more a question of emphasis. If you go looking for crates for network I/O related things (esp HTTP) on the whole you'll find mostly async driven ones. And among them, you'll often find they're hardcoded for tokio, too.

anonymousDan · on Oct 6, 2023

Yeah I agree about the webdev influx. It would be a shame if Rusts utility for systems programming is ruined as a result.

ThinkBeat · on Oct 6, 2023

>I've said it before, but I feel like the focus of Rust as a whole is being >distorted by a massive influx of web-service type development.

This. That is quite right and well said

ndriscoll · on Oct 6, 2023

Funny, I was thinking a web app is ideal for thread per core. The application itself generally has very little state outside of a request (other than for the socket listener and database connections, which can be segmented per thread), and what state it has is probably mostly static across requests (so caches don't invalidate often). It should be easy to deal with ownership of shared state because there isn't any.

cmrdporcupine · on Oct 6, 2023

It all depends on if you think the operating systems scheduler is better than your async framework's.

There's a long tradition of not wanting to "waste" threads by blocking on I/O.

geodel · on Oct 6, 2023

>.. focus of Rust as a whole is being distorted by a massive influx of web-service type development.

True, I think it happened because Rust community quite aggressively sought out those developers to gain market and mindshare. I am not saying it is bad or good but now Rust has to live with unending stream of web related libraries and frameworks of varied quality.

And async will remain constant topic of discussion because most of critical base libraries/crates etc have taken async first approach. Now it is to a level that normal devs can not write plain sync code for business problems unless they make not using async one of main point of their projects.

withoutboats3 · on Oct 6, 2023

> True, I think it happened because Rust community quite aggressively sought out those developers to gain market and mindshare.

Guilty as charged, honestly. But we were really target high performance web services where Rust really makes sense. And Rust has had a lot of success gaining market share in that area, but most people who work on things like that are writing closed source software and don't comment so much on the internet, so what you see online is mostly people blogging about their side project that doesn't need to serve a million concurrent connections.

piperswe · on Oct 6, 2023

I feel lucky(-ish) that the only web Rust project I've ever really worked on is one that absolutely takes advantage of Rust's performance, Cloudflare's Pingora. But yeah, CRUD app #64326 probably should just use Rails/Django/Phoenix/Go/etc. instead of Rust.

geodel · on Oct 6, 2023

This makes me think of another reason of Rust in web space. It is that Rust rather explicitly tried to create an image of real hardcore language used for serious systems stuff. And it is of course true in technical sense. In larger context however message became more of like if I use Rust my stuff will become serious.

So now if someone is told their web app is cute little thing more suited for Rails/PHP/Go etc. they will feel patronized and try to use Rust despite being unsuitable because their app is going to be serious one.

cmrdporcupine · on Oct 6, 2023

I had someone reach out to try to hire me last year to build a website. A neat and useful one, but a website. And they had decided they wanted to use Rust, so they got in touch with me. This person was primarily non-technical, but an entrepreneur. I couldn't understand the decision making that led them there, other than: people had told him that Rust was the new good thing. So that's what he wanted.

newpavlov · on Oct 6, 2023

I believe that having the Send bound as a requirement to allow migration of tasks between executor threads is a clear deficiency of the Rust async system by itself, together with the fundamental issues around async Drop, which prevent implementation of scoped APIs. Similarly to threads it should be sufficient to have Send bound on functions like spawning and sending data through channels. The share-nothing approach is usually used as nothing more than a workaround to hide this deficiency..

Selectively pinning tasks to a thread/core has advantages and can be really useful in some circumstances, but it's a finer discussion, which has little to do with async users dissatisfaction related to Send.

DarkNova6 · on Oct 6, 2023

Good writeup I recommend to read more than just the headline.

My favorite line: > I have a hard time believing that someone who’s biggest complaint is adding “Send” bounds to some generics is engaged in that kind of engineering.

Edit: I fully agree with the comment of "duped". I was not aware of the greater context of this discussion. As such, I might have quoted this sentence prematurely.

MrBuddyCasino · on Oct 6, 2023

> This appears in Rust’s APIs as futures needing to be Send, which can be difficult for people with a poor view of their system’s state to figure out the best way to ensure. This is why work-stealing is said to be “harder.”

Is it just me or does this come across as rather arrogant? The problem of 'static lifetimes and send/sync constraints resonates among developers, and my impression is not that those were morons.

withoutboats3 · on Oct 6, 2023

I'm just referring back to my earlier point: people say not doing work stealing would be both easier and faster. My claim is that its one or the other, because to get share-nothing to be faster you need to architect your code in a way that is not easier than making a shared-state architecture thread-safe. There is a parallel sentence with "slower" in the next paragraph.

I don't think people who struggle to get parallel & concurrent Rust to compile are morons, though I don't like when they act like the APIs we built for them are ruining their lives.

MrBuddyCasino · on Oct 7, 2023

Fair enough, the technical point is argued well. Thanks for the clarification.

Guvante · on Oct 6, 2023

Harder is in quotes because it isn't necessarily harder.

If you would need to do it anyway it isn't harder.

Less "people are worked up over doing a little work" more "async makes you solve problems earlier you were going to solve anyway".

Similar vibe to the borrow checker. Sometimes it is overly restrictive othertimes you didn't actually consider the corner cases when you assumed everything would be fine.

eminence32 · on Oct 6, 2023

From my own personal experience, I definitely struggle sometimes to understand if my state is Send or not. So the line you quoted from the article resonates with me.

dpc_01234 · on Oct 6, 2023

This is missing the forest for the trees.

There is no one always-right way to get best performance for every program. You can argue about it all you want. Thread-per-core benefits/downside is a typical "it depends" discussion.

The problem is that using `async` in the first place is a premature optimization. 99% of Rust programs are not redis, linkerd and alikes. They are some CLI tools or Web apps that could have been written in Python or Ruby, and they would still be fast enough.

So why as a community we abandoned blocking IO Rust and everything is async now, and developers are just doing `#[tokio::main]` by default on everything?

vbezhenar · on Oct 6, 2023

One reason might be: if you're fine with thread-per-core performance, you probably wouldn't want to use Rust in the first place, as there are language with better programming experience trading it for speed. Like python. And if you want to use Rust, probably you need extra performance and you can adopt less convenient style (because you already adapted less convenient language) to get better performance.

dpc_01234 · on Oct 11, 2023

Wat. No.

Pxtl · on Oct 6, 2023

I've never touched Rust, but I can see the complaint -- if I have to write my code in a special way so that its state can be marshalled across threads to redistribute load in a way that I do not need and will make the actual end-to-end latency of a single request actually slower in cases where I've got oodles of CPU head-room, I would find that infuriating.

I could see this approach making sense in a platform where transferrable state was the default and is rarely broken, but it doesn't sound like that's the case in Rust?

edit: I'm curious, what's the ergonomics like for this? Is it just "oh, your code won't compile if you don't add this magic incantation to say it has `Send`?" Or is it "your code will fail in intermittent, hard-to-debug ways as state gets mangled during work-steals if you don't do this right"?

duped · on Oct 6, 2023

> edit: I'm curious, what's the ergonomics like for this? Is it just "oh, your code won't compile if you don't add this magic incantation to say it has `Send`?" Or is it "your code will fail in intermittent, hard-to-debug ways as state gets mangled during work-steals if you don't do this right"?

You write some code that looks like this.

    struct Server {
      // ...
    }

    impl Server {
        async fn serve(&self) {
            loop {
                let server = self.clone();
                let message = read_message().await;

                // Each message handler is a new task, seems reasonable?
                spawn(async move { 
                    let result = server.handler(message).await;
                    write_response(result);
                })
            }
        }

        async fn handler(&self, arg:Message) -> Result<U, Error> {
            let this = do_this();
            let that = do_that(arg).await;
            do_the_rest(this, that);
        }
    }

And everything works fine.

Then one day you change the implementation of `do_this` such that the type of `this` is no longer `Send`. You will get a nasty compile error that spawn(...) doesn't work because the type created by the anonymous scope in `async move { }` is not Send. The reason is not necessarily obvious (and the error message is unhelpful). If `this` is not `Send` then you can't hold it across the `.await` of `do_that(arg).await`, because each .await represents a point in execution where the future may yield and be scheduled onto another thread by the executor.

If you can make the type Send then everything is fine. If you can't (which is entirely possible!) then you need to change the scheduling of the future to `spawn_local` (or whatever your async executor calls it). This may require adding a bunch of boilerplate to even call `spawn_local` in the first place.

This is the issue with Send. It's not just adding type annotations. There are subtle ways it infects your code that may cause it to break later in non-obvious ways, because if a type implements Send is not always obvious.

cmrdporcupine · on Oct 6, 2023

Ok, but it's not Send that's doing it, and causing you problems. It's using async.

You were always sending crap across concurrent execution contexts. But the language syntactically kinda-hid it from you and you got away with not worrying about it. And now it bit you in the ass.

In general, this is the problem with implicit/masqueraded behaviour. Why back in the day we moaned about REST vs RPC (stop trying to hide that there's a network there!). Or why it's a bad idea to have languages where "+" is a string append operator. Etc. In systems design: surprises suck.

Tokio is sending your stuff around, passing it around. So of course it has to be Send. Yeah, you can avoid that by doing thread per-core and just never having things send. But I swear, it's going to bite you in the ass in the end anyways. It's your control flow that's confusing (because it's hidden), not the Send.

In the end, C++ or etc would just let you hang yourself here, and you'd be puzzling over a segfault or deadlock later, at 3am.

(There are some things that piss me off tho. AtomicPtr<T> being Send+Sync while *T/*const T is not makes no sense. Both are equally unsafe, and there's no difference in how they behave across thread boundaries, really.)

duped · on Oct 6, 2023

The problem that I'm trying to illustrate is Send + Async which always comes up in this context, not just Send.

> You were always sending crap across concurrent execution contexts. But the language syntactically kinda-hid it from you and you got away with not worrying about it. And now it bit you in the ass.

I disagree. Here's a more realistic example

    async fn func (resource: Resource) {
       let a = foo(resource).await;
       let b = bar(resource).await;
       baz(a, b).await
    }

Assume that this isn't "crap" and the only correct way to implement this code. There are two fundamentally distinct async operations over the same resource and you can't call them in parallel.

Now say you have an async callstack that looks like this

    let task = async move {
       outer(inner(func(resource).await).await).await;
    };
    spawn(task);

That call to func() could be buried deep in the callstack instead of inner or outer or anywhere else, and very small changes to the implementation of func can cause it to create a compiler error a mile away from where the problem actually is.

I don't feel like this is inherent complexity and it has nothing to do with tokio, it has a bunch of related problems with the limitations of Rust, from the stripped-down generator design to lack of specialization for traits.

imo these limitations make async code quite fragile to write in practice, and it's kind of frustrating to be gaslit repeatedly with "no it's actually good that your code is hard to write."

cmrdporcupine · on Oct 6, 2023

I don't think we disagree. I was saying: it's async that's the problem, not Sync tracing.

Maybe I didn't write that clearly?

diath · on Oct 6, 2023

> And everything works fine.

Until the handler needs to access a resource that's shared by tasks.

duped · on Oct 6, 2023

That's an example of how your code can subtly break

diath · on Oct 6, 2023

My bad, I made the remark before I finished reading the rest of your reply.

cmrdporcupine · on Oct 6, 2023

On the whole the compiler detects when things are 'Send' & 'Sync'. If you write your program thread-safe, you won't have issues.

And that's the crux of the matter: people are griping on the whole because maybe Tokio async is hard when it puts Send&Sync demands all over the placer, but the reality is that writing safe concurrent code of any kind is hard. It's not intuitive, and the problem is that async makes them feel like it's happening automagically and just "taking care of it" -- but it's really not. You need to know what you're doing, and the compiler is just helping you here.

Yes, you can hide that by going thread-per-core and that could eliminate the need for Send in some circumstances (not all). But it might come back to bite you, architecturally, in the long run.

creata · on Oct 6, 2023

> I'm curious, what's the ergonomics like for this?

Your code won't compile unless everything is Send/Sync as appropriate, and (I might be wrong, but) the lazy path to achieving this is usually wrapping things that might be shared in Arcs and/or Mutexes.

ko27 · on Oct 6, 2023

I think the argument against thread-per-core as the default can be made simply:

- if you are CPU bound, work stealing will be better for most cases

- if you are IO bound, thread-per-core might work better, but again, you have enough CPU headroom that the performance doesn't really matter

IMO, work stealing is a better default to encode into language API

vlovich123 · on Oct 6, 2023

The main argument for work stealing is that it’s hard to achieve uniformity of work loads across all threads. The main argument for a single-threaded thread per core design is that it’s easier to code AND performs/scales way better than work stealing (including average and tail latencies, TPS etc).

IMHO it’s a misconception that this is somehow tied to CPU or IO bound work. Take for example databases. You’d think that that’s the prime “I/O bound” use case. Except it’s not. There’s a talk about a DB researcher that analyzed that Postgres spends 70% of its time book keeping things within the database. And that makes sense. I/O is done in bulk with the cost amortized over a lot of transactions. That book keeping work? Extremely expensive because you have to acquire locks all over the place, do atomics, memory allocations etc.

Atomics and memory allocations are extremely expensive in certain contexts and atomics also have a negative in that your scaling with number of CPUs is sub linear due to hard to remove false sharing of cache lines and CPU stalls to handle the synchronization.

On the other hand, a shared nothing approach where you’re not allocating memory in your hot path is very hard to achieve and not suitable for all problems. Nor does everyone need that performance. So the work stealing approach is better in those use cases as it provides reasonable performance and the programming model is simpler in some ways since you don’t have to think about the data path as careful since everything has an Arc / Mutex in there.

layer8 · on Oct 6, 2023

> you have enough CPU headroom that the performance doesn't really matter

But it would affect power consumption? (Just trying to understand.)

basro · on Oct 6, 2023

No, performance and power consumption should go hand in hand in this case. If you are strongly IO bound, paying for the synchronization is not really going to matter much I believe.

There are cases where you can be CPU bound and using the share nothing model would work out to your advantage. There's also the case where you only have one cpu core anyway (for example if you want to get all the juice out of a cheap single core VPS)

pornel · on Oct 6, 2023

Everyone who has worked with sharding at scale knows the pain when there's this one terrible key (your biggest customer, hottest key, largest blob) that just ruins load-balancing of the shards.

scottlamb · on Oct 6, 2023

100% true, but in fairness work stealing doesn't completely solve that either. You often end up locking the shard anyway because lockless data structures are limited and hard. And if you're doing something like a key/value store, you're probably sharding to decide which server to direct traffic to, and so you have the same problem at that layer. (There are things that can help, of course, e.g. <https://aws.amazon.com/blogs/architecture/shuffle-sharding-m...>.)

IshKebab · on Oct 6, 2023

Yeah maybe, but how many people are running these high performance async servers on 2 core machines?

samsquire · on Oct 6, 2023

Thank you for this blog post.

I like the thread per core design because kernel context switches are expensive. But userspace context switching scheduling (such as an unbuffered golang channel) for a 64 bits of data at a time makes me uncomfortable too unless it represents a large amount of work backed by a pointer to work data.

I think I like to multiplex sockets over threads and multiplex IO over threads so that you can do CPU work while IO is going on and you can process multiple clients per thread.

You cannot scale memory mutation by adding threads. You want ideally one thread to own the data and be safe to mutate it uncontendedly.

When you send data to another thread, don't refer to that data again. Transfer ownership to that thread.

If you can divide your request into phases of expansion (map) and contraction (synchronization) you can do intrarequest parallisation.

I've been looking into runqueues of go and tokio where you have a local runqueue without a mutex and a global runqueue with a mutex.

I've been trying to think how IO threads that run liburing or epoll can wake up a Coroutine or an async task on a worker thread without the mutex.

The worker thread is looking for tasks to resume that are unblocked and needs to be notified when there is IO finished. I think you can have a Coroutine that is always runnable to read from a lock free ringbuffer. You can have that Coroutine that checks for finished IO's ringbuffer yield if its contended by writes by the IO thread that is trying to make work available to the worker threads.

ori_b · on Oct 6, 2023

> I like the thread per core design because kernel context switches are expensive

You can do a millions to tens of millions kernel entries/exits per second per core. They're on the same order of magnitude as a single digit number of full cache misses.

So, expensive, but a lot less expensive than many people assume.

Edit: for some (outdated, 2018) measurements, https://eli.thegreenplace.net/2018/measuring-context-switchi...

amluto · on Oct 6, 2023

You’re replying to a comment about context switches, not syscalls. Linux on x86 cannot do tens of millions of context switches per second per core. (I expect that, at high clock speed, carefully tuned, under ideal and possibly unrealistic conditions, you might get 2M/s — it’s been a while since I benchmarked this.)

But this is silly — shared nothing will have no substantial difference in the rate of context switches (or of syscalls) compared to a thread-per-core-shared-everything architecture. The architecture that actually loses is many-threads-per-core if it ends up in a small-batch mode in which it context switches once per request or so.

kldx · on Oct 6, 2023

That's an interesting view and one I agree with. Surely if context switch = expensive is common knowledge for a long time in computing, the kernel people and CPU companies would have tried optimizing it to the fullest, no?

insanitybit · on Oct 6, 2023

Syscalls have gotten quite a lot slower after Specter/Meltdown mitigations. Though I suspect that if you're writing a web service that wants TPC you should just disable these - I've told that to Scylla Cloud before, they have no reason to keep them enabled and plenty of reasons to disable them.

The other thing is 'io_uring', which is trying to come at the problem by removing context switching altogether by providing a cheap primitive for communicating with the kernel.

kldx · on Oct 7, 2023

Sadly io_uring has its share of vulnerabilities as well

insanitybit · on Oct 7, 2023

Of course, I'm very aware of that fact. But it is designed to minimize system calls exactly because they can add significant overhead.

TexasMick · on Oct 6, 2023

I have seen context switches be a killer in some embedded contexts. Especially on FPGA soft core applications where you often are dealing with heavy I/O, if you just naively spin up threads for I/O then you will suffer heavily.

cmrdporcupine · on Oct 6, 2023

And all our assumptions get blown up with every new hardware generation, too.

Hardware engineers are building systems for the applications of today. And the kernel developers are optimizing as well. It's all a moving target.

Dogma has no place here.

davidhyde · on Oct 6, 2023

> “The problem with work-stealing is that it means a task can run on one thread, pause, and then be started again on another thread”.

The article explains how work-stealing is a way to solve tail latency. I understand tail latency to be caused by tasks that yield (by themselves) after an unexpectedly long amount of time. This cannot be know in advance. The article explains how work-stealing results in cache misses and extra developer constraints like Send, Sync and ‘static.

What if the executor only moves a task to another thread after a timeout period has expired? This should result in no latency penalty if things are running smoothly and a constant, low, extra latency when tasks need to be stolen now and again. This would mitigate cache miss issues but alas not the developer overhead of all those multithreaded constraints.

catern · on Oct 6, 2023

An important thing omitted in this post, which makes work-stealing less attractive, is that one core being idle can actually improve performance of other cores. Today's CPUs basically have a fixed energy budget, and if one core is idle that means more of that budget can go to other cores.

In other words, core utilization is less relevant today - what you care about is energy utilization (which is shared across cores).

Of course, there's a point at which this stops being relevant - if you have multiple sockets for example, this won't apply. But work stealing across multiple sockets is so expensive anyway that you would never want to do it. You might as well work-steal across machines at that point - something which is indeed useful sometimes, but usually niche.

CyberDildonics · on Oct 6, 2023

If a CPU is being cooled enough to not throttle, it is much more time and energy efficient to use all the cores you can rather than have another core run at a slightly higher frequency.

Higher frequencies have diminishing returns and exponential heat loss.

You might as well work-steal across machines at that point

Shared memory is extremely fast, it crushes using local loopback networking, let alone using actual networking.

jeffbee · on Oct 6, 2023

You can practice energy-aware scheduling at higher levels, too. If you have to send an RPC and you can choose between multiple peers, choose the one with the coldest CPU temperature.

cryptonector · on Oct 6, 2023

> It’s always off-putting to me that claims written this way can be taken seriously as a technical criticism, but our industry is rather unserious.

It's not the industry. It's people.

> What these people advocate instead is an alternative architecture that they call “thread-per-core.” They promise that this architecture will be simultaneously more performant and easier to implement. In my view, the truth is that it may be one or the other, but not both.

Thread-per-core is hands-down better for performance. That's because you can't rely on storing application state smeared on a thread/co-routine stack, so you have to instead make all your state explicit, and in the process you'll effectively compress it by no longer smearing it on a very large data structure called the call stack. Smaller state == less cache thrashing.

The problem with thread-per-core is that you really need async/await or coding in continuation passing style in order to make it work. CPS is mostly not an option in most cases, so async/await it has to be.

Developers should understand the concepts of thread-per-core and thread-per-client and pick the best one for their case. IMO thread-per-core is always better because if your application explodes in popularity then you'll really care to make it more efficient, but rewrites are always hard or infeasible, so if you get it right from the beginning, then you win. What I mean by "pick the best one" then has to do with the developer's skills and productivity and time constraints. Sometimes "best" is about what you can manage, not what's "best" in an objective vacuum that ignores real-world constraints.

IshKebab · on Oct 6, 2023

> It's not the industry. It's people.

Yes, and this is just naive to think that other industries are somehow bastions of serious debate.

sakras · on Oct 6, 2023

> The problem with work-stealing is that it means a task can run on one thread, pause, and then be started again on another thread: that’s what it means for the work to be stolen. This means that any state that is used across a yield point in that task needs to be thread-safe.

Do any existing "thread-per-core" systems actually provide yield points as a thing you can do? Most of my experience is with OpenMP (using both BSP and task parallelism) and a little bit of TBB (also task parallelism). If you want to yield in these systems, you break it up into two tasks.

> if state is moved from one thread to another, this introduces synchronization costs

This isn't obvious to me. If you're moving state, then there is no synchronization because only one thread is touching the state at any given point. Unless we're talking about synchronization due to starting a new task?

hjl22 · on Oct 6, 2023

I haven't touched async Rust, but have used normal Rust a little bit.

My understanding is that 'static is program lifetime, meaning either static compile-time data like constant strings, or it is memory leaks. How does this apply to async in practice? Do you need to leak things that you want to make async?

pornel · on Oct 6, 2023

No, when used as a bound, it doesn't really say how long data has to live. It just forbids use of all temporary references (types that are borrowing from something short-lived they don't own).

Lifetime requirements simply don't apply at all to types that own their data. Or another way to see it is that self-contained types like Vec and String automatically meet every lifetime requirement, regardless of how long or short they actually live.

Rust kinda screwed up with terminology here, because the general computing term of "object lifetime" applies to more things than the specific 'lifetime concept that Rust applies to references/loans.

hjl22 · on Oct 7, 2023

Thanks for the explanation. I understand it now.

pitaj · on Oct 6, 2023

Most heap-allocated values like `String` are also static. Essentially it means that anything the value references is not on the stack.

hjl22 · on Oct 7, 2023

Thanks for the explanation. I understand it now.

Georgelemental · on Oct 6, 2023

`T: 'static` means that values of the type `T` own all their data (more specifically, their lifetime is not constrained by the lifetime of anything they reference).

`&'static T` is a reference to something that lives as long as the program (static data or leaked memory).

hjl22 · on Oct 7, 2023

Thanks for the explanation. I understand it now.

scottlamb · on Oct 6, 2023

It means spawned futures can't contain references that aren't 'static. But they can own memory. String for example is 'static. Likewise Vec<T>, Box<T>, Arc<T>, Rc<T>, etc. where T: 'static.

hjl22 · on Oct 7, 2023

Thanks for the explanation. I understand it now.

exfalso · on Oct 6, 2023

TLDR summary:

1. Straw man 2. Talk about an unrelated paper 3. Conclusion: I am smart and those people are dumb

On a somewhat related note: I'd say 95% of the software I've written was IO bound(and if it wasn't at the beginning then the goal was to make it IO-bound), and that remaining 5% does not benefit at all from async/coroutines(used them in Haskell, Kotlin(Quasar) and Rust). I'm very curious about what real-world CPU-bound use cases people have that can benefit from async performance-wise. If you're optimizing on that microsecond-nanosecond scale then you shouldn't even have yields/blocks on your hot paths, and synchronization will at most be done using memfences, so what are we talking about?

I'd also conjecture that if you place your problem domain on the IO vs CPU bound axis, the problems where async would provide performance benefits can invariably be solved by other designs that perform even better (GPU/FPGA).

Yes I'm one of those people who thinks coroutines have extremely marginal actual value, and most of that value is the "feeling of how cool this is" that people experience when they first learn about the concept. If there was a way to bin the concept altogether I would do it in a heartbeat. As-is, the entire Rust ecosystem is suffering heavily because of it.

slaymaker1907 · on Oct 6, 2023

A lot of the time, the benefit can be from much more controlled scheduling without having to introduce actual threads. A lot of software doesn't want to take the overhead of creating platform threads (memory and taking additional CPU time) but does want to do scheduling internally for various CPU bound workloads. I think web pages are often like this. You don't really want to hog tons of CPU, but you also want to have guarantees like not blocking rendering when running a user provided regex since said regex could take a long time to run, but in practice that regex will be fast. Even if we want to eventually offload that to a separate thread, async gives us the option to first try evaluating it in the main thread for some number of iterations before offloading it.

GPU/FPGA programming is kind of irrelevant because as expensive as developing high performance code is for CPUs, costs go up by an order of magnitude for those platforms unless there is some existing library you can utilize (mainly applicable for ML/AI). These platforms are also very expensive. It's like saying just rent a helicopter if you need to get somewhere quickly instead of asking what "quickly" actually means and optimizing accordingly.

exfalso · on Oct 7, 2023

Suspendable computations like regex statemachines are actually a good example use case. However I'd still categorize this as extremely marginal. I have seen literally zero Rust crates that handle suspendable computations (ofc outside of the scheduler runtime). Also note that by their nature regex statemachines do not actually necessitate the use of async, it's effectively syntax sugar on top.

My point with GPU/FPGA is that if you're at the point where this level of optimization matters then you're actually dealing with situations where it's worth to invest in the big boy tools. Examples are HPC in fintech (low latency trading) and scientific computations, game development, video codecs etc. You know, "actual" computations.

Webservices are not in this category. Generally speaking with web services your goal is to "hide in the shadow of IO". If you max out your network and/or database and/or filesystem capacity, further CPU optimizations will have literally no effect. I have yet to encounter a web service where this wasn't the case.

What I do see sometimes with webservices is simply unnecessary compute, bad internal structuring, dynamic dispatch, parallelism overcommit, lack of batching, fragmented apis, fragmented data accesses etc etc all of which appear as CPU capacity saturation and also sometimes as kernelspace overhead in profiles. Async does not help solving any of these issues. And once you do solve them, you reached IO boundness and it doesn't matter anymore.

Again this is just my experience, and I'm happy to learn about what kind of web service can utilize coroutines with measurable performance benefits over a managed threadpool.

TexasMick · on Oct 6, 2023

I write software for FPGA soft cores. Even with all this acceleration around the soft core, we need some sort of scheduling and kernel context switches are really a big killer. We have the same issue as web developers, dealing with 50,000 things per second means we need to avoid kernel context switches.

exfalso · on Oct 7, 2023

Avoiding context switches is not a problem async solves. A threadpool popping work items from a queue(or something like lmax disruptors) has the same effect on context switches. The only thing one could argue is that the async runtime's threadpool "homogenizes" work. Again, yet to encounter a case with web services where this was the issue.

I'm not familiar with FPGA scheduling, are there any resources that explore the issue? I was under the impression it's akin to GPU compute where the main bottleneck is the bus.

lordnacho · on Oct 6, 2023

First of all, does anyone know if there's a flag that just turns off work-stealing? Seems like you could then decide for yourself how you want Tokio in particular to work.

Second, the send + sync thing seems onerous, but from my POV you are encouraged to only pass little messages on channels anyway. I find it easier to just not share anything. If two things need the same structure, they can both subscribe to the ring that gets the messages, and each construct the object for themselves. I find if you have a bunch of Arc<RwLock<>> something is wrong. YMMV.

c-cube · on Oct 6, 2023

You can configure tokio with feature flags in cargo. I'm particular you can pick a single threaded scheduler.

weinzierl · on Oct 6, 2023

"This means that any state that is used across a yield point in that task needs to be thread-safe. This appears in Rust’s APIs as futures needing to be Send [..]"

Does that mean that I have to unnecessarily use locking (e.g. ARC and Mutex) even with flavor = "current_thread" or am I misunderstanding this?

necubi · on Oct 6, 2023

No locking (which provides mutable `Sync` in rust terms) is generally required, but things must be Send (i.e., allowed to move between threads, but not shared). In my experience is mostly an issue when dealing with references, because `&T` is only Send if `T` is sync.

However, `&mut T` is Send if `T` is Send. So if you can avoid aliasing your references, you're also ok.

weinzierl · on Oct 6, 2023

Thanks! So does that mean that if I write my code with default tokio and ARC where required and then when I add flavor = "current_thread" I can replace my ARC with Rc?

the_white_oak · on Oct 7, 2023

Rc is not Send.

But you can use tokio's LocalSet to run non-Send futures.

https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html

cryptonector · on Oct 7, 2023

Work stealing is like process migration, and as such it has significant overhead and you really need to be picky about when to rebalance by moving loads.

mikhailfranco · on Oct 6, 2023

Erlang/Elixir/BEAM emphasizes share nothing, allows (encourages) a bezillion user-space processes, then executes with a thread-per-core (by default).

The actual number of schedulers (real threads) is configurable as a command line option, but it's rare, approaching unheard-of, to override the default.

Virding's First Rule of Programming ...

https://rvirding.blogspot.com/2008/01/virdings-first-rule-of...

jerf · on Oct 6, 2023

If Go with its green threads is a step down from Rust in performance, Erlang is two or three steps down from Go. If you step down your performance needs, a lot of these problems melt away.

Most programmers should indeed do that. There's no need to bring these problems on yourself if you don't actually need them. Personally I harbor a deep suspicion a non-trivial amount of the stress in the Rust ecosystem over async and its details is coming from people who don't actually need the performance they are sacrificing for. (Obviously, there absolutely people who do need that performance and I am 100% not talking about them.) But it's hard to tell, because they don't exactly admit that's what they're doing if you ask, or, at least, not until many years and grey hairs later.

But in the meantime, some language needs to actually solve these problems (better than C++), and since Rust has volunteered for that role, that means that at the limit, the fact that other languages that chose to just take a performance hit don't seem to have these problems doesn't have very many applicable lessons for Rust, at least when it is being used at these maximum performance levels.

mikhailfranco · on Oct 6, 2023

Agree, Erlang will never win any performance benchmarks, but that is mostly due to other aspects of the language: big integers, string handling, safer-rather-than-faster floating point, etc.

[Elixir is a little better, supporting binary-first for strings, rather than charlists - Erlang is very good at pattern-matching binaries.]

Share-nothing and thread-per-core are good for many reasons, including performance, but they also feed into the main philosophies for Erlang development: resilience, horizontal scalability and comprehensibility.

As Joe Armstrong said:

“Make it work, then make it beautiful, then if you really, really have to, make it fast.

90% of the time, if you make it beautiful, it will already be fast.

So really, just make it beautiful.”

ergl · on Oct 6, 2023

There's nothing inherently slow about the way you structure a program in Erlang. Most of the problems come from copying values around when sending them across processes.

jerf · on Oct 6, 2023

Erlang/BEAM is significantly slower than either Go or Rust. Its speed reputation was often misunderstood; it was very good at juggling green threads, but it was never a fast programming language. Now that its skill at juggling green threads is commoditized, what's left is the "not very fast programming language".

It's not the slowest language either; it has a decent performance advantage over most of the dynamic scripting languages. But it is quite distinctly slower than Go, let alone Rust.

insanitybit · on Oct 7, 2023

I think you're misunderstanding.

1. Erlang makes sharing data easy, it does so through mailboxes. Data moves across actors constantly.

2. Erlang is not "Thread per Core". I think you're misunderstanding TPC as "there is one OS thread per CPU Core" and that is not the case.

Erlang is basically the exact opposite of the TPC design.

mikhailfranco · on Oct 8, 2023

Erlang (BEAM) has schedulers that execute the outstanding tasks (reductions) on the bezillion user-space (green thread) processes.

For most of Erlang's history, there was a single scheduler per node, so one thread on a physical machine to run all the processes. There is a fixed number of reductions for each (green thread) process, then a context switch to a different process. Repeat.

A few years ago (2008), the schedulers were parallelized, so that multiple schedulers could cooperate on multi-core machines. The number of schedulers and (hw/thread) cores are independent - you can choose any number of real threads to run the schedulers on any physical machine. But, by default, and in practice, the number of schedulers is configured to be one thread-per-core, where core means hardware supported thread (e.g. often Intel chips have 2 hardware threads for each physical core).

So yes, almost always and almost everywhere, there really is one OS thread per hardware supported thread (usually 1x or 2x physical CPU cores) to run the schedulers.

https://www.erlang.org/doc/man/erl.html

https://erlang.org/pipermail/erlang-questions/2008-September...

whytevuhuni · on Oct 8, 2023

As the original article noted, one of the biggest problems of "thread per core" is the name of it, because it confuses people. It does not mean "one thread per one core" in the literal sense of the word, but rather a specific kind of architecture in which message passing is NOT done between threads (as is very common in Erlang), or it is kept to the minimum possible. Instead, the processing for a single request happens, from the beginning to the end, on one single core.

This is done in order to minimize the need to transfer L1 caches between threads, and to keep each thread's cache pool tied to one request, and not much else (at least, to the extent possible).

In the context of Rust async runtimes, this is very similar to Tokio if work-stealing did not exist, and all futures spawned tasks only on their local thread, in order to make coding easier (lack of Sync + Send + 'static constraints), while also making code more performant (which the article argues it does not).

For examples of thread-per-core runtimes, see glommio and monoio.

insanitybit · on Oct 8, 2023

I am extremely familiar with Erlang and its history. You are misunderstanding what "Thread Per Core" means.

Again, the fact that data moves across threads in Erlang means it is not TPC - period. Erlang is basically the exact opposite of a TPC system, it is practically its opposite because it is all about sharing data across actors, which can be on any thread.

Perenti · on Oct 7, 2023

Perhaps the title should mention this is primarily about Rust, and not thread-per-core in general, which would be interesting.

kosolam · on Oct 6, 2023

Very nice write up that demonstrates the difference between theory and practice.

insanitybit · on Oct 6, 2023

> Some Rust users are unhappy with this decision, so unhappy that they use language I would characterize as melodramatic:

Seriously. I've said this before, the way people talk about these problems is so dramatic. I've written 10s of thousands of LOC in Rust and you'd think, from these blog posts, that I must be miserable. I am not. I am quite pleased with it.

It's so funny to me that Thread Per Core is seen as the holy land. Like it's just objectively better. It is not. TPC is quite tricky, indeed. Scylla had a great post recently talking about how they were suffering performance penalties due to accidentally sharing some memory - this is the sort of thing you have to be super careful about when leaning into TPC. "Hot partitions" is another one.

I'm going to refrain from commenting more (this article looks excellent and I want to read the citations) but I'm very very happy to see this post. I think far to little credit is given to the effort put into getting a novel language like Rust to support this so well.

Waterluvian · on Oct 6, 2023

This kind of attitude exists everywhere, but good gravy does it exist strongly in Software Development. I feel like every software developer could benefit from a summer tarring rooftops to mature their idea of what "killing the joy" of doing work feels like.

insanitybit · on Oct 6, 2023

A big problem with software dev is the lack of rigor, imo.

> It’s always off-putting to me that claims written this way can be taken seriously as a technical criticism, but our industry is rather unserious.

From the article, this resonates so strongly with me. Instead of people reading papers, talking seriously, it's so hand wavy and weak.

Waterluvian · on Oct 6, 2023

My general feeling is that I think people lean on how they feel about these problems because building an experiment and quantifying the problem to make a solid case for it actually being a problem worthy of resources and remediation is... tedious and unfun?

Not to say these people don't exist. But I think they exist in lesser numbers. Even I'm guilty of this. I know it... so I do it... but it's very unfun and tedious and... well I'm paid to do it so I can't complain.

aeonik · on Oct 6, 2023

I think it's more then tedium and fun that blocks the rigor. Though, those two properties definitely hold a fair number of people back.

I'll use myself as a research subject: I really like the scientific method, and don't mind a long grind. But the main barriers I run into with software fall into at least two major categories: 1. Compatible datasets or software: it takes a LOT of manual effort to collate good data sets, and because I am a software engineer, I want to automate these things, then I end up in tarpits and rabbit holes, building bridges rather than testing hypotheses.

2. Abstraction fatigue, There is so much vocabulary that we use, and we switch layers so quickly and in different contexts that I find it very time consuming and opaque to harmonize all the concepts. Real life pressures don't always afford the ability to really understand an entire stack, which I posit is necessary for a certain level of Rigor.

For example, right now I'm trying to Grok Clojure Transducers, and some folks are saying they are the same things as Monads. Are transducers a better pattern? Are they the same? Different? How much do they overlap?

I have feelings relating to dynamic types vs static types. I have feelings about wrapping implementations. I have feelings about the JVM vs the Haskell compiler vs the V8 JavaScript engine. Also, I get feelings related to polymorphism, metaprogramming, dependent types, homoiconicty.

But nothing fully baked. To call my feelings even 10% baked would be charitable. I can't currently describe all the relationships, or why my brain is currently zeroing in here in full fidelity. Only parts of it, and I remain in a state perplexity for now.

The formality of Category Theory and Abstract algebra sounds like it would be really nice here, but it's going to take me years to get to to speed in those domains, and I'm not even sure they have the theorums to compare these two patterns!

mulberrybush · on Oct 6, 2023

This video [1] explaining Clojure transducers is excellent. No terminology, no analogies with category theory - just code getting built up in a repl.

[1] https://www.youtube.com/watch?v=TaazvSJvBaw

insanitybit · on Oct 6, 2023

I'm going to avoid derailing into a "tech has these problems", respectfully, as I think the core of the post is really worthwhile and I could talk about tangentially related things for too long and be too distracting. I think we are likely, overall, in agreement.

withoutboats3 · on Oct 6, 2023

> A big problem with software dev is the lack of rigor, imo.

It's amazing to me how much more informed you can seem than everyone else if you just read the thing everyone is quoting.

insanitybit · on Oct 6, 2023

Indeed. I have brought this up when people reference "premature optimization is the root of all evil" and mistakenly attribute a false meaning to it. What an approachable paper, what is the excuse for not simply reading it? To boil it down to one quote and then to use that quote to dogmatically advocate for practically the opposite of the point is... sigh.

I've also been banging the "you are using 'IO Bound wrong'" drum for years and still this persists.

HN is a daily example of this. Reading the blog post immediately highlights all of the comments where users seemingly did not.

pjmlp · on Oct 6, 2023

A side effect from those calling themselves enginners without having a proper Software Engineer degree where this kind of stuff is actually taught.

insanitybit · on Oct 6, 2023

And here I am with no degree at all :) I suspect I've read quite a few more papers than your average graduate, however.

taeric · on Oct 6, 2023

I was going to say the same thing. My gut is it exists as strongly in other places, but that we don't have direct exposure to their social zones to see it.

And it isn't like rust is particularly hit by it, either. You'd think people writing PHP or Java hate life, if you only went by what you are likely to see in our social sites.

safety1st · on Oct 6, 2023

The PHP developers I speak to these days seem pretty happy to be using PHP. The ones who fly into a rage over it seem to be people who don't actually use it!

cesarb · on Oct 6, 2023

> The ones who fly into a rage over it seem to be people who don't actually use it!

Well, that does make some sense: if something bothers you so much that you'd "fly into a rage over it", you would avoid using it as much as possible (and PHP is not like JavaScript which has no real alternative).

taeric · on Oct 6, 2023

The annoying thing is more that a ton of the ones that fly into a rage are also ones that have never used it. There is a ton of dog-piling in what people complain about.

cesarb · on Oct 6, 2023

> The annoying thing is more that a ton of the ones that fly into a rage are also ones that have never used it.

Even that does make some sense: if, on a first look, you see something you deeply dislike, you probably will also avoid using it in the first place.

As a personal example, I never learned Go because, when I first looked at it (IIRC, it was when I had to use it to run a Heartbleed detector), I deeply disliked the way it required developers to organize their source code (a single per-language directory mixing all projects together, instead of the per-project directories I have always used); I understand this might have changed later, but in the meantime I invested my time in learning another programming language, and so far haven't found a need to look at Go again.

taeric · on Oct 6, 2023

Ish? You are, of course, more than welcome to not like things. Dislike them, even. Really, you are more than welcome to dog pile on things, freedom and all of that. It is still annoying and generally not a healthy activity, from my point of view.

mcronce · on Oct 6, 2023

Your analogy really hits. I don't mind doing construction in general (I quite like doing it once in a while), but roofing absolutely sucks no matter how you cut it

api · on Oct 6, 2023

I'm pleased with it too. Seriously. When I learned it it took me a bit to get lifetimes and borrows and stuff but once I did I got it and I rarely have to "fight" it.

Running in an IDE hooked to rust-analyzer helps a lot too since you get pretty instant feedback if you get something wrong most of the time. I have this suspicion that at least a few of the people who hate a lot of languages with more complex type systems or large stdlibs (e.g. Go, Java, C#) are trying to edit them in a plain vanilla text editor without these features. This would require you to memorize way too much shit. Why? Modern machines have gigabytes of RAM. Run a language server.

efficax · on Oct 6, 2023

i've written about 30k lines of async rust this year alone and haven't found the horror of Send + Sync + 'static to impact me much at all. You just have to think about things for a second sometimes.

insanitybit · on Oct 6, 2023

TBH most of the time you don't have to think at all. Compiler says "you need those annotations" and you copy/paste them and then you move on. It's really incredible to me that people are acting like this is some extremely onerous process, like writing a dozen characters causes some sort of incredible pain.

VirusNewbie · on Oct 6, 2023

I love this stuff, and while I consider myself informed, i'm not an expert.

However, I've bantered a bit with Scylla's CTO and others about how I would regularly run into issues with Scylla because they didn't use TPC, and I'd see hot keys that would suffer performance because they only let one core manager a subset of keys...

So, while I can't make an authoritative statement, I think TPC is better overall...

jacknews · on Oct 6, 2023

Paints the 'rust community' as quite restrictive, and not a place I want to be, IMHO.