Hacker News new | past | comments | ask | show | jobs | submit login
A single line of code made a 24-core server slower than a laptop (pkolaczk.github.io)
617 points by Ygg2 on Dec 31, 2021 | hide | past | favorite | 195 comments

An aside:

> L3 cache is shared by all cores of a CPU.

I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache. My particular processor is 32-core where each set of 4 cores is a CCX. I discovered this when trying to figure out why a benchmark was performing wildly differently from one run to the next with a bimodal distribution -- it turned out to depend on whether Linux has scheduled the two processes in my benchmark to run on the same CCX vs. different CCXs.

https://en.wikipedia.org/wiki/Epyc shows the "core config" of each model, which is (number of CCX) x (cores per CCX).

> I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache.

It's still kinda true, just that "a CPU" in that context is a CCX. Cross CCX communications has shown up a fair bit in reviews and benches, and really in all chips at that scale (e.g. Ampere's Altras and Intel's Xeons): https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 and one of the "improvements" in Zen 3 is the CCX are much larger (there's one CCX per CCD, rather than 2) so there's less crosstalk.

And it could already be untrue previously e.g. the Pentium D were rushed out by sticking two P4 dies on the same PCB, I think they even had to go through the northbridge to communicate, so they were dual socket in all but physical conformation (hence being absolute turds). I don't think they had L3 at all though, so that wasn't really a factor, but still...

Yup, and why overclocking on AMD tests each of your ccx/cores, marks gold cores that can reach 5ghz+, and slower cores that dont overclock. Also why AMD tests each ccx, and moves worse ones to lower end products, and higher ones to the 5950x.

Then you can OC tweak each core and try to milk out performance, PBO tends to be pretty good on detecting which cores with slight performance boost. With a few bios options, all that adds up without effort.

And then finally get a better (advanced) scheduler in windows/linux that can move workloads around per core depending on workload. Windows released multiple fixes for the scheduler on AMD starting with win10 1903.

I find scheduler mods and modest bios tweaks to increase performance without much effort, very detectable performance.

Linux I use xanmod (and pf-kernel before that)

Windows I use project lasso and AMD PBO.

Slow/fast cores and scheduler tweaks is how win11/arm mac/android, makes things appear faster too.

Amusing, scheduler/core tweaks been around for linux for a decade+, making the desktop super smooth, but its now mainstream in win11/arm osx.

does xanmod work as a drop-in replacement in Ubuntu with an NVIDIA card?

I think it does.

There are core complexes in Apple’s M1 variations as well.

And it’s not just L3 that can shared by various chips.

Each complex also has its own memory bandwidth, so running on two core in two complexes will get around twice the memory bandwidth of two cores in the same complex.

I have that Pentium D -3.4GHz CPU somewhere, It thought me important lessons like CPU clocks mean nothing if the architecture is botched up, to be weary about purchasing first generation compute hardware and more importantly not to trust big-name brands blindly.

Not only did I put a hole in my father's monthly budget by paying a premium for this but continued to do so for years with the power bills for this inefficient crap.

I remember reading that some politics between intel India-Israel teams led to rushed up design blunder, Couldn't find that article now.

> And it could already be untrue previously e.g. the Pentium D were rushed out by sticking two P4 dies on the same PCB, I think they even had to go through the northbridge to communicate

Part of me hopes you're wrong here. That is absolutely absurd.

Yeah you'd hope so wouldn't you?

But Intel got caught completely unaware by the switch to multi-core, just as it had by the 64b switch.

The eventual Core 2 was not ready yet (Intel even had to bridge to it with the intermediate Core, which really had more to do with the Pentium M than with the Core 2 though it did feature a proper dual-core design... so much so that the Core solo was actually a binned Duo with one core disabled).

So anyway Intel was caught with its pants around its ankle for the second time and they couldn't let that happen. And they actually beat AMD to market, having turned out a working dual-core design in the time between AMD's announcement of the dual-core opteron (and strong hints of x2) and the actual release, about 8 months.

To manage that Intel could not rearchitecture their chip (and probably didn't want to as it'd become clear Netburst was a dead-end), so they stapled two Prescott cores together, FSB included, and connected both to the northbridge.

It probably took more time to validate that solution for the server market, which is why where AMD released the dual core Opterons in April and Athlons in May, it took until October for the first dual-core Xeon to be available.

Here's the wikipedia article:


"The Pentium D brand refers to two series of desktop dual-core 64-bit x86-64 microprocessors with the NetBurst microarchitecture, which is the dual-core variant of Pentium 4 "Prescott" manufactured by Intel. Each CPU comprised two dies, each containing a single core, residing next to each other on a multi-chip module package."

They didn't have to go through the northbridge itself, but they had to go over the frontside bus that connects the CPU to the northbridge (and would normally be point to point).

Just as information for some later readers, but this is applicable not just to Epyc but also the consumer version Threadripper. My understanding is that this is an example of a Non-Uniform Memory Access (NUMA) which was also used to link multiple cpus in different sockets together for a long time now, but now they've been integrated into a cpu that fits in one socket.

This actually has an importance if you are running a VM on such a system since you'll run into things like the actual RAM (not l3 cache) is often directly linked to a particular NUMA node. For example accessing memory in the first ram stick vs the second will give different latencies as it goes from ccx1 => ccx2 => stick2 versus ccx1 => stick1. This is applicable to I think 2XXX versions and earlier for threadripper. My understanding is that they solved this in later versions using the infinity fabric (IO die) so now all ccx's go through the IO die.

I ran into all of this trying to run an ubuntu machine that ran windows using KVM while passing through my nvidia graphics card.

> Just as information for some later readers, but this is applicable not just to Epyc but also the consumer version Threadripper.

It's applicable to any Zen design with more than one CCX, which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it was 4).

The wiki has the explanation under the "Core config" entry of the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16 cores).

Zen 3 APU are all single-CCX, though there are Zen 2 parts in the 5000 range which are multi-CCX (because why not confuse people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a 2x4.

The rest is either low-core enough to be single-CCX Zen 2 (5300U) or topping out at a single 8-cores CCX (everything else).

This also applies to Ryzen as well. The first gen Ryzen was plagued by software and schedulers that couldn't handle the ccx affinity

This sheds some light on why I had so much trouble running GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those were dark days.

Frankly that was probably 200% unraid ...


If this is NUMA, then so is any CPU with hyperthreading, as the hyperthreaded cores share L1.

This kind of [stuff] drives me nuts every time I’m trying to read the performance telemetry tea leaves at work.

Do we have requests that take longer, or did Linux do something dumb with thread affinity?

Pro tip: if you care about performance, and especially if you care about reliable performance measurements, you should pin and accelerate your threads. Using isol CPUs is the next step.

In other words: if your application is at the mercy of Linux making bad decisions about what threads run where, that is a performance bug in your app.

How would you do this when running in kubernetes? Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

I only know that there is one scheduler that gives you dedicated cores if you set the request and limit both to equal multiples of 1000.

On post-docker pre-kubernetes time I used `--cpuset-cpus` on docker args to dedicate specific cpus to redis instances, using CoreOS and fleet for cluster orchestration.

> Afaik it just guarantees that you get scheduled on some cpu for the “right” amount of time. Not on which cpu that is.

If you set cpu request == cpu limits, then container will be pinned to CPU cores using cpuset command [0].

There is also a way to influence NUMA node allocation using Memory Manager [1]

[0] https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

[1] https://kubernetes.io/docs/tasks/administer-cluster/memory-m...

Another reason to put off moving to kubernetes.

We use Kubernetes to host apps running Node.js, and K8s is great in keeping the apps running, scaling the pods and rolling out upgrades. Use the right tool for the job. If you are trying to saturate 24-core server with CPU-bound computation, don't think Kubernetes.

> Do we have requests that take longer, or did Linux do something dumb with thread affinity?


Cross-socket communications do take longer, but a properly configured NUMA-aware OS should probably have segregated threads of the same process to the same socket, so the performance should have increased linearly from 1 to 12 threads, then fallen off a cliff as the cross-socket effect started blowing up performances.

Out of curiosity — was the “bad” mode a case of two separate workloads competing for limited cache within one CCX, or was it really bad cache behaviour across CCXs causing problems because the two processes shared memory?

The two processes were a client and a server, the client just sends HTTP requests to the server over a unix socket, but otherwise they don't share resources. TBH I'm surprised there was so much impact just from that. There might be something more to the story -- I haven't investigated much yet. But basically when I used `taskset` to limit the benchmark to two cores in different CCX's, it ran about half the speed as when limited to two cores in the same CCX. I guess another possibility is that the kernel is allowing both processes to swap between cores regularly, which would much better explain the disastrous performance penalty. But, in that case I don't understand the bimodal distribution I see when running the benchmark with no core affinity...

L3 being shared is only common on smaller cores really, on any large Intel design the L3 isn’t all in one place, it’s broken up across the core.

That's still a single logical L3 (e.g. a single core could use the entire L3 capacity).

I've always wondered how much this affects shared cloud instances (for example, a single c5.small instance). Can your neighbor, who shares your L3 cache, be using memory so much more aggressively than you that it causes you to evict your L2/L1 cache? Or is cache coherency maintained if the core that evicted your L3 cache is known to be living in a different memory space?

Coherency will be maintained (since the protocols support that case). But yes, a separate process would evict that memory. From the processors point of view, they're just addresses tagged with data. Caching behavior doesnt depend on virtual memory addresses because I believe that info is stripped away at that point.

So if someone is thrashing cache on the same core you're on, you will notice it if the processes aren't being shared effectively.

The contents of the cache aren't stored as part of a paused process or context switch. But I'd appreciate a correction here if I'm wrong.

For an example, consider two processes A and B running on a set of cores. If A makes many more memory accesses than B, A can effectively starve B of "cached" memory accesses because A accesses memory more frequently.

If B were run alone, then it's working set would fit in cache. Effectively making the algorithm operate from cache instead of RAM.

BUT. You really have to be hitting the caches hard. Doesn't happen too often in casual applications. I only encountered this on GPUs (where each core has sperate L1 but a shared L2). Even then it's only aa problem if every core is hitting different cache lines.

Found some detail - apparently the CPU manuals are the place to check https://cs.stackexchange.com/a/1090

> Can your neighbor, who shares your L3 cache, be using memory so much more aggressively than you that it causes you to evict your L2/L1 cache?

I'm curious. How could the "neighbor" possibly evict your L1/L2 when it is local to you? Worst it can do is thrash L3 like crazy but if your own data is on L1/L2, how would that get affected?

I believe most caches are designed such that something in a lower level cache obtains information about whether that memory is valid or has been changed from the cache level above it, so typically everything in L1 is also in L2 and L3. If you do not have the data in L3 anymore, this may cause the L1 to reload the memory (into L3 then L2 then L1) to know that the memory is valid.

And L3 wasn't shared in 8 core Ryzens since Zen1. Although they're not seperate of CCX's. They still have a split l3 cache.

It's true for Zen 1, Zen+ and Zen 2. But they changed it with Zen 3. Latency went up from 40 cycles to 47 cycles, but accessible for single core cache capacity doubled, and it greatly improved single threaded performance since single-thread now can utilize full CCD l3 cache.

Do you use numa nodes to tell the kernel about this or something else?

You can enable node-per-CCX mode in UEFI to do that. Otherwise you have to use tools like hwloc to discover the cache topology and pin workloads accordingly.

> You can enable node-per-CCX mode in UEFI to do that.

For those of us who know basically nothing about UEFI: how do we do that? I have a 5900x, and numactl shows only one node.

Press delete or F1 or whatever at boot and look through the menus for the nodes per socket (NPS) setting. This setting may not exist on desktop systems.

Oh, right. I knew that as the BIOS setup, but I guess my terminology is just out of date. I see the "NUMA nodes per socket" option now. Not quite sure yet if that actually allows having two nodes for the two CCXs on my one socket, but I'll play with it...

Reporting back in case anyone looks at this: I think this option does nothing on my system (ASUS TUF Gaming X570-PRO motherboard, latest BIOS version 4021). The "ACPI SRAT L3 Cache as NUMA Domain" also does nothing. Somewhere I read these options are for 2-socket machines only (even though in concept they'd make NUMA meaningful on 1-socket machines). I'd be interested if anyone finds otherwise.

I also tried passing "numa=fake=32G" or "numa=fake=2" to Linux. That syntax seems to match the documentation [1] but produced an error "Malformed early option 'numa'". Haven't dug into why. I'm not sure it'd correctly partition the cores anyway.

I gave up for the moment.

[1] https://www.kernel.org/doc/html/latest/x86/x86_64/boot-optio...

> I also tried passing "numa=fake=32G" or "numa=fake=2" to Linux. That syntax seems to match the documentation [1] but produced an error "Malformed early option 'numa'". Haven't dug into why. I'm not sure it'd correctly partition the cores anyway.

Oh, because the stock Ubuntu kernel doesn't enable CONFIG_NUMA_EMU.

Underrated and useful reply. Thank you!

this is important when sizing and tuning VMs/ hypervisor

The culprit was a shared atomic reference counter (Arc). The cache invalidation across cores eliminated any kind of scaling via multiple threads.

The solution is to stop sharing. The author had to make changes specific to this codebase for each thread to have its own copy of the offending dataset.

The issue is that Rust doesn't allow static borrowing across thread boundaries (except for globals), which necessitates cloning the arc. This is why they chose to instead copy the contents and use thread locals (may be wrong - skimmed the article).

I've had this complaint for over a year, and nobody is addressing it. Arcs are absolutely ubiquitous, it can come back and bite anytime, it's just very hard to know upfront where a bottleneck is and most people will never profile to this extent.

I really wish Rust would reconsider the decision to throw its hands up at non-lexical destruction guarantees. This is related to the thread::scoped removal from 2015 (IIRC), after which mem::forget was added. Anyway, I'm rambling. There's tons to read for the curious.

The simplest mitigation right now is to clone arcs only when necessary, and instead use static references of those arcs whenever possible (especially hot paths). It's a mess for refactors and easy to miss, but better than nothing.

> I really wish Rust would reconsider the decision to throw its hands up at non-lexical destruction guarantees. This is related to the thread::scoped removal from 2015 (IIRC), after which mem::forget was added. Anyway, I'm rambling. There's tons to read for the curious.

I think the original argument that leaking memory using safe code by building a reference cycle is still valid. The language would have to change significantly to prevent that, so either reference counting must be unsafe, or we accept that destruction cannot be guaranteed just because something goes out of scope. That is why `forget` is safe.

The general solution here is to use scopes instead of RAII to ensure destruction. Rust lost its thread::scoped libstd API because that was based on RAII, but there are multiple crates in the ecosystem that provide the lexical scope-based solution

This is indeed the reason often cited, but it's more nuanced than that. If cycle forming ref types had a lifetime param, you could guarantee destruction at a lifetime level, a weaker but highly useful guarantee. In fact one proposal at the time was to add a lifetime param to Arc and Rc, but it was rejected because it had such big impact and the team was stressed about the 1.0 release.

If you look at the various crates that offer static borrowing across threads today, they use lexical scope and they are perfectly safe and guarantee destruction IFF the inner scopes terminate (which is exactly the guarantee you want). However, they are too restrictive to be useful.

Or use a language with a good concurrent garbage collector.

I don't know why you're being downvoted. Memory reclamation is hard in the face of parallelism. It's much easier to design lock-free algorithms in languages with a garbage collector, because memory reclamation issues are taken care of for you (and often times in a batched manner, which won't affect the hot path of your code). There _are_ techniques to avoid using a garbage collector: atomic reference counting, epoch-based reclaimation, hazard points, and more. But they can be quite difficult to implement and difficult to use.

Rust's Arc is an atomic reference counter. But, as the article describes, those aren't free.

One could argue that using a GC for reclamation isn't lock-free if it's not a concurrent GC.

There are very few GCs that are actually lock-free. While uncooperative mutators cannot block other mutators in an on-the-fly collector, for example, the collector is blocked until all mutators cooperate. Nonetheless I'd guess that a sufficiently concurrent collector is good enough.

Right. Plus GCs can compact memory, which we can't do with manual memory management. I don't understand why people fetishize manual memory management --- do you want to spend your short time on Earth twiddling references?

After 28 years of programming in C and 25 years programming in C++ (on and off), I am sick to death of managing memory. Programmers can't do it right or well. Let the super-intelligent idiot savant with billions of cycles of computation per second and trillions of bits of storage do the thinking for me.

Personally I glad the world is moving away from unsafe manual memory management in C++, be that with better RAII and managed pointers, Rust's type system, etc. But those things are still breakable and IMHO, ultimately a big waste of people's time. If Rust's lifetime annotations could be inferred by a compiler, then by all means. But if they can't, just GC already. All that extra thinking pushes people to think about the wrong things and designs end up ossified, difficult to refactor because of the extra overhead of radically changing ownership. Forget it already! Let computers figure out the whole memory management problem.

Want your program to go fast? Don't allocate a bunch of garbage. Redesign it to reuse data structures carefully. Don't burden yourself forever twiddling with managing the large and small alike, obsessing over every dang byte.

> Let the super-intelligent idiot savant with billions of cycles of computation per second and trillions of bits of storage do the thinking for me.

Unfortunately, it can't. To be fair, GC does management memory very well, but there is more to memory than handling allocation.

It becomes immediately obvious when you compare Go to Rust. Go strives for simplicity. Compare Go and Rust code, and Rust's overhead for making memory handling safe (type annotations, the type system exploding because it carries information about so many types of ownership) make it horribly complex compared to Go. Go code expresses the same concepts in far fewer tokens, and just as memory safe as the Rust code - provided there is only one thread.

Add a second thread, and the Rust code will continue to work as before. The Go code - well the GC does manages memory allocation only, it does _not_ arbitrate how multiple threads access those objects. Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result. That IMO is the _worst_ type of bug. Most of the overhead Rust imposes (like ownership) has very little to do with managing memory lifetimes. It is about preventing two threads form stomping on each other. Managing memory lifetimes just comes for free.

I suspect the rise and rise of Rust is mostly due it's concurrency guarantees. Go back a couple of decades and when multiple CPU's were almost non-existent and I suspect the complexity Rust imposes would have been laughed out of the room, given all it really gave you was an very complex alternative to GC - which is what you are effectively saying is all it provides. Nowadays we have $1 computers with multiple ARM cores in a disposable CVOID testing kit. Once you've been hit a couple of times by a currency heisenbug taking out your products in the field, you are screaming out for some tool to save your arse. Rust is one such tool.

Yeah, I recognize that ownership helps with concurrency, and TBH most of the code I write isn't highly concurrent, especially at a fine-grained level. Nevertheless, I think all that ownership is going overboard to solve what is in effect a niche problem. (After all, 30 years ago, engineers at Sun architected the Solaris kernel with very carefully designed fine-grained locking with little more than C. "Solaris Internals" is still highly recommended.).

There are a lot of other strategies for writing concurrent systems, like, for example, making most data immutable, coarse-grained sharing, shared-nothing multi-process architectures like actors, using race detectors and/or thread sanitizers, using good concurrent libraries (with lock-free hashtables and other datastructures).

The difference between the number of concurrency bugs causing catastrophic failure and all other kinds of bugs causing catastrophic failure has gotta be at least 100 or maybe 1000 to 1. So forcing everyone to think about ownership because maybe they are writing concurrent code (then again maybe they aren't) so that "congrats your memory management problems are solved" seems like a Pyrrhic victory--you've already blown their brain cells on the wrong problem. Worse, you've forced them to bake ownership into the seams in every part of the system, making it more difficult to refactor and evolve that system in the future. [1]

> Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result.

You get race conditions, you don't get undefined behavior (nasal daemons). Go and Java have memory models that at least your program doesn't violate the source type system and you don't get out-of-thin air results. You get tearing and deadlocks and race conditions, not undefined behavior. At Google we used TSAN extensively in Chrome and V8 and similar tools exist for Java, and I assume so as well for Go.

It's a broader conversation, but I don't think advocating that everyone write highly concurrent programs with shared mutable state, no matter what tools they have at their disposal, is pushing in the right direction. Erlang, Actors, sharing less, immutable data, and finding better higher level parallel programming constructs should be what we focus our and other's precious brain cycles on, not getting slapped around by a borrower checker or thread sanitizer. We gotta climb out of this muck one of these decades.

[1] If you are instead saying that people should think about the design of their system from the beginning so that they don't have to refactor as much, then I agree; but just don't waste time thinking about ownership, or worse, architecting ownership into the system; it will alter the design.

> So forcing everyone to think about ownership because maybe they are writing concurrent code (then again maybe they aren't) so that "congrats your memory management problems are solved" seems like a Pyrrhic victory--you've already blown their brain cells on the wrong problem.

https://manishearth.github.io/blog/2015/05/17/the-problem-wi... argues that "[a]liasing with mutability in a sufficiently complex, single-threaded program is effectively the same thing as accessing data shared across multiple threads without a lock". This is especially true in Qt apps which launch nested event loops, which can do anything and mutate data behind your back, and C++ turns it into use-after-free UB and crashing (https://github.com/Nheko-Reborn/nheko/issues/656, https://github.com/Nheko-Reborn/nheko/commit/570d00b000bd558...). I find Rust code easier to reason about than C++, since I know that unrelated function calls will never modify the target of a &mut T, and can only change the target of a &T if T has interior mutability.

Nonetheless the increased complexity of Rust is a definite downside for simple/CRUD application code.

On the other hand, when a programmer does write concurrent code with shared mutability (in any language), in my experience, the only way they'll write correct and understandable code is if they've either learned Rust, or were tutored by someone at the skill level of a Solaris kernel architect. And learning Rust is infinitely more scalable.

Rust taught me to make concurrency tractable in C++. In Rust, it's standard practice to designate each piece of data as single-threaded, shared but immutable, atomic, or protected by a mutex, and separate single-threaded data and shared data into separate structs. The average C++ programmer who hasn't studied Rust (eg. the developers behind FamiTracker, BambooTracker, RtAudio, and RSS Guard) will write wrong and incomprehensible threading code which mixes atomic fields, data-raced fields, and accessing fields while holding a mutex, sometimes only holding a mutex on the writer but not reader, sometimes switching back and forth between these modes ad-hoc. Sometimes it only races on integer/flag fields and works most of the time on x86 (FamiTracker, BambooTracker, RtAudio), and sometimes it crashes due to a data race on collections (https://github.com/martinrotter/rssguard/issues/362).

Though I can't find it right now, there was a recent paper that did do some effective copying/compaction under a normal malloc/free regime. It did so by noting that once two pages were getting somewhat empty, and the allocations within them wouldn't then overlap, it could use virtual memory to put them in the same physical page. I believe it claimed effective results, though I cannot recall the scale.

What language wouldn't hit this issue? The second you have an atomic write across multiple numa clusters you'll hit this.

Edit: it appears they're talking specifically about the ref counting whereas I was considering the entirety of a shared context. That clarification helps me understand where their statement was coming from

The problem was caused by a memory management primitive. This wouldn't be an issue in a runtime with a GC designed for concurrent loads.

Wood for trees. The problem was caused by an ill-thought-out design. We can do similarly performance degrading things in GC languages, just the details will be different. However, at some extreme, which, to be fair, most systems won't hit, GC languages perform vastly worse than non GC languages. In one service I own, Java's G1GC uses 20x more CPU than Rust for an application specific benchmark. Most of this time is spent in the concurrent phase so Shenandoah and GenShen aren't going to make a dent (and we can't afford the RAM for Shendandoah). 20x CPU and 2x wall clock. The question we are looking at is, "Should we just continue to spend 20x $ on operating costs for the Java version to avoid writing it in Rust?"

How would the GC avoid the atomic lock and cache invalidation across numa boundaries? What language has a sufficiently lock less rw capable GC?

Edit: to clarify I was thinking about a mutable context share as shown in the code example, not solely about the ref counting.

> How would the GC avoid the atomic lock and cache invalidation across numa boundaries?

By not using reference counting. State of the art GCs don’t count references. They usually doing mark and sweep, implementing multiple generations, and/or doing a few other things.

Most of that overhead only happens while collecting. Merely referencing an object from another thread doesn’t modify any shared cache lines.

> What language has a sufficiently lock less rw capable GC?

Java, C#, F#, Golang.

Yeah I think my confusion was that I was thinking about a fully mutable context shared across threads based on the code example in the post.

But if it's just for the ref counting part of the Arc then I can see how a GC would solve it by not needing the RC

> I was thinking about a fully mutable context shared across threads

A quote from the article: “No locks, no mutexes, no syscalls, no shared mutable data here. There are some read-only structures context and unit shared behind an Arc, but read-only sharing shouldn’t be a problem.” As you see, the data shared across threads was immutable.

However, the library they have picked was designed around Rust’s ref.counting Arc<> smart pointers. Apparently for some other use cases, not needed by the OP, that library needs to modify these objects.

> I can see how a GC would solve it by not needing the RC

Interestingly enough, C++ would also solve that. The language does not stop programmers from changing things from multiple threads concurrently. For this reason, very few libraries have their public APIs designed around std::shared_ptr<> (C++ equivalent of Rust’s Arc<>). Instead, what usually happens, library authors write in the documentation things like “the object you pass to this API must be thread safe” and “it’s your responsibility to make sure the pointer you pass stays alive for as long as you using the API”, and call it a day.

To be fair, anything you can do in C++ can be done in Rust. The language just steers you away from it unless you enter into unsafe.

> To be fair, anything you can do in C++ can be done in Rust.

Technically, all programming languages are Turing-complete. Practically, various things can affect development cost by a factor of magnitude. The OP acknowledges that, they wrote "Rewriting Rune just for my tiny use case was out of the question".

Just because something can be done doesn't mean it's a good idea to do that. Programming is not science, it's engineering, it's all about various tradeoffs.

> The language just steers you away

Such steering caused unique performance issues missing from both safer garbage collected languages, and unsafe C++.

Note the OP was lucky to be able to workaround by cloning the data. If that context or unit objects would use a gigabyte of RAM, that workaround probably wouldn't work, too much RAM overhead.

Your comment said that C++ would solve it, I was merely pointing out that Rust can solve it identically to C++ by side stepping the borrow checker. You can do so without any performance penalties as well and the code would function identically to the way it does in C++.

> Rust can solve it identically to C++ by side stepping the borrow checker

Doing that is prohibitively expensive in this particular case. It would require patching a large third-party library who uses Arc for the API: https://docs.rs/rune/latest/rune/struct.Vm.html#method.new

And the reason that library uses Arc in the API is unique to Rust.

You're talking about the specifics of this library. As you said in your original comment, that would apply to any C++ library using shared_ptr too across the API boundary.

A different non-GC language wouldn't change things, because you'd have the exact same trade off if the same decision was made.

The only major difference is that Rust pushes you to Arc but C++ doesn't push you to a shared_ptr.

GoLang. It was designed for highly concurrent loads. For an overview of history and technology involved, read e.g.:


Edit: another comment explains that you're likely talking about just the ref counting aspect rather than the entire context sharing used by the rune code shown, in which case, yes I see why a concurrent GC would avoid the issue in that scenario.


I'm familiar with Go's GC. Your linked post doesn't explain how it would avoid the hit mentioned by the cache invalidation across multiple clusters.

It'll either try and put multiple go routines on a single cluster (as listed in the link) or it'll need to copy the necessary stack per thread. Which is effectively what the original article ends up doing.

But if you encounter anything that needs to run concurrently across threads while using a single r/w object, you'll hit the same cliff surely?

While this isn't true with Go's GC, a GC that's stop-the-world can avoid these cache coherency issues altogether. If you pause every thread before marking and sweeping, then you won't run into problems--only the GC is running. While this may sound silly, stopping the world will oftentimes lead to higher throughput than concurrent collectors, at the cost of higher latency (which is why it's not suitable for Go, which is often used for building servers where latency is a priority).

For one, I'd be surprised if Azul's C4 (Zing) didn't have these issues already solved (assuming anyone has solved that, they probably did).

If the only purpose of the Arc was to be a reference count, a language with a (non-refcount) GC wouldn't need that atomic in the first place.

That's fair. It read to me, from his post, that Rune was using it for context sharing as well between threads (since that's the source of the highest level Arc in his code example) . If it's only for ref counts then it makes sense that a concurrent GC could avoid the issue

Awesome TL;DR! I read the full article but this summarizes it well. The article is anyways worth a full read

Immutable collections save the day!

Immutable collections wouldn't really help here since Rust does not have a garbage collector, so they would still be wrapped in an `Arc<T>`, or use one internally. Unless you're giving each thread their own copy of the immutable collection, in which case the immutability doesn't really add much.

You're right if they're wrapped in an Arc<T>. But it's a lot easier to confidently clone an immutable data structure to threads/async functions because you know they won't be mutated by other cores. If you're cloning mutable data, you'll have to collect each "shard" at the end of whatever runtime loop your program has, and merge them back into the authoritative model. (Or drop the changes, but then why make the data mutable?) Gets hairy.

The notion of "lock-free" is fundamentally misleading.

It trips up everybody who hopes "lock-free" will be a magic bullet freeing them from resource contention bottlenecks.

When you have explicit locks, aka mutexes (condition variables, semaphores, what-have-you), the interaction between your threads is visible on the surface. Replacing that with "lock-free" interaction, you have essentially the same set of operations as when taking a mutex. On the up side, overhead cost may be lower. On the down side, mistakes show up as subtly wrong results instead of deadlocks. And, you get less control, so when you have a problem, it is at best harder to see why, and harder to fix.

Because the lock-free operations involve similar hardware bus interactions under the covers, they cannot fix contention problems. You have no choice but to fix contention problems at a higher, architectural design level, by reducing actual contention. Having solved the problems there, the extra cost of explicit mutex operations often does not matter, and the extra control you get may be worth any extra cost.

What is lock-free good for, then? Lock-free components reduce overhead when there is no contention. Given any actual contention, performance is abandoned in favor of correctness. So, if you have mutexes and performance is good, moving to lock-free operations might make performance a little better. If performance is bad, mutexes and lock-free operations will be about equally as bad.

Don't see anything more misleading about lock free than about anything else involving parallelism. Lock freedom has to do with guarantees about scheduling, you can be lock free and still have contention but a lock free algorithm is guaranteed to make progress within a bounded amount of time. A blocking algorithm is not guaranteed to make any kind of progress within a bounded amount of time.

If you're working on a real time system, and I don't mean the colloquial meaning of that word, but rather a system that must guarantee that a side effect is produced no later than a given period of time, then you must use a lock free algorithm, there is no alternative. Avionics, DSP, high frequency trading, these are all domains where lock free algorithms are necessary. Making a fast lock free algorithm is great and generally the goal of any kind of parallelism is performance, but fast is not an unambiguous term, it can mean low latency or high bandwidth and it's very unlikely to write an algorithm that is both.

Lock free algorithms are great when you need low latency and guaranteed throughput. If you don't need that then it's possible to trade those properties for much higher bandwidth and use blocking algorithms.

If you are counting on lock-free operations to guarantee you will meet your real-time scheduling requirements, you have chosen to fail, right out of the gate. Guaranteeing met deadlines takes engineering. Using lock-free operations is not a substitute for engineering. There is no such substitute.

Also: All this has nothing to do with parallelism. Parallelism is about things happening at the same time. Locking or not locking, and contention, are about concurrency, which is about synchronization and serialization, a different topic.

If you are using a mutex to guarantee real-time scheduling requirements, you have failed.

Furthermore if you think a discussion about benchmarking an algorithm on a 24-core SMP system involving atomic reads, writes and shared caches has nothing to do with parallelism then let's just agree to disagree with one another and I wish you all the best in your programming endeavors.

People can decide for themselves what credibility they wish to lend to your argument given your perspective on this matter.

. . . I don't even need to say anything here.

It's best that you don't.


RTOS programming tends to make extensive use of semaphores, often to build queues.

(Commenting here since the other thread is going nowhere)

To clarify, the main property of a non-blocking algorithm is that it's exclusively composed of simpler, smaller ("atomic") operations that demonstrate time-bound/progress-making guarantees, in turn making the bigger algorithm generally easier to reason about (Note "generally").

The consequence of this argument is twofold:

1. It's entirely possible to make a blocking algorithm provide the same guarantees as a non-blocking algorithm, as long as the underlying operations (blocking or otherwise) are proven to be correct. Blocking algorithm tend to throw other complex systems into the mix (e.g. the scheduler), and all of the dependencies and their usages have to be correct (w.r.t. whatever property we desire) to show that the final blocking algorithm is also correct. For instance, it can be argued that the priority inversion problem stems from the lack of scheduler's correctness, or alternatively, the incorrect use of the scheduler. However, in the absense of such problems, there are effectively no disadvantages on blocking algorithms on this respect.

2. If it turns out any atomic operations below the non-blocking algorithm do not meet the expected guarantees, then the non-blocking algorithm itself also becomes unsafe. I think the OP clearly demonstrates this problem--the CPU atomic instructions are but some kind of hardware "locks" (either a LOCK# pin on older processors or some complex cache coherency protocol, but this is an implementation detail) on their own that can sometimes fail to meet the programmer's expectations, leading to surprising results.

In short, the fine line between blocking algorithms and nonblocking algorithms can be drawn by whichever smaller operations or components we assume to be correct.

Unfortunately I think this comment does is muddy the waters even more. All "lock free" means is that one process(+) may not cause any other process to block during execution.

"Contention" is not very meaningful to a lock-free algorithm, since lock-free algorithms typically do not permit shared data access semantics. Rather they describe how data is sent between processes, which is why a lock-free solution looks very different than something behind mutexes.

But to the heart of your point, I do wish people spent more time understanding that lock-free vs lock-ful is a distinction in semantics and doesn't imply anything about performance. There are subtle trade offs.

(+) I'm using "process" with a lot of hand waving, assume it means whatever your atom of parallelism is

You are confusing two fundamentally different topics. Anywhere there is no possibility of contention for access to a resource, "lock-free" is trivially meaningless.

> The fact that our reference counters are atomic is an additional problem that makes things even more complex for the processor. Although using atomic instructions is often referred to as “lockless programming”, this is slightly misleading – in fact, atomic operations require some locking to happen at the hardware level. This locking is very fine-grained and cheap as long as there is no congestion, but as usual with locking, you may expect very poor performance if many things try to fight for the same lock at the same time. And it is of course much worse if those “things” are whole CPUs and not just single cores that are close to each other.

Contention is the devil and you can never hide from it if you bring it into your life. The fastest "contention avoidance hack" would be the CAS operation as mentioned above, which still performs approximately 2 orders of magnitude slower than a single threaded operation would in the most adverse scenarios.

If you aren't sure of which problems actually "fit" on a modern x86 thread these days, rest assured that many developers in fintech have been able to squeeze entire financial exchanges onto one:


For most scenarios, you really don't need all those damn cores. They are almost certainly getting in the way of you solving your problems.

Another interesting example is audio synthesis. It is interesting because there are at least two models for audio synthesis: single sample processing (as used in, for example, VCV Rack) and block structured processing (as used by every plugin API and thus every DAW-that-runs-plugins). But it gets even more interesting because the degree of potential parallelism is entirely controlled by the signal routing pathways chosen by the user. If the software is processing N independent signal pathways that do not meet until a final output stage, you have a high (well, "N") level of potential parallelism, and if N can expand to a high value then you will always be able to benefit from more cores.

However, if the signal pathway is more convoluted (no pun intended for my audio nerd crew), then not only does the degree of parallelism decrease but the issue of single-sample vs. block structured can become much more important. In the single sample case, the cost of computing a single sample has (relatively) enormous fixed overhead if the code has any serious level of modularity, but is still very cheap compared to inter-thread and inter-core synchronization. Spreading the computation across cores will frequently, but not always, slow things down quite dramatically. By contrast, block structured processing has a much lower per-sample cost, because the function call tree is invoked only once for an entire block of samples, and so the relative cost of synchronization decreases.

This sounds like a slam dunk for block structured processing.

Problem is, there are things you cannot do correctly with block structured processing, and so there is always a place for single sample processing. However, the potential for parallelization and the level at which it should take place differs significantly between the two processing models, which opens to door to some highly questionable design.

The short version of this is that audio synthesis in the abstract can always expand to use any number of cores, particularly for cpu-intensive synthesis like physical modelling, but that in the presence of single-sample processing, the cost of synchronization may completely negate the benefits of parallelism.

To add onto this very good explanation from Paul, in practice, audio processing graphs always rapidly decay into serial processing after starting out massively parallel. Most of the time, we are writing to a single output (i.e. speakers via your audio interface or a file). On top of that, some of the heaviest weight DSP happens on this final output stage (mastering chains with multiband compressors, linear phase equalizers, limiters etc.) So every 5.3ms (256 sample buffer @ 48kHz sample rate) we start massively parallel processing all the leaf nodes (audio tracks with sound files, virtual instruments and synths) and end up bottlenecking as the tree collapses into a line. Then we are stuck doing some of the most CPU intensive work on a single core since we can’t process the limiter DSP plug-in until the EQ finishes its work, for example.

We need to meet our real-time deadline or risk dropping buffers and making nasty pops and clicks. That mastering stage can pretty easily be the limiting (hah) step that causes us to miss the deadline, even if we processed hundreds of tracks in parallel moments before in less time.

The plug-in APIs (AudioUnits, VSTs, AAX) which are responsible for all the DSP and virtual instruments are also designed to process synchronously. Some plug-ins implement their own threading under the hood but this can often get in the way of the host application’s real-time processing. On top of that, because the API isn’t designed to be asynchronous, the host’s processing thread is tied up waiting for the completed result from the plug-in before it can move on.

Add on that many DSP algorithms are time-dependent. You can’t chop up the sample buffer into N different parts and process them independently. The result for sample i+1 depends on processing sample i first.

Great write up!

What I like about Pd is that you can freely reblock and resample any subpatch. Want some section with single-sample-feedback? Just put a [block~ 1]. You can also increase the blocksize. Usually, this is done for upsampling and FFT processing. Finally, reblocking can be nested, meaning that you can reblock to 1024 samples and inside have another subpatch running at 1 sample blocksize.

SuperCollider, on the other hand, has a fixed global blocksize and samplerate, which I think is one of its biggest limitations. (Needless to say, there are many things it does better than Pd!)


In the last few days I have been experimenting with adding multi-threading support to Pd (https://github.com/Spacechild1/pure-data/tree/multi-threadin...). With the usual blocksize of 64 sample, you can definitely observe the scheduling overhead in the CPU meter. If you have a few (heavy-weight) subpatches running in parallel, the overhead is neglible. But for [clone] with a high number of (light-weight) copies, the overhead becomes rather noticable. In my quick tests, reblocking to 256 samples already reduces the overhead significantly, at the cost of increased latency, of course.


Also, in my plugin host for Pd/Supercollider (https://git.iem.at/pd/vstplugin/) I have a multi-threading and bridging/sandboxing option. If the plugin itself is rather lightweight and the blocksize is small, the scheduling overhead becomes quite noticable. In Pd you can just put [vstplugin~] in a subpatch + [block~]. For the SuperCollider version I have added a "reblock" argument to process the plugin at a higher blocksize, at the cost of increased latency.

"For most scenarios, you really don't need all those damn cores. They are almost certainly getting in the way of you solving your problems." This rings true. I still think the most promising multi-threading strategy is to, by default, keep everything on one core and if you find some expensive and more or less independent workload to put that in another thread. This is the old-fashioned way though. All of the cool kids will push for thread pools where every workload will jump from core to core and back again.... This potentially can require a lot of locking, though, which also is not good for performance.

It's an interesting reflection then on the Actor paradigm where you are encouraged to create large numbers of fine grained tasks and toss them into a giant thread pool blender because light weight threads are "free" now. It would be quite interesting to see the consequences of thread affinity loss in that setting.

I believe Rust is pulling on the right thread here but I have no idea how this will look in fifteen years. At some point we will need to be able to tell that two threads don’t contend and thus can be “farther away” from each other, but these threads are constantly communicating and need to use the same complex and/or memory banks.

I think it could be something built upon lifetime analysis. I think a sufficiently complex compiler could use the data available already today to infer thread affinity and other rules.

Basically what I was thinking. In the near term it doesn't even have to be that precise to have a benefit (Rust builds on escape analysis that helped with distributed garbage collection, even when it wasn't that exhaustive). Even ranking threads by suspected degree of interaction is a lot of data.

These definitely don't talk. Offload. These definitely talk. Colocate. These probably talk. Bin pack.

Games are a prime example of systems which must take advantage of all possible hardware resources.

Games similarly struggle with multi core/ccx designed chips.

Battlefield 4 was famously terrible on Ryzen when AMD first released it.

To be fair, Ryzen mostly underperformed when it was released which was not AMDs fault. When for years engineers used Intel as their benchmarking and testing machines and were tweaking all sorts of parameters to measure against Intel-only, you inevitable end up with an optimized implementation that resembles the hardware details of the Chip in some kind. Any new architecture would perform badly, except if it is a perfect clone of the original testing hardware.

I wasn't implying it was AMDs fault, but the idea that the games were optimized for Intel doesn't hold true IMHO because they were also built for consoles that were AMD.

The majority of the performance was down to not accounting for CCX scheduling

The fact that we believe this is probably part of the problem with game development today.

This article is a well-written deep dive on parallel performance debugging that ends up identifying the difference being due to how caching is handled on multi-CPU vs single-CPU systems. Thanks to the author for taking time to write up their troubleshooting process description.

Agreed. I appreciate the author adding the small effort to briefly describe the idiosyncratic pieces (e.g., Rune) in their unique problem space. So often, writers believe their readers all play in a similar problem space. So they elect to leave understanding such idiosyncrasies for the reader to either dig deeper on their own, or self-select out of reading altogether.

Would this tool have have scaled as well as 24 separate OS processes?

I remember hitting this same issue many years ago. Some primitive in the library (c++) didn’t on the surface appear to involve shared state and cache write lock but of course it did if you thought about it. Since I switched over to more application level coding I don’t work that low level very much and tend to scale with processes now (typically not needing that much parallelism anyway) so i don’t often have to think these things through anymore.

The problem outlined in the article is not with process vs thread, the problem is with multiple parallel execution contexts reading and writing the same cache line. This issue would have manifested in the same way if your separate processes touched the same offset in a shared memory mapping.

Of course, designing with multiple processes in mind helps to design a shared-nothing architecture, as the costs of shared state between processes are much more explicit. But any kind of state that needs to be shared will run into the same bottleneck, because all forms of shared IPC fundamentally have to use some form of locking somewhere in the stack to serialize incoming inputs.

I think the intuition on processes vs threads was right:

1. The Arc use is a case (I believe! Am not a Rust'er!) of true sharing of references. The cacheline will be contention across cores. Doing processes would have eliminated true sharing. Tracking logical core affinity for references via a modal type system could have flagged this at compile time, but I suspect that's not a thing in Rust land.

2. Even if it was false sharing at the cacheline level, process abstractions in shared memory systems are generally aligned to avoid false sharing at cacheline granularity (and misaligned ops at smaller granularities). Of course, in the extreme, a 10X hero programmer may have tried to do amazing levels of bit packing and syscall hackery and thus break all those niceties, but that's the cleanup that everyone else in a team has to make the hero a 10Xer.

Probably yes, unless the author went to great lengths to contend on shared memory segments.

This is an excellent writeup, but I'm curious about how dependent the microbenchmark is on the scripting engine doing no real work. My off-the-cuff guess is that the unshared version might still perform better with other workloads (it would have higher throughput and avoid variance caused by fluctuating degrees of cache contention) but it might be that the observable overhead from hammering a single cache line would disappear if each thread had more varied work to do. And of course if the scripting engine benefits substantially from caching internal state across invocations (things like incrementally optimized JIT code) then having 24 instances can incur substantial extra work.

Is there any tooling that can demonstrate cache coherency hotspots like cache line bouncing or inter-core communication?

Yes. Linux perf has a c2c hit modify “mode”/“option” that will report precisely this issue.

Perf c2c is extremely useful. Especially when using it with call graphs. Seeing which call stack interacts how with a cache line can make it a lot easier to find solutions to contended cache lines.

What is c2c? Thanks.

c2c stands for cache-2-cache.

perf-c2c allows you to measure cacheline contention - see https://man7.org/linux/man-pages/man1/perf-c2c.1.html and https://joemario.github.io/blog/2016/09/01/c2c-blog/.


Cache to cents? Kernels 2 cents on how your code is utilizing the cache? I have 0 clue too.

Cache to Cache,

It tracks a bunch of cache-related counters, and HITM (hit modify) tracks loads of data modified by either an other core of the same node ("local hitm") or by an other node entirely ("remote hitm").

Here, the contention would have shown up as extreme amounts of both, as each core would almost certainly try atomically updating a value which had just been updated by an other core, with higher than even chances the value had been updated by a core of the other socket (= node).

I've used the Intel VTune Profiler, I think it's free. It's a pretty good tool for profiling code and gives you quite a lot of information on how you're using the core pipelines, but you need an Intel processor to use it. I think it would have quickly flagged this kind of problem and told you that the processor were stalled waiting for cache.

Vtune is free now. it's by the far the best profiler I've ever come across, AMD's equivalent is a bit rough and ready by comparison.

Until this moment I didn't realize that it also became available for Linux.

All of the "serious" profilers support this. Intel have a bunch of tooling for diagnosing threading issues, AMD also have some less polished but still functional metrics you can use.

Do they track it down to the actual problem point in the code? Or do you just see that maybe there’s a problem and then experimenting with changes to figure out what needs to be fixed?

Short answer: yes. Much better than perf does.

> Because there are 4 cores, the throughput increases linearly up to 4 threads. Then it increases slightly more up to 8 threads, thanks to hyper-threading that squeezes a bit more performance out of each core. Obviously there is no performance improvement beyond 8 threads, because all CPU resources are saturated at this point. [emphasis mine]

Can someone more knowledgeable than I please explain the obvious part of this? Why is the performance identical for 8-12 threads? What is it that is saturated at 8 threads even though there are 4 more threads hanging around?

Note that it is a 4-core CPU which employs simultaneous multithreading[1] (hyper-threading in this case) to achieve a higher throughput while executing independent instruction streams. This technique can provide multiple hardware threads to the OS to schedule its processes on (typically 2 per physical core), and the CPU takes care of executing them as concurrently as possible on a single core (typically by employing a few hardware enhancements within the core so that it can work with multiple instruction streams in a since clock cycle).

At first, the throughput up to 4 cores increases linearly because 4 OS threads can utilize 4 hardware threads independently, making the greatest possible use of available hardware resources. Beyond 4 OS threads, simultaneous multithreading comes into play and up to 8 OS threads get scheduled on 8 "simultaneously multithreaded" hardware threads, which offers increased throughput, but not as drastic (see how the author uses the word "slightly"). Beyond 8 OS threads, the throughput will not increase as you are out of hardware threads which actually execute your instruction streams independently. The OS can spawn an arbitrary number of threads beyond 8, but they will take turns executing on your 8 available hardware threads - no gain in net throughput. You get limited by the hardware.

[1] https://en.wikipedia.org/wiki/Simultaneous_multithreading

The simplest explanation I've heard is that hyperthreading is like using two hands to keep the mouth always full(whereas with only one sometimes it remains idle). Once you can keep it always full, there's no use adding more hands.

POWER has 8-way SMT. More than 2-way SMT is definitely useful in some cases, but it does not exist on any x86 chip, so it doesn't matter if you try to use more threads in software. The hardware can only handle 2 per core.

The CPUs can only run two threads per core at a time. Therefore, it doesn't matter if you make more software threads; only 8 will actually run at a time.

All CPU resources aren't actually saturated, as there will still be idle execution units, but since the CPU can't actually dispatch another thread to make use of those units, there's nothing you can do about that.

In situations like this, it’s valuable to know about all the event types perf can sample beyond CPU cycles, including cache misses at various levels and branch mis-predictions. These can go right into flamegraphs as well.

Interesting when you start to compare with C++ stdlib which only has shared_ptr, being the equivalent to Arc, but no single-threaded equivalent Rc. Usually the argument you hear is that the performance difference is negligible.

One could also argue that you shouldn't use any kind of refcounter in a tight loop where it can affect performance as much as it did there. Refactoring the code a bit to allow passing by reference there would likely help even more. Still cool to see that just changing to Rc gave that much gain.

If I understand this correctly, the problem is strictly contention on the cache lines containing the reference counts for the shared data. The fact that the data itself was shared is fine, it's just the reference counts that are a problem, yes?

Instead of unsharing the whole thing, I wonder if it would make more sense for Rune's Vm type to just re-wrap the Arcs in a new reference-counted type. I don't know if Rune spawns new threads internally, if not then it could store these as Rc<Arc<…>>, but if so then it could use Arc<Arc<…>>. This may look odd but it would ensure that every Vm gets its own independent reference count to use for all of its internal cloning. Or it could redesign its internals to avoid doing so much cloning of the pointers, instead using borrows, though I don't know anything about the Rune API so I don't know how viable that is.

In any case, the basic takeaway here should be that using Arc is still potentially expensive if you end up cloning it often (and it looks like Rune does; in a quick skim it's cloning the runtime context and unit quite often). Arc should be used just to deal with data lifetime issues, but cloning of Arc should be avoided as much as possible.

Shared-nothing dataflow models are an interesting way to engineer (some types of) parallelism.

FWIW this is the kind of content I originally came to HN for and what keeps me coming back.

This one is quite surprising. Typical multithreaded/async Rust code contains lots of atomically reference counted variables. And while it’s somewhat common sense that it’s not perfect from a performance point of view (and shared nothing frameworks like seastar/glommio try to avoid it) I’ve never seen a report about such a massive degradation before. Wondering in what kind of pathological case this is running here.

Has anyone any idea where the `` format strings come from??

As far as I know rust doesn't have them.

And as far as I know if rust gets format strings it most likely get them in form of a `f` prefix (like raw strings have an `r` prefix and binary strings have a `b` prefix).

Is it just a thing to make the code more readable/shorter on the web?

The code shown in the article isn't Rust, but Rune [0], which has similar syntax, but is a dynamic scripting language. It has template literals [1].

[0] https://rune-rs.github.io/

[1] https://rune-rs.github.io/book/template_literals.html

Oh, that makes sense.

If you are talking about "Here is an example of a complete workload that measures performance of selecting rows by random keys", then you are looking at Rune code rather than Rust code.

See https://rune-rs.github.io/book/template_literals.html

HN doesn't use Markdown.

Is this the kind of situation where assigning a core affinity to the various threads would have made a difference (to make sure they all run on the same physical processor), or is the mere presence of another processor enough to trigger the behavior shown in the article?

I don't think core affinity would have helped in this situation. They had one instance that was being shared across all cores, necessitating the reference count to be sync'd between sockets at L3 cache level. They may have seen some benefit if you had, say, 2 instances, one per CPU socket, and dedicated an instance per socket to avoid L3 cache interactions. But, in this case, it is far simpler, cleaner and better performing to not share the instance and avoid the reference counting entirely.

No, that would not make a significant difference here. The problem is that the same line has to bounce around all of the threads that share it; pinning doesn’t reduce that very much on this time-scale. Thread migrations will be very infrequent relative to atomic operations, especially when the number of runnable threads is lower than number of physical cores.

Core affinity would only help if he didn't have more threads than clusters. Otherwise he'd either hit the wall his laptop did, or cross the cluster and hit the cache issues

It seems to me that one solution would be to be able to shard an Arc into separate counts for each thread, so that the synchronisation wouldn’t be required (?) but I don’t know how it would be done without making most Arcs unnecessarily expensive.

Perhaps another problem is that libraries can’t be (or just aren’t) generic about the kind of smart-pointers that are used but the kind of smart-pointers used can matter a lot (e.g. libraries often have to use Arc for some of their clients even though Rc would often suffice).

But maybe I’m just totally wrong about this?

It's a great idea and in fact a variation of it is being explored by Python to eliminate the use of the GIL without introducing performance regressions.

The variation is instead of one count per thread, have two reference counts where one is atomic and the other is not atomic. The heuristic is that at any given time, the majority of writes are performed within a single thread and so that thread has exclusive read/write access to the non-atomic reference count, all other threads share the atomic reference count. There is some work needed to coordinate the atomic and non-atomic reference count so as to test when an object can be deleted, but the overhead of this work is less than the cost of always using an atomic.


In the case of the original article, the reference count was being updated by all the threads so having an Arc-with-Rc-for-one-thread wouldn’t actually change things, I think. But maybe you just mean for ‘cheaper smart-pointers in general’?

The python thing is interesting. I thought there were other reasons for the GIL than ref counts but I guess everything must be addressed.

Yep in this case it might not have helped, but all this to say that some exploration of having multiple reference counts is a great idea. The balance becomes the overhead of coordinating multiple reference counts versus the gains from reduced atomic operations.

Relevant video - Scott Meyers: Cpu Caches and Why You Care


Remember, boys and girls, small mistakes can have big consequences.

Keep those data in immutable data structures when mutli-threading. Actor model also helps.

My laptop (i7-8650U) has slightly faster single-core performance than my 96-thread dual 7402 EPYC Milan home server. I'm fine with that. I'm not fine with undue inefficiencies, i.e., excessive cache misses, branch mis-predictions, and unparallelization due to sloppy or unfortunate code.

The mantra I live by is that multi-threaded programming is so potentially complex and it's so easy to screw up (eg erasing all your supposed performance gains as was the case in this article) or by introducing subtle bugs or (worse) security flaws that you should do absolutely everything you can to avoid it.

IME C++ programmers seem to put an unreasonable amount of faith on RAII to "solve" these problems eg:

    std::mutex m;
      std::lock_guard<std::mutex> lg(m);
      // do stuff with a mutex lock
    } // lock_guard destructor releases mutex
If you're just updating a (private) map or something, this is fine. But as soon as this gets complex (eg by calling a function and you lose control over what thread is executing because some deep code creates a new thread) then this all turns to crap.

So whenever anyone tells you "just start a thread" without any hesitation, it's reasonably safe to assume they have no idea what they're talking about.

This is one reason I love the async code model used in a number of languages. I have the most experience with Hack [1] but it's not unique to that. What I tend to think of as "product" code (which is a real Facebook-ism) has no need to every create a thread and dealing with any concurrency issues.

Now if you're programming an HTTP server or an RDMBS or other infrastructure then sure, you have a valid need. But if you're just serving HTTP requests, just avoid it entirely if you can because any performance gain is probably imaginary but the complexity increase isn't.

[1]: https://docs.hhvm.com/hack/asynchronous-operations/introduct...

Isn't that mantra a bit too pessimistic? After fixing the performance issue, a speedup that was almost linear in the number of cores was achieved using multithreading.

In addition, async code can bring much of the same challenges as multithreading. For example, in C# tasks may run on the current thread, but they may also run on a thread-pool, and this is mostly transparent to the programmer. When you don't immediately await all asynchronous operations your tasks may end up running at the same time.

Like anything you weigh up the upside against the downside.

The downside is more complex code, code that's harder to reason about, the possibility of concurrency bugs, slower development times, the possibility of security flaws and you may just be shifting your performance bottleneck (eg creating a hot mutex).

The upside is lower latency, more throughput and higher QPS.

My argument is quite simply YAGNI. A lot of these benefits are for many people purely imaginary. They're the very definition of premature optimization. Code that is less complex, faster to develop in and likely "good enough" is probably going to help you way more than that.

If you get to the scale where heavy multithreading actually matters, that's a good problem to have. My argument is that many people aren't there and I would hazard a guess that a contributing factor to not getting there is worrying about this level of performance before they need to.

>"But if you're just serving HTTP requests, just avoid it entirely if you can because any performance gain is probably imaginary but the complexity increase isn't"

I serve HTTP requests offering JSON based RPC in my C++ servers that do all kinds of calculations (often CPU consuming) in multiple threads. Works just fine and scales well. Yes I have to be careful on how I work with shared data but it is centralized encapsulated and done in a single place I call request router / controller. I reuse it among the projects. Nothing too exciting about it and no reason to be scare do death ;)

Don't call functions (that aren't part of the C++ standard library) inside the lock guard. It's not a hard rule and you'll get 99% of the way there.

This model is good for I/O heavy code that is typical for most web services. However if your load is a mix of I/O and compute something like Go should do a lot better.

Not sure why you're being downvoted because you raise a valid point: Go has a concurrency model aimed at avoiding many of these problems. This is of course the channel abstraction.

Go has other issues. My first piece of advice to any new Go programmer is make every single one of your channels unbuffered. This way it operates a lot like the async model I mentioned and it tends to be easy to reason about.

But people fall into a trap of thinking adding a buffer will improve concurrency and throughput. This might be true but most of the time it's a premature optimization. Buffered channels also often lead to obscure bugs and a system that's simply harder to reason about.

It's really better to think of Go as an tool for organizing concurrent code, not necessary a direct solution to the problems involved. But organization matters.

All concurrency tools are foot guns but go seems to be the safest foot gun :)

That actually makes me appreciate that the single-threadedness of Python might a good thing since Python uses refcounting.

CQL is a very confusing term to pronounce.

Yes! Finally, HN removed the rule to strip thew word "How" from the title.

A few hours later...

TL;DR: serializing your parallel work dispatch on an atomic global counter is bad for scaling across many cores.

Or, per-cpu counters exist for a reason.

Can anyone TL;DR it? A single line, in Rust, can make a 24-core server slower than a laptop? What line is that?

In a multithreaded context, shared data was shared via atomic reference counting. The atomic operation acted as a single bottleneck for every operation. The solution was to derive deep cloning for the shared data (a 1 line fix) to reduce atomic contention.

There are only two hard things in Computer Science: cache invalidation and naming things. - Phil Karlton

Better version:

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-oly two hard things in Computer Scivetëm� dy gjëra të vështira në Shkencën Komp���jute

Now, just toss in a byte order mark, and I think we've a winner!

Hehe. Ok. That is very funny :)

I don't get it -- am I slow?

This used to happen in C for me. Basically, you are trying to print a string which has overflowed the allocated length and you get junk printed on the string. C-Strings are null terminated and the system will keep printing till it encounters a null. So, bounds checking and buffer overflows are also hard problems.

Buffer overflow :)

Memory corruption due to off-by-one errors.

Gotta watch out for those nulls :)

Nulls are easy compared to invalid non-null pointers.

People say null in Java was a mistake, but I've never had any issues in my null-only dialect of Java /s

There are only two hard problems in Computer Science: cache invalidation, naming things, off-by-one errors, and remembering to update the documentation.

Better better version:

There are only two hard things in Computer Science: cache invalidation, and…what did you ask me again?

There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking

There is only one hard problem in computer science: people.

You forgot text encodings, endianess, ...

Is 4 uninitialized variables?

Most languages will do bounds checking for you. I reckon that one's mostly solved at this point.

But... but... but... we HAVE to use C, it's the only language that's fast enough!


Too slow.

You missed the off-by-one errors

So you mean I was off by one?

That code? Thread.sleep(5000)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact