> L3 cache is shared by all cores of a CPU.
I recently learned this is no longer true! AMD EPYC processors (and maybe other recent AMDs?) divide cores into groups called "core complexes" (CCX), each of which has a separate L3 cache. My particular processor is 32-core where each set of 4 cores is a CCX. I discovered this when trying to figure out why a benchmark was performing wildly differently from one run to the next with a bimodal distribution -- it turned out to depend on whether Linux has scheduled the two processes in my benchmark to run on the same CCX vs. different CCXs.
https://en.wikipedia.org/wiki/Epyc shows the "core config" of each model, which is (number of CCX) x (cores per CCX).
It's still kinda true, just that "a CPU" in that context is a CCX. Cross CCX communications has shown up a fair bit in reviews and benches, and really in all chips at that scale (e.g. Ampere's Altras and Intel's Xeons): https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 and one of the "improvements" in Zen 3 is the CCX are much larger (there's one CCX per CCD, rather than 2) so there's less crosstalk.
And it could already be untrue previously e.g. the Pentium D were rushed out by sticking two P4 dies on the same PCB, I think they even had to go through the northbridge to communicate, so they were dual socket in all but physical conformation (hence being absolute turds). I don't think they had L3 at all though, so that wasn't really a factor, but still...
Then you can OC tweak each core and try to milk out performance, PBO tends to be pretty good on detecting which cores with slight performance boost. With a few bios options, all that adds up without effort.
And then finally get a better (advanced) scheduler in windows/linux that can move workloads around per core depending on workload. Windows released multiple fixes for the scheduler on AMD starting with win10 1903.
I find scheduler mods and modest bios tweaks to increase performance without much effort, very detectable performance.
Linux I use xanmod (and pf-kernel before that)
Windows I use project lasso and AMD PBO.
Slow/fast cores and scheduler tweaks is how win11/arm mac/android, makes things appear faster too.
Amusing, scheduler/core tweaks been around for linux for a decade+, making the desktop super smooth, but its now mainstream in win11/arm osx.
And it’s not just L3 that can shared by various chips.
Each complex also has its own memory bandwidth, so running on two core in two complexes will get around twice the memory bandwidth of two cores in the same complex.
Not only did I put a hole in my father's monthly budget by paying a premium for this but continued to do so for years with the power bills for this inefficient crap.
I remember reading that some politics between intel India-Israel teams led to rushed up design blunder, Couldn't find that article now.
Part of me hopes you're wrong here. That is absolutely absurd.
But Intel got caught completely unaware by the switch to multi-core, just as it had by the 64b switch.
The eventual Core 2 was not ready yet (Intel even had to bridge to it with the intermediate Core, which really had more to do with the Pentium M than with the Core 2 though it did feature a proper dual-core design... so much so that the Core solo was actually a binned Duo with one core disabled).
So anyway Intel was caught with its pants around its ankle for the second time and they couldn't let that happen. And they actually beat AMD to market, having turned out a working dual-core design in the time between AMD's announcement of the dual-core opteron (and strong hints of x2) and the actual release, about 8 months.
To manage that Intel could not rearchitecture their chip (and probably didn't want to as it'd become clear Netburst was a dead-end), so they stapled two Prescott cores together, FSB included, and connected both to the northbridge.
It probably took more time to validate that solution for the server market, which is why where AMD released the dual core Opterons in April and Athlons in May, it took until October for the first dual-core Xeon to be available.
"The Pentium D brand refers to two series of desktop dual-core 64-bit x86-64 microprocessors with the NetBurst microarchitecture, which is the dual-core variant of Pentium 4 "Prescott" manufactured by Intel. Each CPU comprised two dies, each containing a single core, residing next to each other on a multi-chip module package."
This actually has an importance if you are running a VM on such a system since you'll run into things like the actual RAM (not l3 cache) is often directly linked to a particular NUMA node. For example accessing memory in the first ram stick vs the second will give different latencies as it goes from ccx1 => ccx2 => stick2 versus ccx1 => stick1. This is applicable to I think 2XXX versions and earlier for threadripper. My understanding is that they solved this in later versions using the infinity fabric (IO die) so now all ccx's go through the IO die.
I ran into all of this trying to run an ubuntu machine that ran windows using KVM while passing through my nvidia graphics card.
It's applicable to any Zen design with more than one CCX, which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it was 4).
The wiki has the explanation under the "Core config" entry of the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16 cores).
Zen 3 APU are all single-CCX, though there are Zen 2 parts in the 5000 range which are multi-CCX (because why not confuse people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a 2x4.
The rest is either low-core enough to be single-CCX Zen 2 (5300U) or topping out at a single 8-cores CCX (everything else).
If this is NUMA, then so is any CPU with hyperthreading, as the hyperthreaded cores share L1.
Do we have requests that take longer, or did Linux do something dumb with thread affinity?
In other words: if your application is at the mercy of Linux making bad decisions about what threads run where, that is a performance bug in your app.
I only know that there is one scheduler that gives you dedicated cores if you set the request and limit both to equal multiples of 1000.
If you set cpu request == cpu limits, then container will be pinned to CPU cores using cpuset command .
There is also a way to influence NUMA node allocation using Memory Manager 
Cross-socket communications do take longer, but a properly configured NUMA-aware OS should probably have segregated threads of the same process to the same socket, so the performance should have increased linearly from 1 to 12 threads, then fallen off a cliff as the cross-socket effect started blowing up performances.
So if someone is thrashing cache on the same core you're on, you will notice it if the processes aren't being shared effectively.
The contents of the cache aren't stored as part of a paused process or context switch. But I'd appreciate a correction here if I'm wrong.
For an example, consider two processes A and B running on a set of cores. If A makes many more memory accesses than B, A can effectively starve B of "cached" memory accesses because A accesses memory more frequently.
If B were run alone, then it's working set would fit in cache. Effectively making the algorithm operate from cache instead of RAM.
BUT. You really have to be hitting the caches hard. Doesn't happen too often in casual applications. I only encountered this on GPUs (where each core has sperate L1 but a shared L2). Even then it's only aa problem if every core is hitting different cache lines.
I'm curious. How could the "neighbor" possibly evict your L1/L2 when it is local to you? Worst it can do is thrash L3 like crazy but if your own data is on L1/L2, how would that get affected?
For those of us who know basically nothing about UEFI: how do we do that? I have a 5900x, and numactl shows only one node.
I also tried passing "numa=fake=32G" or "numa=fake=2" to Linux. That syntax seems to match the documentation  but produced an error "Malformed early option 'numa'". Haven't dug into why. I'm not sure it'd correctly partition the cores anyway.
I gave up for the moment.
Oh, because the stock Ubuntu kernel doesn't enable CONFIG_NUMA_EMU.
The solution is to stop sharing. The author had to make changes specific to this codebase for each thread to have its own copy of the offending dataset.
I've had this complaint for over a year, and nobody is addressing it. Arcs are absolutely ubiquitous, it can come back and bite anytime, it's just very hard to know upfront where a bottleneck is and most people will never profile to this extent.
I really wish Rust would reconsider the decision to throw its hands up at non-lexical destruction guarantees. This is related to the thread::scoped removal from 2015 (IIRC), after which mem::forget was added. Anyway, I'm rambling. There's tons to read for the curious.
The simplest mitigation right now is to clone arcs only when necessary, and instead use static references of those arcs whenever possible (especially hot paths). It's a mess for refactors and easy to miss, but better than nothing.
I think the original argument that leaking memory using safe code by building a reference cycle is still valid. The language would have to change significantly to prevent that, so either reference counting must be unsafe, or we accept that destruction cannot be guaranteed just because something goes out of scope. That is why `forget` is safe.
If you look at the various crates that offer static borrowing across threads today, they use lexical scope and they are perfectly safe and guarantee destruction IFF the inner scopes terminate (which is exactly the guarantee you want). However, they are too restrictive to be useful.
Personally I glad the world is moving away from unsafe manual memory management in C++, be that with better RAII and managed pointers, Rust's type system, etc. But those things are still breakable and IMHO, ultimately a big waste of people's time. If Rust's lifetime annotations could be inferred by a compiler, then by all means. But if they can't, just GC already. All that extra thinking pushes people to think about the wrong things and designs end up ossified, difficult to refactor because of the extra overhead of radically changing ownership. Forget it already! Let computers figure out the whole memory management problem.
Want your program to go fast? Don't allocate a bunch of garbage. Redesign it to reuse data structures carefully. Don't burden yourself forever twiddling with managing the large and small alike, obsessing over every dang byte.
Unfortunately, it can't. To be fair, GC does management memory very well, but there is more to memory than handling allocation.
It becomes immediately obvious when you compare Go to Rust. Go strives for simplicity. Compare Go and Rust code, and Rust's overhead for making memory handling safe (type annotations, the type system exploding because it carries information about so many types of ownership) make it horribly complex compared to Go. Go code expresses the same concepts in far fewer tokens, and just as memory safe as the Rust code - provided there is only one thread.
Add a second thread, and the Rust code will continue to work as before. The Go code - well the GC does manages memory allocation only, it does _not_ arbitrate how multiple threads access those objects. Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result. That IMO is the _worst_ type of bug. Most of the overhead Rust imposes (like ownership) has very little to do with managing memory lifetimes. It is about preventing two threads form stomping on each other. Managing memory lifetimes just comes for free.
I suspect the rise and rise of Rust is mostly due it's concurrency guarantees. Go back a couple of decades and when multiple CPU's were almost non-existent and I suspect the complexity Rust imposes would have been laughed out of the room, given all it really gave you was an very complex alternative to GC - which is what you are effectively saying is all it provides. Nowadays we have $1 computers with multiple ARM cores in a disposable CVOID testing kit. Once you've been hit a couple of times by a currency heisenbug taking out your products in the field, you are screaming out for some tool to save your arse. Rust is one such tool.
There are a lot of other strategies for writing concurrent systems, like, for example, making most data immutable, coarse-grained sharing, shared-nothing multi-process architectures like actors, using race detectors and/or thread sanitizers, using good concurrent libraries (with lock-free hashtables and other datastructures).
The difference between the number of concurrency bugs causing catastrophic failure and all other kinds of bugs causing catastrophic failure has gotta be at least 100 or maybe 1000 to 1. So forcing everyone to think about ownership because maybe they are writing concurrent code (then again maybe they aren't) so that "congrats your memory management problems are solved" seems like a Pyrrhic victory--you've already blown their brain cells on the wrong problem. Worse, you've forced them to bake ownership into the seams in every part of the system, making it more difficult to refactor and evolve that system in the future. 
> Screw up how multiple Go threads access that nice GC managed memory and undefined and non-deterministic behaviour is inevitable result.
You get race conditions, you don't get undefined behavior (nasal daemons). Go and Java have memory models that at least your program doesn't violate the source type system and you don't get out-of-thin air results. You get tearing and deadlocks and race conditions, not undefined behavior. At Google we used TSAN extensively in Chrome and V8 and similar tools exist for Java, and I assume so as well for Go.
It's a broader conversation, but I don't think advocating that everyone write highly concurrent programs with shared mutable state, no matter what tools they have at their disposal, is pushing in the right direction. Erlang, Actors, sharing less, immutable data, and finding better higher level parallel programming constructs should be what we focus our and other's precious brain cycles on, not getting slapped around by a borrower checker or thread sanitizer. We gotta climb out of this muck one of these decades.
 If you are instead saying that people should think about the design of their system from the beginning so that they don't have to refactor as much, then I agree; but just don't waste time thinking about ownership, or worse, architecting ownership into the system; it will alter the design.
https://manishearth.github.io/blog/2015/05/17/the-problem-wi... argues that "[a]liasing with mutability in a sufficiently complex, single-threaded program is effectively the same thing as accessing data shared across multiple threads without a lock". This is especially true in Qt apps which launch nested event loops, which can do anything and mutate data behind your back, and C++ turns it into use-after-free UB and crashing (https://github.com/Nheko-Reborn/nheko/issues/656, https://github.com/Nheko-Reborn/nheko/commit/570d00b000bd558...). I find Rust code easier to reason about than C++, since I know that unrelated function calls will never modify the target of a &mut T, and can only change the target of a &T if T has interior mutability.
Nonetheless the increased complexity of Rust is a definite downside for simple/CRUD application code.
On the other hand, when a programmer does write concurrent code with shared mutability (in any language), in my experience, the only way they'll write correct and understandable code is if they've either learned Rust, or were tutored by someone at the skill level of a Solaris kernel architect. And learning Rust is infinitely more scalable.
Rust taught me to make concurrency tractable in C++. In Rust, it's standard practice to designate each piece of data as single-threaded, shared but immutable, atomic, or protected by a mutex, and separate single-threaded data and shared data into separate structs. The average C++ programmer who hasn't studied Rust (eg. the developers behind FamiTracker, BambooTracker, RtAudio, and RSS Guard) will write wrong and incomprehensible threading code which mixes atomic fields, data-raced fields, and accessing fields while holding a mutex, sometimes only holding a mutex on the writer but not reader, sometimes switching back and forth between these modes ad-hoc. Sometimes it only races on integer/flag fields and works most of the time on x86 (FamiTracker, BambooTracker, RtAudio), and sometimes it crashes due to a data race on collections (https://github.com/martinrotter/rssguard/issues/362).
Edit: it appears they're talking specifically about the ref counting whereas I was considering the entirety of a shared context. That clarification helps me understand where their statement was coming from
Edit: to clarify I was thinking about a mutable context share as shown in the code example, not solely about the ref counting.
By not using reference counting. State of the art GCs don’t count references. They usually doing mark and sweep, implementing multiple generations, and/or doing a few other things.
Most of that overhead only happens while collecting. Merely referencing an object from another thread doesn’t modify any shared cache lines.
> What language has a sufficiently lock less rw capable GC?
Java, C#, F#, Golang.
But if it's just for the ref counting part of the Arc then I can see how a GC would solve it by not needing the RC
A quote from the article: “No locks, no mutexes, no syscalls, no shared mutable data here. There are some read-only structures context and unit shared behind an Arc, but read-only sharing shouldn’t be a problem.” As you see, the data shared across threads was immutable.
However, the library they have picked was designed around Rust’s ref.counting Arc<> smart pointers. Apparently for some other use cases, not needed by the OP, that library needs to modify these objects.
> I can see how a GC would solve it by not needing the RC
Interestingly enough, C++ would also solve that. The language does not stop programmers from changing things from multiple threads concurrently. For this reason, very few libraries have their public APIs designed around std::shared_ptr<> (C++ equivalent of Rust’s Arc<>). Instead, what usually happens, library authors write in the documentation things like “the object you pass to this API must be thread safe” and “it’s your responsibility to make sure the pointer you pass stays alive for as long as you using the API”, and call it a day.
Technically, all programming languages are Turing-complete. Practically, various things can affect development cost by a factor of magnitude. The OP acknowledges that, they wrote "Rewriting Rune just for my tiny use case was out of the question".
Just because something can be done doesn't mean it's a good idea to do that. Programming is not science, it's engineering, it's all about various tradeoffs.
> The language just steers you away
Such steering caused unique performance issues missing from both safer garbage collected languages, and unsafe C++.
Note the OP was lucky to be able to workaround by cloning the data. If that context or unit objects would use a gigabyte of RAM, that workaround probably wouldn't work, too much RAM overhead.
Doing that is prohibitively expensive in this particular case. It would require patching a large third-party library who uses Arc for the API: https://docs.rs/rune/latest/rune/struct.Vm.html#method.new
And the reason that library uses Arc in the API is unique to Rust.
A different non-GC language wouldn't change things, because you'd have the exact same trade off if the same decision was made.
The only major difference is that Rust pushes you to Arc but C++ doesn't push you to a shared_ptr.
I'm familiar with Go's GC. Your linked post doesn't explain how it would avoid the hit mentioned by the cache invalidation across multiple clusters.
It'll either try and put multiple go routines on a single cluster (as listed in the link) or it'll need to copy the necessary stack per thread. Which is effectively what the original article ends up doing.
But if you encounter anything that needs to run concurrently across threads while using a single r/w object, you'll hit the same cliff surely?
It trips up everybody who hopes "lock-free" will be a magic bullet freeing them from resource contention bottlenecks.
When you have explicit locks, aka mutexes (condition variables, semaphores, what-have-you), the interaction between your threads is visible on the surface. Replacing that with "lock-free" interaction, you have essentially the same set of operations as when taking a mutex. On the up side, overhead cost may be lower. On the down side, mistakes show up as subtly wrong results instead of deadlocks. And, you get less control, so when you have a problem, it is at best harder to see why, and harder to fix.
Because the lock-free operations involve similar hardware bus interactions under the covers, they cannot fix contention problems. You have no choice but to fix contention problems at a higher, architectural design level, by reducing actual contention. Having solved the problems there, the extra cost of explicit mutex operations often does not matter, and the extra control you get may be worth any extra cost.
What is lock-free good for, then? Lock-free components reduce overhead when there is no contention. Given any actual contention, performance is abandoned in favor of correctness. So, if you have mutexes and performance is good, moving to lock-free operations might make performance a little better. If performance is bad, mutexes and lock-free operations will be about equally as bad.
If you're working on a real time system, and I don't mean the colloquial meaning of that word, but rather a system that must guarantee that a side effect is produced no later than a given period of time, then you must use a lock free algorithm, there is no alternative. Avionics, DSP, high frequency trading, these are all domains where lock free algorithms are necessary. Making a fast lock free algorithm is great and generally the goal of any kind of parallelism is performance, but fast is not an unambiguous term, it can mean low latency or high bandwidth and it's very unlikely to write an algorithm that is both.
Lock free algorithms are great when you need low latency and guaranteed throughput. If you don't need that then it's possible to trade those properties for much higher bandwidth and use blocking algorithms.
Also: All this has nothing to do with parallelism. Parallelism is about things happening at the same time. Locking or not locking, and contention, are about concurrency, which is about synchronization and serialization, a different topic.
Furthermore if you think a discussion about benchmarking an algorithm on a 24-core SMP system involving atomic reads, writes and shared caches has nothing to do with parallelism then let's just agree to disagree with one another and I wish you all the best in your programming endeavors.
People can decide for themselves what credibility they wish to lend to your argument given your perspective on this matter.
To clarify, the main property of a non-blocking algorithm is that it's exclusively composed of simpler, smaller ("atomic") operations that demonstrate time-bound/progress-making guarantees, in turn making the bigger algorithm generally easier to reason about (Note "generally").
The consequence of this argument is twofold:
1. It's entirely possible to make a blocking algorithm provide the same guarantees as a non-blocking algorithm, as long as the underlying operations (blocking or otherwise) are proven to be correct. Blocking algorithm tend to throw other complex systems into the mix (e.g. the scheduler), and all of the dependencies and their usages have to be correct (w.r.t. whatever property we desire) to show that the final blocking algorithm is also correct. For instance, it can be argued that the priority inversion problem stems from the lack of scheduler's correctness, or alternatively, the incorrect use of the scheduler. However, in the absense of such problems, there are effectively no disadvantages on blocking algorithms on this respect.
2. If it turns out any atomic operations below the non-blocking algorithm do not meet the expected guarantees, then the non-blocking algorithm itself also becomes unsafe. I think the OP clearly demonstrates this problem--the CPU atomic instructions are but some kind of hardware "locks" (either a LOCK# pin on older processors or some complex cache coherency protocol, but this is an implementation detail) on their own that can sometimes fail to meet the programmer's expectations, leading to surprising results.
In short, the fine line between blocking algorithms and nonblocking algorithms can be drawn by whichever smaller operations or components we assume to be correct.
"Contention" is not very meaningful to a lock-free algorithm, since lock-free algorithms typically do not permit shared data access semantics. Rather they describe how data is sent between processes, which is why a lock-free solution looks very different than something behind mutexes.
But to the heart of your point, I do wish people spent more time understanding that lock-free vs lock-ful is a distinction in semantics and doesn't imply anything about performance. There are subtle trade offs.
(+) I'm using "process" with a lot of hand waving, assume it means whatever your atom of parallelism is
Contention is the devil and you can never hide from it if you bring it into your life. The fastest "contention avoidance hack" would be the CAS operation as mentioned above, which still performs approximately 2 orders of magnitude slower than a single threaded operation would in the most adverse scenarios.
If you aren't sure of which problems actually "fit" on a modern x86 thread these days, rest assured that many developers in fintech have been able to squeeze entire financial exchanges onto one:
For most scenarios, you really don't need all those damn cores. They are almost certainly getting in the way of you solving your problems.
However, if the signal pathway is more convoluted (no pun intended for my audio nerd crew), then not only does the degree of parallelism decrease but the issue of single-sample vs. block structured can become much more important. In the single sample case, the cost of computing a single sample has (relatively) enormous fixed overhead if the code has any serious level of modularity, but is still very cheap compared to inter-thread and inter-core synchronization. Spreading the computation across cores will frequently, but not always, slow things down quite dramatically. By contrast, block structured processing has a much lower per-sample cost, because the function call tree is invoked only once for an entire block of samples, and so the relative cost of synchronization decreases.
This sounds like a slam dunk for block structured processing.
Problem is, there are things you cannot do correctly with block structured processing, and so there is always a place for single sample processing. However, the potential for parallelization and the level at which it should take place differs significantly between the two processing models, which opens to door to some highly questionable design.
The short version of this is that audio synthesis in the abstract can always expand to use any number of cores, particularly for cpu-intensive synthesis like physical modelling, but that in the presence of single-sample processing, the cost of synchronization may completely negate the benefits of parallelism.
We need to meet our real-time deadline or risk dropping buffers and making nasty pops and clicks. That mastering stage can pretty easily be the limiting (hah) step that causes us to miss the deadline, even if we processed hundreds of tracks in parallel moments before in less time.
The plug-in APIs (AudioUnits, VSTs, AAX) which are responsible for all the DSP and virtual instruments are also designed to process synchronously. Some plug-ins implement their own threading under the hood but this can often get in the way of the host application’s real-time processing. On top of that, because the API isn’t designed to be asynchronous, the host’s processing thread is tied up waiting for the completed result from the plug-in before it can move on.
Add on that many DSP algorithms are time-dependent. You can’t chop up the sample buffer into N different parts and process them independently. The result for sample i+1 depends on processing sample i first.
What I like about Pd is that you can freely reblock and resample any subpatch. Want some section with single-sample-feedback? Just put a [block~ 1]. You can also increase the blocksize. Usually, this is done for upsampling and FFT processing. Finally, reblocking can be nested, meaning that you can reblock to 1024 samples and inside have another subpatch running at 1 sample blocksize.
SuperCollider, on the other hand, has a fixed global blocksize and samplerate, which I think is one of its biggest limitations. (Needless to say, there are many things it does better than Pd!)
In the last few days I have been experimenting with adding multi-threading support to Pd (https://github.com/Spacechild1/pure-data/tree/multi-threadin...). With the usual blocksize of 64 sample, you can definitely observe the scheduling overhead in the CPU meter. If you have a few (heavy-weight) subpatches running in parallel, the overhead is neglible. But for [clone] with a high number of (light-weight) copies, the overhead becomes rather noticable. In my quick tests, reblocking to 256 samples already reduces the overhead significantly, at the cost of increased latency, of course.
Also, in my plugin host for Pd/Supercollider (https://git.iem.at/pd/vstplugin/) I have a multi-threading and bridging/sandboxing option. If the plugin itself is rather lightweight and the blocksize is small, the scheduling overhead becomes quite noticable. In Pd you can just put [vstplugin~] in a subpatch + [block~]. For the SuperCollider version I have added a "reblock" argument to process the plugin at a higher blocksize, at the cost of increased latency.
These definitely don't talk. Offload. These definitely talk. Colocate. These probably talk. Bin pack.
Battlefield 4 was famously terrible on Ryzen when AMD first released it.
The majority of the performance was down to not accounting for CCX scheduling
I remember hitting this same issue many years ago. Some primitive in the library (c++) didn’t on the surface appear to involve shared state and cache write lock but of course it did if you thought about it. Since I switched over to more application level coding I don’t work that low level very much and tend to scale with processes now (typically not needing that much parallelism anyway) so i don’t often have to think these things through anymore.
Of course, designing with multiple processes in mind helps to design a shared-nothing architecture, as the costs of shared state between processes are much more explicit. But any kind of state that needs to be shared will run into the same bottleneck, because all forms of shared IPC fundamentally have to use some form of locking somewhere in the stack to serialize incoming inputs.
1. The Arc use is a case (I believe! Am not a Rust'er!) of true sharing of references. The cacheline will be contention across cores. Doing processes would have eliminated true sharing. Tracking logical core affinity for references via a modal type system could have flagged this at compile time, but I suspect that's not a thing in Rust land.
2. Even if it was false sharing at the cacheline level, process abstractions in shared memory systems are generally aligned to avoid false sharing at cacheline granularity (and misaligned ops at smaller granularities). Of course, in the extreme, a 10X hero programmer may have tried to do amazing levels of bit packing and syscall hackery and thus break all those niceties, but that's the cleanup that everyone else in a team has to make the hero a 10Xer.
perf-c2c allows you to measure cacheline contention - see https://man7.org/linux/man-pages/man1/perf-c2c.1.html and https://joemario.github.io/blog/2016/09/01/c2c-blog/.
It tracks a bunch of cache-related counters, and HITM (hit modify) tracks loads of data modified by either an other core of the same node ("local hitm") or by an other node entirely ("remote hitm").
Here, the contention would have shown up as extreme amounts of both, as each core would almost certainly try atomically updating a value which had just been updated by an other core, with higher than even chances the value had been updated by a core of the other socket (= node).
Can someone more knowledgeable than I please explain the obvious part of this? Why is the performance identical for 8-12 threads? What is it that is saturated at 8 threads even though there are 4 more threads hanging around?
At first, the throughput up to 4 cores increases linearly because 4 OS threads can utilize 4 hardware threads independently, making the greatest possible use of available hardware resources. Beyond 4 OS threads, simultaneous multithreading comes into play and up to 8 OS threads get scheduled on 8 "simultaneously multithreaded" hardware threads, which offers increased throughput, but not as drastic (see how the author uses the word "slightly"). Beyond 8 OS threads, the throughput will not increase as you are out of hardware threads which actually execute your instruction streams independently. The OS can spawn an arbitrary number of threads beyond 8, but they will take turns executing on your 8 available hardware threads - no gain in net throughput. You get limited by the hardware.
All CPU resources aren't actually saturated, as there will still be idle execution units, but since the CPU can't actually dispatch another thread to make use of those units, there's nothing you can do about that.
One could also argue that you shouldn't use any kind of refcounter in a tight loop where it can affect performance as much as it did there. Refactoring the code a bit to allow passing by reference there would likely help even more. Still cool to see that just changing to Rc gave that much gain.
Instead of unsharing the whole thing, I wonder if it would make more sense for Rune's Vm type to just re-wrap the Arcs in a new reference-counted type. I don't know if Rune spawns new threads internally, if not then it could store these as Rc<Arc<…>>, but if so then it could use Arc<Arc<…>>. This may look odd but it would ensure that every Vm gets its own independent reference count to use for all of its internal cloning. Or it could redesign its internals to avoid doing so much cloning of the pointers, instead using borrows, though I don't know anything about the Rune API so I don't know how viable that is.
In any case, the basic takeaway here should be that using Arc is still potentially expensive if you end up cloning it often (and it looks like Rune does; in a quick skim it's cloning the runtime context and unit quite often). Arc should be used just to deal with data lifetime issues, but cloning of Arc should be avoided as much as possible.
As far as I know rust doesn't have them.
And as far as I know if rust gets format strings it most likely get
them in form of a `f` prefix (like raw strings have an `r` prefix and binary strings have a `b` prefix).
Is it just a thing to make the code more readable/shorter on the web?
Perhaps another problem is that libraries can’t be (or just aren’t) generic about the kind of smart-pointers that are used but the kind of smart-pointers used can matter a lot (e.g. libraries often have to use Arc for some of their clients even though Rc would often suffice).
But maybe I’m just totally wrong about this?
The variation is instead of one count per thread, have two reference counts where one is atomic and the other is not atomic. The heuristic is that at any given time, the majority of writes are performed within a single thread and so that thread has exclusive read/write access to the non-atomic reference count, all other threads share the atomic reference count. There is some work needed to coordinate the atomic and non-atomic reference count so as to test when an object can be deleted, but the overhead of this work is less than the cost of always using an atomic.
The python thing is interesting. I thought there were other reasons for the GIL than ref counts but I guess everything must be addressed.
Remember, boys and girls, small mistakes can have big consequences.
Keep those data in immutable data structures when mutli-threading. Actor model also helps.
IME C++ programmers seem to put an unreasonable amount of faith on RAII to "solve" these problems eg:
// do stuff with a mutex lock
} // lock_guard destructor releases mutex
So whenever anyone tells you "just start a thread" without any hesitation, it's reasonably safe to assume they have no idea what they're talking about.
This is one reason I love the async code model used in a number of languages. I have the most experience with Hack  but it's not unique to that. What I tend to think of as "product" code (which is a real Facebook-ism) has no need to every create a thread and dealing with any concurrency issues.
Now if you're programming an HTTP server or an RDMBS or other infrastructure then sure, you have a valid need. But if you're just serving HTTP requests, just avoid it entirely if you can because any performance gain is probably imaginary but the complexity increase isn't.
In addition, async code can bring much of the same challenges as multithreading. For example, in C# tasks may run on the current thread, but they may also run on a thread-pool, and this is mostly transparent to the programmer. When you don't immediately await all asynchronous operations your tasks may end up running at the same time.
The downside is more complex code, code that's harder to reason about, the possibility of concurrency bugs, slower development times, the possibility of security flaws and you may just be shifting your performance bottleneck (eg creating a hot mutex).
The upside is lower latency, more throughput and higher QPS.
My argument is quite simply YAGNI. A lot of these benefits are for many people purely imaginary. They're the very definition of premature optimization. Code that is less complex, faster to develop in and likely "good enough" is probably going to help you way more than that.
If you get to the scale where heavy multithreading actually matters, that's a good problem to have. My argument is that many people aren't there and I would hazard a guess that a contributing factor to not getting there is worrying about this level of performance before they need to.
I serve HTTP requests offering JSON based RPC in my C++ servers that do all kinds of calculations (often CPU consuming) in multiple threads. Works just fine and scales well. Yes I have to be careful on how I work with shared data but it is centralized encapsulated and done in a single place I call request router / controller. I reuse it among the projects. Nothing too exciting about it and no reason to be scare do death ;)
Go has other issues. My first piece of advice to any new Go programmer is make every single one of your channels unbuffered. This way it operates a lot like the async model I mentioned and it tends to be easy to reason about.
But people fall into a trap of thinking adding a buffer will improve concurrency and throughput. This might be true but most of the time it's a premature optimization. Buffered channels also often lead to obscure bugs and a system that's simply harder to reason about.
It's really better to think of Go as an tool for organizing concurrent code, not necessary a direct solution to the problems involved. But organization matters.
Or, per-cpu counters exist for a reason.
There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.
There are only two hard things in Computer Science: cache invalidation, and…what did you ask me again?
0) Cache invalidation
1) Naming things
5) Asynchronous callbacks
2) Off-by-one errors
3) Scope creep
6) Bounds checking