A few cores too many (2016)

m463 · on Oct 12, 2023

I remember writing some threaded code years ago, and was surprised to find running on a machine with multiple cores didn't go fast, sometimes it went slower.

Then I found out about pinning threads and processes to a cpu - which made everything go lots faster.

AtlasBarfed · on Oct 13, 2023

Yeah, with the GHz free lunch done, we're headed for more hardware-specific optimization. The abstractions of a lot of code from the last six decades kind of has this informal "we're getting 2x speed in 18 months, so lets do more abstractions for the programmer".

Now CPUs are still stupid fast, so you can use the poor abstractions, but if you really want to make a high core big CPU cache system work, you'll need some hardware specific stuff.

ScyllaDB vs Cassandra was a big example of this, and scylla's performance leap forced a pretty big rewrite of the java-based cassandra core from a streaming event architecture (SEDA) which would thrash the cores/caches for events, to a pinned CPU architecture.

Even though we are about, what, 15+ years into desktops having 4+ cores, and ludicrous amounts on the servers, there really isn't very good documentation, tutorials, and examples of squeezing maximum performance from these multicore systems. Platforms like Java have 20 years of multithreading code and libraries and examples, and a huge amount of them are really bad practice these days.

The big explosion of Python and Javascript aren't really helping, those languages are saddled with GIL or single process + wait architectures. That isn't going to wring the max out of a huge multicore system that pinned processes would benefit from. But then again does Java, while it is multiprocess/multithreaded, have the ability to pin a process to a CPU?

fiedzia · on Oct 13, 2023

We have Rust now (which is also coming to Js and Python), and that's coming to desktop right now. Other native languages will follow. Also both Python and Js have some solution (webworkers and subinterpreters) that will eventually become more popular.

bee_rider · on Oct 12, 2023

CPUs also have different frequencies depending on how many cores are active.

It is pretty annoying to benchmark nowadays unless you have access to BIOS, or at least root.

tails4e · on Oct 12, 2023

I was talking about this with a colleage today who does benchmarking. He was of the opinion that the os scheduled generally does a better job than manual core pinning, but would be interested to hear more on your experience.

sgerenser · on Oct 13, 2023

OS scheduling generally does a good job of maximizing overall throughput and/or evening out usage of cores so as to give a fair share to all processes. What it does not normally do a good job of is keeping relatively good latency/timing. If you're doing anything that is latency sensitive (say, audio applications or high-frequency trading) then pinning a single process to a single core and not running anything else on that core is going to minimize jitter and latency spikes.

sgerenser · on Oct 12, 2023

Confused about “From here we tried increasing the running time of the benchmark, removing I/O and system calls, checking for hyperthreading (ARM cores don’t support it), and even changing the cores’ CPU governor”. Where do ARM cores come in here? Is it a typo for AMD? But those cores (Bulldozer era) did have shared resources between threads even though they didn’t call it Hyperthreading.

zelon88 · on Oct 13, 2023

The "shared" components being the integer units, which are typically assembled with cores in a 1:1 ratio. So by having two cores with one integer unit you technically still don't have a hyper-threaded CPU. AMD argued in court at the time that this counted as 2 separate cores, while some lawsuits claimed that statement was false advertising.

I opted out of that lawsuit in writing stating my opinion that a judgement against AMD in this case would have a chilling consequence on future architectural developments. AMD eventually settled for $12.1m.

https://www.anandtech.com/show/14804/amd-settlement

gus_massa · on Oct 13, 2023

> So by having two cores with one integer unit

IIRC they shared the floating point unit, not the integer unit. (A long time ago I had one with 8 "cores" but only 4 FPU.)

sgerenser · on Oct 13, 2023

Yeah, sharing FPU makes a lot more sense than sharing integer unit. FPU, particularly on home user applications, is not used nearly as heavily as integer. But AFAICT the CPU used in the article was one of these Bulldozer-based designs with shared FPUs. And it didn't sound like they did anything with trying to pin processes to specific cores to avoid 2 threads sharing the same FPU. The description of their code doesn't sound like it would be floating-point heavy but its hard to say for sure.

rjsw · on Oct 13, 2023

I was using a 6 core one until earlier this year, it worked well for compiling software, didn't care about floating point.

AtlasBarfed · on Oct 13, 2023

So they argued basically the FPU was a coprocessor?

Ahhh 80386SX and 80387SX live again in spirit ...

samsquire · on Oct 12, 2023

I've learned that You cannot accelerate writes to a single memory location by adding cores. Memory bandwidth is fixed when running on a single core, when you add writes from other threads you want thread safety, which slows things down. Mutexes do not scale. If you want to saturate the memory bus you can read and write to different locations and fan out.

I am working on a multithreaded barrier which does mass synchronization on a schedule, a rhythm.

It can send ~169 million messages a second across 10 barrier threads and 63 million event ingests from 3 external threads.

Your algorithm has to be redesigned to support this style of programming. Even message passing can be slow due to context switches. But bulk buffer processing is fast.

OnlyMortal · on Oct 12, 2023

The trick with scalable threading is to make the behaviour as independent as is possible. Including memory allocations. “jemalloc” been something worth looking at.

It’s better to follow the scatter/gather model of MPI (don’t use that) in the same process and do as much amortisation in each thread as possible before collation.

When it comes to I/O (with spinning rust), a measured rule is to use twice the cores to account for seek time and use big writes/reads rather than small ones. This can saturate the disks so do measure it.

I spent about ten years worrying about these kind of issues on enterprise storage in C++.

nyanpasu64 · on Oct 12, 2023

Would snmalloc (https://news.ycombinator.com/item?id=37851210) help with scalable threading or not? It claims to be better at allocating memory on a producer and freeing on a consumer thread, and "Freeing memory in a different thread to initially allocated it, does not take any locks and instead uses a novel message passing scheme to return the memory to the original allocator, where it is recycled. This enables 1000s of remote deallocations to be performed with only a single atomic operation enabling great scaling with core count."

OnlyMortal · on Oct 16, 2023

I’ll offer that if you have a long lasting thread, you might consider allocating big blocks to it and having a “pool” it can grab memory from. In C++ you can use a shared pointer to release it back to the pool.

This avoids contention in user space. It also reduces fragmentation. You can also bound the memory usage by blocking until memory is free.

If memory serves, boost C++ has some code to help there though I did it myself.

vlovich123 · on Oct 13, 2023

It’s a shame that they don’t compare against mimalloc which is another Microsoft project and it’s unclear which tcmalloc they are comparing against (the gperftools one is stale and performs worse than the current standalone release)

tedunangst · on Oct 12, 2023

But why?

vient · on Oct 12, 2023

Perhaps cache effects. If one instance has relatively large memory footprint, then two instances will compete for L3 cache. One moment they may share some common memory or just not use much memory and work fast, next moment they will work on distinct memory sets which will cause active cache thrashing. Different runs of several instances will interact a bit differently each time, resulting in big differences in run times.

On their system two instances work fine because they have two CPUs so two L3 caches.

Their Opterons actually have an unusual cache hierarchy by modern standards - L2 and L1i caches are also shared, in every pair of cores. L3 is shared between all cores as usual.

sitkack · on Oct 16, 2023

I think this might have been a compiler or runtime bug. It sounds like from the write up that the answer changing by the number of cores was baffling.