The weirdest one of the bunch is the AMD EPYC 9175F: 16 cores with 512MB of L3 c...

bob1029 · on Oct 12, 2024

Another good example is any kind of discrete event simulation. Things like spiking neural networks are inherently single threaded if you are simulating them accurately (I.e., serialized through the pending spike queue). Being able to keep all the state in local cache and picking the fastest core to do the job is the best possible arrangement. The ability to run 16 in parallel simply reduces the search space by the same factor. Worrying about inter CCD latency isn't a thing for these kinds of problems. The amount of bandwidth between cores is minimal, even if we were doing something like a genetic algorithm with periodic crossover between physical cores.

londons_explore · on Oct 12, 2024

Plenty of applications are single threaded and it's cheaper to spend thousands on a super fast CPU to run it as fast as possible than spend tens of thousands on a programmer to rewrite the code to be more parallel.

And like you say, plenty of times it is infeasible to rewrite the code because its third party code for which you don't have the source or the rights.

bee_rider · on Oct 12, 2024

512 MB of cache, wow.

A couple years ago I noticed that some Xeons I was using had a much cache as the ram in the systems I had growing up (millennial, so, we’re not talking about ancient commodores or whatever; real usable computers that could play Quake and everything).

But 512MB? That’s roomy. Could Puppy Linux just be held entirely in L3 cache?

zamadatix · on Oct 12, 2024

CCDs can't access each other's L3 cache as their own (fabric penalty is too high to do that directly). Assuming it's anything like the 9174F that means it's really 8 groups of 2 cores that each have 64 MB of L3 cache. Still enormous, and you can still access data over the infinity fabric with penalties, but not quite a block of 512 MB of cache on a single 16 core block that it might sound like at first.

Zen 4 also had 96 MB per CCD variants like the 9184X, so 768 MB per, and they are dual socket so you can end up with a 1.5 GB of total L3 cache single machine! The downside being now beyond CCD<->CCD latencies you have socket<->socket latencies.

edward28 · on Oct 12, 2024

It's actually 16 CCDs with a single core and 32MB each.

nullc · on Oct 13, 2024

9684x is 1152 MB cache per socket, 12 CCDs * 96MB. A similar X series zen5 is planned.

Though I wish they did some chips with 128GB of high bandwidth dram instead of a extra sized sram caches.

bee_rider · on Oct 13, 2024

Hmm. Ok, instead of treating the cache as ram, we will have to treat each CCD as a node, and treat the chip as a cluster. It will be hard, but you can fit quite a bit in 64MB.

hedora · on Oct 12, 2024

I wonder if you can boot it without populating any DRAM sockets.

lewurm · on Oct 12, 2024

Firmware is using cache as RAM (e.g. https://www.coreboot.org/images/6/6c/LBCar.pdf) to do early init, like DRAM training. I guess later things in the boot chain rely on DRAM being set up probably though.

bee_rider · on Oct 12, 2024

I would be pretty curious about such a system. Or, maybe more practically, it might be interesting to have a system pretends the L3 cache is ram, and the ram is the hard drive (in particular, ram could disguise itself as the swap partition, to so the OS would treat is as basically a chunk of ram that it would rather not use).

compressedgas · on Oct 12, 2024

Philip Machanick's RAMpage! (ca. 2000)

> The RAMpage memory hierarchy is an alternative to a conventional cache-based hierarchy, in which the lowest-level cache is managed as a paged memory, and DRAM becomes a paging device.

afr0ck · on Oct 12, 2024

So, essentially, you're just doing cache eviction in software. That's obviously a lot of overhead, but at least it gives you eviction control. However, there is very little to do when it comes to cache eviction. The algorithms are all well known and there is little innovation in that space. So baking that into the hardware is always better, for now.

edward28 · on Oct 12, 2024

Intel has such a CPU with the previous gen called the xeon AMX with up to 64gb of HBM on chip. It could use it a cache or just memory.

dmitrygr · on Oct 14, 2024

That would require either rewriting drivers to never use DMA or making sure that all DMA controllers are able to write into and read from L3 directly.

Jestzer · on Oct 12, 2024

MATLAB Parallel Server also does per-core licensing.

https://www.mathworks.com/products/matlab-parallel-server/li....

Aurornis · on Oct 12, 2024

Many algorithms are limited by memory bandwidth. On my 16-core workstation I’ve run several workloads that have peak performance with less than 16 threads.

It’s common practice to test algorithms with different numbers of threads and then use the optimal number of threads. For memory-intensive algorithms the peak performance frequently comes in at a relatively small number of cores.

CraigJPerry · on Oct 12, 2024

Is this because of NUMA or is it L2 cache or something entirely different?

I worked on high perf around 10 years ago and at that point I would pin the OS and interrupt handling to a specific core so I’d always lose one core. Testing led me to disable hyperthreading in our particular use case, so that was “cores” (really threads) halfed.

A colleague had a nifty trick built on top of solarflare zero copy but at that time it required fairly intrusive kernel changes, which never totally sat well with me, but again I’d lose a 2nd core to some bookkeeping code that orchestrated that.

I’d then tasksel the app to the other cores.

NUMA was a thing by then so it really wasn’t straightforward to eek maximum performance. It became somewhat of a competition to see who could get highest throughout but usually those configurations were unusable due to unacceptable p99 latencies.

afr0ck · on Oct 12, 2024

NUMA gives you more bandwidth at the expense of higher latency (if not managed properly).

RHab · on Oct 12, 2024

Abaqus for example is by core, I am severly limited, for me this makes totally sense.

heraldgeezer · on Oct 12, 2024

Windows server and MSSQL is per core now. A lot of enterprise software is. They are switching to core because before they had it based on CPU sockets. Not just Oracle.

aecmadden · on Oct 14, 2024

This optimises for a key vmware license mechanism "Per core licensing with a minimum of 16 cores licensed per CPU.".

puzzlingcaptcha · on Oct 12, 2024

Windows server licensing starts at 16 cores

forinti · on Oct 12, 2024

You can pin which cores you will use and so stay within your contract with Oracle.

elil17 · on Oct 14, 2024

Many computational fluid dynamics programs have per core licensing and also benefit from large amounts of cache.

yusyusyus · on Oct 12, 2024

new vmware licensing is per-core.