Hacker News new | past | comments | ask | show | jobs | submit login
Testing AMD's Bergamo: Zen 4c (chipsandcheese.com)
158 points by latchkey 5 months ago | hide | past | favorite | 100 comments



The actual title seems to be "Testing AMD’s Bergamo: Zen 4c Spam" which I really like because for the perspectives of 20 years ago this would feel a bit like "spam" or a CPU-core "Zergling rush".

As I said before, I do believe that this is the future of CPUs core. [1] With RAM latency not really having kept pace with CPUs have more performant cores really seems like a waste. In a Cloud setting where you always have some work to do it seems like simpler cores but more of them is really the answer. It's in the environment that the weight of x86's legacy will catch up with us and we'll need to get rid of all the waste transistors decoding cruft.

https://news.ycombinator.com/item?id=40535915


I largely agree with you, but funnily enough the very same blog has a great post on the x86 decoding myth https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...


I'm not sure I agree with that. The unknown length of instructions in x86 does make decoders more power hungry. There's no doubt about that. That is a big problem for power efificency of x86 and the blog doesn't address that at all.

M1 really is a counterexample to all theory that Jim is saying etc. The real proof would be if same results were also reproduced on M1 instead of Zen


The unknown length of instructions in x86 does make decoders more power hungry. There's no doubt about that.

I have doubts about that. I-cache word lines are much larger than instructions anyway, and it was the reduction in memory fetch operations that made THUMB more energy-efficient on ARM (and even there, there's lots of discussion on whether that claim holds up). And if you're going for fixed-width instructions then many instructions will use more space than they use now, reducing the overall effectiveness of the I-cache.

So even if you can prove that a fixed-size decoder uses less power, you will still need to prove that that gain in decoder power efficiency is greater than the increased power usage due to reduced instruction density and accompanying loss in cache efficiency.


It's the width of multiplexing that has to on between having a fetch line and extracting a single instruction. As an instruction can start at many different locations, you need to be able to shift down from all those positions.

That's not too bad for the first instruction in a line but the second instruction is dependant on how the first instruction decides, and the third dependent on the second. Etc. So it's not only a big multiplexer tree, but a content dependent multiplexer tree. If you're trying to unpack multiple instructions each clock (or course you are. You've got six schedulers to feed) then that's a big pile of logic.

Even RISC-V has this problem, but there they've limited it to two sizes of instruction (2 and 4 bytes), and the size is in the first 2 bits of each instruction (so no fancy decode needed)


Jim was involved in the early versions of Zen & M1, I believe he knows.

Apples M series looks very impressive because typically, at launch, they are node ahead of the competition, so early access deals with TSMC is the secret weapon this buys them about 6 months. They also are primarily laptop chips, AMD has competitive technology but always launches the low power chips after the desktop & server parts.


> so early access deals with TSMC is the secret weapon this buys them about 6 months

Aren't Apple typically 2 years ahead? M1 came out 2020, other CPUs from the same node level (5 nm TSMC) came out 2022. If you mean apple launches their 6 months ahead of the rest of the industry gets on the previous node, sure, but not the current node.

What you are thinking about is maybe that AMD 7nm is comparable to Apple 5nm, but really what you should compare is todays AMD cpus with the Apple cpu from 2022, since they are on the same architecture.

But yeah, all the impressive bits about Apple performance disapears once you take architecture into account.


> but really what you should compare is todays AMD cpus with the Apple cpu from 2022, since they are on the same architecture

There only seems to be comparisons between laptop CPUs which are quiet limited.


Same node. Not same architecture.


Not necessarily. Qualcomm just released its Windows chips, and in the benchmarks I've seen, it loses to the M1 in power efficiency, despite being built on a more advanced node, performing much closer the the Intel and AMD offerings.

Apple is just that good.


I agreed with you up until the x86 comment. Legacy x86 support is largely a red herring. The constraints are architectural (as you noted, per-core memory bandwidth, plus other things) more than they are due to being tied down to legacy instruction sets.


If the goal ends up being many-many-core, x86's complexity tax may start to matter. The cost of x86 compatibility relative to all the complexities required for performance has been small, but if we end up deciding that memory latency is going to kill us and we can't keep complex cores fed, then that is a vote for simpler architectures.

I suspect the future will be something between these extremes (tons of dumb cores or ever-more-complicated cores to try and squeeze out IPC), though.


The x86 tax ALREADY matters. Intel was only able to increase the number of decoders by adding entirely independent decoder complexes while reducing the number of decoders per branch.

In practice, this means that decode increases for branchy code, but non-branchy code will be limited to just 3 decoders. In contrast, an ARM X4 or Apple M4 can decode 10 instructions under all conditions.

This also play into ideas like Larabee/Knights processors where you basically want the tiniest core possible attached to a massive SIMD engine. x86 decode eats up a lot of real estate. Even worse, x86 decode adds a bunch of extra stages which in turn increase the size of the branch predictor.

That's not the biggest issue though. Dealing with all of this and all the x86 footguns threaded throughout the pipeline slows down development. It takes more designers and more QAs more time to make and test everything. ARM can develop a similar CPU design for a fraction of the cost compared to AMD/Intel and in less time too because there's simply fewer edge cases they have to work with. This ultimately means the ARM chips can be sold for significantly less money or higher margins.


Then show me the low-cost ARM version of AMD's 96-core Threadripper.


Ampere One was supposed to be out by now.


In the talk given by lead architect for skymont they implied it could decode 9 under all conditions, not just when there are heavy branches.


The fun thing with branch predictors is that they tell you where the next branch is (among other things like the direction of the branch). Since hardware is built out of finite wires, the prediction will saturate to some maximum distance (something in the next few cache lines).

How this affects decode clusters is left as an exercise to the reader.


> In contrast, an ARM X4 or Apple M4 can decode 10 instructions under all conditions.

Not if there are fewer than 10 instructions between branches...


Each core needs to handle the full complexity of x86. Now, as super-scalar OoO x86 cores have evolved the percentage of die allocated to decoding the cruft has gone down.

…but when we start swarming simple cores, that cost starts to rise. Each core needs to be able to decode everything. Now when you can a 100 cores, even if the cruft is just 4%, that means you can have 4 more cores. This is for free if you are willing to recompile your code.

Now, it may turn out that we need more decoding complexity than something like RISC-V currently has (Qualcomm has been working in it), but these will be deliberate, intentionally chose instead of accrued, that meet the needs of today and current trade offs, and not of the eart 80’s.


As a developer of fairly standard software, there's very little I can say I rely on from the x86/x64 ISA.

One big one is probably around consistency model[1] and such which affects atomic operations and synchronizing multi-threaded code. Usually not directly though, I typically use libraries or OS primitives.

Are there any non-obvious (to me anyway) ways us "typical devs" rely on x86/x64?

I get the sense that a lot of software is one recompile away from running on some other ISA, but perhaps I'm overly naive.

[1]: https://en.wikipedia.org/wiki/Consistency_model


> Are there any non-obvious (to me anyway) ways us "typical devs" rely on x86/x64?

Generally the answer is "we bought this product 12 years ago and it doesn't have an ARM version". Or variants like "We can't retire this set of systems which is still running the binary we blessed in this other contract".

It's true that no one writing "fairly standard software" is freaking out over the inability to work on a platform without AVX-VNNI, or deals with lockless algorithms that can't be made feasibly correct with only acquire/release memory ordering semantics. But that's not really where the friction is.


Yea was just trying to check for a blind spot. In these cloudy days, it seems nearly trivial for a lot of workloads, but like I said perhaps I had missed something.

For us the biggest obstacle is that our compiler doesn't support anything but x86/x64. But we're moving to .Net so that'll solve itself.


A lot of systems are “good enough” and run flawlessly for years/decades so unless you have a good business case you won’t be able to move from x86 to ARM or the new RISC open stuff because the original system cost a couple million dollars.


"good enough" but made a decade ago would run fine in an emulator, while much more instrumentalized and under control than if running directly on hardware.


That was true through the 90's, but not anymore. A typical datacenter unit from 2014 would have been a 4-socket Ivy Bridge-EP with 32ish ~3 GHz cores. You aren't going to emulate that feasibly with equivalent performance on anything modern, not yet.


Cycle exact? Sure. But what are the odds you need that for some x86 software made in 2014.

Otherwise, via translation techniques, getting 80% of native performance isn't unheard of. Which would be very fast relative to any such Ivy Bridge server.

Transitioning away from x86 definitely is feasible, as very successfully demonstrated by Apple.


> Legacy x86 support

... is slowly dissapearing. Even on Windows 10 is very hard to run Win32 programs from Win95, Win98 era.


People have been saying this 20ish years ago (or probably much longer) - more, simple cores are the future.

In my experience, people just don't know how to build multi-threaded software and programming languages haven't done all that much to support the paradigm.

Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

The only kind of forward looking thing I've seen in this area is the Bend language that has been making strides a couple months ago.

And besides all that, Amdahl's law still exists - if 10% of the program cannot be parallelized, you're going to get a 10x speedup at most. Every consumer grade chip tends to have that many cores already.


> In my experience, people just don't know how to build multi-threaded software and programming languages haven't done all that much to support the paradigm.

Go? Rust? Any functional language?


Go just turned a bunch of library constructs (green threads, queues) into language keywords, many languages have the same stuff with a bit more involved syntax.

Rust in my opinion, is the biggest admission of failure of modern multithreaded thinking, with having classes like 'X but single threaded' and 'X but slower but works with multiple threads', requiring a complex rewrite for the code to even compile. It's moving all the mental burden of threading to the programmer.

CPUs have the ability to automatically execute non-dependent instructions in parallel, at the finest granularity. Yet if we take a look at a higher level, on the level of functions, and operational blocks of a program, there's almost nothing production grade that can say: Hey, you sort array A and B and then compare them, so lets run these 2 sorts in parallel.


Writing multithreaded programs does not means that people know how to do this.

Just fire up the Windows Process Explorer and look at the CPU graphs.


> Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

1) In the cloud there are always more requests to serve. Each request can still be serial. 2) Stuff like reactive streams allow for parallelisation. The independent threads acquiring locks will forever be difficult, but there are other models that are easier and getting adopted.


> Multi threading is still the domain of gnarly bugs, and specialists writing specialist software.

It is not even there. Windows (7,10) has difficulties splitting jobs between I/O and processor. Simulations take hours because of this and because Windows like to migrate tasks from one core to the others.


I haven't written low-level code for Windows for a while, but I recall that all Windows I/O since the NT days has been asynchronous at the kernel level. An I/O thread is purely an user-space construct.

In Linux, I/O threads are real, with true asynchronous I/O only being recently introduced with io_uring.


> As I said before, I do believe that this is the future of CPUs core

It is not. Or at least not the future, singular. Many applications still favor strong single-core performance, which means in, say, a 64-core CPU, ~56 (if not more) of them will be twiddling their thumbs.

> It's in the environment that the weight of x86's legacy will catch up with us and we'll need to get rid of all the waste transistors decoding cruft.

This very same site has a well-known article named “ISA doesn’t matter”. As noted though, with many-core, having to budget decoder silicon/power might start to matter enough.


Why does everyone keep repeating this mantra? I wrote the x86 decoder for https://github.com/jart/blink which is based off intel's xed disassembler. It's so tiny to do if you you have the know-how to do it.

    master jart@studio:~/blink$ ls -hal o/tiny/blink/x86.o
    -rw-r--r-- 1 jart staff 23K Jun 22 19:03 o/tiny/blink/x86.o
Modern microprocessors have 100,000,000,000+ transistors, so how much die space could 184,000 bits for x86 decoding really need? What proof is there that this isn't just some holy war over the user-facing design. The stuff that actually matters is probably just memory speed and other chip internals, and companies like Intel, AMD, NVIDIA, and ARM aren't sharing that with us. So if you think you understand the challenges and tradeoffs they're facing, then I'm willing to bet it's just false confidence and peanut gallery consensus, since we don't know what we don't know.


Decoding 1 x86 instruction per cycle is easy. That's solved like 40 years ago.

The problem is that superscalar CPU needs to decode multiple x86 instructions per cycle. I think latest Intel big core pipeline can do (IIRC) 6 instructions per cycle, so to keep the pipeline full the decode MUST be able to decode 6 per cycle too.

If it's ARM, it's easy to do multiple decode. M1 do (IIRC) 8 per cycle easily, because the instruction length is fixed. So the first decoder starts at PC, the second starts at PC+4, etc. But x86 instructions are variable length, so after the first decoder decodes instruction at IP, where does the second decoder start decoding at?


It isn't quite that bad. The decoders write stop bits back into the L1D, to demarc where the instructions align. Since those bits aren't indexed in the cache and don't affect associativity, they don't really cost much. A handful of 6T SRAMs per cache line.


I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.


It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.

But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.


That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.


It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.


Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.


Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.


IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.


Chips and Cheese specifically talks about this in the article I mention[0].

x86 decoders take a tiny but still significant silicon and power budget, usually somewhere between 3-7%. Not a terrible cost to pay, but if legacy is your only reason, why keep doing so? It’s extra watts and silicon you could dedicate to something else.

[0] https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...


But decoders for e.g. ARM are not free either, right? Or am I misunderstanding something?


Correct. However because ARM has fixed-length instructions the decoder can make more assumptions, keeping the decoder simpler.

Like I said, you its only a small amount of extra silicon you’re paying the x86 tax with, but with the world mostly becoming ARM-compatible, there’s no more reason to pay it.


> with many-core, having to budget decoder silicon/power might start to matter enough

That seems backwards to me. Narrower, simpler cores with fewer execution engines have a much easier time decoding. It's the combinatorics of x86's variable length instructions and prefix coding that makes wide decoders superlinearly expensive.


I apologize for removing the word spam (and apologized to C&C directly as well). I mistook it as a mistake on their part since the word "spam" was not used anywhere else in the article. They put it in there as an assumption that people would just get it and I did not. My bad!


This idea was explored over a decade ago, in the context of cloud computing: https://www.cs.cmu.edu/~fawnproj/


And almost a decade earlier, in the context of private cloud, was implemented in Sun's Niagara line of processors: https://en.wikipedia.org/wiki/UltraSPARC_T1

The UltraSPARC T1 was designed from scratch as a multi-threaded, special-purpose processor, and thus introduced a whole new architecture for obtaining performance. Rather than try to make each core as intelligent and optimized as they can, Sun's goal was to run as many concurrent threads as possible, and maximize utilization of each core's pipeline.


Something I've realised is that we're entering the era of "kilocores", where we start counting cores by the thousands, much like the early computers had kilo-words of memory. Soon... mega-cores, then giga-cores, and on!


Hate to burst your bubble, but with the end of Moores law this seems unlikely to pass.


There are 256-core and 288-core server processors from AMD and Intel respectively about to ship this year. If you count hyper-threads as a "virtual core", and count all such vCPUs in a box, then we're up to 1,024 or 1,152 already. That is the number you'll see in 'top' or Task Manager, and that's what matters to running software.

Also worth noting that a high-end GPU already has 16K cores, although the definition of a "core" in that context isn't as clear-cut as with normal CPUs.

These server CPUs are still being made with 5nm or 4nm technology. Sure, that's just a marketing number, not a physical size, but the point is that there are already firm plans from both Intel and TSMC to at least double the density compared to these current-gen nodes. Another doubling is definitely physically possible, but might not be cost effective for a long time.

Still, I wouldn't be surprised to see 4K vCPUs in a single box available in about a decade in the cloud.

After that? Maybe 3D stacking or volumetric manufacturing techniques will let us side-step the scale limits imposed by the atomic nature of matter. We won't be able to shrink any further, but we'll eventually figure out how to make more complex structures more efficiently.


Yes, but that's still a far cry from a million cores. Unless we change the power requirements and heat generation fundamentally, I don't see how we could get to a point where we have a million CPU cores that look anything like we have today (you can get there with more limited cores, but my impression of OP's comment was that they would be like today's cores).


A million cores won't look like a bigger Xeon or EPYC in a socket.

It'll be a combination of nascent technologies on the cusp of viability.

First, something like this will have to be manufactured with a future process node about 2-3 generations past what is currently planned. Intel has plans in place for "18A", so we're talking something like "5A" here, with upwards of 1 trillion transistors per chip. We're already over 200 billion, so this is not unreasonable.

Power draw will have to be reduced by switching materials to something like silicon-carbide instead of pure silicon.

Then this will have to use 3D stacking of some sort and packaging more like DIMMs instead of a single central CPU. So dozens of sockets per box with much weaker memory coherency guarantees. We can do 8-16 sockets with coherent memory now, and we're already moving towards multiple chiplets on a board that is itself a lot like a large chip with very high bandwidth interconnect. This can be extended further, with memory and compute interleaved.

Some further simplifications might be needed. This might end up looking like a hybrid between GPUs and CPUs. An example might be the IBM POWER server CPUs, some of which have 8 hyper-threads per physical core. Unlike POWER, getting to hundreds of kilocores or one megacore with general-purpose compute might require full-featured but simple cores.

Imagine 1024 compute chiplets, each with 64 GB of local memory layered on top. Each chiplet has 32 simple cores, and each core has 8 hyper threads. This would be a server with 64 TB of memory and 256K vCPUs. A single big motherboard with 32 DIMM-style sockets each holding a compute board with 32 chiplets on it would be about the same size as a normal server.


The article says "AMD’s server platform also leaves potential for expansion. Top end Genoa SKUs use 12 compute chiplets while Bergamo is limited to just eight. 12 Zen 4c compute dies would let AMD fit 192 cores in a single socket".

It should be noted that the successor of Bergamo, Turin dense, which is expected by the end of the year, will have 12 compute chiplets, for a total of 192 Zen 5c cores, bringing thus both more cores and faster cores.


192 cores per socket? If so, that's pretty wild.


I wonder if these large cache sizes make it possible to operate functionally RAM-less servers? When optimized for size, it'd probably be possible to fit some microservices in an MB of 8 of RAM. You can fit a decent bit of per-connection client state in a few KB, which can take up the rest of the cache. Throw in the X3D cache for 96MB per CCD and you can actually do some pretty serious stuff.

With the right preprocessing, it should be possible to essentially stream from NIC straight to CPU, process some stuff, and output it straight to the NIC again without ever touching RAM - although I doubt current DMA hardware allows this. It'd require quite a bit of re-engineering of the on-wire protocol to remove the need for any nontrivially-sized buffers, but I reckon it isn't impossible.


> With the right preprocessing, it should be possible to essentially stream from NIC straight to CPU, process some stuff, and output it straight to the NIC again without ever touching RAM

this already exists [0]

[0] https://www.intel.com/content/dam/www/public/us/en/documents...


I wasn't aware of that, thanks for sharing!


Lots of older or low end wireless routers have only 16 or 32mb RAM. A full linux stack fits in that. I do a full HTTPS stack on ESP32-C3 in ~500k. Lots of limitations, of course. Probably the biggest difficulty is that most modern systems simply refuse to boot without RAM installed, so you can't choose any platform without BIOS / EFI modifications. Definitely a worthwhile idea though.


The difference is that those low end routers offload most or even all of the packet processing to dedicated networking hardware: it never even touches the CPU. The whole Linux stack is only used to host the management interface, configure the switching hardware, and deal with high-level stuff like DNS and DHCP. That's also why VPN performance is so horribly bad: there's no hardware acceleration for encryption, so it needs to be done on the CPU, which is not sized to deal with that kinds of traffic.


Offloading the packet processing doesn't really save any RAM at all. The limitations you're discussing are related to running on a low power ~700mhz MIPS or ARM core with very low performance. As you said, using a VPN or adding additional routing rules requires the packets to pass through the CPU, which demonstrates that all the software and buffers for doing the routing on the CPU are present within the 16 or 32Mb ram. It's just attached to a very low performance CPU core.

It's perfectly possible to load the same OpenWRT release on a modern x86 and run in the same amount of ram with much higher performance. That's how I do my routing on a 1gbit/s synchronous fiber connection and I can do line-speed wireguard just fine.

Distributions like OpenWRT require so little RAM because they select very conservative options when compiling everything, are very selective about what they include, build against minimal C libraries like musl libc, make use of highly integrated system utilities like busybox, and strip things like debugging symbols from the resulting binaries.


This is theoretically true, but I think you'd have to get a lot of support from AMD and be responsible for implementing most of a traditional computer yourself. I have a feeling it's be roughly as feasible to do it all with fully custom hardware.

Otoh if you just mean try to write software to minimize the need to hit memory, that's totally reasonable -- and what you have to do if you want the best performance.


But in that case, wouldn't it be more efficient to put a processing core on/near the NIC, and do simple responses without even bothering the main CPU? I mean, if your "process some stuff" doesn't need data from RAM, it doesn't need to involve the main CPU either.


Pretty much, yeah! But the only NICs with compute that I'm aware of use quite expensive FPGAs. Being able to use off-the-shelf hardware would be far cheaper, far easier to develop for, and far more flexible.


Intel definitely has cache injection for DMA; I'm not sure about AMD. With DPDK you should be able to process packets completely in the L3.


I remember as part of the TBB library release, Intel remarked that 100 pentium cores had the same transistor count of a core2. Took a while, but starting to turn the corner on slower and wider becoming more common.


Reminds me of the ill-fated Larabee, which was supposed to be something like 50 Pentium-like cores in a GPU-shaped trenchcoat.


Larabee was launched as the first generation Knight-series chip (it even had the GPU-specific stuff still in there disabled IIRC).

I believe the Knight-series chips were killed because they ran into a bunch of issues and politics with Knight's Mill then killed it in favor of their upcoming GPU architecture only for that to have issues and be delayed too.

I suspect that a RISC-V version of Larabee would work better as the ratio of SIMD to the rest of the core would be a lot better.


A dual-issue in-order RISC-V isn't enough smaller than a P54c to make a significant difference. The overwhelming majority of Larabee's area went to SIMD units and caches.


Do you have any evidence for this statement? Their actual core wasn’t a Pentium. It added in most/all the x86 bloat from the last 3 decades too.

Cutting all that stuff and saving power means you can squeeze in a couple more cores and improve performance that more. 32 registers a significant advantage too.

RVV also allows wider vectors than AVX. Without reordering, you can get stronger performance guarantees from a single, wide SIMD than from two narrow SIMD (it’s also possible to look at the next couple instructions and switch between executing one at full width or two at half width each without massive investment).

And we haven’t even reached peak advantage yet. RISCV has the option for 48/64 bit instructions with loads of room to add VLIW instructions. While most VLIW implementations require everything to be VLIW, this would be optional in RISCV. Your core could be dual issuing two wide VLIW allowing a single thread to theoretically use four vector units. If you add in 4-8 wide SMT, the result is something very GPU-like while still using standard RISCV.

Maybe x86 can do this too, but I don’t think it would be anywhere near as efficient and it would certainly eat up a lot of the precious remaining opcode space.


RVV is not exactly in a great shape, as ISA extension itself and its support, to be competitive.


RVV was only ratified Nov 2021 (not too far from the release of SVE2). It shouldn’t be surprising that adoption will take some time.

The first cores using the final extension are hitting the market this year, so addition is happening really quickly.


The problem with adding more cores is that SRAM is the real bottleneck. Adding more cores means more cache.

Until someone figures out how to do 3D stackable SRAM similar to how SSDs work, SRAM will always consume most of the area on your chip.


This isn't fill the reticle with CPU, it's make a dozen separate chips and package them on a network fabric. Amount of cache can increase linearly with number of cores without a problem.

There is some outstanding uncertainty about cache coherency vs performance as N goes up which shows up in numa cliff fashion. My pet theory is that'll be what ultimately kills x64 - the concurrency semantics are skewed really hard towards convenience and thus away from scalability.


I know next to nothing about CPU architecture so please forgive a stupid question.

Are you saying that the x86 memory model means RAM latency is more impactful than on some other architectures?

Is this (tangentially) related to the memory mode that Apple reportedly added to the M1 to emulate x86 memory model to make emulation faster? - presumably to account for assumptions that compilers make about the state of the CPU after certain operations?


The preceding comment was about the CPU cache coherency, which is the bane of the symmetric multiprocessor (SMP) system design. The problem arises due to the fact the main memory is shared across processors (or CPU cores). Consider this rough sketch:

                                            [memory (shared)]
                                                    ⇕
                                           [L3 cache (shared)]
                                                    ⇕
  [core 0] ⟺ [L1 cache (private)]   ⟺   [L2 cache (shared)]   ⟺   [L1 cache (private)] ⟺ [core N]
                                                    ⇕
                                     [core 1] ⟺ [L1 cache (private)]

Each CPU (or CPU core) has its own private L1 cache that other CPU's/CPU cores do not have access to. Now, code running on the CPU 0 has modified 32 bytes at the address 0x1234 but the modification does not occur directly in the main memory, it takes places within a cache and changes to the data now have to be written back into the main memory. Depending on the complexity of the system design, the change has to be back propagates through a hierarchy of L2/L3/L4 (POWER CPU's have a L4 cache) caches until the main memory that is shared across all CPU's is updated.

It is easy and simple if no other CPU is trying to access the address 0x1234 at the same time – the change is simply written back and the job is done.

But when another CPU is trying to access the same 0x1234 address at the same time whilst the change has not made it back into the main memory, there is a problem as stale data reads are typically not allowed, and another CPU / CPU cores has to wait for the CPU 0 to complete the write back. Since multiple cache level are involved in modern system design, the problem is known as the cache coherency problem, and it is a very complex problem to solve in SMP designs.

It is a grossly oversimplified description of the problem, but it should be able to illustrate what the parent was referring to.


Thanks for the explanation, however I was more curious about why ARM would have an advantage over x86.

I think the sibling comment explains it - x86 makes memory consistency promises that are increasingly expensive to keep, suggesting that x86’s future success might be limited by how much it can scale in a single package.


I'm guessing GP's point is that x86 makes really strong memory consistency guarantees.

It makes life easier for programmers of multithreaded software, but at the cost of high synchronization overhead.

Contrast to e.g. Arm, where programmers can avoid a lot of that synchronization, but in exchange they have to be more careful.

* IIRC from some recent reading.


The real real problem is that we don't know how to effectively use growing transistor budgets to speed applications up compared to the single core era. Adding cache and more cores are both options that make for very sublinear improvements at the margin.

(for 95% of usage - somen niche things can utilize resources better, those not coincidentally often being the focus of benchmarks)



It's called V-Cache.


V-cache is just L3 which is shared between cores. L1 and L2 are still within each core.


Every time I hear about a 100+ core my mind jumps to python GIL


Has been heavily worked on in python 3.12 and up, don’t count out python just yet :)


Spoiler alert: Software doesn't jump from “uses one core” to “efficiently uses 192 cores” once you remove the first lock.


well back in the days when somebody invented gunicorn it was basically running one worker per core, if the software allowed it. thus you at least could scale on the process level, which of course was super slow if you only had like 10 req/s on a 60 cores machine.


Never said it did. I assume you know that at least a few people on here have a basic grasp on multiprocessors, locks, threads, asynch, etc


Yeah not actually an issue for me. I run into skill limitations with python long before 100 core issues.


My immediate thought is that Sun beat both of these guys to market with the T1 20 years ago. Not exactly the same thing, but I wonder how close to the same idea they are; the line seems to have stopped with the M8, which is a 32c/256t part from 2017.


> the line seems to have stopped with the M8, which is a 32c/256t part from 2017.

Epyc Naples was also 32c and also 2017. 4- and 8-way SMT wasn't something AMD or Intel ever bothered with, no, but IBM also did with the POWER line. It seems to help in some workloads, but probably isn't generally useful enough for volume solutions.


There is an absolutely huge difference between a 32c/256t part and a 128c/256t part. And Tera's MTA predated the T1 by quite a bit with a similar (and evidently doomed) architecture...


Is the 4c core slower in any other way than L3 cache reductions?

Would be interesting to see a compute bound perfectly scaling workload and compare it in terms of absolute performance and performance per watt between Bergamo and Genoa.


Same performance per cycle; less L3 and lower maximum frequency.

Zen4C is said to be a little better performance per watt than Zen4, but it's not clear how much.


> Is the 4c core slower in any other way than L3 cache reductions?

It's got a lower clock ceiling. Not much lower than acheivable clocks in really dense zen4 Epyc, but a lot less than an 8 core Ryzen.



[flagged]


test failed successfully




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: