Hacker News new | past | comments | ask | show | jobs | submit login
AMD's Strix Point: Zen 5 Hits Mobile (chipsandcheese.com)
166 points by klelatti 32 days ago | hide | past | favorite | 228 comments



I sit firm in my belief that the best thing Microsoft could do for their laptop ecosystem is to add support for a "max fan speed" slider somewhere prominent in the Windows UI.

People want the option to make their laptop silent or nearly silent. And when users do need the power, they generally prefer a slightly slower laptop at a reasonable volume rather than the roar of a jet engine.

Laptop manufacturers want their devices to score high on benchmarks. The best way to do that is to add a fan that can become very loud.

The incentives are not aligned.

All laptops should be designed to operate passively 100% of the time, if the owner so chooses. I doubt manufacturers will go that route unless Microsoft nudges them towards it. It would have downstream effects on how review sites benchmark laptops (i.e., at various power draws/noise levels producing a curve rather than a single number), which would have downstream effects on what CPU designers optimize for. It'd be great for consumers.


Stop demanding paper thin laptops. My work Dell rarely turns on its fan unless an AV scan is in progress and even then it's rather tolerable. It isn't a fashionable thickness so has plenty of internal volume for heat distribution.


I want a thin and fanless laptop. You might want something else.


I want a very thick laptop with many fans. It should be chonky and with personality like an old sun tadpole unix labtop, not svlete and artsy liek a 2000s macbook. I like the white noise and if it helps preformance all the better.


MacBook Air is thin and fanless, so it can be done.


"Thin and fanless" aren't that hard, just use any low power CPU.

But then people also want fast.

Apple does this by buying out TSMC's capacity for the latest process nodes and then taking the performance/efficiency trade off in favor of efficiency, so they get something with similar performance and lower power consumption. But then they charge you $400 for $50 worth of RAM and solder it so you can't upgrade it yourself.

The thing to realize is that fans are not required to spin, and the difference between the faster and slower processors are the clock speed rather than the transistors. So what you want is exactly what the OP requested: A laptop with fans in it, but you can turn them off. Then the CPU gets capped at a 15W TDP, has basically the same single-thread performance but is slower on threaded workloads, and it's no longer possible for Chrome to make your laptop loud.

But if you want to open up Blender you still have the option to move up the fan slider and make it go faster.


I know everyone on this site loves to hate on soldered ram, but my impression is most people don’t understand that soldered ram is not the same thing as regular ram modules. They are literally different memory chips (LPDDR vs DDR) . When built to a specific chip my understanding is you can design for tighter timings and higher bandwidth which is important for the gpu. The M1 shipped with very fast LPDDR4X running at 4266MT/s which was even pretty fast by XMP desktop speeds at the time (2020). There are real engineering advantages to soldered ram especially if the memory controller has need designed with to take advantage of it. I guess it is similar to how gpu memory configurations are specialized and not modular.


You're making two separate points here.

The first is the timings, which is nominally real but it was never a huge difference. Moreover, the new CAMM standard aims to address this and basically does. The legacy SODIMM standard wasn't great and is essentially what caused this.

The second is the bus width. If you use slotted memory and want a wide bus then you need at least one slot per channel and then you could end up needing a lot of slots. This isn't impossible -- servers do it -- but there is a cost attached to it.

But this doesn't apply to systems that aren't using a wide bus. The base M3 has the same bus width as ordinary dual-channel PCs. The Pro has the equivalent of four channels or, for the newer generation, three. That's still not a crazy number in a high end system. With CAMM it would only be two modules, since the modules are each 128-bit. By way of comparison, Threadripper has four or eight channels and modern servers have dozens.


The objection to soldered RAM isn’t the RAM.


The M1 was made on 5nm which have long been available to AMD and other competitors in volume.


Fast is relative. The Ryzen HX 370 has a TDP configurable down to 15W and at that power level it could be run fanless and would be faster than the M1, but it's still faster yet if you give it 54W and raise the clock speed.


Is that the chip AMD just released? Isn't the M1 about 4 years old?


The premise is that others can now use the same process as the M1 did to make fanless CPUs. Which they can, but they could always make fanless CPUs. The issue is that people also want them to be fast, which is not an absolute measurement fixed for all time, it's relative to competing contemporary systems with more cooling, which will always be faster.


I'm going to need source on that.

What does HX 370 score at 10w?


You're asking for a benchmark result for a CPU which just came out and has a configurable TDP that hardly anybody is going to have set to its lowest value, if they even disclose it, much less have done so in a test against the original M1. If you think a source for that even exists you can provide a link.

But the result seems pretty obvious. Even the 7nm Ryzen U-series at 15W (e.g. 7730U) was beating the 5nm M1 on multi-threaded workloads and the HX 370 is well ahead of both on single-thread performance. Single-thread workloads aren't significantly power limited, so to not be the case the Zen5 HX 370 would have to be slower than the Zen3 7730U on threaded workloads at the same TDP, which seems unlikely.


>But the result seems pretty obvious. Even the 7nm Ryzen U-series at 15W (e.g. 7730U) was beating the 5nm M1 on multi-threaded workloads and the HX 370 is well ahead of both on single-thread performance. Single-thread workloads aren't significantly power limited, so to not be the case the Zen5 HX 370 would have to be slower than the Zen3 7730U on threaded workloads at the same TDP, which seems unlikely.

Again, would like a source on that. Please no Cinebench R23.


7730U (7nm) vs. M1 (5nm) for MT:

https://nanoreview.net/en/cpu-compare/apple-m1-vs-amd-ryzen-...

Faster in Passmark MT, basically tied in Geekbench MT, faster in average MT score.

HX 370 vs. M1:

https://nanoreview.net/en/cpu-compare/apple-m1-vs-amd-ryzen-...

Faster in everything, ST and MT. ST difference is significant, MT difference is huge. Obviously this is expected because in this comparison AMD has the process advantage, but the expected thing is indeed what happens.


Let's not use Passmark MT. Stick to the better benchmarks that are optimized for both ARM and x86. GB5 and GB6, M1 is faster in MT despite having 4 fewer cores. If you can find SPEC scores, that'd be great too.

HX 370 vs M1, what's the perf/watt for SPEC and GB5/6 and Cinebench 2024?

HX370 consumes a lot more power. Hence, there aren't any fanless laptops available for it.

4 years later, AMD's chips still can't work in a fanless laptop.


> Let's not use Passmark MT. Stick to the better benchmarks that are optimized for both ARM and x86.

At some point you just run out of benchmarks. The majority of benchmarks people ordinarily use already don't run on Macs.

> GB5 and GB6, M1 is faster in MT despite having 4 fewer cores.

It has the same number of cores as the 7730U. Half the M1's cores are E-cores, but that should be an advantage on a comparison at a given power level because E-cores have better performance per watt. The M1 gets within the margin of error of the same MT score on the benchmark you actually like even though the M1 is built on a newer process.

> HX 370 vs M1, what's the perf/watt for SPEC and GB5/6 and Cinebench 2024?

You keep asking for benchmarks that probably nobody has published.

> HX370 consumes a lot more power. Hence, there aren't any fanless laptops available for it.

It has a configurable TDP down to 15W. You can make a laptop that passively dissipates 15W. But you can also make a laptop with a fan in it which is capable of higher performance from the same silicon and then have a setting for "silent mode" that lets you switch between them at will. People generally like that better so that's what they make.


>At some point you just run out of benchmarks. The majority of benchmarks people ordinarily use already don't run on Macs.

You just need GB5 or GB6. They are correlated to SPEC. Anything else is sort of worthless in 2024.

>It has the same number of cores as the 7730U. Half the M1's cores are E-cores, but that should be an advantage on a comparison at a given power level because E-cores have better performance per watt. The M1 gets within the margin of error of the same MT score on the benchmark you actually like even though the M1 is built on a newer process.

You're right, the 7730U does have 8 cores only. My mistake.


OP has already cited sufficient stats to prove his point, and you're looping on reply for different sources.

Why don't you supply your own sources? You're making a claim just the same as the OP, without providing any evidence in your favor. A good faith responder would do the legwork to provide a researched counterpoint.

To anybody that has actually been paying attention to CPU evolution over the years, the process node has clearly been the main differentiator between CPU performance. Intel had the process advantage and thus the CPU advantage, and now they don't.

Architecture matters too, but does not result anywhere near an order of magnitude difference, conventionally.


OP has been using outdated benchmarks not optimized for ARM to prove his point.

https://browser.geekbench.com/v6/cpu/compare/7292282?baselin...

Look at GB instead. GB is highly correlated to SPEC.

M1 is significantly faster in ST and MT while using a lot less power.


And this AMD processor is not the one specified in the OP (Ryzen HX 370) and is not on the same process node as the m1, thus not valid to prove your counterpoint.

You are comparing a 7nm processor to a 5nm one, and yet the gap isn't even very large. Which was entirely the OPs point.

Does a 5nm AMD chip perform similarly to the 5nm Apple chip at the same wattage? (Again, performance does not increase linearly with wattage, as you're likely to cite something violating this logic in the next response)

You seem to not understand the point being discussed though, so no reason to discuss further


>You are comparing a 7nm processor to a 5nm one, and yet the gap isn't even very large. Which was entirely the OPs point.

The gap is huge. AMD's 7nm chip typically uses ~5x more power than the M1 and is still slower.

>Does a 5nm AMD chip perform similarly to the 5nm Apple chip at the same wattage? (Again, performance does not increase linearly with wattage, as you're likely to cite something violating this logic in the next response)

No it does not. Apple's chips are significantly more efficient.

>You seem to not understand the point being discussed though, so no reason to discuss further

What is your point?


Don't run Windows and you don't need fast. Unfortunately Linux on notebooks is always a dice roll of random features (cam, fingerprint, ..) not working.

There is a lot of older hardware running like crap because Windows just bloats up.


So is running MacOS on non-apple laptops or running windows on chromebooks. Preinstalled is another story, of course, you paid someone to make sure all those random features work.

That said, defaults seem often wrong and defaults matter. For instance, I recently got an HP Elitebook with an amd 7840hs because for some reason the _u version was tied to a lower res screen. By default it runs high powered and then the fan is loud enough to be annoying. Set it it to balanced or low power and the fan is inaudible.


The cheapest MacBook Air is $1000, and it's more like $1500+ if you want a reasonable amount of RAM and storage. There are similarly expensive Windows laptops available that are fanless.


Mind linking some (genuinely curious, would like to checkout potential Linux machine for the next upgrade)


The Huawei Mate book X.

I think there's a fanless Asus ZenBook too, or the Surface Pro X.

https://www.notebookcheck.net/Huawei-MateBook-X-in-review-Ul...


Count me in


Surface laptop.


Not even the arm version of surface laptop is fanless.


I spent $1700 or so on my M1 Air not too long after they were released. A ThinkPad X1 Carbon would have cost me more money for massively worse performance. Quality costs more.

The difference is that a 4800U would be looking pretty bad vs a HX370 while the M1 still looks decent 4 years later (especially when that HX370 is unplugged).


>There are similarly expensive Windows laptops available that are fanless.

Such as?


Robo & Kala 2 in 1. Thinkpad x13s Snapdragon which can still be found in a few places.


Wait until Lunar Lake comes out.


For $1200 you can easily pick up a decent refurb MBP - these are apple refurbs for example. OOS but an example of what you can find if you look around a bit.

https://sellout.woot.com/offers/apple-14-macbook-pro-with-10...

There’s very little reason to chase the exact latest model when even a 2020/2021 M1 family is still great.


Many things can be done if you don't have coexist with AV.

(Though the use of the fan is always a configuration choice with thermal management in CPUs these days).


> Stop demanding paper thin laptops

Then use a desktop. Like most people I want my laptop paper thin: at least Apple understands that correctly. My daily laptop driver is a "LG Gram" which is especially slick, thin and light (lighter than any Mac laptop) and it's no slouch: 24 GB of RAM for example. And it's basically quiet: I don't even know if it has any fan (I own it since years and never heard a fan).

I'll take a slightly slower laptop if it means it's much quieter. But there's no way I'm going back to the bricks we used to have in 90s/2000s.

If you need a 4090 GPU, buy a desktop and call it a day. For everything else, you can get plenty of power, fast NVMe M.2 SSD, lots of RAM in a paper thin laptop.


> add support for a "max fan speed" slider somewhere prominent in the Windows UI.

Isn't that what the "power settings" do? It's a slider at the bottom right, hidden in a tray icon. Sure, it only has three positions and also influences battery consumption but it pretty much does what you want. (Not sure if windows 11 kept this though)


It might also be good to mandate 10+ hours of battery life when the laptop is in power saving mode. A number of laptops that'd otherwise have decent battery life are hampered by things like half-baked power management of discrete GPUs that doesn't completely cut power supply to those components. Manufacturers should be more heavily testing under this mode.


A few misbehaving CSS filters can make my discreet GPU turn on and at that point my battery life is a goner. Not sure who to blame in that scenario.

There was an old bug in FF around 2018 where a tab using a GPU would prevent a Windows laptop from ever sleeping. That ended up destroying that laptop's battery after it got thrown in my backpack and overheated a couple times.


Seems like this could be fixed by a system setting that disables automatic graphic switching which can be controlled by power profiles. That way the user can set the machine to use iGPU only when on battery, regardless of what programs want.


>I sit firm in my belief that the best thing Microsoft could do for their laptop ecosystem is to add support for a "max fan speed" slider somewhere prominent in the Windows UI

Until then, there's https://github.com/Rem0o/FanControl.Releases


A closed source application for controlling one's fan...umm no thank you.

I never will understand the reasoning behind why people are so afraid of releasing their source code. Looks like a weekend project; does he expect to make a living out of a weekend project?


>Looks like a weekend project;

Is that the best developer insult in your repertoire?

>does he expect to make a living out of a weekend project?

Trying to make money from writing SW is not illegal. The free market will decide.

>A closed source application for controlling one's fan...umm no thank you.

Well since you think it's only a weekend project, why don't you put your money where your mouth is and spend a weekend developing a FOSS fan control app if you need one?


> Looks like a weekend project; does he expect to make a living out of a weekend project?

Why won’t you spend a weekend and do the world a service?


In this case, every OEM will just copy it and slap their name on it. He released it as freeware and he has every right to do so. You have no right to others work.


It's a .NET program without any obfuscation. You can read the source code using any .NET decompiler, such as ILSPY.


Did you know NBFC (Notebook Fan Control)? It's old, but still works on some devices and you can create custom profiles via XML.

https://github.com/hirschmann/nbfc


This exists.

My friend's new ASUS ROG G16 let's you control maximum power draw for the CPU / GPU and draw the fan curve (temp → RPM).

The only thing that seems missing is a (temperature → max power curve) to handle spikes.


I don't know anything about Windows, but at least on Mac, I've been using TGPro for years [0]. I'd assume there is something similar in the Windows world.

In normal conditions my M1 mac can control its fans just fine, but when I travel to hot places like Vietnam... I just keep the fans on more often and my machine doesn't get nearly as hot. I end up having to open it up after a few months and clean out the fans, but that's fine.

[0] https://www.tunabellysoftware.com/tgpro/


Some recent asus ROG/TUF 2022+ models have fine tuning available via "armoury crate" or "g-helper" (non-proprietary, fan/community supported, code on github).

Disabling a dGpu and reducing power can yeld some impressive results for battery life. Allows also defining fan speed curves depending on temperature.


I don't mind fans at all, in fact I find fan noises a little soothing (a childhood thing, we didn't have AC). Everyone has different priorities, personally I'd prefer to not have throttled performance.


You can use Linux and hwmon.


IMO, the most interesting thing about this line is the battery life---within an hour of MBP3 and within 2 hours of Asus's Qualcomm. Making it comparable to ARM architectures.

Which is a little surprising because ARM is commonly believed to be much more power efficient than x86.

[1] https://youtu.be/Z8WKR0VHfJw?si=A7zbFY2lsDa8iVQN&t=277


ARM got a lot of hype since the release of the M1, but most users only compared it to the terrible Intel MBPs. Ryzen mobile has been consistently close to Apple silicon perf/watt for 5 years. But got little press coverage.

Hype can be really decorrelated from real world performance.


Any efficiency comparison involving Apples chips also has to factor in that Tim Cook keeps showing up at TSMCs door with a freight container full of cash to buy out exclusive access to their bleeding edge silicon processes. ARM may be a factor but don't underestimate the power of having more money than God.

Case in point, Strix Point is built on TSMC 4nm while Apple is already using TSMCs second generation 3nm process.


Let's do the math on M1 Pro (10-core, N5, 2021) vs HX370 (12-core, N4P, 2024).

Firestorm without L3 is 2.281mm2. Icestorm is 0.59mm2. M1 Pro has 8P+2E for a total of 19.428mm2 of cores included.

Zen4 without L3 is 3.84mm2. Zen4c reduces that down to 2.48mm2. Zen5 CCD is pretty much the same size as Zen4 (though with 27% more transistors), so core size should be similar. AMD has also stated that Zen5c has a similar shrink percent to Zen4c. We'll use their numbers. HX370 has 4P+8C for a total area of 35.2mm2. If being twice the size despite being on N4P instead of N5 like M1 seems like foreshadowing, it is.

We'll use notebookcheck's Cinebench 2024 multithread power and performance numbers to calculate perf / power / area then multiply that by 100 to eliminate some decimals.

M1 Pro scores 824 (10-core) and while they don't have a power value listed, they do list 33.6w package power running the prime95 power virus, so cinebench's power should be lower than that.

HX370 scored 1213 (12-core) and averaged 119w (maxing at a massive 121.7w and that's without running a power virus).

This gives the following perf/power/area*100 scores:

M1 Pro — 126 PPA

HX 379 — 29 PPA

M1 is more than 4.3x better while being an entire node behind and being released years before.


119W for hx370 looks extremely sus, seems to me more like the system level power consumption and not CPU-only.

According to phoronix [1,2], in their blender CPU test, they measured a peak of 33W.

Here max power numbers from some other tests that I know are multi-threaded:

--

Linux 6.8 Compilation: 33.13 W

LLVM Compilation: 33.25 W

--

If I plug in 33W into your equation, that would give us score of HX 370: 104 PPA

This supports the HX 370 being pretty power efficient, although still not as power efficient as M3.

[1] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/3

[2] https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/4


https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...

They got those kinds of numbers across multiple systems. You can take it up with them I guess.

I didn't even mention one of these systems was peaking at 59w on single-core workloads.


I see what's going on, they have two HX370 laptops:

  Laptop  MC score  Avg Power
     P16      1213      113 W
     S16       921       29 W
  M3 Pro      1059    (30 W?)
They don't have M3 Pro power numbers, but I assume it is somewhere around 30W, seems like S16 has similar power efficiency as HX 370 at 30 W.

Any more power, and the CPU is much less power efficient, 300% increase in power for 30% increase in performance.


This is true for every CPU. Past a certain point power consumption scales quadratically with performance.


About cinebench-geekbench-spec: https://old.reddit.com/r/hardware/comments/pitid6/eli5_why_d... That's about Cinebench 20, an overview of Cinebench 24 cpu&gpu(!): https://www.cgdirector.com/cinebench-2024-scores/


Even with the M3 the difference is marginal in multi-threaded benchmarks, from the Cinebench link [1] someone posted earlier on the thread.

    Apple M3 Pro 11-Core - 394 Points per Watt
    AMD Ryzen AI 9 HX 370 - 354 Points per Watt
    Apple M3 Max 16-Core - 306 Points per Watt
And the Ryzen in on TSMC 4nm while the M3 is on 3nm. As parent is saying, a lot of the Apple Silicon hype was due to the massive upgrade it was over the Intel CPUs Apple was using previously.

[1]: https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Their efficiency tests use Cinebench R23 (as called out explicitly).

R23 is not optimized for Apple silicon but is for x86. The R24 numbers are actually what you need for a fair comparison, otherwise you put the Arm numbers at a significant handicap.


That the max should be worse than the m3 pro is a little bit shady.


Cinebench might not be the most relevant benchmark, it uses lots of scalar instructions with fairly high branch mispredictions and low IPC: https://chipsandcheese.com/2021/02/22/analyzing-zen-2s-cineb....


Power efficiency is a curve, and Apple may have its own reason not to make M1 Pro run at 110W as well


I think the OC might have mis-read the power numbers, 110 W is well into desktop CPU power range. Here is a excerpt from Anand Tech:

> In our peak power test, the Ryzen AI 9 HX 370 ramped up and peaked at 33 W.

https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...


You can read the notebookcheck review for yourself.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Those 100W+ numbers are total system power. And that system has the CPU TDP set to 80W (far above AMD's official max of 54W). It also has a discrete 4070 GPU that can use over 100W on its own.


if x86 laptops have 90w of platform power, that’s a thing that’s concerning in itself, not a reasonable defense.

Remember, apple laptops have screens too, etc, and that shows up in the average system power measurements the same way. What's the difference in an x86 laptop?

I really doubt it's actually platform power, the problem is that x86 is boosting up to 35W average/60W peak per thread. 120W package power isn't unexpected, if you're boosting 3-4 cores to maximum!

And that's the problem. x86 is far far worse at race-to-sleep. It's not just "macos has better scheduling"... you can see from the 1T power measurements that x86 is simply drawing 2-3x the power while it's racing-to-sleep, for performance that's roughly equivalent to ARM.

Whatever the cause, whether it's just bad design from AMD and Intel, or legacy x86 cruft (I don't get how this applies to actual computational load though, as opposed to situations like idle power), or what... there is no getting around the fact that M2 tops out at 10W per core and a 8840HS or HX370 or Intel Meteor Lake are boosting to 30-35W at 1T loads.


I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores? Why are AMD's cores nearly 2x the size despite being on a better node and having 3 more years to work on them?

Why are Apple's scores the same on battery while AMD's scores drop dramatically?

Apple does have a reason not to run at 120w -- it doesn't need to.

Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.


You should try not to talk so confidently about things you don't know about -- this statement

> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

Is completely incorrect, as another commenter (and I think the notebookcheck article?) point out -- 30w is about the sweet spot for these processors, and the reason that 110w laptop seems so inefficient is because it's giving the APU 80w of TDP, which is a bit silly since it only performs marginally better than if you gave it e.g. 30 watts. It's not a good idea to take that example as a benchmark for the APU's efficiency, it varies depending on how much TDP you give the processor, and 80w is not a good TDP for these


Halo products with high scores sell chips. This isn’t a new idea.

So you lower the wattage down. Now you’re at M1 Pro levels of performance with 17% more cores and nearly double the die area and barely competing with a chip 3 years older while on a newer, more expensive node too.

That’s not selling me on your product (and that’s without mentioning the worst core latency I’ve seen in years when going between P and C cores).


> if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad

I’m writing this comment on HP ProBook 445 G8 laptop. I believe I bought it in early 2022, so it's a relatively old model. The laptop has a Ryzen 5 5600U processor which uses ≤ 25W. I’m quite happy with both the performance and battery life.


It's well known that performance doesn't scale linearly with power.

Benchmarking incentives on PC have long pushed X86 vendors to drive their CPUs at points of the power/performance curve that make their chips look less efficient than they really are. Laptop benchmarking has inherited that culture from desktop PC benchmarking to some extent. This is slowly changing, but Apple has never been subject to the same benchmarking pressures in the first place.

You'll see in reviews that Zen5 can be very efficient when operated in the right power range.


Zen5 can be more efficient at lower clockspeeds, but then it loses badly to Apple's chips in raw performance.


> I stacked the deck in AMD's favor using a 3-year-old chip on an older node.

You could just compare the ones that are actually on the same process node:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

But then you would see an AMD CPU with a lower TDP getting higher benchmark results.

> Why is AMD using 3.6x more power than M1 to get just 32% higher performance while having 17% more cores?

Getting 32% higher performance from 17% more cores implies higher performance per core.

The power measurements that site uses are from the plug, which is highly variable to the point of uselessness because it takes into account every other component the OEM puts into the machine and random other factors like screen brightness, thermal solution and temperature targets (which affects fan speed which affects fan power consumption) etc. If you measure the wall power of a system with a discrete GPU that by itself has a TDP >100W and the system is drawing >100W, this tells you nothing about the efficiency of the CPU.

AMD's CPUs have internal power monitors and configurable power targets. At full load there is very little light between the configured TDP and what they actually use. This is basically required because the CPU has to be able to operate in a system that can't dissipate more heat than that, or one that can't supply more power.

> Meanwhile, if AMD used the same 33w, nobody would buy their chips because performance would be so incredibly bad.

33W is approximately what their mobile CPUs actually use. Also, even lower-configured TDP models exist and they're not that much slower, e.g. the 7840U has a base TDP of 15W vs. 35W for the 7840HS and the difference is a base clock of 3.3GHz instead of 3.8GHz.


> Getting 32% higher performance from 17% more cores implies higher performance per core.

I don't disagree that it is higher perf/core. It is simply MUCH worse perf/watt because they are forced to clock so high to achieve those results.

> The power measurements that site uses are from the plug, which is highly variable to the point of uselessness

They measure the HX370 using 119w with the screen off (using an external monitor). What on that motherboard would be using the remaining 85+W of power?

TDP is a suggestion, not a hard limit. Before thermal throttling, they will often exceed the TDP by a factor of 2x or more.

As to these specific benchmarks, the R9 7945HX3D you linked to used 187w while the M2 Max used 78w for CB R15. As to perf/watt, Cinebench before 2024 wasn't using NEON properly on ARM, but was using Intel's hyper-optimized libraries for x86. You should be looking at benchmarks without such a massive bias.


> I don't disagree that it is higher perf/core. It is simply MUCH worse perf/watt because they are forced to clock so high to achieve those results.

The base clock for that CPU is nominally 2 GHz.

> They measure the HX370 using 119w with the screen off (using an external monitor). What on that motherboard would be using the remaining 85+W of power?

For the Asus ProArt P16 H7606WI? Probably the 115W RTX 4070.

> TDP is a suggestion, not a hard limit. Before thermal throttling, they will often exceed the TDP by a factor of 2x or more.

TDP is not really a suggestion. There are systems that can't dissipate more than a specific amount of heat and producing more than that could fry other components in the system even if the CPU itself isn't over-temperature yet, e.g. because the other components have a lower heat tolerance. There are also systems that can't supply more than a specific amount of power and if the CPU tried to non-trivially exceed that limit the system would crash.

The TDP is, however, configurable, including different values for boost. So if the OEM sets the value to the higher end of the range even though their cooling solution can't handle it, the CPU will start out there and gradually lower its power use as it becomes thermally limited. This is not the same as "TDP is a suggestion", it's just not quite as simple as a single number.

> As to these specific benchmarks, the R9 7945HX3D you linked to used 187w while the M2 Max used 78w for CB R15.

Which is the same site measuring power consumption at the plug on an arbitrary system with arbitrary other components drawing power. Are they even measuring it though the power brick and adding its conversion losses?

These CPUs have internal power meters. Doing it the way they're doing it is meaningless and unnecessary.

> You should be looking at benchmarks without such a massive bias.

Do you have one that compares the same CPUs on some representative set of tests and actually measures the power consumption of the CPU itself? Diligently-conducted benchmarks are unfortunately rare.

Note however that the same link shows the 7945HX3D also ahead in Blender, Geekbench ST and MT, Kraken, Octane, etc. It's consistently faster on the same process, and has a lower TDP.


lmao he’s citing cinebench R15? Which isn’t just ancient but actually emulated on arm, of course.

Really digging through the vaults for that one.

Geekbench 6 is perfectly fine for that stuff. But that still shows apple tieing in MT and beating the pants off x86 in 1T efficiency.

x86 1T boosts being silly is where the real problem comes from. But if they don’t throw 30-35w at a single thread they lose horribly.


> lmao he’s citing cinebench R15?

It's the only one where they measured the power use. I don't get to decide which tests they run. But if their method of measuring power use is going to be meaningless then the associated benchmark result might as well be too, right?

> Geekbench 6 is perfectly fine for that stuff. But that still shows apple tieing in MT and beating the pants off x86 in 1T efficiency.

It shows Apple behind by 8% in ST and 12% in MT with no power measurement for that test at all, but an Apple CPU with a higher TDP. Meanwhile the claim was that AMD hadn't even caught up on the same process, which isn't true.

> x86 1T boosts being silly is where the real problem comes from. But if they don’t throw 30-35w at a single thread they lose horribly.

They don't use 30-35W for a single thread on mobile CPUs. The average for the HX 370 from a set of mostly-threaded benchmarks was 20W when you actually measure the power consumption of the CPU:

https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/13

On single-threaded tests like PyBench the average was 10W:

https://www.phoronix.com/review/amd-ryzen-ai-9-hx-370/9

34W was the max across all tests, presumably the configured TDP for that system, derived from the tests like compiling LLVM that max out arbitrarily many cores.


Process helps but have you seen benchmarks showing equivalent performance between the same process node? I think it’s less that ARM is amazing than the Apple Silicon team being very good and paired with aggressive optimization throughout the stack but everything I’ve seen suggests they are simply building better chips at their target levels (not server, high power, etc.).


> Our benchmark database shows the Dimensity 9300 scores 2,207 and 7,408 in Geekbench 6.2's single and multi-core tests. A 30% performance improvement implies the Dimensity 9400 would score around 2,869 and and 9,630. Its single-core performance is close to that of the Snapdragon 8 Gen 4 (2,884/8,840) and it understandably takes the lead in multi-core. Both are within spitting distance from the Apple A17 Pro, which scores 2,915 and 7,222 points in the benchmark. Then again, all three chips are said to be manufactured on TSMC's N3 class node, effectively leveling the playing field.

https://www.notebookcheck.net/MediaTek-Dimensity-9400-rumour...


That appears to be an unconfirmed rumor and it’s exciting if true (and there aren’t major caveats on power), but did you notice how they mentioned extra work by ARM? The argument isn’t that Apple is unique, it’s that the performance gaps they’ve shown are more than simply buying premium fab capacity.

That doesn’t mean other designers can’t also do that work, but simply that it’s more than just the process - for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap. Some of that is x86 vs. ARM but there isn’t a single, simple factor which can explain this - e.g. Apple carefully tuning the hardware, firmware, OS, compilers, and libraries too undoubtably helps a lot and it’s been a perennial problem for non-Intel vendors on the PC side since so many developers have tuned for Intel first/only for decades.


> for example, the M2 shipped on TSMC’s N5P first as an exclusive but when Zen 5 shipped later on the same process it didn’t close the single core performance or perf/watt gap.

That was Zen 4, but it did close the gap:

https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073_14...

Single thread performance is higher (so is MT), TDP is slightly lower, Cinebench MT "points per watt" is 5% higher.

We'll get to see it again when the 3nm version of Zen5 is released (the initial ones are 4nm, which is a node Apple didn't use).


Since it's unclear whether Apple has a significant architectural advantage over Qualcomm and MediaTek, I would rather attribute this to relatively poor AMD architectures. Provisionally. At least their GPUs have been behind Nvidia for years. (AMD holding its own against Intel is not surprising given Intel's chip fab problems.)


Yes, to be clear I’d be very happy if MediaTek jumps in with a strong contender since consumers win. It doesn’t look like the Qualcomm chips are performing as well as hoped but I’d wait a bit to see how much tuning helps since Windows ARM was not a major target until now.


I guess getting close to the same single thread score is nice. Unfortunately, since only Apple is shipping it is hard to compare if the others burn the battery to get there.

I suspect the others two, like Apple with the A18 shipping next month, will be using the second gen N3. Apple is expected to be around 3500 on that node.

Needless to say, what will be very interesting is to see the perf/watt of all three on the same node and shipping in actual products where the benchmarks can be put to more useful tests.


Yeah, and GPU tests, since the benchmarks above were only for the CPU.


But how is this the case? I never saw a single article mentioning that a non-Mac laptop was better.

(Random article saying M3 pro is better than a Dell laptop https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-bat... )


You're right, but... The idea comes from the desktop world. AMD's zen 4 desktop cpu's especially the gaming variants like the Ryzen 7 7800X3D almost matches the performance per watt of Apple's M3.

Their laptop cpu's as some companies did release same model different cpu were less efficient than intel.

But the Asus ProArt P16 (used in the article) did manage an extreme endurance score in the video test called Big Buck Bunny H.264 1080p which runs at 150 cd/m² with 21 hours. With it's higher resolution, oled and 10% less battery capacity that's better 40 minutes better than the macbook pro 16 m3 max. In the wifi test also run at 150 cd/m² the m3 run for 16 hours, the asus 8. ( https://www.notebookcheck.net/Asus-ProArt-P16-laptop-review-... )

For me noise matters, that Asus has a whisper mode which produces 42db as much as an M3 max under full load. Please be aware that if you're susceptible of PWM, that ASUS laptop has issues.


I have heard that part of the reason for little coverage of ryzen mobile CPUs is their limited availability as AMD was focussing on using the fab capacity for server chips.


I think that's because all the press talks about actual battery life per laptop and the Apple Silicone laptops ship with literally double the size battery of any AMD based laptop without a discrete GPU. So while the efficiency may be close, actually perceived battery life of the Mac will he more than double when you also consider the priority Apple puts into their power control combined with a larger overall battery.


Ryzen mobile is consistently close, yeah. But with the sole exception of the Steam deck, I've yet to see a Ryzen mobile-bearing laptop, Windows included, which is close to the overall performance of the Macbook.


"overall performance" does a lot of work here. On sheer benchmarks it's really comparable, with AMD being slightly better depending on what you look at. e.g. the M1 vs the 5700U (a similar class widely available mobile CPU):

https://www.cpubenchmark.net/cpu.php?cpu=AMD%20Ryzen%207%205...

https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+320...

They're not profiled the same, and don't belong in the same ecosystem though, which makes a lot more difference than the CPU themselves. In particular the AMD doesn't get a dedicated compiler optimizing every applications of the system to its strength and weaknesses (the other side of it being the compatibility with the two vastest ecosystem we have now)


Depends on what you mean by "overall performance", but my Asus ROG Zephyrus G14 2023 is full AMD, and outperforms my work issued top of the line M1 MacBook Pro from a few months earlier in every task I've done across the two (gaming, compiling, heavy browsing). Battery life is lower under heavy load and high performance on the Zephyrus, but in power saving mode it's roughly comparable, albeit still worse.


Same here, my G14 and the M1 MBP are pretty much interchangeable for most workloads. The only time then G14 starts fans is when the 4070 turns on... and that's not an option on the M1 at all.


> But with the sole exception of the Steam deck

Uuh wut? The Steam Deck is like 3-generation-old hardware in mobile Ryzen terms. In a lot of ways it's similar to a pared-back 4800u with fewer (and older) cores, and a slightly bumped up GPU.

To me it's kinda the opposite. Excluding the Steam Deck, I think most of AMD's Ultrabook APUs have been very close to the products Apple's made on the equivalent nodes. Even on 7nm the 4800u put up a competitive fight against M1, and the gap has gotten thinner with each passing year. According to the OpenCL benchmarks, the Radeon 680m on 6nm scores higher than the M1 on 5nm: https://browser.geekbench.com/opencl-benchmarks

Even back when Ryzen Mobile only shipped with Vega, it was pretty clear that Apple and AMD were a pretty close match in onboard GPU power.


Steam Deck might be behind in terms of hardware but in terms of software it's way beyond your typical x86 linux system power efficiency, and dare I say it's doing better than windows machines with the typical shoddy bioses and drivers, specially when you consider all the extraneous services constantly sapping varying amounts of cpu time. All that contributes to make the SD punch well above its weight.


My Alienware M15 Ryzen edition gets 7-8W power consumption by just running "sudo powertop --autotune". Basically all of the power efficiency stuff in the Steam Deck apply to other Ryzen systems and are in the mainline kernel.


Battery tests are important, but so is how it fairs on battery (what is the performance drop off to maintain that), what’s its performance is ant its peak and how it long before it throttles when pushed.

The M series processors have succeeded in all four: battery life, performance parity between battery and plugged in, high performance and performance sustainability.

So far, very few benchmarks have been comparing the latter three as part of the full package assessment.


> because ARM is commonly believed to be much more power efficient than x86.

Because most ARM processors were designed for mobile phones and optimised to death for power efficiency.

The total power usage of the front end decoders is a single digit percentage of the total power draw. Even if ARM magically needed 0 watts for this, it couldn’t save more power than that. The rest of the processor design elements are essentially identical.


>5hr Battery life in laptops is mostly a function of how well idle is managed, i think. The less work you can do while running the users core program, the better. I'm not sure how much impact CPU efficiency really has in that case.

If you are running a remotely demanding program (say, a game) , your battery life will be bad no matter what (ie. <4hrs) unless you choose a very low TDP that performs badly always.

A laptop at idle should be able to manage ~5w power consumption sumtpion regardless of AMD/intel/Apple processor, but it's largely on the OS to achieve that.


I have a 365 AMD laptop.

The battery is great if your doing very light stuff, Call of Duty takes it's battery down to 3 hours.

Macs don't really support higher end games, so I can't directly compare to my M1 Air.


How does “great” translate to hours?


This is really tricky.

The OEMs will use ever trick possible and do something like open GMAIL to claim 10 hours, but given my typical use I average 5 to 6. I make music using a software called Maschine.

It's a massive step up over my old( still working just very heavy) Lenovo Legion 2020, which would last about 2 hours given the same usage.

This is all subjective at the end of the day. If none of your applications actually work since your on ARM Windows of course you'll have higher battery life.


The CPU core's instruction set has no influence on how well the chip as a whole manages power when not executing instructions.


That is fair, I was taught that decoders for x86 are less efficient and more power hungry than RISC ISAs because of their variable length instructions.

I remember being told (and it might be wrong) that ARM can decode multiple instructions in parallel because the CPU knows where the next instruction starts, but for x86, you'd have to decode the instructions in order.


That seems to not matter much nowadays. There's another great(according to my untrained eye) writeup of the lack of importance on chips and cheese.

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...


The various mentioned power consumption amounts are 4-10% per-core, or 0.5-6% of package (with the caveat of running with micro-op cache off) for Zen 2, and 3-10% for Haswell. That's not massive, but is still far from what I'd consider insignificant; it could give leeway for an extra core or some improved ALUs; or, even, depending on the benchmark, is the difference between Zen 4 and Zen 5 (making the false assumption of a linear relation between power and performance, at least), which'd essentially be a "free" generational improvement. Of course the reality is gonna be more modest than that, but it's not nothing.


You missed the part where they mention ARM ends up implementing the same thing to go fast.

The point is processors are either slow and efficient, or fast and inefficient. It's just a tradeoff along the curve.


ARM doesn't need the variable-length instruction decoding though, which on x86 essentially means that the decoder has to attempt to decode at every single byte offset for the start of the pipeline, wasting computation.

Indeed pretty much any architecture can benefit from some form of op cache, but less of a need for it means its size can be reduced (and savings spent in more useful ways), and you'll still need actual decoding at some point anyway (and, depending on the code footprint, may need it a lot).

More generally, throwing silicon at a problem is, quite obviously, a more expensive solution than not having the problem in the first place.


x86 processors simply run a instruction length predictor the same way they do it for branch prediction. That turns the problem into something that can be tuned. Instead of having to decode the instruction at every byte offset, you can simply decide to optimize for the 99% case with a slow path for rare combinations.


That's still silicon spent on a problem that can be architecturally avoided.


But bigger fixed-length instructions mean more I$ pressure, right?


RISC doesn't imply wasted instruction space; RISC-V has a particularly interesting thing for this - with the compressed ('c') extension you get 16-bit instructions (which you can determine by just checking two bits), but without it you can still save 6% of icache silicon via only storing 30 bits per instruction, the remaining two being always-1 for non-compressed instructions.

Also, x86 isn't even that efficient in its variable-length instructions - some half of them contain the byte 0x0F, representing an "oh no, we're low on single-byte instructions, prefix new things with 0F". On top of that, general-purpose instructions on 64-bit registers have a prefix byte with 4 fixed bits. The VEX prefix (all AVX1/2 instructions) has 7 fixed bits. EVEX (all AVX-512 instructions) is a full fixed byte.


https://oscarlab.github.io/papers/instrpop-systor19.pdf

ARM64 instructions are 4 bytes. x86 instructions in real-world code average 4.25 bytes. ARM64 gets closer to x86 code size as it adds new instructions to replace common instruction sequences.

RISC-V has 2-byte and 4-byte instructions and averages very close to 3-bytes. Despite this, the original compressed code was only around 15% more dense than x86. The addition of the B (bitwise) extensions and Zcb have increased that advantage by quite a lot. As other extensions get added, I'd expect to see this lead increase over time.


x86-64 wastes enough of its address space that arm64 is typically smaller in practice. The RISC-V folks pointed this out a decade ago - geomean across their SPEC suite, x86 is 7.3% larger binary size than arm64.

https://people.eecs.berkeley.edu/%7Ekrste/papers/EECS-2016-1...

So there’s another small factor leaning against x86 - inferior code density means they get less out of their icache than ARM64 due to their ISA design (legacy cruft). And ARM64 often has larger icaches anyway - M1 is 6x the icache of zen4 iirc, and they get more out of it with better code density.

<uno-reverse-card.png>


That stuff is WAY out-of-date and was flatly wrong when it was published.

A715 cut decoder size a whopping 75% by dropping the more CISC 32-bit stuff and completely eliminated the uop cache too. Losing all that decode, cache, and cache controllers means a big reduction in power consumption (decoders are basically always on). All of ARM's latest CPU designs have eliminated uop cache for this same reason.

At the time of publication, we already knew that M1 (already out for nearly a year) was the highest IPC chip ever made and did not use a uop cache.


Clam makes some serious technical mistakes in that article and some info is outdated.

1. His claim that "ARM decoder is complex too" was wrong at the time (M1 being an obvious example) and has been proven more wrong since publication. ARM dropped the uop cache as soon as they dropped support for their very CISC-y 32-bit catastrophe. They bragged that this coincided with a whopping 75% reduction in decoder size for their A715 (while INCREASING from 4 decoders to 5) and this was almost single-handedly responsible for the reduced power consumption of that chip (as all the other changes were comparatively minor). NONE of the current-gen cores from ARM, Apple, or Qualcomm use uop cache eliminating these power-hungry cache and cache controllers.

2. The paper[0] he quotes has a stupid conclusion. They show integer workloads using a massive 22% of total core power on the decoder and even their fake float workload showed 8% of total core power. Realize that a study[1] of the entire Ubuntu package repo showed that just 12 int/ALU instructions made up 89% of all code with float/SIMD being in the very low single digits of use.

3. x86 decoder situation has gotten worse. Because adding extra decoders is exponentially complex, they decided to spend massive amounts of transistors on multiple decoder blocks working on various speculated branches. Setting aside that this penalizes unrolled code (where they may have just 3-4 decoders while modern ARM will have 10+ decoders), the setup for this is incredibly complex and man-year intensive.

4. "ARM decodes into uops too" is a false equivalency. The uops used by ARM are extremely close to the original instructions as shown by them being able to easily eliminate the uop cache. x86 has a much harder job here mapping a small set of instructions onto a large set.

5. "ARM is bloated too". ARM redid their entire ISA to eliminate bloat. If ISA didn't actually matter, why would they do this?

6. "RISC-V will become bloated too" is an appeal to ignorance. x86 has SEVENTEEN major SIMD extensions excluding the dozen or so AVX-512 extensions all with various incompatibilities and issues. This is because nobody knew what SIMD should look like. We know now and RISC-V won't be making that mistake. x86 has useless stuff like BCD instructions using up precious small instruction space because they didn't know. RISC-V won't do this either. With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

7. Omitting complexity. A bloated, ancient codebase takes forever to do anything with. A bloated, ancient ISA takes forever to do anything with. If ARM and Intel both put X dollars into a new CPU design, Intel is going to spend 20-30% or maybe even more of their budget on devs spending time chasing edge cases and testers to test al those edge cases. Meanwhile, ARM is going to spend that 20-30% of their budget on increasing performance. All other things equal, the ARM chip will be better at any given design price point.

8. Compilers matter. Spitting out fast x86 code is incredibly hard because there are so many variations on how to do things each with their own tradeoffs (that conflate in weird ways with the tradeoffs of nearby instructions). We do peephole heuristic optimizations because provably fast would take centuries. RISC-V and ARM both make it far easier for compiler writers because there's usually just one option rather than many options and that one option is going to be fast.

[0] https://www.usenix.org/system/files/conference/cooldc16/cool...

[1] https://oscarlab.github.io/papers/instrpop-systor19.pdf


One more: there's more to an ISA than just the instructions; there's semantic differences as well. x86 dates to a time before out-of-order execution, caches, and multi-core systems, so it has an extremely strict memory model that does not reflect modern hardware -- the only memory-reordering optimization permitted by the ISA is store buffering.

Modern x86 processors will actually perform speculative weak memory accesses in order to try to work around this memory model, flushing the pipeline if it turns out a memory-ordering guarantee was violated in a way that became visible to another core -- but this has complexity and performance impacts, especially when applications make heavy use of atomic operations and/or communication between threads.

Simple atomic operations can be an order of magnitude faster on ARMv8 vs x86: https://web.archive.org/web/20220129144454/https://twitter.c...


"the only memory-reordering optimization permitted by the ISA is store buffering."

I think this is a mischaracterization of TSO. TSO only dictates the store ordering to other entities in the system, the individual cores are fully capable of using the results of stores that are not yet visible for their own OoO purposes as long as the dataflow dependencies are correctly solved. The complexities of the read/write bypassing is simply to clarify correct program order.

And this is why the TSO/non TSO mode on something like the apple cores doesn't seem to make a huge difference, particularly if one assumes that the core is aggressively optimized for the arm memory model, and the TSO buffering/ordering is not a critical optimization point.

Put another way, a core designed to track store ordering utilizing some kind of writeback merging is going to be fully capable of executing just as aggressively OoO and holding back or buffering the visibility of completed stores until earlier stores complete. In fact for multithreaded lock-free code the lack of explicit write fencing is likely a performance gain for very carefully optimized code in most cases. A core which can pipeline and execute multiple outstanding store fences is going to look very similar to one that implements TSO.


Yes, and Apple added this memory model to their ARM implementation so Rosetta2 would work well.


Some notes:

3: I don't think more decoders should be exponentially more complex, or even polynomial; I think O(n log n) should suffice. It just has a hilarious constant factor due to the lookup tables and logic needed, and that log factor also impacts the critical path length, i.e. pipeline length, i.e. mispredict penalty. Of note is that x86's variable-length instructions aren't even particularly good at code size.

Golden Cove (~1y after M1) has 6-wide decode, which is probably reasonably near M1's 8-wide given x86's complex instructions (mainly free single-use loads). [EDIT: actually, no, chipsandcheese's diagram shows it only moving 6 micro-ops per cycle to reorder buffer, even out of the micro-op cache. Despite having 8/cycle retire. Weird.]

6: The count of extensions is a very bad way to measure things; RISC-V will beat everything in that in no time, if not already. The main things that matter are ≤SSE4.2 (uses same instruction encoding as scalar code); AVX1/2 (VEX prefix); and AVX-512 (EVEX). The actual instruction opcodes are shared across those. But three encoding modes (plus the three different lengths of the legacy encoding) is still bad (and APX adds another two onto this) and the SSE-to-AVX transition thing is sad.

RISC-V already has two completely separate solutions for SIMD - v (aka RVV, i.e. the interesting scalable one) and p (a simpler thing that works in GPRs; largely not being worked on but there's still some activity). And if one wants to count extensions, there are already a dozen for RVV (never mind its embedded subsets) - Zvfh, Zvfhmin, Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned, Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those work better together than, say, SSE and AVX (but on x86 there's no reason to mix them anyway).

And RVV might get multiple instruction encoding forms too - the current 32-bit one is forced into allowing using only one register for masking due to lack of encoding space, and a potential 48-bit and/or 64-bit instruction encoding extension has been discussed quite a bit.

8: RISC-V RVV can be pretty problematic for some things if compiling without a specific target architecture, as the scalability means that different implementations can have good reason to have wildly different relative instruction performance (perhaps most significant being in-register gather (aka shuffle) vs arithmetic vs indexed load from memory).


3. You can look up the papers released in the late 90s on the topic. If it was O(n log n), going bigger than 4 full decoders would be pretty easy.

6. Not all of those SIMD sets are compatible with each other. Some (eg, SSE4a) wound up casualties of the Intel v AMD war. It's so bad that the Intel AVX10 proposal is mostly about trying to unify their latest stuff into something more cohesive. If you try to code this stuff by hand, it's an absolute mess.

The P proposal is basically DOA. It could happen, but nobody's interested at this point. Just like the B proposal subsumed a bunch of ridiculously small extensions, I expect a new V proposal to simply unify these. As you point out, there isn't really any conflict between these tiny instruction releases.

There is discussion around the 48-bit format (the bits have been reserved for years now), but there are a couple different proposals (personally, I think 64-bit only with the ability to put multiple instructions inside is better, but that's another topic). Most likely, a 48-bit format does NOT do multiple encoding, but instead does a superset of encodings (just like how every 16-bit instruction expands into a 32-bit instruction). They need/want 48-bits to allow 4-address instructions too, so I'd imagine it's coming sooner or later.

Either way, the length encoding is easy to work with compared to x86 where you must check half the bits in half the bytes before you can be sure about how long your instruction really is.

8. There could be some variance, but x86 has this issue too and SO many more besides.


The trend seems to be going towards multiple decoder complexes. Recent designs from AMD and Intel do this.

It makes sense to me: if the distance between branches is small, a 10-wide decode may be wasted anyway. Better to decode multiple basic blocks in parallel


I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that?

6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such.

8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse).

Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better.

[1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h...


I think you may be correct about gracemont v golden cove. Rumors/insiders say that Intel has supposedly decided to kill off either the P or E-core team, so I'd guess that the P-core team is getting layed off because the E-core IPC is basically the same, but the E-core is massively more efficient. Even if the P-core wins, I'd expect them to adopt the 3x3 decoder just as AMD adopted a 2x4 decoder for zen5.

Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).

There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).

In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.


> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.

Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.


> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction

+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.


> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.


Going through Highway's set of shuffle ops:

64-bit OddEven/Reverse2/ConcatOdd/ConcatEven, OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse, CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and Broadcast especially for 8-bit, TwoTablesLookupLanes, InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).

All of these are considerably more expensive on RVV. SVE has a nice set, despite also being VL-agnostic.


More RVV questionable optimization cases:

- broadcasting a loaded value: a stride-0 load can be used for this, and could be faster than going through a GPR load & vmv.v.x, but could also be much slower.

- reversing: could use vrgather (could do high LMUL everywhere and split into multiple LMUL=1 vrgathers), could use a stride -1 load or store.

- early-exit loops: It's feasible to vectorize such, even with loads via fault-only-first. But if vl=vlmax is used for it, it might end up doing a ton of unnecessary computation, esp. on high-VLEN hardware. Though there's the "fun" solution of hardware intentionally lowering vl on fault-onlt-first to what it considers reasonable as there aren't strict requirements for it.


Expanding on 3: I think it ends up at O(n^2 * log n) transistors, O(log n) critical path (not sure on routing or what fan-out issues might there be).

Basically: determine end of instruction at each byte (trivial but expensive). Determine end of two instructions at each byte via end2[i]=end[end[i]]. Then end4[i]=end2[end2[i]], etc, log times.

That's essentially log(n) shuffles. With 32-byte/cycle decode that's roughy five 'vpermb ymm's, which is rather expensive (though various forms of shortcuts should exist - for the larger layers direct chasing is probably feasible, and for the smaller ones some special-casing of single-byte instructions could work).

And, actually, given the mention of O(log n)-transistor shuffles at http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo..., it might even just be O(n * log^2(n)) transistors.

Importantly, x86 itself plays no part in the non-trivial part. It applies equqlly to the RISC-V compressed extension, just with a smaller constant.


Determining the end of a RISC-V instruction requires checking two bits and you have the knowledge that no instruction exceeds 4 bytes or uses less than 2 bytes.

x86 requires checking for a REX, REX2, VEX, EVEX, etc prefix. Then you must check for either 1 or 2 instruction bytes. Then you must check for the existence of a register byte, how many immediate byte(s), and if you use a scaled index byte. Then if a register byte exists, you must check it for any displacement bytes to get your final instruction length total.

RISC-V starts with a small complexity then multiplies it by a small amount. x86 starts with a high complexity then multiplies it by a big amount. The real world difference here is large.

As I pointed out elsewhere ARM's A715 dropped support for aarch32 (which is still far easier to decode than x86) and cut decoder size by 75% while increasing raw decoder count by 20%. The decoder penalties of bad ISA design extend beyond finding instruction boundaries.


I don't disagree that the real-world difference is massive; that much is pretty clear. I'm just pointing out that, as far as I can tell, it's all just a question of a constant factor, it's just massive. I've written half of a basic x86 decoder in regular imperative code, handling just the baseline general-purpose legacy encoding instructions (determines length correctly, and determines opcode & operand values to some extent), and that was already much.


> With 50+ years of figuring the basics out, RISC-V won't be making any major mistakes on the most important stuff.

RVV does have significant departures from prior work, and some of them are difficult to understand:

- the whole concept of avl, which adds complexity in many areas including reg renaming. From where I sit, we could just use masks instead.

- mask bits reside in the lower bits of a vector, so we either require tons of lane-crossing wires or some kind of caching.

- global state LMUL/SEW makes things hard for compilers and OoO.

- LMUL is cool but I imagine it's not fun to implement reductions, and vrgather.


How does avl affect register renaming? (there's the edge-case of vl=0 that is horrifically stupid (which is by itself a mistake for which I have seen no justification but whatever) but that's probably not what you're thinking of?) Agnostic mode makes it pretty simple for hardware to do whatever it wants.

Over masks it has the benefit of allowing simple hardware short-circuiting, though I'd imagine it'd be cheap enough to 'or' together mask bit groups to short-circuit on (and would also have the benefit of better masked throughput)

Cray-1 (1976) had VL, though, granted, that's a pretty long span of no-VL until RVV.


Was thinking of a shorter avl producing partial results merged into another reg. Something like a += b; a[0] += c[0]. Without avl we'd just have a write-after-write, but with it, we now have an additional input, and whether this happens depends on global state (VL).

Espasa discusses this around 6:45 of https://www.youtube.com/watch?v=WzID6kk8RNs.

Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?


> Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?

Yes, but it should rarely do so.

The problem is that because of the vl=0 case you always have a dependency on avl. I think the motivavtion for the vl=0 case was that any serious ooo implementation will need to predict vl/vtype anyways, so there might as well be this nice to have feature.

IMO they should've only supported ta,mu. I think the only usecase for ma, is when you need to avoid exceptions. And while tu is usefull, e.g. summing am array, it could be handled differently. E.g. once vl<vlmax you write the summ to a difgerent vector and do two reductions (or rather two diffetent vectors given the avl to vl rules).


What's the "nice to have feature" of vl=0 not modifying registers? I can't see any benefit from it. If anything, it's worse, due to the problems on reduce and vmv.s.x.


"nice to hace" because it removes the need for a branch for the n=0 case, for regular loops you probably still want it, but there are siturations were not needing to worry about vl=0 corrupting your data is somewhat nice.


Huh, in what situation would vl=0 clobbering registers be undesirable while on vl≥1 it's fine?

If hardware will be predicting vl, I'd imagine that would break down anyway. Potentially catastrophically so if hardware always chooses to predict vl=0 doesn't happen.


> Agree agnostic would help, but the machine also has to handle SW asking for mask/tail unchanged, right?

The agnosticness flags can be forwarded at decode-time (at the cost of the non-immediate-vtype vsetvl being very slow), so for most purposes it could be as fast as if it were a bit inside the vector instruction itself. Doesn't help vl=0 though.


Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Intel Golden Cove's decoder can. More complex logic with more delays is harder to clock up. Of course M1 may be held back by another critical path, but it's interesting that no one has managed to get a 8-wide Arm decoder running at the clock speeds that Zen 3/4 and Golden Cove can.

A715's slides say the L1 icache gains uop cache features including caching fusion cases. Likely it's a predecode scheme much like AMD K10, just more aggressive with what's in the predecode stage. Arm has been doing predecode (moving some stages to the L1i fill path rather than the hotter L1i hit path) to mitigate decode costs for a long time. Mitigating decode costs again with a uop cache never made much sense especially considering their low clock speeds. Picking one solution or the other is a good move, as Intel/AMD have done. Arm picked predecode for A715.

2. The paper does not say 22% of core power is in the decoders. It does say core power is ~22% of package power. Wrong figure? Also, can you determine if the decoder power situation is different on Arm cores? I haven't seen any studies on that.

3. Multiple decoder blocks doesn't penalize decoder blocks once the load balancing is done right, which Gracemont did. And you have to massively unroll a loop to screw up Tremont anyway. Conversely, decode blocks may lose less throughput with branchy code. Consider that decode slots after a taken branch are wasted, and clustered decode gets around that. Intel stated they preferred 3x3 over 2x4 for that reason.

4. "uops used by ARM are extremely close to the original instructions" It's the same on x86, micro-op count is nearly equal to instruction count. It's helpful to gather data to substantiate your conclusions. For example, on Zen 4 and libx264 video encoding, there's ~4.7% more micro-ops than instructions. Neoverse V2 retires ~19.3% more micro-ops than instructions in the same workload. Ofc it varies by workload. It's even possible to get negative micro-op expansion on both architectures if you hit branch fusion cases enough.

8. You also have to tell your ARM compiler which of the dozen or so ISA extension levels you want to target (see https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#inde...). It's not one option by any means. Not sure what you mean by "peephole heuristic optimizations", but people certainly micro-optimize for both arm and x86. For arm, see https://github.com/dotnet/runtime/pull/106191/files as an example. Of course optimizations will vary for different ISAs and microarchitectures. x86 is more widely used in performance critical applications and so there's been more research on optimizing for x86 architectures, but that doesn't mean Arm's cores won't benefit from similar optimization attention should they be pressed into a performance critical role.


> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695


1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.


> My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though.

I'd call that more neat than absurd.

> You may want it so you can use 16 registers, but it also increases code size.

RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.

Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.


RISC-V has a different version of this issue that is pretty straight-forward. Preferring 2-register operations is already done to save register space. The only real extra is preferring the 8 registers C uses for math. After this, it's all just compression.

x86 has a multitude of other factors than just compression. This is especially true with standard vs REX instructions because most of the original 8 instructions have specific purposes and instructions that depend on them for these (eg, Accumulator instructions with A register, Mul/div using A+D, shift uses C, etc). It's a problem a lot harder than simple compression.

Just as cracking an alphanumeric password is exponentially harder than a same-length password with numbers only, solving for all the x86 complications and exceptions is also exponentially harder.


If anything, I'd say x86's fixed operands make register allocation easier! Don't have to register-allocate that which you can't. (ok, it might end up worse if you need some additional 'mov's. And in my experience more 'mov's is exactly what compilers often do.)

And, right, RISC-V even has the problem of being two-operand for some compressed instructions. So the same register allocation code that's gone towards x86 can still help RISC-V (and vice versa)! On RISC-V, failure means 2→4 bytes on a compressed instruction, and on x86 it means +3 bytes of a 'mov'. (granted, the additioanal REX prefix cost is separate on x86, while included in decompression on RISC-V)


With 16 registers, you can't just avoid a register because it has a special use. Instead, you must work to efficiently schedule around that special use.

Lack of special GPRs means you can rename with impunity (this will change slightly with the load/store pair extension). Having 31 truly GPR rather than 8 GPR+8 special GPR also gives a lot of freedom to compilers.


Function arguments and return values already are effectively special use, and should frequently be on par if not much more frequent than the couple x86 instructions with fixed registers.

Both clang and gcc support calls having differing used calling conventions within one function, which ends up effectively exactly identical to fixed-register instructions (i.e. an x86 'imul r64' can be done via a pseudo-function where the return values are in rdx & rax, an input is in rax, and everything else is non-volatile; and the dynamically-choosable input can be allocated separately). And '__asm__()' can do mixed fixed and non-fixed registers anyway.


Unlike x86, none of this is strictly necessary. As long as you put things back as expected, you may use all the registers however you like.


The option of not needing any fixed register usage would apply to, what, optimizing compilers without support for function calls (at least via passing arguments/results via registers)? That's a very tiny niche to use as an argument for having simplified compiler behavior.

And good register allocation is still pretty important on RISC-V - using more registers, besides leading to less compressed instruction usage, means more non-volatile register spilling/restoring in function prologue/epilogue, which on current compilers (esp. clang) happens at the start & end of functions, even in paths that don't need the registers.

That said, yes, RISC-V still indeed has much saner baseline behavior here and allows for simpler basic register allocation, but for non-trivial compilers the actual set of useful optimizations isn't that different.


Not just simpler basic allocation. There are fewer hazards to account for as well. The process on RISC-V should be shorter, faster, and with less risk that the chosen heuristics are bad in an edge case.


1. Performance. Also Arm implemented instruction cache coherency too.

Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.

And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?

2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.

But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)

3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.

4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)

Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?

8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)

But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.

Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?


1. The biggest chip market is laptops and getting 15% better performance for 80% more power (like we saw with X Elite recently) isn't worth doing outside the marketing win of a halo product (a big reason why almost everyone is using slower X Elite variants). The most profitable (per-chip) market is servers. They also prefer lower clocks and better perf/watt because even with the high chip costs, the energy will wind up costing them more over the chip's lifespan. There's also a real cost to adding extra pipeline stages. Tejas/Jayhawk cores are Intel's cancelled examples of this.

L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.

2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.

4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.

8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.

Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?

Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.

ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.


1. Yeah I agree, both X Elite and many Intel/AMD chips clock well past their efficiency sweet spot at stock. There is a cost to extra pipeline stages, but no one is designing anything like Tejas/Jayhawk, or even earlier P4 variants these days. Also P4 had worse problems (like not being able to cancel bogus ops until retirement) than just a long pipeline.

Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.

2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.

4. IMO no one is really RISC or CISC these days

8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.

Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.

ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.

If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.


That was true when ARM was first released, but over the years the decoder for ARM has gotten more and more complicated. Who would have guessed adding more specialized instructions would result in more complicated decoders? ARM now uses multi-stage decoders, just the same as x86.


Sure, but it's not idle power consumption that's the difference between these.


When a laptop gets 12 hours or more of battery life that's because it's 90% idle.


And while it's important to design a chip that can enter a deep idle state, the thing that differentiates one Windows laptop from the next is how many mistakes the BIOS writers made and whether the platform drivers work correctly. This is also why you cannot really judge the expected battery life under Linux by reading reviews of laptops running Windows.


I didn’t watch this link, but my Zenbook S 16 only gets remotely close to my M2 MBA battery life if the zenbook is in whatever is Windows 11 ‘efficiency’ mode, and then it benchmarks at 50% of the M2.

I don’t think the two are remotely comparable in perf/watt.


Unlike AMD and Qualcomm, Apple uses an expensive TSMC 3nm process, so you would expect better battery life from the "MBP3". I assume they used the process improvements to increase performance instead.


Perf per watt is higher for M1 on N5 vs Zen5 on N4P, so the problems go deeper than just process.

X Elite also beats AMD/Intel in perf/watt while being on the same N4P node as HX370.

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Performance per watt also depends on clock speed. Other things equal, higher clock speed means worse performance per watt.


The display, RAM, and other peripherals are consuming power too. Short of running continuous high CPU loads, which most people don't do on laptops, changes in CPU efficiency have less apparent effect on battery life because it's only a fraction of overall power draw.


> within an hour of MBP3

Not a good way to measure. The Zenbook S16 has a larger 78Wh battery vs the MacBook Pro’s 69.6Wh.

So that’s 11% less battery life despite 12% more battery capacity.


Yeah if you make a worse core and then downclock it then you will increase power efficiency. AMD thankfully only downclocks the 5c, but Intel is shipping ivy lake equivalents in their flagship products just to get power efficiency up.


One of these has to be true (or both true):

1. ARM is inherently more efficient than x86 CPUs in most tasks

2. Nuvia and Apple are better CPU designers than AMD and Intel

Here are results from Notebookcheck:

Cinebench R24 ST perf/watt

* M3: 12.7 points/watt

* X Elite: 9.3 points/watt

* AMD HX 370: 3.74 points/watt

* AMD 8845HS: 3.1 points/watt

* Intel 155H: 3.1 points/watt

In ST, Apple is 3.4x more efficient than Zen5. X Elite is 2.4x more efficient than Zen5.

Cinebench R24 MT perf/watt

* M3: 28.3 points/watt

* X Elite: 22.6 points/watt

* AMD HX 370: 19.7 points/watt

* AMD 8845HS: 14.8 points/watt

* Intel 155H: 14.5 points/watt

In MT, Apple is 1.9x more efficient than Zen4 and 1.4x more efficient than HX 370. I expect M3 Pro/Max to increase the gap because generally, more cores means more efficiency for Cinebench MT. X Elite is also more efficient but the gap is closer. However, we should note that in a laptop, ST matters more for efficiency because of the burst behavior of usage. It's easier to gain in MT efficiency as long as you have many cores and run them at lower wattage. In this case, AMD's Zen5 12 core setup and 24 threads works well in Cinebench. Cinebench loves more threads.

One thing that is intriguing is that X Elite does not have little cores which hurts its MT efficiency. It's likely a remnant of Nuvia designing a server CPU, which does not need big.Little but Qualcomm used it in a laptop SoC first.

Sources: https://www.youtube.com/watch?v=ZN2tC8DfJnc

https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...


Performance per watt is a bad metric. You want instead performance for a given power budget (eg, how much performance can I get at 15w? 30w? etc...)

Otherwise you can trivially win 100% of performance/watt comparisons by just setting clocks to the limit of the lowest usable voltage level.

For example compare the 7950X to the 7950X@65w using the officially supported eco mode option: https://www.anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x...

Cinebench R23 MT:

7950X stock: 225 points/watt

7950X @ 65w eco mode: 482 points/watt

Over 2x perf/watt improvement on the exact same chip, and a power efficiency that tops the charts, beating every laptop chip in that notebookcheck test by a large amount as well. And yet how many 7950x owners are using the 65w eco mode option? Probably none. Because perf/watt isn't actually meaningful. Rather it's how much performance can I get for a given power budget.


>Performance per watt is a bad metric.

It isn't. It needs context.

Sure, you can get an Intel Celeron to have more perf/watt than an M1 if you give the Celeron low enough wattage.

The key here is the absolute performance.

In this case, both the M3 and the X Elite are not only significantly more efficient than both Zen4 and Zen5 in ST, they are also straight up faster while being more efficient.


> 1. ARM is inherently more efficient than x86 CPUs in most tasks

> 2. Nuvia and Apple are better CPU designers than AMD and Intel

The third possibility is that they just pick a different point on the efficiency curve. You can double power consumption in exchange for a few percent higher performance, double it again for an even smaller increase.

The max turbo on the i9 14900KS is 253 W. The power efficiency is bad. But it generally outperforms the M3, despite being on a significantly worse process node, because that's the trade off.

AMD is only on a slightly worse process node and doesn't have to do anything so aggressive, but they'll also sell you whatever you want. The 8845HS and 8840U are basically the same chip, but former has around double the TDP. In exchange for that you get ~2% more single thread performance and ~15% more multi-thread performance. Whereas the performance per watt for the 8840U is nearly that of the M3, and the remaining difference is basically the process node.


>The third possibility is that they just pick a different point on the efficiency curve. You can double power consumption in exchange for a few percent higher performance, double it again for an even smaller increase.

This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

AMD's Zen5 chip is just straight up slower in any curve.

>The max turbo on the i9 14900KS is 253 W. The power efficiency is bad. But it generally outperforms the M3, despite being on a significantly worse process node, because that's the trade off.

It's not a trade off that Intel wants. The 14900KS runs at 253w (sometimes 400w+) because that's the only way Intel is able to stay remotely competitive at the very high end. An M3 Max will often match a 14900KS in performance using 5-10% of the power.


> This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

It makes sense if Zen5 is faster in MT, since that's when the CPUs will be power limited, and it is. For ST the performance generally isn't power-limited for either of them and then the M3 is on a newer process node.

It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

The power numbers in that link also don't inspire a lot of confidence. They have two systems with the same CPU but one of them uses 119.3W and the other one uses 46.7W? Plausibly the OEMs could have configured them differently but that kind of throws out the entire premise of using the comparison to measure the efficiency of the CPUs. The number doesn't mean anything if the power consumption is being set as a configuration parameter by the OEM and the number of watts going to the display or a discrete GPU are an uncontrolled hidden variable.

> It's not a trade off that Intel wants.

It's the one they've always taken, even when they were unquestionably in the lead. They were selling up to 150W desktop processors in 2008, because people buy them, because they're faster.

Now they have to do it just to be competitive because their process isn't as good, but the process is a different thing than the ISA or the design of the CPU.


>It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

Cinebench R23 uses Intel Embree underneath. It's hand optimized for AVX instruction set and poorly translated to NEON. It's not even clear if it has any NEON optimization.


R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

Anyone using R23 instead of R24, is putting arm at a disadvantage. Notebookcheck is often called out for this and haven’t really addressed why they stick with R23 beyond not wanting to redo tests for older hardware. They are by far the outlier for performance numbers and why the discussion around performance gets muddied.


> R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

The single-thread benchmark is SIMD-heavy?

Now it just sounds like Cinebench ST is a useless benchmark because it's putting a parallelizable SIMD workload on a single core. In real life you'd always be running those multi-threaded, whereas the reason people care about ST performance is for the serialized branch-heavy spaghetti code that inherently only runs on one core. "Run the SIMD code, but clamp it to a single thread" is a garbage proxy for that.


Yes, there’s no difference between the single and multi core benchmark other than how many threads get spun up.

I’m not sure why you’re trying to equate simd with parallelization. Tbh, a lot of your response seems odd to me because it’s making several incorrect assumptions.

You can’t really escape parallelization with how any modern core works, even on a single core. You may have certain operations process concurrently depending on how the cores resources are available at a given time and what is needed.

Regardless, SIMD isn’t concurrency. It’s batching.

There’s still significant benefit to having SIMD on a single threaded task. There’s a lot of thread overhead to using multiple cores to do something, whereas SIMD lets you effectively batch things on a single core.


SIMD workloads generally imply that you're doing the same operation repeatedly. It's literally in the name; single instruction, multiple data. There are occasional cases where that happens but doesn't parallelize well. TLS is probably a good example because you might have to encrypt a network packet and it's big enough to benefit from SIMD but not big enough that the overhead of splitting it across cores is worth it.

But most of the time if you're doing the same operation repeatedly you'll benefit from using more cores. Even for TLS, the client might not split the individual connection across multiple cores, but the server is going to handle multiple clients at once in parallel. Heavy workloads like video encoding make this even more apparent. In general the things that benefit from SIMD are parallel tasks that benefit from multiple cores.

Compare this with, say, a browser running JavaScript in a single tab. There is nothing to put on another core, you don't know what instructions will be executed next until you get there. This is where people actually care about single-thread performance, and where processors achieve it by using branch prediction etc. But these exercise very different parts of the CPU than SIMD-heavy workloads. The latter can easily fill the execution units of a wide processor that would be stymied by the former.


This feels like a really absurd stretch of trying to discern SIMD away from stuff like standard integer and float operations.

Maybe in the early 90s, but they’re such a part of processor design that you can’t realistically avoid them.

Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

And because SIMD is such a part of every mainstream processor, it’s very important that benchmarks show how well they perform.

I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.


> Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

Well sure, but rendering is a classically parallel operation which is regularly implemented as threaded.

> I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

JavaScript runtimes are executing code, so they'll implement the whole gamut and their execution will depend on what kind of code it actually is. But the common JavaScript code, and the kind presumably being tested in JavaScript benchmarks because it's what people care about, isn't implementing a video encoder using SIMD. It's manipulating DOM objects and parsing short pieces of text input, which is branch-heavy code with lots of indirection and very little use of SIMD if any.

> If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.

Which is kind of my point. Production rendering is going to be threaded and max out all the cores, which is Cinebench MT. "CineBench ST" is measuring something that nobody does in real life and doesn't even really correlate with the things people actually do.

It doesn't represent real threaded workloads (which optimally run on many low-clocked cores, not one high-clocked one) nor real serialized workloads (which are full of conditional jumps and cache misses).


Rendering is parallel, yes, but it also makes heavy use of SIMD to accelerate the operations per thread. One does not obviate the other.

In the most trivial case, sure, the JavaScript runtimes won’t compile to use SIMD but there’s lot of cases where they will as well as part of their JIT. I think you’re trivializing how they work.

And back to the main point, it doesn’t really matter if you believe Cinebench reflects real world rendering. The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

So using R23 as benchmarks for efficiency and performance will naturally benefit x86 significantly. There’s a reason none of the people who push the “AMD is almost the same” use the fairer benchmark to do so. R24 really highlights the actual discrepancy when both are given a fair playing field.


> The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

But that's not the issue. Even if Cinebench R23 isn't a valid comparison, Zen5 also faster for Cinebench R24 MT.

Cinebench ST (R24 or R23) turns out to be a silly benchmark, because nobody in real life is going to artificially limit their renderer to one thread, but a renderer limited to one thread is also a bad proxy for real single-threaded workloads.

What it's mostly telling you is how wide the CPU is. Which only matters for real single-threaded code if the CPU can find enough instruction-level parallelism to exploit (which that test doesn't probe), and only matters for multi-threaded code to the extent that the processor can maintain that instruction density without becoming limited by thermals/cache/memory/etc. when the code is running on all the cores, which is the thing the MT benchmark tests.


Again, I think your logic here is flawed. It’s still valuable to know how a single core behaves for a rendering work load.

I’ve worked professionally in feature film production. We test single threaded performance all the time.

It tells us the performance characteristics of the nodes on our farm, and lets us more accurately get a picture of how jobs will scale and/or be packed.

It’s not uncommon to have a single thread rendering, for example to update part of an image while an artist works on other parts.

It’s not just a test of how wide the chip is. It also tests things like how it actually handles the various instruction sets from a real world codebase. Not all processors that are equally “wide” (not the right term but whatever) handle AVX the same, and you need to know how a single core behaves for that. It’s also useful to see how the cores actually behave on their own so you can eliminate the overhead of thread synchronization and system scheduling affecting you.


> An M3 Max will often match a 14900KS in performance using 5-10% of the power.

On Cinebench and Passmark CPU the 14900K is 50-60% faster so I'm not sure that's true.


The 14900ks and ryzen 7950 can basically turbo way over 253w as long as the chip stays cool. Both chips have two (or more?) eco modes. It’s actually super silly. Because the 125w mode is actually most often good enough


> 1. ARM is inherently more efficient than x86 CPUs in most tasks

I'm not sure how you're reaching the conclusion of "most tasks" when Cinebench R24 is the only test you used because R23, which doesn't agree, was rejected for hand-wavey nebulous reasons, and nothing else was tested.

R24 is hardly a representative workload of "most tasks" nor is it claiming/trying to be.


Anandtech shows[0] that M3 is massively ahead in integer performance, but slightly behind in float performance on Spec 2017.

Integer workloads are by far the most common, but they tend to not scale to multiple cores very well. Most workloads that scale well across cores also benefit from big FP/SIMD units too.

Put another way, the real issue with R24 is that it makes HX370 look better than it would look in more normal consumer workloads.

[0] https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...


> that M3 is massively ahead in integer performance

The M3 is certainly an impressive chip, but note that it's only massively ahead in some of the int tests. It's not a consistent gap.

> Integer workloads are by far the most common, but they tend to not scale to multiple cores very well.

The HX370 does better than the Me in specint MT though.

But regardless the anandtech results paint a much closer picture than the single R24 results that GP used as the basis of the efficiency thesis.


The HX370 should win in SPECINT MT. It has 12 cores to the M3's 8 cores and it runs at significantly higher power.

Compare HX370 SPECINT MT To an M3 Pro and let's see the results.


> [HX370] runs at significantly higher power.

It used 33w. Meanwhile the M3 result came from a 2023 MacBook Pro 14-Inch, which certainly has the potential for a TDP of around that. If you can find SPECINT MT numbers w/ power data for an M3 Pro lets see it. Or even just power data for an M3 non-pro in the 14" MBP. A quick search isn't turning up any.


M3's CPU uses around ~10w while running ST tasks.

The ~30w TDP of an M3 Pro is if it's running CPU and GPU tasks at max load.


>I'm not sure how you're reaching the conclusion of "most tasks" when Cinebench R24 is the only test you used because R23, which doesn't agree, was rejected for hand-wavey nebulous reasons, and nothing else was tested.

There are no hand-wavey nebulous reasons.

Cinebench R23 uses Intel Embree engine, which is hand optimized for x86 CPUs. That's why x86 CPUs look far better than ARM CPUs in it.

If there is an application that is purely hand optimized for ARM, and then compiled for x86, do you think it's fair to use it to compare the two architectures?

SPEC & GB6 mostly agrees with Cinebench 2024.


The Snapdragon X Elite is on the same node and when actually doing a lot of work (i.e. cores loaded) it is close enough to HX 370 while delivering similar throughput.

Why wouldn't the inherent inefficiency of x64 be as noticeable in MT when all the inefficient cores are working? Because it is running at lower clocks? Then what allows it to match the SDXE in throughput? Does that need to lower its clock even more? I'm not seeing what makes it inherent.


From you link - Intel is topping the performance charts (alongside AMD in SC) - they probably tune power usage agressively to achieve these results.

I would guess it's more to do with coming from desktop CPU design to mobile vs. phones to laptops.


>From you link - Intel is topping the performance charts (alongside AMD in SC) - they probably tune power usage agressively to achieve these results.

Cinebench 2024 ST:

* M3: 142 points

* X Elite: 123 points

* AMD HX 370: 116 points

* AMD 8845HS: 102 points

* Intel 155H: 108 points

Amongst each company's best laptop ST SoCs, no, Intel and AMD are far behind in both ST scores and perf/watt.

If you're referring to desktop speeds, then yes, Intel's 14900k does top the charts in ST Cinebench but it likely uses well over 100w.

I mostly care about laptop SoCs. In the case of the M3, it doesn’t even have a fan.


That's what I'm thinking - they make trade-offs to reach peak performance in desktop designs that don't translate optimally to laptops and when you start from mobile designs you probably made the opposite trade-offs - that would be my guess for the discrepancy.


Laptops vastly outsell desktops, so this tradeoff means hurting the majority of your customers to please a small minority. Servers also care about perf/watt a LOT and they are the highest profit margin segment.

Why would AMD choose a target that hurts the majority of their market unless there wasn't another good option available?


The architecture started in desktop space and data center/mobile was an afterthought up until Intel shitting the bed repeatedly. If they redesigned from ground up they could probably get better instructions/watt but that would look terrible if it wasn't accompanied by a perf boost over previous generation. Just like Apple doesn't seem to scale well with more power.


> Just like Apple doesn't seem to scale well with more power.

How do you know this? When has Apple ever given 400w to the M3 Max like the 14900KS can get up to?

PS. An M4 running at 7w is faster in ST than a 14900KS running in 250w+.


I'm pretty sure that M3 Max closely matches the 14900k in ST speeds but using something like 5 - 10% of the power.


Not sure - they had power/thermal envelope in desktop parts and no difference in performance AFAIK.


In the same article, if you picked their R23 benchmarks the advantage vanishes in MT benchmarks; the 3nm M3 Max is actually behind 4nm Strix Point in efficiency.


Cinebench R23 is hand optimized for X86. It uses Intel Embree engine underneath.

That's why Cinebench R23 heavily favors x86.


Why are you comparing a chip optimized for performance with a chip optimized for efficiency? Take an Ultrabook Zen5, not the HX370.


Switching from individual schedulers to unified one for integer execution makes sense to me, but I still don’t quite understand why FP execution units do the opposite, could somebody explain why?


In this interview Mike Clark from AMD explains that a little, search for "third scheduler":

https://chipsandcheese.com/2024/07/15/a-video-interview-with...

The way I understand it, it's a combination of a unified scheduler for floating point being difficult to implement because many FP instructions need multiple cycles, and FP code being more regular in practice, so you don't need the scheduler to be as powerful.


Do all new processors include NPUs now? What if I don't need AI, should I still pay for the unneeded transistors? If only things could be made modular/reconfigurable.


I wonder if the NPUs can help other parts of the computer offload some tasks anyway?


You still pay for unused hw when you do need AI since it's incredibly unlikely your software can use it.


Are there any mini PCs with Zen 5?


I think there's a lot of hope for a Strix Halo mini-PC.

We currently have 7840HS+6650M with two massive heatsinks just barely coming in at what you would consider a large mini-PC rather than a SFFPC.

Just one chip cuts the heatsink demand in half. The cores should be faster and it moves from 28CU up to 40CU and from RDNA2 to RDNA3.5. As long as the cross-core-complex latency isn't as bad as the HX370, I think it could be a real winner for a long time as it's basically an upgraded PS5 that runs Linux/Windows.


Hell I'd settle for a Z1 SBC or even mini PC. They can't seem to keep up with demand for these new chips to put them into any sort of products that don't have complete mass appeal. It's impossible to find one that's not in a handheld gaming console. I doubt they'll even make enough of the Halo to cover laptops.


Z1 is basically the same as 8840U which you can find.


"AMD Strix Point expected to debut in October, claims AOOSTAR" was a new headline I've seen. Seems about right that laptops (higher margin parts) land a few months before the SFFs.

I'm excited about the strix halo that has more cores, double the GPU, and double the memory badwidth (to keep the GPU fed).


You could probably put a 9700X in a MS-A1. Besides that it will take a few months.


There seem to be some on hawk point but not many. I'd like to replace a PN50 (4800u system) from years ago but am attached to it being small enough to fit in the cable tidy under the desk - the 4" form factor seems to have grown a little over time.


What does "strix point" mean?


It's AMD's code/product name for their mobile CPUs with the Zen5 microarchitecture.

The last AMD laptop code name was "Hawk Point" - strix is a mythological bird. Who knows if AMD will keep with this naming scheme.


>Read bandwidth from a single cluster caps out at just under 62 GB/s. The memory controller has a bit more bandwidth on tap, but you’ll need to load cores from both clusters to get it.

Except for DRR5-7500 it isn't just "a bit more" it is actually double at 120GB/s. This might pose a challenge for LLM inference, which absolutely needs the full 120GB/s.


Speaking of Zen 5, are there any rumors on when 128 core turin-x will ship?


[flagged]


[flagged]


What's insane? The claim that recall is spyware isn't that bad of an exaggeration, and the claim that the NPU is hype is a reasonable opinion to have.

If that 7% of the die goes unused it's not a huge waste but it's not very good either. If you're not using it enough to affect your battery then you might not get much benefit, because the GPU can do the same tasks about half as fast, it's just less efficient (and it would be a lot more than half if the NPU die space was converted into GPU compute blocks). And for serious tasks it's not very powerful. There's a range where it's good, but not as big a range as people might expect.


If my speculation about block FP16 is correct, it might be possible to hit 24 to 48 teraflops on the NPU. This means it would be entirely memory bottlenecked even in prompt processing. There would basically be no application where you would run into the NPU being a limitation. What I fear though is that the NPU will be gimped with a single infinity fabric port, which would limit it to a relatively weak 62GB/s out of 120GB/s. For crazy people who want to use 128k token context, that might turn out to make it or break it, in the end.


Whoever makes something cheap with lots of memory bandwidth and parallel compute gets my vote :)

They somehow managed to put 16GB of HBM2 with a 4096 bit bus on the Radeon VII for $700, it's not like it's impossible. Or at least enough memory channels to rival Apples's M series in some meaningful way.

Besides, what do you expect will the average person be able to run on these Coral tier NPUs? They won't be running Yolov9 all day and all these things can do in practice is accelerate really tiny convnets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: