Right, its easy to name a half dozen companies that designed their own cores tha...

celrod · on June 25, 2023

The ARM cortex x4 will have 10-wide dispatch: https://www.anandtech.com/show/18871/arm-unveils-armv92-mobi... It'll probably be roughly the same chip as one of their upcoming V chips.

The specs there sound seriously impressive to me, and like they might be getting ready to leave AMD and Intel behind in terms of IPC (for their highest performing chips).

ARM's clock speeds are much lower, so single core performance will probably be worse. But I'd guess server clock speeds may be similar.

_a_a_a_ · on June 25, 2023

Having an N-wide dispatcher means nothing unless the software can use it, and server clock speeds tend to be lower than desktops.

Disclaimer: I don't know what I'm talking about

celrod · on June 25, 2023

Given that the M1 and M2 perform similarly to AMD and Intel CPUs with far higher clock speeds, it seems most software can use wider dispatch.

Note that dispatch doesn't mean vector width, which is harder for software to take advantage of. It means how many uops the pipeline can handle/clock cycle.

_a_a_a_ · on June 25, 2023

A basic block is usually taken to be about 6 instructions. I suppose if you take one or two speculated branches as well then you might easily get to your 10 dispatches possible. Perhaps.

As for higher clock speeds, there's a whole lot more that matters such as pipelining instructions, cache sizes, and any number of other things. Clock speed by itself isn't particularly revealing.

celrod · on June 26, 2023

FWIW, IIRC my Skylake-X CPU normally has around 2 instructions per clock when I run perf. It has a pipeline width of 4 uops internally. So it's falling far short of a typical basic block/clock cycle.

I would also not expect the X4 to utilize that full width in any real workload, but it only needs a fraction of its full width to get more IPC than the X64 competition.

But I'd expect it is a reasonably balanced chip (why waste silicon?), and thus the wide pipeline is an indicator of the chip itself being wide with immense out of order capability.

Branch prediction rates tend to be extremely high. The X4's frontend also has 10 frontend pipeline stages. Which means ideally, it'd be correctly predicting all branches at least 10 cycles into the future, so that on clock cycle `N-10`, the frontend can get started on the correct instructions that'll be needed on clock cycle `N`. The difference between 1 basic block/cycle and >1 basic block/cycle is really small; it already needs a long history of successful predictions to get 1. But of course, each mispredict is extremely costly.

As for bringing up clock speeds and the M1, my point there was that the M1 has already left Intel and AMD behind in terms of IPC; it achieves similar performance despite much lower clock speeds. My original comment said that the ARM Cortex X4 looks like it is starting to leave Intel and AMD behind in terms of IPC, and I used the width as an indicator. You responded saying that the software has to actually allow for this. Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve.

So you could argue that, unlike the M1, the Cortex X4 will not be able to realize such an advantage. While plausible, if it does fail to do so, we at least won't be able to blame the software, because the M1 is able to do so despite the software. It'd have to be some deficiency of the X4 relative to the M1 -- such as cache sizes, memory bandwidth... Hopefully it does turn out to be a great chip! But that remains to be seen.

_a_a_a_ · on June 26, 2023

I can't imagine getting every instruction in a basic block started at every clock. There is almost certainly dependencies within the block. I talked about basic blocks because that would mean that if you want to kick-off more instructions than in the block, you'd have to speculate about the branch taken. And I am sure can start executing instructions down are speculated jump, only I don't know how far. I also don't know if you can speculate past a write to RAM. I'd like to know.

> Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve

My point is, shortening the pipeline needed to execute an instruction would also get higher performance. Perhaps they invented a better cache, perhaps larger, perhaps more associative, perhaps…? There's more than one way of increasing performance besides IPC and clock.

celrod · on June 26, 2023

I'd say the ways to increase performance are 1. decrease the number of instructions needed (has to be done in software, but also dependent on ISA, e.g. using AVX512 can help a lot here, so long as you don't end up executing more scalar epilogue iterations). 2. increase IPC (obviously software can help a lot here) 3. increase clocks (not much software can do here; wider instructions are generally worth it, so if choosing between "1." and "3." in software, it's generally better to favor "1." (especially on more recent CPUs that don't have downclocking problems).

Design of the CPU can also influence all three of these. Things like better cache, better branch predictors and shorter pipelines, will all help IPC.

> My point is, shortening the pipeline needed to execute an instruction would also get higher performance.

This wouldn't increase throughput if 100% of branches are predicted correctly -- except for the the extra cycles before instructions start executing. It'd decrease branch mispredict penalties though, which is a big deal and would help in practice. The Cortex X4 did shave off a frontend pipeline stage relative to the Cortex X3 (11 -> 10). This is better than Intel Alder Lake. One contributor is probably that it is easier to decode ARM instructions in parallel without needing mutliple pipeline stages (e.g., one to find out where variable width instructions end before the instruction byte stream can be sent to decoders [not an issue if instructions are already in the uop cache]).

> There's more than one way of increasing performance besides IPC and clock.

I assume by "IPC" here you mean dispatch width? Things like a better cache for fewer misses, better prefetching, better branch prediction, larger reorder buffers so that it can speculate further ahead before stalling, all help IPC.

Zen1 CPUs (6 uops) were already wider than Intel Skylake (4 uops) and Ice/Tiger lake (5 uops), matching Alder Lake (6 uops). But they were obviously far behind in IPC (and Zen1 in particular also decoded AVX2 instructions into 2 uops).

Zen1 has SMT, which was part of the reason to go wide early on: the frontend wasn't good enough to feed that width with a single thread, but using two threads could mitigate that. Early on, Zen1 (and the Zen family) generally did better in multithreaded than single threaded benchmarks thanks to that approach.

The ARM Cortex X4 doesn't have SMT, so it's taking a different approach to performance.

A single number isn't going to be representative of performance across benchmarks or all the tasks you're interested in.

Unfortunately, I think it'll be more than a year before we can see the Cortex X4 (as it's aiming at TSMC N3E), but I'm definitely looking forward to deep dives into it's performance (and also that of Intel's Meteor Lake, Zen5, etc).

tracker1 · on June 26, 2023

Even then... in what ways are they actually faster, it seems that a lot of the custom (non-arm) custom processing units are what gives them the boost in a few cases, and compared to NVidia and AMD are still a bit behind there. There are a lot of ways to approach this, and in the Apple case, where they're really winning is the power utilization in terms of performance/watt.

wmf · on June 25, 2023

Ampere is on their fourth generation custom core but none of the earlier X-Gene cores were any good so I don't know if they're learning from their mistakes.