*> is they promoted the myth that ISA has no impact on performance* IMO, Intel (...

inkyoto · 2025-05-11T06:43:13 1746945793

> In fact, I actually think need to continue supporting the legacy x86 ISA was a massive advantage to Intel.

I think this is a myth that Intel (or somebody else) has invented in an attempt to save face. Legacy x86 instructions could have been culled from the silicon and implemented in the software as emulation traps – this has been done elsewhere nearly since the first revised CPU design came out. Since CPU's have been getting faster and faster, and legacy instructions have been used less and less, the emulation overhead would have been negligible to the point that no one would even notice it.

xscott · 2025-05-11T09:22:56 1746955376

They tried this with Itanium and got beaten up constantly about how legacy performance was bad. Personally, I agree with you, but the market isn't rational. This paved the way for AMD to eat some of their lunch by making a "compatible" 64 bit ISA. Itanium could've been great for the kinds of workloads I was interested in at the time.

phire · 2025-05-11T19:51:08 1746993068

We aren't talking about legacy instructions. Those have all been culled and replaced by microcode (in fact, most of them were always microcode from the very first 8086, they just never got non-microcoded versions thought the 286, 386, 486, pentium era).

We are talking about how the whole ISA is legacy. How the basic structure of the encoding is complex and hard to decode. How the newer instructions get longer encodings. Or things which can be done with a single instruction on some RISC ISAs take 3 or 4 instructions.

inkyoto · 2025-05-12T05:37:47 1747028267

I was actually commenting on the need to support aspect being a myth.

x86 is not a legacy, it is a legacy of legacies as the x86 ISA ascends all the way to 8008 via 8080 at least as a spiritual predecessor even if it can't directly execute the 8008 binary code.

Intel also had their own, indigeneous, RISC design – i960, which was a very good RISC design. At some point – if I am not mistaken – Intel contemplated phasing out the x86 ISA and replacing it with i960, but there was a change of plans, and they went all in with the 80486 CPU. i960 remained around in embedded and defence applications.

Intel also had their own hybrid VLIW/RISC design, i860, which preceded Itanium, and which they did not know what to do. Similarly, they faced the same issue with compilers of the day not being to produce fast code.

phire · 2025-05-12T08:56:02 1747040162

> if I am not mistaken – Intel contemplated phasing out the x86 ISA and replacing it with i960, but there was a change of plans,

I don't think there was ever any serious thought about replacing x86 with i960 (at least nothing publicly). There was a serious plan to replace x86 with the iAPX 432, which is the predecessor to the i960, but those plans all predated x86 becoming a run-away success when the IBM PC became an industry standard. And "replace" is kind of over stating it, there was no plan for any compatibility, not even source compatibility. It's more that they were planning for iAPX 432 to take the "workstation CPU" spot on their product chart that was currently occupied by the 8086.

By the time the i960 was in development, x86 was so entrenched that I really doubt there could have been any serious thoughts of replacing x86 with something that wasn't fully backwards compatible.

And we know that when Intel did try to replace x86 with Itanium, they went with a hardware backwards compatibility mode.

> I was actually commenting on the need to support aspect being a myth.

Yes, you have a point that it should have been possible to replace x86 with a software emulation approach.

But the only person who can really do that is the platform owner. Apple were quite successful with their 68k to PowerPC transition. And the PowerPC to x86 transition. And the x86 to Aarch64 transition. But that transition really needs to be done by the platform owner.

But the PC didn't really have an owner. IBM had lost control of it. You could argue that Microsoft had control, but they didn't have enough control (especially with DOS. Most DOS programs were bypassing DOS to some extent or another and directly accessed hardware). Intel certainly didn't have enough control to transition to another arch.

(The PC had such high demands for backwards compatibility that even the Pentium Pro ran into issue. It worked, but it simply wasn't fast enough when executing instructions with 16-bit operands. Attempting to run DOS or win95 apps would be slower than a 486. So the Pentium Pro was limited to the market of Windows NT workstations running Apps with full 32-bit code. Intel had to fix this with the Pentium II before they could sell the P6 arch outside of the workstation market.)

Intel didn't even have control over x86 itself. Other companies were already making competing CPU designs that were faster than Intel's own. If Intel didn't keep releasing faster x86 designs, then someone else would steal all of their market share. Intel were more or less forced to keep releasing faster native x86 designs, or they would lose what little control they did have.

inkyoto · 2025-05-12T10:28:17 1747045697

I can't readily find the original reference where I read it, but one source[0] does allude that it was a real possibility:

«At the time, the 386 team felt that they were treated as the "stepchild" while the P7 [80960] project was the focus of Intel's attention. This would change as the sales of x86-based personal computers climbed and money poured into Intel. The 386 team would soon transform from stepchild to king».

And, yes, the histories of iAPX 432 and 80960 are so closely intertwined, that in many ways the 960 can be considered a design successor of the 432.

> But the only person who can really do that is the platform owner. Apple were quite successful with their 68k to PowerPC transition. And the PowerPC to x86 transition. And the x86 to Aarch64 transition. But that transition really needs to be done by the platform owner.

I wholeheartedly and vehemently agree with you on this – full platform ownership and the control of the entire vertical is key to being able to successfully execute an ISA transition. Another success story is, of course, IBM with iSeries (nèe AS/400) and zSeries (nèe 360/370/390), albeit their approach is rather different.

[0] https://www.righto.com/2023/07/the-complex-history-of-intel-...

phire · 2025-05-13T09:52:30 1747129950

> but one source[0] does allude that it was a real possibility

That doesn't really suggest an intention to replace. To me that seems more of a hope that x86 would fade into irrelevance on its own, beaten down by superior RISC ISAs.

--------------

It is interesting to consider what a transition away from x86 would have looked like.

I think the best chance would have been something lead by Microsoft in the early 90s. The 386 version of Windows 3.0 was already virtualising both DOS and Win16 code into their own isolated VMs. If you added a translation layer for 16-bit x86 code to those VMs, then you could probably port windows to any host CPU arch.

I think we are talking about a world where 486 class CPUs never arrived, or they preformed horribly and the pentium was canceled.

But it's a small window. In 1990, it was very rare to see 32bit x86 code. 32-bit DOS extenders were only just starting to be a thing. Windows didn't support 32-bit userspace until 1993. The main 32-bit code anyone was running in 1990 was the windows 3.0 kernel itself. By 1992, it was common for DOS games to use DOS Extenders, and the transition would have required a 32-bit x86 translation layer too.

These RISC PC compatibles would lost the ability to boot directly into real-mode DOS, but would have run DOS just fine inside a windows DOS VM.

It should have been possible to get good hardware compatibility too. Windows 3.0 can already run DOS drivers inside a DOS VM, adding cpu translation shouldn't have caused issues. With motherboard support, it should have been possible to support most existing ISA/EISA/VLB cards.

vlovich123 · 2025-05-11T09:33:31 1746956011

> IMO, Intel (and AMD) did prove the impact of a legacy ISA was low enough to not be a competitive disadvantage. Not zero, but close enough for high-performance designs.

And Apple proved that in fact it was a significant problem once you factored into account performance per watt allowing them to completely spank AMD and Intel once those hit a thermal limit. There’s a benefit from being able to decode and dispatch multiple instructions in parallel vs having to emulate that through heuristically guessing at instruction boundaries and backtrack when you make a mistake (among other things).

phire · 2025-05-11T20:12:46 1746994366

> having to emulate that through heuristically guessing at instruction boundaries and backtrack when you make a mistake

Intel/AMD don't use heuristics-based decoding, or backtracking. They can decode 4 instructions in a single cycle. They implement this by starting a pre-decode at every single byte offset (within 16 bytes) and then resolving it to actual instructions at the end of the cycle.

The actual decode is then done the following cycle, but the pre-decoder has already moved upto 4 instructions forwards, so the whole pipelined decoder can maintain 4 instructions per cycle on some code.

This pre-decode approach does have limits. Due to propagation delays, 4 instructions over 16 bytes is probably the realistic limit that you can push it (while Apple can easily do 8 instructions over 32 bytes). Intel's Golden Cove did finally push it to 6 instructions over 32 bytes, but I'm not sure that's worth it.

Intel's Skymont shows the way forwards. It only uses 3-wide decoders, but it has three of them running in parallel, leapfrogging over each other. They use the branch predictor to start each decoder running at a future instruction boundaries (inserting dummy branches to break up large branchless blocks). Skymont can maintain 9 instructions per cycle, which is more than the 8-wide that Apple currently is using. And unlike the previous "parallel pre-decode in a single cycle", this approach is scalable. Nothing stopping Intel adding a fourth decoder for 12 instructions per cycle, or a fifth decoder for 15. AMD is showing signs of going down the same path, zen5 has two 4-wide decoders though they can't work on the same thread, yet.

vlovich123 · 2025-05-13T20:11:39 1747167099

You're arguing semantics IMHO. In my mind speculatively decoding every single byte offset & then resolving at the end of the cycle which to take is a form of heuristic execution because the heuristic is "decode all possible executions". And 4 vs 8 is a pretty sizeable difference. Moreover, the pre-decoder requiring knowing the op-codes at the end means all instruction decodes are serialized on decoding 16 instructions whereas Apple can just decode each op independently & only decodes 8.

phire · 2025-05-14T00:31:17 1747182677

Oh, I see what you are saying. I don't consider it to be a heuristic because it's simply bruteforcing it. IMO a heuristic needs improve over brute force.

> And 4 vs 8 is a pretty sizeable difference.

True, but x86 was doing four instruction 20 years ago. As I mentioned the current state of the art (in a shipping product) is 9, and 9 is larger than 8. Importantly, this Skymont of leapfrogging decoders approach is scalable.

> whereas Apple can just decode each op independently & only decodes 8.

Apple isn't as free from serialisation as you suggest. Like X86, many instructions decode to multiple uops. According to research [1] instructions which decode to two uops are common and a few decode to as many as 12 uops.

It also does instruction fusion, two neighbouring instructions can sometimes decode into a single uop. This all means that there is plenty of serialisation within Apple's decoder. And branching also creates serialisation.

It's just not as simple as independently decoding eight instructions into eight uops every cycle. Simpler than what x86 implementations need to do, but not as brain-dead simple as you suggest.

Actually, Skymont's approach has an advantage over Apple here, because it only needs to serialise within each 3-wide decoder.

[1] https://dougallj.github.io/applecpu/firestorm.html

adgjlsfhk1 · 2025-05-11T19:43:20 1746992600

I don't think Apple's battery life wins are primarily from isa. I think it's largely from better target and process optimization/ecosystem control. Intel (and somewhat AMD) make most of their money in servers where what matters is performance/watt in a 100% loaded system. they also are designing for jdec RAM and pcie connectivity (and lots of other industry standards). most of Apple's efficiency advantage comes at the edges, lowering max clock speed, integrating the ram to save power, using custom SSDs where the controller is on the CPU etc.

ahartmetz · 2025-05-12T10:16:58 1747045018

Reportedly, the decoder is like 10-20% of the power budget of Ryzen CPUs, so that rather contradicts that the ISA is the main issue that makes Ryzen's efficiency worse than Apple's A and M cores.

phire · 2025-05-13T02:25:23 1747103123

I looked around and couldn't find anything about Ryzen decoding power conniption. I'm only aware of one report for x86 decode power consumption, [1] and there are quite a few problems with trying to use it to justify that conclusion.

First, it's covers Intel Haswell, which is "not ryzen", it's not even AMD. Plus, Haswell is 12 years old at this point, how much relevance does it even have to modern Intel CPUs?

Second, the "instruction decoders" power zone was only 10%, not 20%. And still reported 3% on workloads that used very few instructions and always hit the uop cache. So really we are talking about 7% overhead for decoding instructions. They do speculate that other workloads use more power (they only tested two workloads), as the theoretical instruction throughput might be double. (which is where I suspect you got the 20% from), but they provide no evidence for that, and double the throughput doesn't mean double the power consumption. And double 7% + 3% base would be 17% at most.

Third. Intel doesn't publish any details about what this "instruction decoder" zone actually covers. It's almost certainly more just the "decoding x86" part. Given there are only four zones, I'm almost certain this zone covers the entire frontend, which includes branch prediction, instruction fetch, the stack engine. It might include register renaming too. Maybe instruction TLB lookups? I am reasonably sure it includes the (dynamic) power cost of accessing the L1i cache too.

So this 7% power usage is way more than just the decoding of decoding 86%. It's the entire frontend.

Finally. I haven't seen any power numbers for the front end of an equivalent ARM processor (like Apple's M1). For all we know, they are also using 7% of their power budget to fetch ARM instructions from the L1 cache, decode them, do branch prediction, do all the fancy front-end stuff. The 7% number isn't x86 overhead as many people imply, it's just the cost of running Haswell's frontend.

Without anything else to compare to, this 7% number is worthless. It's certainly an interesting paper, I don't have any major criticisms, but it simply cannot be used to support (or disprove) any arguments about the overhead of x86 decoding.

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...