Ask HN: How did Apple manage to create such a better chip than Intel?

kortex · on Dec 11, 2020

https://news.ycombinator.com/item?id=25257932

My 10,000' view understanding:

- Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

- more efficient transistors. 7nm uses finFet. M1 probably uses GAAFET, which means you can cram more active gate volume in less chip footprint. This means less heat and more speed

- layout. M1 is optimized for Apple's use case. General purpose chips do more things, so they need more real estate

- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

- M1 has tons of cache, IIRC, something like 3x the typical amount

- some fancy stuff that allows them to really optimize the reorder buffer, decoder, and branch prediction, which also leverages all that cache

GoldenStake · on Dec 11, 2020

One small correction, TSCM 5nm still uses finFet Samsungs 3nm is being testing with GAAFET and has it slated for 2021. TSMCs roadmap also has it but for 2022, and intel at 2025

rayiner · on Dec 11, 2020

The fancy stuff with the reorder buffer, decoder, and branch prediction is the most important thing. The M1 is just as general purpose as any Intel/AMD CPU, and indeed even more so because it's designed to scale from cell phones to desktops.

Specialization helps when it helps, but doesn't do much on typical programs and M1 still excels on those.

gizmodo59 · on Dec 11, 2020

Can you explain what you mean by Apple's use case vs general purpose chips?

erulabs · on Dec 11, 2020

The M1 does one thing: run MacOS and MacOS apps. They can control the vast majority of the compiled code that will be run on the chip - unlike an x86 platform where the exact same architecture is used for desktops, servers and everything in between - including Linux, windows, and Mac.

Specifically there is a reference counting optimization on the M1 that dramatically helps performance of compiled Swift apps - something only worthwhile if you know the majority of what the chip will ever do is run swift apps.

Someone · on Dec 11, 2020

It also does that for, basically, one device (the three models available so far are almost identical. Fan/no fan, and some binning on the GPU)

That’s one further reason their system is faster: they designed a system, and the design of their CPU, GPU, etc. was driven by what the final system needed.

Other system manufacturers buy individual parts, where the manufacturer of each part extends/optimizes it with only a vague knowledge of what the system it will be used in will look like (and they don’t want to focus on one specific system, as that would mean they can sell it to fewer device manufacturers)

mcphage · on Dec 11, 2020

> The M1 does one thing: run MacOS and MacOS apps. They can control the vast majority of the compiled code that will be run on the chip - unlike an x86 platform where the exact same architecture is used for desktops, servers and everything in between

The M1 is strongly related to their A14, which runs phones and tablets.

Also: what's between a server and a desktop machine?

DaiPlusPlus · on Dec 11, 2020

Could there also be hardware acceleration or built-in support for Objective-C's message-passing? I've always wondered how Apple gets decent performance with Objective-C in-spite of MP given it its complexity compared to vtables (and vtables have the advantage of being easily cachable in L1/L2).

rudedogg · on Dec 11, 2020

I haven't read it, but this article might address your performance question: https://www.mikeash.com/pyblog/friday-qa-2017-06-30-dissecti...

Someone · on Dec 11, 2020

Typical Objective-C code is mostly C (and sometimes C++). It doesn’t use message-passing in the hot loops.

judge2020 · on Dec 11, 2020

> - Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

See: https://news.ycombinator.com/item?id=25277124

gruez · on Dec 11, 2020

>- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

that wouldn't help in benchmarks, would it? Only in the most dishonest benchmarks would they compare x264 to a hardware h.264 encoder, for instance.

flumpcakes · on Dec 11, 2020

Maybe - but how do you know they are off loading to a hardware encoder?

Example: I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?

If the utilisation of the hardware accelerators is completly opaque to the user (not importing special libraries) is it unfair that one CPU has specific hardware implementation for common tasks and one only has the generic circuitry?

legacynl · on Dec 11, 2020

> I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?

YES! Unless you're specifically searching for the fastest AES cpu.

If you want to compare general performance this benchmark is flawed. E.g.: I could have the fastest CPU in existence, but since it happens to be lacking hardware AES circuitry, your benchmark will always show another CPU as the 'fastest'.

It's not 'unfair' or whatever. It just makes it so that you need to think better about your benchmark, what you want to measure, and what you're actually measuring. Or you need to adjust your conclusion: "this is the fastest cpu" -> "this cpu performs best on this specific task"

aarkay · on Dec 11, 2020

Thanks for the link to the old HN item. Missed it the first time. Seems like having the ability to control the entire SoC helped apple take a lot of decisions specific to how they wanted to use the chips.

oneplane · on Dec 11, 2020

It's not just the M1's design, it's what they don't have to do: no need to support anything legacy. You can't change the x86 ISA to the point where it makes a huge difference because it would no longer run x86 code.

Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance) would have to modify all their stuff as well and they don't want to, or at least, won't want the same thing.

k0stas · on Dec 11, 2020

The overhead of translating x86 instructions to RISC-like microcode is known and is something like 1-2%. This is not the reason for the difference between Intel and Apple.

hartator · on Dec 11, 2020

They fully support i386 via Rosetta 2 though.

The real explanation is Intel has been complacent and lazy. We had 5 generations of the same chip. Enough is enough.

harpratap · on Dec 11, 2020

What about AMD though? They are at par with M1 but use a lot more power while doing so. I don't think it's an Intel problem, it's an x86 problem

rayiner · on Dec 11, 2020

It's not an x86 problem. Apple has been lapping all of its ARM competitors for a while now too.

silly-silly · on Dec 11, 2020

Nobody writes i386 code anymore though, Apple mostly support x86-64 from 10 years ago.

yagya · on Dec 11, 2020

And i386 support was removed in macOS Mojave anyway.

muterad_murilax · on Dec 12, 2020

macOS Catalina, actually.

h0l0cube · on Dec 11, 2020

> Intel can probably make faster stuff than they currently do but then their customers (PC manufacturers for instance)n would have to modify all their stuff

This is a crucial point. It's a coordination problem.

Drivers only need to be made for those components that Apple choose for their system vs the multitude combinations of GPU boards, drive systems, motherboard chips, wifi etc etc. that exist for a modern PC.

There's no steering committee for PCs (at least none that could be incorruptible by Intel) that could cause this change to happen industry wide. And there's been little appetite (yet) for Windows/Linux + ARM for consumer PCs to help this happen from the bottom-up.

mcphage · on Dec 11, 2020

> You can't change the x86 ISA to the point where it makes a huge difference because it would no longer run x86 code.

Couldn't they remove obsolete instructions from the ISA & then emulate removed instructions? Sure, it would be slower than having them still in the ISA, but given software using them was written for older machines, it might wash out in the end.

the_arun · on Dec 11, 2020

I respectfully disagree. Why can't Intel/AMD make new flavor of chips & motherboards explicitly saying - it doesn't support x86 ISA. Then, wouldn't we address that problem?

0134340 · on Dec 11, 2020

Apple has an advantage here because they have both an integrated hardware and software platform they can control. If only Intel did the same they'd be relying on Microsoft to provide a translation layer for their chips and that'd be unlikely unless they paid up and Microsoft dared to stray from its traditional course.

stephenr · on Dec 11, 2020

Intel tried that with Itanium.

Didn’t work out so well

rayiner · on Dec 11, 2020

Unfortunately Jon Stokes at Ars Technica and David Kanter at RWT have most stopped doing CPU design articles. The AnandTech one is the best I’ve seen: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

titzer · on Dec 11, 2020

Lots of other comments point out the vertical integration.

For raw single-thread performance:

1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel.

2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed:

3. 16 execution ports.

maxioatic · on Dec 11, 2020

I don't fully grasp assembly, instruction sets, and how CPUs work so pardon the silly questions.

I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)

2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until

3) these things do the work, and there an abnormally high amount of them?

p.s. Any noob friendly reading is also appreciated!

titzer · on Dec 11, 2020

For 1), just think of instructions of little bundles of bytes. The CPU runs through the instructions in forward order, jumping around to other bits of the code as it goes. X86 has variable-width instructions (i.e from 1 byte up to 17 bytes--X86 is complex and there are a lot of prefix bytes that have been used to add new functionality over the years). To determine how long an instruction is, you need to decode the bits of the instruction. For ARM64, and most other ISAs nowadays, the instructions are all 4 bytes long. That means they can all be decoded in parallel.

For 2, imagine a boa-constrictor swallowing a huge piece of prey. One mouth (CPU: the frontend) and one rear (CPU: the retirement phase). The instructions go in the front end in the program order. They are decoded into operations that pile up in the middle (the giant bulge in the boa constrictor). When an instruction is ready to go, one of the execution ports (3--think of 16 little stomachs) picks up an instruction and executes it. Then at the backend, the retirement phase, instructions are committed in the order they appeared in the original program, so that the program computes the same result.

By making basically all of the pieces of this boa constrictor bigger and more numerous, it eats a lot more instructions per clock (on average). Making that bulge (the reorder buffer) huge allows the CPU to have high chance of some useful work to feed to one of its 16 stomachs.

soneil · on Dec 11, 2020

I think it's easy to underestimate how much difference (1) makes. Take the famous line "thequickbrownfoxjumpsoverthelazydog" - and think how you'd parse that out programatically. You'd start at the start, reading each character in, comparing it against a dictionary, and when you decide you have a whole word - then you can split that word out - and then continue on to the next.

But you can't really do this in parallel as the start for each word depends on the previous split already being known.

If it was simply law that every word in existence was 5 characters, you could parse this out with zero lookups, zero knowledge. "accurately" isn't so much the issue, it's that you have to decode each instruction to know where the next starts.

jasonwatkinspdx · on Dec 11, 2020

Yup, you've got the basic ideas. Hennessy and Patterson's books are the standard rec. "Computer Organization and Design" one is version more targeted at developers, and "Computer Architecture: A Quantitative Approach" is more focused on CE's or people that will be getting more into the guts.

titzer · on Dec 11, 2020

I think the M1 chip finally proves the inherent design superiority of RISC over CISC. For years, Intel stayed ahead of all other competitors by having the best process, clockspeeds, and the most advanced out-of-order execution. By internally decoding CISC to RISC, Intel could feed a large number of execution ports to extract maximum ILP. They had to spend gobs of silicon for that: complex decoding, made worse by the legacy of x86's encodings, complex branch prediction, and all that OOE takes a lot of real estate. They could do that because they were ahead of everyone else in transistor count.

But in the end all of that went bye bye when Intel lost the process edge and therefore lost the transistor count advantage. Now with the 5nm process others can field gobs of transistors and they don't have the x86 frontend millstone around their necks. So ARM64 unlocked a lot of frontend bandwidth to feed even more execution ports. And with the transistor budget so high, 8 massive cores could be put on die.

Now, people have argued for decades that the instruction density of CISC is a major advantage, because that density would make better use of I-cache and bandwidth. But it looks like decode bandwidth is the thing. That, and RISC usually requires aligned instructions, which means that branch density cannot be too high, and branch prediction data structures are simpler and more effective. (Intel still has weird slowdowns if you have too many branches in a cache line).

It seems frontend effects are real.

rayiner · on Dec 11, 2020

ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache: https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x.... And we don't know that the M1 doesn't do that. Also, even on Zen 2, the entire decode section is still a fraction of the size of say the vector units: https://forums.anandtech.com/threads/annotated-hi-res-core-d.... And the cores themselves take up a small amount of the die space on a modern CPU: https://cdn.mos.cms.futurecdn.net/m22pkncJXbqSMVisfrWcZ5-102....

A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.

The real question is how does Apple keep that thing fed? An 8-wide decoder is pointless if most of the time you've got 6 empty pipelines: https://open.hpi.de/courses/parprog2014/items/aybclrPgY4nPyY... (discussing ILP wall). M1 outperforming Zen 3 by 20% on the SPEC GCC benchmark, at 1/3 lower clocks-speed. That's 80% more ILP than an Zen 3, which is itself a large advance in ILP.

titzer · on Dec 11, 2020

> ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache

My point was more about that the fixed-width instructions allow trivial parallel decoding while x86 requires predicting the length of instructions or just brute forcing all possible offsets, which is costly.

> The real question is how does Apple keep that thing fed?

That's why there's such an enormous reorder buffer. It's so that there's a massive amount of potential work out there for execution ports to pick up and do. Of course, that's all wasted when you have a branch mispredict. I haven't seen anything specific about M1's branch prediction, but it is clearly top-notch.

cma · on Dec 11, 2020

Other things helping are forced >=16Kb page sizes, and massive L1 caches (M1 has 4X Zen 3's L1 data cache and 3X the L1 instruction cache; how much of that cache size is enabled by new process node and larger page sizes vs just lack of x86 decode I don't know).

bcrl · on Dec 12, 2020

L1 cache size is driven by the target clock frequency. Apple is not aiming for 5+GHz, whereas both Intel and AMD cores can turbo above 5GHz these days.

ibraheemdev · on Dec 11, 2020

I just wanted to point out that it is not Apple out of the blue made a chip better than Intel's. They have also been designing chips for quite a while. The APL0098 chip that was used in the original iPhone was introduced back in 2007.

redisman · on Dec 11, 2020

Right the M1 is based on A14 - the iPhone/iPad chip. It’s just the first PC one based on that architecture. Making top of the line mobile chips has given them a huge efficiency leg up. The SoC also gives big gains compared to a traditional separation of memory modules and CPUs

vvanders · on Dec 11, 2020

SoC doesn't make as large of a difference as you'd think. The only place you really get hammered is if your moving a lot of memory between separate memory domains.

As a real world example the X360 had a unified memory architecture and PS3 had a split along system/gpu. From a CPU performance perspective they were pretty close(although the SPUs in the PS3 could really go if you vectorized your data for them appropriately).

Animats · on Dec 11, 2020

SoC doesn't make as large of a difference as you'd think.

There are now enough transistors on chip for a reasonably good on-chip GPU. Plus you get better bandwidth between CPU and GPU. This is independent of the instruction architecture. The PS5's SOIC does some of the same things, though it's an x86 instruction set, not ARM.

Neither the PS5 nor the Xbox use AMD CPUs. Intel has a big problem.

deafcalculus · on Dec 11, 2020

Both PS5 and Xbox use AMD Zen 2 CPU. I guess you meant that PS5 and Xbox don't use Intel CPUs.

I highly doubt Intel wants to be in the low margin business of supplying console CPUs. Intel's fortune is entirely dependent on whether it can get back advanced process node crown from TSMC.

h0l0cube · on Dec 11, 2020

The code complexity to pipeline everything to be DSP-like to avoid memory latencies is very different. And you can be sure that most developers (particularly non-console developers) aren't thinking about it. Was super impressed with how Naughty Dog was able to use 70% of the SPUs on Uncharted 2 though - which was either at launch or close to launch.

vvanders · on Dec 11, 2020

The answer for that one is real simple, it was mostly a function of what your first target platform was.

Back in that era X360 had the edge by coming out slightly ahead schedule wise so most engines targeted that first and then had a brutal slog to cram things down to the SPUs on the subsequent PS3 port.

If you went the other way though your nicely linearized SPU friendly data code fit in-cache on the X360/PC and usually ran remarkably better.

h0l0cube · on Dec 11, 2020

Once again, it was just more work to pipeline your data. That the same code worked well across both isn't my point. I worked on an engine team supporting both of those platforms, so I know. This kind of approach is pretty rare for desktop and mobile apps, which is what the M1 is used for.

Edit: Exceptions being typical DSP realms, such as video/image/sound processing, rendering packages, AI, which are all already targeting GPUs. Note that Final Cut Pro works faster on Intel setups with traditional (non-integrated) GPUs vs M1.

vvanders · on Dec 11, 2020

Oh I worked on titles in that era too, and on other oddball platforms like the PSP :).

Without getting into the weeds too much you just weren't doing much with the single CPU core on the PS3. As soon as you booted anything of note it was pretty obvious that you'd have to offload to the SPUs. I'm not saying that it wasn't more work, but most of the other teams we talked with who had done PS3 as a baseline first had a much, much easier time of it.

I would say that any console/handheld developer the approach wasn't unfamiliar, but there were certainly a fair number of engines with roots in the PC space that had a rough time of it. That said it was more down to the SPUs than anything related a unified memory architecture(although we did do some fun shenanigans around streaming music back from VRAM to the CPU to give us more system memory headroom).

aarkay · on Dec 11, 2020

Which also implies that intel had a lot of time to catch up.

rorykoehler · on Dec 11, 2020

I read somewhere (possibly on this site) that intel made a purposeful decision to not invest in trying to catch up as they didn't believe in the feasibility of 5nm and lower due to difficulties managing thermal issues. Not sure how true that is but it would explain a lot.

ksaj · on Dec 11, 2020

My suspicion is that they don't need to worry about backward compatibility, or compatibility outside of their own hardware choices at all. That opens the door for them to tie OS decisions with CPU decisions straight from a single product family perspective.

acranox · on Dec 11, 2020

There have been several HN links discussing this. Have you read these yet? I thought they did a pretty good job answering your questions.

  https://debugger.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

https://medium.com/swlh/what-does-risc-and-cisc-mean-in-2020...

aarkay · on Dec 11, 2020

Thanks for the links. These articles do indeed answer a lot of my questions.

Fazel94 · on Dec 11, 2020

ARM is RISC , Intel and AMD are CISC, One important reason is their new pipelining facility.

Apple M1 has 16 units that can pipeline their instructions.

Meaning, they can reorder sequential instructions that aren't dependent on each other to run in parallel. That is not threads or anything, that can be and is being done in a single threaded program.

AMD and Intel have 4 units for reordering tops, because their architecture is CISC and on instruction can be up to 15 bytes. M1 is RISC and instructions are just 4 byte fixed-length. Thus architecturally it is easier to reorder instructions for RISC than CISC.

CISC were better because of the specific instructions but now Apple has stuffed their CPU with specific hardware for alot of things including machine learning, graphic processor and encryption, instead of specific instructions, Apple has specific hardware, and can do with less instructions.

And since they control hardware, software SDKs and OS they can actually get away with such radical changes. Intel and others can't, without a big change in industry.

Source: https://debugger.medium.com/why-is-apples-m1-chip-so-fast-32...

rayiner · on Dec 11, 2020

1) There’s not a meaningful difference between RISC and CISC on modern architectures. CISC has certain advantages these days because they are compact at encoding memory operations. Intel and AMD crack instructions into operations called micro-ops. There is no meaningful difference between how easy it is to reorder the micro-ops versus RISC ops. CISC or RISC, the internal structures of the processor operate on something quite different than the instruction the instruction as written in memory. (A decoded form.)

2) Reordering is different than pipelining, and CPUs have done both for decades. The difference between the M1 and Intel/AMD is that the M1 is wider in spots and can do much more extensive reordering. The M1 can decode and issue 8 instructions at a time. AMD can do 4 or 8 depending on whether the instruction is coming from memory or a special cache for pre-decoded instructions. The M1 has a reorder buffer of over 600 instructions—meaning it can have 600 instructions waiting for completion at a time (e.g. some executing while others are waiting for data to come back from memory). Intel and AMD’s reorder buffers are half the size.

3) Special instructions and controlling the software interface has little to do with performance on general purpose code.

gzer0 · on Dec 11, 2020

I disagree, here's why:

CISC instruction in are still variable length. People can argue that micro-ops are RISC like, but micro-code is an implementation detail very close to hardware.

One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.

Time is more critical when running micro-ops than when compiling. It is an obvious advantage in making it possible for advance compiler to rearrange code rather than relying on precious silicon to do it.

While RISC processors have gotten more specialized instructions over the years, e.g. for vector processing. They still lack the complexity of memory access modes that many CISC instructions have.

fstopmick · on Dec 12, 2020

> How did Tesla manage to create this when GM has been making cars for decades and that is the singular focus of the company?

(replaced the subjects with auto-industry corollaries)

I think it's the power of vertical integrations. When you aren't cobbling together a bunch of general-purpose bits for general-purpose consumers, you don't have to pay as many taxes. Sort of like SRP in software - a multi-purpose function is going to become super bloated and expensive to maintain, compared to several single-purpose alternatives.

https://en.wikipedia.org/wiki/Single-responsibility_principl...

Vertical integration is like taking horizontally-integrated business units and refactoring them per SOLID principles.

https://en.wikipedia.org/wiki/SOLID

throwarchitect · on Dec 11, 2020

Guess where some of Intel's engineers have fled to? People move around, so it's not like one company has a strangle-hold on knowledge that can't be replicated by another company, especially when one of those companies is willing to pay more for talent.

swang720 · on Dec 11, 2020

They weren't weighed down by the legacy bloat in the x86 instruction set architecture.

dboreham · on Dec 11, 2020

Except we already had 30 years of other ISAs without that bloat, and they were all resoundingly beaten by Intel.

wahern · on Dec 11, 2020

I think the answer is that the marginal benefits of a better ISA were less than the marginal benefits of better node process and faster iteration that Intel enjoyed. But for various reasons Intel no longer enjoys those advantages. With TSMC Apple has the process advantage, and their smartphone business has given them both the motivation and cash to iterate their architecture faster than Intel. The simpler ISA has compounded those advantages.

Koiwai · on Dec 11, 2020

Several comments mentioned Apple M1 doesn't need to support legacy, but Rosetta 2's support for amd64(yes I choose this term over x86-64) is beyond great, and I looked into that specifically a while ago, some mention Apple had something designed specifically for amd64 emulation, So I'm against that point.

67868018 · on Dec 13, 2020

amd64 is the correct term. It's amd's instruction set; they were the first to do it. The BSDs call it amd64, it's only Linux being the odd duck calling it x86_64

jcfrei · on Dec 11, 2020

This makes me wonder: If there's such a benefit from creating an integrated and specialized chip, will the next consoles follow the same approach? Will they be ARM based? If Microsoft and Sony follow this same model then PC games might be left behind with poorer graphics and fewer titles.

h0l0cube · on Dec 11, 2020

Consoles are already like this. The PS4, for instance, had a unified memory access between it's GPU/CPU, and custom chips silicon were basically the norm for consoles PS3 and earlier across all manufacturers. The PS5 supports custom silicon to get insane SSD streaming straight to memory to have almost instantaneous download times. In fact, older Apples, the Amiga etc. from the 80s were all powered custom silicon. Apple is taking this trope from the console world, and using it to disruptive effect in the modern PC.

chubot · on Dec 11, 2020

I imagine part of it is vertical integration. If you make the hardware that the CPU integrates with, and the OS that runs on the hardware, you can do a lot of optimization.

Intel CPUs don't know what memory they're talking to, i.e. they have to support a variety of memory. Likewise they don't necessarily know what OS they're running; how it context switches, etc. If it's virtualized, etc. Sure they have optimizations for those common cases, but the design is sort of accreted rather than derived from first principles.

To make an analogy, if you know your JS is running on v8, you can do a bunch of optimization so you don't fall off the cliff of the JIT, and get say 10x performance wins in many cases. But if you're writing JS abstractly then you may use different patterns that hit slow paths in different VMs. Performance is a leaky abstraction.

wpg_steve · on Dec 11, 2020

I read that the out of order execution of the RISC was simpler to handle with the fixed 32 bit instructions. They said Apple managed to dispatch 8 instructions in parallel whereas the hi end CISC (x86) tops out at 4.

throwarchitect · on Dec 11, 2020

The greater simplicity of ARMv8 and its fixed sized instructions definitely helps, but also Intel runs their cores at nearly 2x higher frequency, which means a lot less logic can be squeezed into a clock cycle. That makes it much harder to to make a wider processor.

wangchucheng · on Dec 11, 2020

Apple has less technical debt and more aggressively eliminates old technologies.And Apple makes its own laptops and uses its own operating system. This allows Apple to provide relatively more complete support.

656565656565 · on Dec 11, 2020

Does the unified memory architecture mean these SoCs will always have “limited” memory as some is in use by graphics?

Are we seeing a deeper bifurcation of the industry; personal vs server

Maybe intel and others can happily coexist?

pedalpete · on Dec 11, 2020

Innovators Dilemma.

I think you need to look at this from another angle. Yes, Apple did make some excellent choices, but the market was Intel's to lose.

The difference in the chips isn't limited to the 5nm, memory pooling, etc etc. Look at the base x86 vs ARM core architecture, and that is where you'll see the problem Intel had.

I'm sure there were discussions inside Intel which went along the lines of one person arguing that they had to start developing ARM based or other RISC based chips, and somebody else countering "but at Intel we build for the desktop, and servers, and RISC processors are toys, they're for mobile devices and tablets. They'll never catch-up with our..."

This change in architecture was a long time coming. As we all know, there is very little we do with our computers today, that we can't also accomplish on a phone (or tablet). The processing requirements for the average person are not that large, and ARM chips, made by Apple, Qualcomm, Samsung, or anybody else, have improved to the point they are up to some of the more demanding tasks. Even able to play high quality games at a good frame-rate or edit video.

So, now we have to ask, what was delaying the move from x86 to ARM. Apple aren't the only ones making ARM based computers. Microsoft has two generations of ARM based Surface laptops out, and I think samsung has made one too. I'm sure there are others. This is a wave that has been building for a long time.

So, now we can look at why Apple was able to be so successful in their ARM launch compared to Microsoft and the lackluster reviews of Windows based ARM devices.

From my understanding, it isn't the 5nm technology, though I a no expert in chip design. However, as you state, Apple was able to pool memory, and put their memory right on the chip, which (from what I understand) saves overhead of transferring memory in and out, as well as allowing CPU and GPU to share memory more efficiently.

As I understand it, the Qualcomm or other chips have a much smaller internal memory footprint, expecting the memory to be external to the CPU/GPU. Perhaps because this is just always the way it has been done.

Now this is where Apple's real breakthrough comes in. First off, they have the iOS app store and all the apps now available to use on desktop. This means all the video editing or gaming apps that were already designed for iOS can now run perfectly fine on the "new" ARM architecture. Then there is Rosetta2. Apple understood how important running legacy software for a small number of their users would be, and I suspect they also had very good metrics on what those legacy programs were. They did an exceptional job on Rosetta (from what I understand), and should be commended on that. Though most users will likely never use Rosetta extensively, it goes a huge way to making the M1 chip an absolute no brainer.

Compare Rosetta to Microsoft's attempt at backward compatibility, and the difference seems glaring. HOWEVER, I think again this comes down to strategy and execution. Apple knows that only a small number of their customers need a small number of apps to run in Rosetta. Microsoft, having both a larger user base, AND much more bespoke software running on their platform, don't have this luxury.

I'm sure there are other factors, but my thinking is it is less about direct technology and more flawed strategy/execution from Intel and absolutely amazing execution from Apple.

I'm very torn by this all tbh. I've been an Apple hater for a long time. Every Apple product I've bought has turned out to be crap (except my original generation 2 iPod, it was truly magical). I'm beginning to think Apple may have actually got the upper hand here.

aarkay · on Dec 11, 2020

Time to move on to the apple bandwagon :) I've been a mac user for 10 years and it felt like Apple completely lost the plot with their laptops from 2015-2019. Glad to see them make a come back and do what they do best -- Make excellent hardware and operating systems.

Groxx · on Dec 11, 2020

Just as a small note, as far as I can tell the PPC -> x86 version of Rosetta was at least partly licensed from "Transitive Corporation" by Apple, not entirely in-house: https://en.wikipedia.org/wiki/QuickTransit . I have no idea about Rosetta2 though.

jibcage · on Dec 11, 2020

Transitive was acquired by Apple, I imagine Rosetta 2 was the product of many of the same engineers.

lukeh · on Dec 14, 2020

Transitive was acquired by IBM, not Apple. I believe some engineers ended up at Apple though.

dboreham · on Dec 11, 2020

Only supporting 8G RAM may help too.

mcphage · on Dec 11, 2020

How would less RAM make the M1 faster?

jmarcher · on Dec 11, 2020

16GB. Still not enough though.

67868018 · on Dec 13, 2020

The M1 CPU only supports a 36bit address space, so nothing can be done right now

parisianka · on Dec 11, 2020

I'd like to know this too.

phendrenad2 · on Dec 11, 2020

It just shows the power of RISC. Soon there will be RISC-V chips (or at least Samsung ARM chips - remember them?) that will closely follow it's lead, mark my words...