I'm still waiting for my M1 to show up but from studying the details that other people have commented on, it's clear that Apple is slathering on the microarchitectural resources that both Intel and AMD are stingy with. Their core has a gigantic L1 instruction cache, rather a lot of L1 TLB entries (4x more than Intel), large(r) pages by default, and many other aspects. L1 TLB entries in particular are a well-known problem for x86 performance so it's a real mystery why Intel doesn't simply add more of them. x86 is probably married to 4K pages forever, because of god-damned DOS, whereas Apple can make the default page size be whatever they want it to be. People with high requirements can jump through a million hoops to get their x86 program loaded into hugepages, but on Apple's platform you're getting it for free.
While Intel is indeed losing the performance game, even with the shady management engine shenanigans intel CPUs are still much more open than the locked down proprietary nightmare that apparently are the new Apple ARM SoCs.
That's why I commute to the office via horse — so I don't need to deal with those fascists at the DMV. Sure the costs of boarding and hay and horseshoes pile up, but at least I'm not being hassled by The Man.
This is a weirdly reductive take. Any new technology can be better in some aspects and worse in others. (To use your car/horse analogy: cars are much faster and have increased standards of living, but they have also had large negative effects on society and the environment.)
And it's a poor analogy because driver licensing is almost a straight consequence of how dangerous cars are, but the openness of an SoC does not directly follow from how fast it is.
Many of us are? And we can replace parts from third parties.
It is getting more and more integrated though which makes it hard to replace parts. But that has more to do with greed and size requirements than vertical integration, a soldering iron can amend some of that. Driver situation isn't by any means something new or necessarily an indication of vertical integration either.
Since the DMV exists primarily for traffic safety, I assume Apple is removing your ability to run whatever you want on your processor for your safety too?
I was serious. I don't want to actually get into the usefulness of the DMV, I was just assuming a common opinion for the purposes of the discussion since the OP made the comparison to the DMV.
Yeah, just like they hassle me with my wire-wrapped homebrew OpenBSD-based SDR mobile phone in my blue Radio Shack project box every time I try to get on an airplane. Why does The Man always pick on me?!
Yeah this point is being continually missed - having better performance doesn't mean much if you are limited in what you can do by the OS, the ecosystem and the hardware. As it is you can't run another OS or attach external or PCIe GPUs to anything Apple Silicon. So if macOS continues the downward trajectory or if someone else comes up with a better GPU you're stuck unlike with a PC.
I think it is typical of Apple to have feature poor early entrants and then increase feature set with new generations. I am sure in the next few years the GPU/other pcie issues will be resolved.
1. Power efficiency (read: battery life)
2. Performance
3. Openness
99% of consumers only care about 2 of those. The more that time goes on, the more it becomes clear that prioritizing openness, in particular for hardware, is a direct trade-off against building a great consumer product.
I think this is simply a consequence of the fact that the technological overlords who wish to oversee all of computing had not had a chance to overly lock down computers just yet (though they certainly wish to do it). Once they inevitably overplay their hand and the average consumer truly loses the ability to run the things he wishes to, you'll start to see a dramatic increase in the consumer's desire for openness.
That, in part, comes from x86 vs arm. x86, with its variable length instructions, means your logic gets increasingly complex as you widen. Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size), that lets Apple do things like an 8-bit wide decode block (vs. 4-bit for Intel) and a humongous 630 OOB (almost double Intel’s deepest cores).
I think this marks the definitive point where RISC dominates CISC. Been a long time coming but M1 spells it out in bold. Sure variable size instructions are great when cache is limited and page faults are expensive. But with clock speeds and node sizes ansymptoting, the only way to scale is horizontal. More chiplets, more cores, bigger caches, bringing DRAM closer to cache. Minimizing losses from cache flushes by more threads, and less context-switching.
Basically computers that start looking and acting more like clusters. Smarter memory and caching, zero-copy/fast-copy-on-mutate IPC as first-class citizens. More system primitives like io_uring to facilitate i/o.
The popularity of modern languages that make concurrency easy means more leveraging of all those cores.
Agree with everything except the RISC vs CISC bit. Modern Arm isn't really RISC and x86 gets translated into RISC like microinstructions. To the extent that ARMv8 has an advantage I think it's due to being a nearly clean design 10 years ago rather than carrying the x86 legacy.
AnandTech's analysis corroborates what DeRock is saying here about the x86 variable length instructions being the limiting factor on how wide the decode stage can be.
The other factor is Apple is using a "better" process node (TSMC 5nm). I put it in quotes because Intel's 10nm and upcoming nodes may be competitive, but Intel's 14nm is what Apple gets to compete against today, right now.
> but Intel's 14nm is what Apple gets to compete against today, right now.
Intel's 10nm node is out, I'm typing this on one right now. It's competitive in single-core performance against what we've seen from the M1. Graphics and multi-core it gets beat though...
Or do you mean what Apple used to use? (edit: the following is incorrect) It's true Apple never used an Intel 10nm part.
EDIT: I was wrong! Apple has used an Intel 10nm part. Thanks for the correction!
I'm using a MacBook Pro with an Intel 10nm part in it. The 4 port 13" MBP still comes with one. I think the MacBook Air might have had a 10nm part before it went ARM, too.
There are still no 10nm parts for the desktop or the high-end/high-TDP laptops anyway afaik.
Tiger Lake is objectively the fastest thing in any laptop you can buy, regardless of whether its TDP is as high as others. You're right about the lack of desktop parts, though.
I'm typing this on a Tiger Lake, too, and it is very fast and draws modest power. But, are they making any money on it? If they lose Apple as a laptop customer how much does that hurt their business.
There was a lot of discussion on a related, recent post about Apple buying many of Intel’s most desirable chips. Will be interesting to see whether the loss of a high-end customer translates into more woes for Intel.
> Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size),
That's only for 32-bit ARM; for 64-bit ARM, the instruction size is always constant (there's no Thumb/Thumb2/ThumbEE/etc). It won't surprise me at all if Apple's new ARM processor is 64-bit only (being 64-bit only is allowed for ARM, unlike x86), which means that not only the decoder does not have to worry about the 2-byte instructions from Thumb, but also the decoder does not have to worry about the older 32-bit ARM instructions (including the quirky ones like LDM/STM).
That would also explains why Apple can have wide decoders while their competitors can't: these competitors want to keep compatibility with 32-bit ARM, while Apple doesn't care.
From the Anandtech article[1] on the M1, I think this is referring to the reorder buffer, or ROB. Reorder buffers allow instructions to execute out-of-order, so maybe that's where the "OO" comes from.
Yes, I was going by memory, which thought “out of order buffer” was a thing. I meant ROB. Point still stands, in that scaling these blocks is very difficult with variable instruction size.
Not a chip designer, but I think it could lead to a more efficient use of the functional units.
With more instructions decoded per clock and a larger reorder buffer, the core should be able to keep its units busier, i.e. not wasting energy without producing valuable output.
This efficiency gain of course needs to outweigh the consumption of the additional decoders. This part is easier with ARM as decoding x86 is complicated.
In addition to higher unit utilization, the increased parallelism should also be an advantage in the "race to sleep" power management strategy.
The manufacturing process is TSMC's, so probably not.
To me the innovation is using phone-type SoC technology.
A lot of the power (and die area) demand in modern chips is in the interface circuitry to drive the few centimetres of PCB trace between ICs.
Put the RAM in the same package as the CPU and you eliminate that interface. That gives you die space and power budget for big buffers and lots of parallel execution.
Interface transistors (and wires) are many times larger than the core compute logic transistors (and wires), so there is much more than a one-for-one gain.
Interesting that you mention page size, IMHO 4k was small even 10 years ago. Though of course the next bump is 4M which I think might be too big for most apps
Apparently ARM supports 4KB, 64KB and 1MB (I just googled that)
Also maybe all the x86 crud might be finally taking its toll (and not just the ISA but all the cruft that got on top of it for 40yrs+ like legacy buses and functionalities, ACPI, or even the overengineered mess that's UEFI)
There’s nothing wrong with ACPI - in fact it’s what’s going to enable ARM in the server space. Without something like ACPI, you need customized kernels and other crap to control system devices. ACPI is actually quite nice. And it replaced something (APM) that really was legacy cruft (dropping into real mode for power management?!! Ugh!)
EFI doesn’t seem that bad either. Maybe something like uboot would suit you better? But the reason we have those nice graphical bios utilities now is because of EFI.
It seems like a straightforward tradeoff of IPC vs scalability.
They are able to shovel transistors at the core because there are only 4 of them. AMD on the other hand is packing 64 cores into a package, so each core has to make do with less.
I believe my point is that the M1 seems to spend its gate budget where it really helps performance of software in practice, while x86 vendors are spending it somewhere else.
AMD in particular wasted an entire generation of Zen by having too few BTB entries. Zen1 performance on real software (i.e. not SPEC) was tragic, similar to Barcelona on a per-clock comparison. They learned this lesson and dramatically increased BTB entries on Zen2 and again for Zen3. But the question in my mind is why would you save a few pennies per million units by shaving off half the BTB entries? Doesn't make sense. They must have been guided by the wrong benchmark software.
I doubt it was about saving their pennies. Zen1 EPYC was a huge package with an expensive assembly & copious silicon area. But it was spread across 32 cores. A larger BTB probably had to come at the expense of something else.
What 'real software' are you thinking of? Anything in particular? Just curious, not looking to argue.
(sorry I changed my comment around the same time you replied)
Very large, branchy programs with very short basic blocks. This describes all the major workloads at Google, according to their paper at https://research.google/pubs/pub48320/
Is page size actually a noticeable problem for most people (say, MacBook users)? Linux has supported huge pages for years, and my impression is that it's still considered a niche feature for squeezing out the last percentage of performance. I don't think any popular distro enables it by default.
If you have a large program putting its text on huge pages can make much more than one percent difference. Any time your program spends on iTLB misses can be eliminated with larger pages.
Not just Linux, but windows also has Large Pages support. Back in the day where CPU cryptocurrency mining was still a thing (and for some cryptocurrencies which are aimed towards CPU-mining still is), some mining programs would ask to enable large page support.
Honestly? A lot of it has to do with the fact that they booked TSMC's entire 5nm production capability for some time.
So they are working on a smaller process node than their competitors, and they used their market position to ensure that they will be the only ones using it for a little while. That gives them a significant leg up.
Intel dropped the ball on 10nm and 7nm processes constantly for the last 5 years.
Skylake was the "tick", then they couldn't "Tock" to 10nm, so they made 14nm+, then 14nm++, then 14nm+++... Intel's been stuck on the same process for half a decade now.
Part of it is just that Apple spent a bunch of time hiring the right people and making the right acquisitions.
Another part of it is just that designing a CPU with good single threaded performance is hard and costs a lot of die area and no one else in the ARM world was incentivized to do this - who needs a smartphone with that kind of performance (I know, I know - all of us, now)? But Apple also makes IPads which are kind of a bridge device - mobile and battery-powered, but also big and something people might want to do real work on.
I’m also going to say that I think Apple is still alone among the phone manufacturers in recognizing that smartphones are software platforms, not simply hardware devices. I think they have a special appreciation for the enabling power of a more capable CPU.
The 5 Watt iPhone CPU almost beats the 45 Watt AMD CPU. And the M1 will then simply obliterate the x86 incumbents.
Why? The real reason is that Apple had/has the real advantage. It's size, resources, money, strategy, know-how.
It's bigger than Intel. It has a better corporate culture than Intel, it has a vertically integrated market completely safe from any outside disruption for years, and will be for years. It had the opportunity to do this switch, so it had the motivation to develop this capability.
Apple was able to hire the best semi designers plus benefit from the ARM platform.
AMD is still too small and relatively spread thin - trying to cover multiple segments of the general market - compared to Apple, which basically has this one chip in two versions (mobile and ultra mobile).
Neither had the underlying single enormous vertically integrated market slice.
Apple has and had the iPhone. Apple started using their own chips, the A5, about 10 years ago. (In 2011 March, starting with the iPad 2. And they used ARM before that too.) And since then Apple optimized the "supply chain", incremental steps, but when you are Apple every step can be almost revolutionary (and year after year Steve told exactly that to the believers, and it's basically true, just not on actual feature front, but on hardware and systems level - eg. Gorilla glass for the phones, power efficiency, the oversized battery for the MBP, milling, machining, marketing, design, and so on).
They have access to a fab process that is ~2 full generations more advanced than what Intel is shipping most their products in, and 64-bit ARM has one advantage over x86 (fixed-width instructions) which allow them to make a very wide core.
Of those two advantages, the process one is a lot more significant. Their chip loses in single-threaded performance to AMD's newest, which are still one node behind Apple, but appears to be more power-efficient. I fully expect that once AMD gets on the matching node they will take the crown back, even in a laptop power envelope.
My guess until reviewers get their hands on the thing is that they've got some pretty good RnD alongside the massive process node advantage over Intel. AMD has already shown that there's a lot more performance in 7nm than we thought and the M1 looks like it'll be on-par with what you'd expect from zen 4, so I wouldn't say they've got any advantage from the architecture (as you'd expect)
Samsung chip division doesn't have guaranteed "buyer" or option to build SoC specifically for given product (even if usually you can bet on next generation Galaxy device as your target, you still have to design as if you were selling to open market - and that means less option to, for example, throw a lot of memory/cache resources that have been the main differentiator of Apple chips - which started from the same design as Exynos!)
Apple was actually the one designing those cores in the Exynos - up until and including the Exynos 4000. That was when Apple decided to sever ties with Samsung and stopped designing cores for them and Samsung had to go back to stock ARM designs with the Exynos 5000. That was why the 5000 series sucked so badly - Samsung lost access to those Apple/Intrinsity juiced cores. It’s also what prompted Samsung to open SARC in Austin.
I'm wrong about that - it was the Exynos 3110 that had the Apple designed core that also appeared in the A4. And after that, it was back to ARM's humble designs for Samsung.
Hummingbird in SGS2 (this was before rename to Exynos) and A4 in iPhone 4 stem from cooperation between Samsung and Apple (who previously used Samsung SoC in iPhone 3G and I think in 2G). The teams split before both phones reached maturity in design. From then on Samsung kept iterating on the Hummingbird design till I think quite recently - I'm not sure if my Note 10+ doesn't still have cores derived from that design.
Graph conveniently omits AMD who now beat Intel using the same ISA but different process.