What's the real reason Apple can design a higher performance device? Better fab ...

jeffbee · on Nov 14, 2020

I'm still waiting for my M1 to show up but from studying the details that other people have commented on, it's clear that Apple is slathering on the microarchitectural resources that both Intel and AMD are stingy with. Their core has a gigantic L1 instruction cache, rather a lot of L1 TLB entries (4x more than Intel), large(r) pages by default, and many other aspects. L1 TLB entries in particular are a well-known problem for x86 performance so it's a real mystery why Intel doesn't simply add more of them. x86 is probably married to 4K pages forever, because of god-damned DOS, whereas Apple can make the default page size be whatever they want it to be. People with high requirements can jump through a million hoops to get their x86 program loaded into hugepages, but on Apple's platform you're getting it for free.

m4rtink · on Nov 14, 2020

While Intel is indeed losing the performance game, even with the shady management engine shenanigans intel CPUs are still much more open than the locked down proprietary nightmare that apparently are the new Apple ARM SoCs.

coward8675309 · on Nov 14, 2020

That's why I commute to the office via horse — so I don't need to deal with those fascists at the DMV. Sure the costs of boarding and hay and horseshoes pile up, but at least I'm not being hassled by The Man.

jml7c5 · on Nov 14, 2020

This is a weirdly reductive take. Any new technology can be better in some aspects and worse in others. (To use your car/horse analogy: cars are much faster and have increased standards of living, but they have also had large negative effects on society and the environment.)

And it's a poor analogy because driver licensing is almost a straight consequence of how dangerous cars are, but the openness of an SoC does not directly follow from how fast it is.

tjoff · on Nov 14, 2020

And the PC has been a massive mistake that needs rectifying?

pjmlp · on Nov 14, 2020

From the point of view of IBM, most likely yes, otherwise there wouldn't have been PS/2 and MCA architecture.

Hence why laptops and 2-1 hybrids are basically back to the days of vertical integration of 16 bit computers, whose the PC was the only exception.

tjoff · on Nov 14, 2020

So let's all make sure, for apples sake, that is fixed then.

Not really, PC laptops and hybrids are pretty standardized. A slim case doesn't really make it vertical integration.

pjmlp · on Nov 14, 2020

Try to use anything other than Windows or ChromeOS on such laptops, with 100% support of underlying hardware.

tjoff · on Nov 14, 2020

Many of us are? And we can replace parts from third parties.

It is getting more and more integrated though which makes it hard to replace parts. But that has more to do with greed and size requirements than vertical integration, a soldering iron can amend some of that. Driver situation isn't by any means something new or necessarily an indication of vertical integration either.

pjmlp · on Nov 14, 2020

Looking at my Asus sold with Linux "support" I doubt it, hence why I explicility stated 100%.

As for the rest, nothing that regular consumers would ever bother with.

tjoff · on Nov 15, 2020

I don't have any issues, but I did do my research.

We are talking about vertical integration are we not? Don't see how any of that is relevant then.

pjmlp · on Nov 15, 2020

Which is exactly the proof of how Linux fails at that, by not proving it, thanks for confirming my assertion.

And since you did your research the wider public would appreciate to learn in what consumer shop one can get such out of the box experience.

tjoff · on Nov 15, 2020

I don't even know what you are arguing about anymore.

feanaro · on Nov 14, 2020

Since the DMV exists primarily for traffic safety, I assume Apple is removing your ability to run whatever you want on your processor for your safety too?

greedo · on Nov 14, 2020

LOL, I almost thought you were serious for a moment.

The DMV is a revenue source for most states. Same with traffic cops. They both only have a tenuous link to traffic safety.

feanaro · on Nov 15, 2020

I was serious. I don't want to actually get into the usefulness of the DMV, I was just assuming a common opinion for the purposes of the discussion since the OP made the comparison to the DMV.

drummer · on Nov 14, 2020

I think you will find that The Man is especially going to be hassling you with your horse.

coward8675309 · on Nov 14, 2020

Yeah, just like they hassle me with my wire-wrapped homebrew OpenBSD-based SDR mobile phone in my blue Radio Shack project box every time I try to get on an airplane. Why does The Man always pick on me?!

rurban · on Nov 14, 2020

No private copter yet? Look at Elon. He flies to work with his Rocketman suite and disturbs the LAX flight controllers and pilots all the way. The Man

blinkingled · on Nov 14, 2020

Yeah this point is being continually missed - having better performance doesn't mean much if you are limited in what you can do by the OS, the ecosystem and the hardware. As it is you can't run another OS or attach external or PCIe GPUs to anything Apple Silicon. So if macOS continues the downward trajectory or if someone else comes up with a better GPU you're stuck unlike with a PC.

ivalm · on Nov 14, 2020

I think it is typical of Apple to have feature poor early entrants and then increase feature set with new generations. I am sure in the next few years the GPU/other pcie issues will be resolved.

Miraste · on Nov 14, 2020

I'd bet money Apple and Nvidia remain incompatible for the foreseeable future.

emdowling · on Nov 14, 2020

So? Consider 3 aspects:

1. Power efficiency (read: battery life) 2. Performance 3. Openness

99% of consumers only care about 2 of those. The more that time goes on, the more it becomes clear that prioritizing openness, in particular for hardware, is a direct trade-off against building a great consumer product.

feanaro · on Nov 14, 2020

I think this is simply a consequence of the fact that the technological overlords who wish to oversee all of computing had not had a chance to overly lock down computers just yet (though they certainly wish to do it). Once they inevitably overplay their hand and the average consumer truly loses the ability to run the things he wishes to, you'll start to see a dramatic increase in the consumer's desire for openness.

frogblast · on Nov 14, 2020

Apple’s core is very wide. But wide usually means very high power consumption.

M1 apparently has much lower power consumption.

So something else is going on other than “Apple is willing to spend more to make cores wider”.

DeRock · on Nov 14, 2020

That, in part, comes from x86 vs arm. x86, with its variable length instructions, means your logic gets increasingly complex as you widen. Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size), that lets Apple do things like an 8-bit wide decode block (vs. 4-bit for Intel) and a humongous 630 OOB (almost double Intel’s deepest cores).

kortex · on Nov 14, 2020

I think this marks the definitive point where RISC dominates CISC. Been a long time coming but M1 spells it out in bold. Sure variable size instructions are great when cache is limited and page faults are expensive. But with clock speeds and node sizes ansymptoting, the only way to scale is horizontal. More chiplets, more cores, bigger caches, bringing DRAM closer to cache. Minimizing losses from cache flushes by more threads, and less context-switching.

Basically computers that start looking and acting more like clusters. Smarter memory and caching, zero-copy/fast-copy-on-mutate IPC as first-class citizens. More system primitives like io_uring to facilitate i/o.

The popularity of modern languages that make concurrency easy means more leveraging of all those cores.

klelatti · on Nov 14, 2020

Agree with everything except the RISC vs CISC bit. Modern Arm isn't really RISC and x86 gets translated into RISC like microinstructions. To the extent that ARMv8 has an advantage I think it's due to being a nearly clean design 10 years ago rather than carrying the x86 legacy.

sounds · on Nov 14, 2020

AnandTech's analysis corroborates what DeRock is saying here about the x86 variable length instructions being the limiting factor on how wide the decode stage can be.

The other factor is Apple is using a "better" process node (TSMC 5nm). I put it in quotes because Intel's 10nm and upcoming nodes may be competitive, but Intel's 14nm is what Apple gets to compete against today, right now.

Intel has been defeated in detail.

eulers_secret · on Nov 14, 2020

> but Intel's 14nm is what Apple gets to compete against today, right now.

Intel's 10nm node is out, I'm typing this on one right now. It's competitive in single-core performance against what we've seen from the M1. Graphics and multi-core it gets beat though...

Or do you mean what Apple used to use? (edit: the following is incorrect) It's true Apple never used an Intel 10nm part.

EDIT: I was wrong! Apple has used an Intel 10nm part. Thanks for the correction!

micv · on Nov 14, 2020

I'm using a MacBook Pro with an Intel 10nm part in it. The 4 port 13" MBP still comes with one. I think the MacBook Air might have had a 10nm part before it went ARM, too.

There are still no 10nm parts for the desktop or the high-end/high-TDP laptops anyway afaik.

jeffbee · on Nov 14, 2020

Tiger Lake is objectively the fastest thing in any laptop you can buy, regardless of whether its TDP is as high as others. You're right about the lack of desktop parts, though.

Miraste · on Nov 14, 2020

It's not faster than the M1, is it?

jeffbee · on Nov 14, 2020

Maybe but I won’t give it credit for existing until it lands in public hands. You can buy tiger lake laptops off literal shelves, today.

klelatti · on Nov 14, 2020

Are you sure about the single core performance - Geekbench has the M1 far ahead against 10nm laptops?

jeffbee · on Nov 14, 2020

I'm typing this on a Tiger Lake, too, and it is very fast and draws modest power. But, are they making any money on it? If they lose Apple as a laptop customer how much does that hurt their business.

macintux · on Nov 14, 2020

There was a lot of discussion on a related, recent post about Apple buying many of Intel’s most desirable chips. Will be interesting to see whether the loss of a high-end customer translates into more woes for Intel.

alisonkisk · on Nov 14, 2020

Maybe that will just make PCs better now that Apple isn't buying out the best chips.

cesarb · on Nov 14, 2020

> Meanwhile with arm, you have constant instruction size (other than thumb, but that’s more straightforward than variable size),

That's only for 32-bit ARM; for 64-bit ARM, the instruction size is always constant (there's no Thumb/Thumb2/ThumbEE/etc). It won't surprise me at all if Apple's new ARM processor is 64-bit only (being 64-bit only is allowed for ARM, unlike x86), which means that not only the decoder does not have to worry about the 2-byte instructions from Thumb, but also the decoder does not have to worry about the older 32-bit ARM instructions (including the quirky ones like LDM/STM).

That would also explains why Apple can have wide decoders while their competitors can't: these competitors want to keep compatibility with 32-bit ARM, while Apple doesn't care.

sidewndr46 · on Nov 14, 2020

For the uneducated, what is an "OOB" here?

sounds · on Nov 14, 2020

I think that's a typo - the Apple Silicon M1 has a "ROB is in the 630 instruction range" - https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

"ROB" and "OOO" might have gotten mixed together here.

OOO = Out-of-order. Refers to the fact that the M1 can decode instructions in parallel.

ROB = Re-Order Buffer. Refers to the stage where the parallel instructions get put back "in-order" and "retired."

jmull · on Nov 14, 2020

From the Anandtech article[1] on the M1, I think this is referring to the reorder buffer, or ROB. Reorder buffers allow instructions to execute out-of-order, so maybe that's where the "OO" comes from.

[1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

DeRock · on Nov 14, 2020

Yes, I was going by memory, which thought “out of order buffer” was a thing. I meant ROB. Point still stands, in that scaling these blocks is very difficult with variable instruction size.

andor · on Nov 14, 2020

Not a chip designer, but I think it could lead to a more efficient use of the functional units.

With more instructions decoded per clock and a larger reorder buffer, the core should be able to keep its units busier, i.e. not wasting energy without producing valuable output.

This efficiency gain of course needs to outweigh the consumption of the additional decoders. This part is easier with ARM as decoding x86 is complicated.

In addition to higher unit utilization, the increased parallelism should also be an advantage in the "race to sleep" power management strategy.

BlueTemplar · on Nov 14, 2020

> M1 apparently has much lower power consumption.

Even compared to similar manufacturing processes ?

tuatoru · on Nov 14, 2020

The manufacturing process is TSMC's, so probably not.

To me the innovation is using phone-type SoC technology.

A lot of the power (and die area) demand in modern chips is in the interface circuitry to drive the few centimetres of PCB trace between ICs.

Put the RAM in the same package as the CPU and you eliminate that interface. That gives you die space and power budget for big buffers and lots of parallel execution.

Interface transistors (and wires) are many times larger than the core compute logic transistors (and wires), so there is much more than a one-for-one gain.

Miraste · on Nov 14, 2020

There are no similar manufacturing processes, Apple has paid for time-limited exclusivity on 5nm.

raverbashing · on Nov 14, 2020

Interesting that you mention page size, IMHO 4k was small even 10 years ago. Though of course the next bump is 4M which I think might be too big for most apps

Apparently ARM supports 4KB, 64KB and 1MB (I just googled that)

Also maybe all the x86 crud might be finally taking its toll (and not just the ISA but all the cruft that got on top of it for 40yrs+ like legacy buses and functionalities, ACPI, or even the overengineered mess that's UEFI)

jeffbee · on Nov 14, 2020

All the legacy junk that's actually on the CPU is tiny. Even the x86 instruction decode you could lose in a corner of a 512x512 FMA unit.

my123 · on Nov 14, 2020

16KB page size is what arm64 macOS and iOS on Apple A9 onwards use.

CountSessine · on Nov 14, 2020

There’s nothing wrong with ACPI - in fact it’s what’s going to enable ARM in the server space. Without something like ACPI, you need customized kernels and other crap to control system devices. ACPI is actually quite nice. And it replaced something (APM) that really was legacy cruft (dropping into real mode for power management?!! Ugh!)

EFI doesn’t seem that bad either. Maybe something like uboot would suit you better? But the reason we have those nice graphical bios utilities now is because of EFI.

ip26 · on Nov 14, 2020

It seems like a straightforward tradeoff of IPC vs scalability.

They are able to shovel transistors at the core because there are only 4 of them. AMD on the other hand is packing 64 cores into a package, so each core has to make do with less.

jeffbee · on Nov 14, 2020

I believe my point is that the M1 seems to spend its gate budget where it really helps performance of software in practice, while x86 vendors are spending it somewhere else.

AMD in particular wasted an entire generation of Zen by having too few BTB entries. Zen1 performance on real software (i.e. not SPEC) was tragic, similar to Barcelona on a per-clock comparison. They learned this lesson and dramatically increased BTB entries on Zen2 and again for Zen3. But the question in my mind is why would you save a few pennies per million units by shaving off half the BTB entries? Doesn't make sense. They must have been guided by the wrong benchmark software.

ip26 · on Nov 14, 2020

I doubt it was about saving their pennies. Zen1 EPYC was a huge package with an expensive assembly & copious silicon area. But it was spread across 32 cores. A larger BTB probably had to come at the expense of something else.

What 'real software' are you thinking of? Anything in particular? Just curious, not looking to argue.

(sorry I changed my comment around the same time you replied)

jeffbee · on Nov 14, 2020

Very large, branchy programs with very short basic blocks. This describes all the major workloads at Google, according to their paper at https://research.google/pubs/pub48320/

yongjik · on Nov 14, 2020

Is page size actually a noticeable problem for most people (say, MacBook users)? Linux has supported huge pages for years, and my impression is that it's still considered a niche feature for squeezing out the last percentage of performance. I don't think any popular distro enables it by default.

jeffbee · on Nov 14, 2020

If you have a large program putting its text on huge pages can make much more than one percent difference. Any time your program spends on iTLB misses can be eliminated with larger pages.

dr_zoidberg · on Nov 14, 2020

Not just Linux, but windows also has Large Pages support. Back in the day where CPU cryptocurrency mining was still a thing (and for some cryptocurrencies which are aimed towards CPU-mining still is), some mining programs would ask to enable large page support.

joeryscript · on Nov 14, 2020

Honestly? A lot of it has to do with the fact that they booked TSMC's entire 5nm production capability for some time.

So they are working on a smaller process node than their competitors, and they used their market position to ensure that they will be the only ones using it for a little while. That gives them a significant leg up.

cma · on Nov 14, 2020

Yep, like when Apple got an exclusivity deal with Toshiba on its tiny-sized harddrives for use in music players.

dragontamer · on Nov 14, 2020

Intel dropped the ball on 10nm and 7nm processes constantly for the last 5 years.

Skylake was the "tick", then they couldn't "Tock" to 10nm, so they made 14nm+, then 14nm++, then 14nm+++... Intel's been stuck on the same process for half a decade now.

bredren · on Nov 14, 2020

This doesn’t come up enough: intel’s ceo had to abruptly resign in June 2018 due to a relationship with an employee. [1]

That could not have come at a worse time for a company like intel.

[1] https://www.reuters.com/article/us-intel-ceo-idUSKBN1JH1VW

rurban · on Nov 14, 2020

That could have been a good thing for Intel. The fish rots from the top for a couple of years already.

bredren · on Nov 14, 2020

Not for 2020-2025 intel. For reference, Satya Nadella became Microsoft's CEO in 2013.

qayxc · on Nov 14, 2020

> Intel's been stuck on the same process for half a decade now.

Even worse - the same architecture as well, which is why they have to backport their 10nm Ice Lake architecture to 14nm to even get anywhere.

CountSessine · on Nov 14, 2020

Part of it is just that Apple spent a bunch of time hiring the right people and making the right acquisitions.

Another part of it is just that designing a CPU with good single threaded performance is hard and costs a lot of die area and no one else in the ARM world was incentivized to do this - who needs a smartphone with that kind of performance (I know, I know - all of us, now)? But Apple also makes IPads which are kind of a bridge device - mobile and battery-powered, but also big and something people might want to do real work on.

I’m also going to say that I think Apple is still alone among the phone manufacturers in recognizing that smartphones are software platforms, not simply hardware devices. I think they have a special appreciation for the enabling power of a more capable CPU.

pas · on Nov 15, 2020

Look at these graphs: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

The 5 Watt iPhone CPU almost beats the 45 Watt AMD CPU. And the M1 will then simply obliterate the x86 incumbents.

Why? The real reason is that Apple had/has the real advantage. It's size, resources, money, strategy, know-how.

It's bigger than Intel. It has a better corporate culture than Intel, it has a vertically integrated market completely safe from any outside disruption for years, and will be for years. It had the opportunity to do this switch, so it had the motivation to develop this capability.

Apple was able to hire the best semi designers plus benefit from the ARM platform.

AMD is still too small and relatively spread thin - trying to cover multiple segments of the general market - compared to Apple, which basically has this one chip in two versions (mobile and ultra mobile).

dboreham · on Nov 15, 2020

Well ok but by that theory Dec Alpha or HP PA should have dominated?

pas · on Nov 15, 2020

Neither had the underlying single enormous vertically integrated market slice.

Apple has and had the iPhone. Apple started using their own chips, the A5, about 10 years ago. (In 2011 March, starting with the iPad 2. And they used ARM before that too.) And since then Apple optimized the "supply chain", incremental steps, but when you are Apple every step can be almost revolutionary (and year after year Steve told exactly that to the believers, and it's basically true, just not on actual feature front, but on hardware and systems level - eg. Gorilla glass for the phones, power efficiency, the oversized battery for the MBP, milling, machining, marketing, design, and so on).

protomyth · on Nov 15, 2020

Both examples were chips that were basically bought by Intel. The Alpha was a monster, and the PA ended up being replaced by the Itanium.

Tuna-Fish · on Nov 14, 2020

They have access to a fab process that is ~2 full generations more advanced than what Intel is shipping most their products in, and 64-bit ARM has one advantage over x86 (fixed-width instructions) which allow them to make a very wide core.

Of those two advantages, the process one is a lot more significant. Their chip loses in single-threaded performance to AMD's newest, which are still one node behind Apple, but appears to be more power-efficient. I fully expect that once AMD gets on the matching node they will take the crown back, even in a laptop power envelope.

trevyn · on Nov 14, 2020

>What's the real reason Apple can design a higher performance device?

Everyone else with the immediate capability are professional ball-droppers.

skohan · on Nov 14, 2020

Part of it is that they're essentially using all of TSMC's 5nm capacity, right?

ben-schaaf · on Nov 14, 2020

My guess until reviewers get their hands on the thing is that they've got some pretty good RnD alongside the massive process node advantage over Intel. AMD has already shown that there's a lot more performance in 7nm than we thought and the M1 looks like it'll be on-par with what you'd expect from zen 4, so I wouldn't say they've got any advantage from the architecture (as you'd expect)

ivalm · on Nov 14, 2020

Right, but zen 4 is a year away. If Apple is a year ahead then that’s a lot (even if it is because they are able to use 5nm node fab capacity first)

jibcage · on Nov 14, 2020

Apple has a greater incentive to optimize their chips for power performance because of their smaller devices (not just the phone, but the watch too).

_0w8t · on Nov 14, 2020

Samsung has the same incentives. But their CPUs are far behind Apple.

p_l · on Nov 14, 2020

Samsung chip division doesn't have guaranteed "buyer" or option to build SoC specifically for given product (even if usually you can bet on next generation Galaxy device as your target, you still have to design as if you were selling to open market - and that means less option to, for example, throw a lot of memory/cache resources that have been the main differentiator of Apple chips - which started from the same design as Exynos!)

CountSessine · on Nov 14, 2020

which started from the same design as Exynos

Apple was actually the one designing those cores in the Exynos - up until and including the Exynos 4000. That was when Apple decided to sever ties with Samsung and stopped designing cores for them and Samsung had to go back to stock ARM designs with the Exynos 5000. That was why the 5000 series sucked so badly - Samsung lost access to those Apple/Intrinsity juiced cores. It’s also what prompted Samsung to open SARC in Austin.

CountSessine · on Nov 15, 2020

I'm wrong about that - it was the Exynos 3110 that had the Apple designed core that also appeared in the A4. And after that, it was back to ARM's humble designs for Samsung.

p_l · on Nov 15, 2020

Hummingbird in SGS2 (this was before rename to Exynos) and A4 in iPhone 4 stem from cooperation between Samsung and Apple (who previously used Samsung SoC in iPhone 3G and I think in 2G). The teams split before both phones reached maturity in design. From then on Samsung kept iterating on the Hummingbird design till I think quite recently - I'm not sure if my Note 10+ doesn't still have cores derived from that design.

chrisseaton · on Nov 14, 2020

> What's the real reason Apple can design a higher performance device?

Motivation to do so.

Intel had none.

pfortuny · on Nov 14, 2020

I guess their sharp focus on phones and tablets led them to think... Hey, this is so powerful we might... And they indeed could.

streetcat1 · on Nov 14, 2020

My understanding is that there were a lots of bugs in the current processors. It was quality issues.

redvenom · on Nov 14, 2020

It's also the fact that they are putting everything on one chip. With an Intel chip it's just the CPU.