Hacker News new | past | comments | ask | show | jobs | submit login
Why is Apple's M1 chip so fast? (erik-engheim.medium.com)
768 points by socialdemocrat on Nov 30, 2020 | hide | past | favorite | 629 comments



Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

The M1 is really wide (8 wide decode) and has a lot of execution units. It has a huge 630 deep reorder buffer to keep them all filled, multiple large caches and a lot of memory bandwidth.

It is just a monster of a chip, well designed balanced and executed.

BTW this isn’t really new. Apple has been making incremental progress year by year on these processor for their A-series chips. Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop. Well turns out that given the right cooling solution those benchmarks were accurate.

Anandtech has a wonderful deep dive into the processor architecture.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Edit: I didn’t mean to disparage Apple or the M1 by saying that Apple threw hardware at the problem. That Apple was able to keep power low with such a wide chip is extremely impressive and speaks to how finely tuned the chip is. I was trying to say that Apple got the results they did the hard way by advancing every aspect of the chip.


The answer of wide decode and deep reorder buffer gets much closer than the “tricks” mentioned in tweets. That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.

On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.

An insane amount of good engineering went into this CPU.


> An insane amount of good engineering went into this CPU.

I agree but lets not overblow the difficulties either.

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder.

That could be irrelevant for small enough numbers, and ARM is easier to decode than x86. So this can be very well be dominated by other things. What you cite seems to be only about decoding logical register decoding going into the renaming structures, and then for just that tiny part it even tells that "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

> The delay of a full bypass network is quadratic in execution width.

Maybe if that's a problem don't do a full bypass network.

> dropping rapidly as both a function of increasing width and increasing clock rate

Good thing that the clock rate is not too high then :p

More seriously the M1 can keep the beast fed probably because everything is dimensioned correctly, (and yes also because the clocks are not too high, but if you manage to make a wide and slow CPU that actually works well, I don't see why you would want to scale the freq too much high, given you would quickly consume like crazy, and there is only only limited headroom above 3.2GHz anyway). It obviously helps to have a gigantic OOO. So I don't really see where there is so much surprise. Esp. since we saw the progression in the A series.

To finish probably TSMC 5nm does not hurt. The competitors are on bigger nodes and have smaller structures. Coincidence? Or just like it has worked during decades already.


It's not completely groundbreaking, but painting it as an outgrowth of existing trends doesn't give Apple enough credit. The challenges of scaling wider CPUs within available power budgets is widely accepted: https://www.cse.wustl.edu/~roger/560M.f18/CSE560-Superscalar... (for "high-performance per watt cores," optimal "issue width is ~2"). Intel designed an entire architecture, Itanium, around the theory that OOO scaling would hit a point of diminishing returns. https://www.realworldtech.com/poulson ("Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry."). It is also well accepted that we are hitting limits on ability to extract instruction-level parallelism: https://docencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProces... ("There just aren’t enough instructions that can actually be executed in parallel!"); https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp16/cs... ("Hardly more than 1-2 IPC on real workloads").

Apple being able to wring robust performance out of an 8-wide 3.2 GHz design, on a variety of benchmarks, is impressive and unexpected. For example, the M1 outperforms a Ryzen 5950X by 15%. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste.... Zen 3 is either 4- or 8-wide decode (depending on whether you hit the micro-op cache) and boosts to 5 GHz. It beats the 10900k, a 4+1-way design that boosts to 5.1 GHz, by 25%. The GCC subtest, meanwhile, is famous for being branch-heavy code with low parallelism. Apple extracting 80% more IPC from that test than AMD's latest core (which is already a very impressive, very wide core to begin with!) is very unexpected.

A lot of the conventional wisdom is based on assumptions about branch prediction and memory disambiguation, which have major impacts on how much ILP you can extract: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... To do so well, Apple must be doing something very impressive on both fronts.


The i7-1165G7 is within 20% on single core performance of the M1. The Ryzen 4800U is within 20% on multi-core performance. Both are sustained ~25W parts similar to the M1. If you turned x64 SMT/AVX2 off, normalized for cores (Intel/AMD 4/8 vs Apple 4+4), on-die cache (Intel/AMD 12M L2+L3 vs Apple 32MB L2+L3) and frequency (Apple 3.2 vs AMD/Intel 4.2/4.7), you'd likely get very close results on 5nm or 7nm-equivalent same process. Zen 3 2666 vs 3200 RAM alone is about a 10% difference. The M1 is 4266 RAM IIRC.

TBH, Laptop/Desktop level performance is indeed very nice to see out of ARM, after a few years of false starts by a few startups and Qualcomm. Apple designed a wider core they deserve credit for, but wider cores have been a definitive trend starting with the Pentium M vs Pentium 4. There is a trade-off here for die area IMO, AMD/Intel have AVX2 and even AVX512 and SMT on each core, and narrower cores (with smaller structures, higher frequency). Apple has wider cores (larger structures, less frequency, higher IPC). It's not that simple, but it kind of is if you squint a bit.


The i7-1165G7 boosts to 4.7 GHz, 50% higher than the M1. A 75% uplift in IPC (20% more performance at 2/3 the clock speed) compared to Intel’s latest Sunny Cove architecture is enormous. Especially since Sunny Cove is itself the biggest update to Intel’s architecture since Sandy Bridge a decade ago.


Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is only around a ~20% difference in geekbench and at similar sustained power levels. If you want AVX2/512+SMT, a slightly narrower core of realistically 6+ wide with uOP-cache upto 8-wide is an acceptable tradeoff. We have seen Zen 3 go wider from Zen 1/2[1], so wider x64 designs with AVX/SMT should be coming, but this is the squinting part with TSMC 5nm vs 7nm.

[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.

The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.

If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.


Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.

The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.


> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.

It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/

> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.

Theoretical studies have shown that higher ILP is attainable (http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi...) but the M1 suggests some really notable advances in being able to actually extract higher ILP in real-world code.

I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.


The M1 is a 10W part, no? I would kill to see the 25W M-series chip.

Oh and the 10W is for the entire SOC, GPU and memory included.


Nope. Anandtech measured 27W peak on M1 CPU workloads with the average closer to 20W+[1].

The Ryzen 4800U and i7-1165G7 also have comparable GPUs (and TPU+ISP for the i7) within the same ~15-25W TDP. The Intel i7-1165G7 average TDP might be closer to ~30W because of it's 4.7Ghz boost clock, but it's still comparable to the M1.

The i7-1165G7 and 4800U have a few laptop designs with soldered RAM. You can get 17hrs+ of video out of a 4800U laptop with a 60Wh battery[2]. Also comparable with i7-1065G7/i7-1165G7 at 15hrs+/50Wh.

[1] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

[2] https://www.pcworld.com/article/3531989/battery-life-amd-ryz...


Wasn’t 27W for the whole Mac Mini machine using a meter at the wall plug? So that includes losses in the power supply and ssd and everything else outside the chip that uses a bit of juice whereas the AMD tdp is just the chip. I thought Anandtech said there was currently no reliable way to do an ‘apples to apples’ tdp comparison?

Edit: quote from anandtech:

“As we had access to the Mac mini rather than a Macbook, it meant that power measurement was rather simple on the device as we can just hook up a meter to the AC input of the device. It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.”


The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)


I have device with me and on full load on both CPU and GPU it can go up to 25 but for most use cases i see the whole SOC hovering around 15W


M1 is a 20W (max) CPU and a 40W SoC (whole package max).

However, in most intense workloads it doesn't go near 40W, more like ~25W under high load.

Still incredibly impressive.


18W CPU peak power, 10W is their power efficiency comparison point.


The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

https://www.youtube.com/watch?v=_MUTS7xvKhQ&list=PLo11Rczpzu... Check this out, 12.5W power consumption for the M1 CPU vs. 68W CPU power consumption for the Intel i9 CPU of the 16” Macbook Pro, and yet the M1 is 8% faster in Cinebench R23 in multi-core score.


My naive assumption would be that 4c big + 4c little would perform better than 4c/8t all other things being equal (and assuming software was written to optimize for each design respectively). Also no reason you can't have 4c/8t big + 4c/8t little too.


Apple’s 4big + 4LITTLE config performs better than Intel’s 8c/16t mobile chips right now.


> For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15).

That's a decoder for a single field, where width of the field is the parameter it scales by. That would be instruction size or smaller, and instructions don't change size depending on how many you decode at once.

And logically once you separate the instructions you can decode in parallel in fixed time, and if all your instructions are 4 bytes then it takes no circuitry to separate them.

Also: "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."


Your first source does not support your statement.

While there is theoretically a quadratic component, in their words:

> We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width.


> That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

Well, because it doesn't, it's ~25 watts. And also because it runs at just 3ghz. You'll see similar power numbers from x86 CPUs at 3ghz, too. The M1's multicore performance vs. the 4800U and 4900HS demonstrate this nicely.


I haven’t read the linked AnandTech article yet, but is there a clear answer why Apple was able to defy common comp arch wisdom (M1 has wider decode which works fine for various applications/code)?


Check the parent article that explains it well. Apple didn’t defy common comp arch wisdom... they applied it.

The reason why is hard for Intel/AMD to do the same is not the lack of engineering geniuses (I’m sure they have plenty), but the support for a legacy ISA, and a particular business model.

What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat? The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.


> What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat?

Having own silicon, means the upstream will not be able to turn lights on you (Samsung — a company keeping a quarter of its host country's GDP hostage.) I believe the immediate goal of PA Semi purchase was that.

> The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

PA Semi was clearly a diamond in the rough. It took a great insight to single out PA Semi, because on the surface it was a very barebones SoC sweatshop, but in reality PA were the last of Mohicans of US chip design.

PA was a place where all non-Intel IC engineers left to after the severe carnage of microchip businesses of US tech giants like Sun, IBM, HP, DEC, SGI..., and etc.

It was a star team which back then was toiling at router box SoCs.


Just to quantify your adjectives, per the Anandtech article:

> The M1 is really wide (8 wide decode)

In contrast to x86 CPUs which are 4 wide decode.

> It has a huge 630 deep reorder buffer

By comparison, Intel Sunny/Willow has 352.


Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Intel/AMD have had 4-wide decode and issue width for 10 years and I'm glad they're moving to wider machines.

Edited "decode" to "issue" for clarity.


Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.

All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?


https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.

Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.


Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.

But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?

Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.

Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).


I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.

On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.

Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).


X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.

Arm is different but probably easier to decode. So you can widen the decoder.


This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).


Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.


You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.

[1] https://www.researchgate.net/profile/Sally_McKee/publication...


That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.


Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.

Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.

Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40...

[2] http://www.cs.unc.edu/~porter/pubs/instrpop-systor19.pdf


But there's more work being done per average x86_64 instruction due to RMW ops. Hence why they just look at an entire binary.


OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).

But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).

I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?


As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.

As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

Ref: "Modern Processor Design" p268.


> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?


Probably should check Abner's guide, but the P6 is still the rough blueprint for everything (except P4) that intel did since.


They were still doing this in the Nehalem timeframe (possibly influence from Hinton?)


I guess this requires extending the architecture to 8-wide instructions when it makes sense.


What do you mean with "8-wide instructions", and what does that have to do with multiple decoders?


So Intel and AMD are capable of building a chip like this, but the ambitious size meant it was more economically feasible for Apple to build it themselves?

(Not a hardware guy)


Neither Intel nor AMD are capable of doing for a very basic reason, there is no market for it. You can't just release a CPU for which there is no operating system.

Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it.

Apple, by creating new CPU, replace a part of the stack that is owned by Intel by their own which only strengthens them position even if it did not improve any performance.

Apple is invulnerable to other companies copying the CPU and creating their own because they are not really a competition here. Apple sells an integrated product of which CPU is just one component.


That's not entirely true. Windows ARM64 can execute natively on the M1 (through QEMU for hardware emulation, but the instructions execute natively). Intel/AMD could produce an ARM processor that could find a market. They also have a close partnership with Microsoft and I have to believe there would be a path forward there. They could also target Linux.

I haven't seen enough evidence yet though that ARM is the reason M1 performs so efficiently. It may just be the fact that it is on a cutting edge 5nm process with a SoC design. I'm not even sure if the PC/Windows market would adopt such a chip since it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU. Heck even AMD has been retaining backwards compatability with older motherboards since it's been one one socket for a while.

I think for laptops/mobile this makes a lot of sense. For a desktop though I honestly prefer AMD's latest Ryzen Zen 3 chips.


> It may just be the fact that it is on a cutting edge 5nm process with a SoC design.

Yup. It's fast because it's got short distances to memory and everything else. Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power. Using shorter paths to memory also lets you use lower voltages, which means less waste heat and less need to spend effort on cooling and overall power savings for the chip.

Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

There's a reason that manufacturers used to be able to "speed" up a chip by just doing a die shrink - photographically reducing chip masks to make them smaller, which also made them faster with relatively small amounts of work.

As the late Adm. Grace Hopper put it, there are ever so many picoseconds between the two ends of a wire.


> Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

A maximum of a few nanoseconds. Not much in comparison to an overall memory system latency.

> Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power.

You cannot run away from that with just shorter PCB distance. The circuitry for link training is mandated by the standard.

You will need a redesigned memory standard for that.


Until the late 90s on-chip wire delays were something we just didn't care much about speed was limited by gate capacitance - we got speedups when we shrunk the gate sizes on transistors - after the mid 90s RC delays in wires started to matter (not speed of light delays, how fast you can shuffle electrons in there to fill up the C) soon after it got worse because wire RC delays don't scale perfectly with shrinks because of edge effects - this was addressed in a number of ways, high speed systems reduced the R by switching from Al wires to Cu, tools got better able to model those delays and synthesize and do layout at (almost) the same time


Intel/AMD could produce an ARM processor that could find a market.

Intel did have an ARM processor line, and it did have a market. They acquired the StrongARM line from Digital Equipment and evolved it into the XScale line. What Intel didn't want was for something to eat into it's x86 market, and Windows/ARM didn't exist. So they evolved ARM in a different direction than Apple later did. It was very successful in the high-performance embedded market.


"It was very successful in the high-performance embedded market."

As long as you don't define that market as "billions of mobile smartphones".

I remember StrongArms in PDAs back in the early 2000s.

They should have had ready processors for the smartphone, but IIRC they kept pushing x86 on phones.


Fair point. I forgot about the PXA line. I suspect, however, more of the IOP & IXP embedded processors were sold.


>AMD could produce an ARM processor that could find a market.

They did and it couldn't find a market.


IIRC the chip they sort-of released was originally meant for Amazon, but missed targets wildly, leading Amazon to doing one on their own.

Lisa Su put the kibosh on K12 for focus reasons, given how well Zen turned out it was the right call at least for now.


> it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU.

Honestly, it doesn't. You just swap it for a new one and sell the old one.

When you have 8GB, you pay another 8GB and end up with 16.

In this case, you just sell you SoC with 8GB, and but another SoC with 16GB. You'll only pay out the difference.

This is pretty much how you upgrade a relatively recent phone works too.


Depreciation means you won't pay out only the difference, in most cases.


This is kinda where Apple products thrive. They drop in price very little, and have, ultimately, long lives.

iPhone 7s and iPhone 8s are still great low-end devices, and they reach the right market by being resold by people getting a newer one.

I don't see why M1 laptops would be an exception.


> "Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it."

Note that this is the same model that Sun Microsystems, DEC, HP, etc. had and it didn't work out for them.

I'd venture to say that it currently only works out for Apple because Intel has stumbled very, very badly and TSMC has pulled ahead in fabbing process technology. If Intel manages to get back on its feet with both process enhancements and processor architecture (and there's no doubt they've had a wake up call), this strategic move could come back to bite Apple.


Only because of Linux, without it in the picture they would still be selling UNIX workstations.


Without Linux, they would've lasted longer but still would've lost out on price/performance against x86 and Intel's tick-tock cadence well before Intel's current stumble. We might all have wound up running Windows Server in our datacenters.


I doubt places like Hollywood would have migrated to Windows, given their dependency on UNIX variants like Irix.


I don't understand, how do these low level changes impact the OS exactly assuming that the ISA remains the same? It doesn't seem much more impactful than SSE/AVX and friends, i.e. code not specifically optimized for these features won't benefit but it'll still work and people can incrementally recompile/reoptimize for the new features.

After all that's pretty much how Intel has operated all the way since the 8086.

It's not like Itanium where everything had to be redone from scratch basically.


Are you referring to Apple's laptop x86 -> ARM change? Entertaining the idea that the ISA is significant here: Surely there would be a big market for ARM chips in the Android and server sides too, so this shouldn't be the only reason why other vendors aren't making competitive ARM chips. Apple's laptop volumes aren't that big compared to those markets.

And of course you have to factor in the large amount of pain that Apple is imposing on its user and ISV base in addition to the inhouse cost of switching out the OS and supporting two architectures for a long time in parallel. A vendor making chips for Android or servers wouldn't have to bear that.


> You can't just release a CPU for which there is no operating system

sure you can. That's what compilers are for.


Intel had such an attitude once before.


Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[82]

https://en.wikipedia.org/wiki/Itanic

So they didn't have the needed compiler


Surely there were compilers, they just weren't as good (as optimizing) as Intel wished.


Of course there were itanic-targetting compilers, they worked, just not well enough to deliver on marketing promise (edit: and what the hardware was theoretically capable of).


I wonder how HP and Microsoft managed to port HP-UX and Windows without a compiler.


That's kind of the point.

Compilers existed just fine to do the porting, and solved that problem.

Intel's failure is that they were unable to solve a different problem because that compiler didn't exist, one that went well beyond merely porting.

In other words, "That's what compilers are for." is a perfectly fine attitude when those compilers exist, and a bad attitude when they don't exist. Porting is the former, making VLIW efficient is the latter.


GP probably means that you won't be able to sell it, even if there is a compiler. (Not true in the super embedded space, sure.)


It's not that it was more economical, but that at least some of these AMD and Intel would not benefit from due to the ISA: x64 instructions can be up to 15 bytes, so just finding 8 instructions to decode would be costly, and I assume Intel and AMD think more so than the gains from more decoders (you couldn't keep them fed enough to be worth it, basically).


I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.

Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.

(Note that CISC definitely does have it's advantages - especially in large code foot-print server workloads where the dense packing of instructions help - but it might be hindered in typical desktop workloads)

Source: I've worked in front-end CPU micro-architecture research for ~5 years


How do you feel about RISC-V compact instructions? The resulting code seems to be 10-15% smaller than x86 in practice (25-30% smaller than aarch64) while not requiring the weirdness and mode-switching associated with thumb or MIPS16e.

Has there actually been much research into increasing instruction density without significantly complicating decode?

Given the move toward wide decoders, has there been any work on the idea of using fixed-size instruction blocks and huffman encoding?


I can't really comment on the tradeoffs between specific ISAs since I've mainly worked on micro-arch research (which is ISA agnostic for most of the pipeline).

As for the questions on research into looking at decode complexity v instruction density tradeoff - I'm not aware of any recent work but you've got me excited to go dig up some papers now. I suspect any work done would be fairly old - back in the days when ISA research was active. Similar to compiler front-end work (think lex, yacc, grammar etc..) ISA research is not an active area currently. But maybe it's time to revisit it?

Also, I'm not sure if Huffman encoding is applicable to a fixed-size ISA. Wouldn't it be applicable only in a variable size ISA where you devote smaller size encoding to more frequent instructions?


Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.

Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.

We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.

As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.

Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.

Never gave it too much thought before now, but it's an interesting exercise.


Not necessarily. Samsung used to make custom cores that were just as large if not larger than Apple’s (amusingly the first of these was called M1).

Unfortunately, Samsung’s cores always performed worse and used significantly more power than the contemporary Apple cores.

Apple’s chip team has proven capable of making the most of their transistor budget, and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.


> there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

From what I have seen the only difference in efficiency is the manufacturing process. M1 consumes about as much power per core as a Ryzen core. AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.


> AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.

TDP is no where near actual load power use.


> and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

What's that reason?


> What's that reason?

Apple’s efficiency is based on a very wide and deep core that operates at a somewhat lower clock speed. Frequent branches and latency for memory operations can make it difficult to keep a wide core fully utilized. Moreover, wider cores generally cannot clock as high. That’s why Intel and AMD have chosen to pursue narrower cores that can clock near 5 GHz.

The maximum ILP that can be extracted from code can be increased with better branch prediction accuracy, larger out of order window size, and more accurate memory disambiguation: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... The M1 appears to have made significant advances in all three areas, in order for it to be able to keep such a wide core utilized.


What you write makes sense but it does not address why AMD and Intel could not do the same "even if they had the same process, ISA, and area to work with."


They wouldn’t have Apple’s IP relating to branch prediction, memory disambiguation, etc.


I think it's "faith".


Faith implies no data.

Why will Apple always out compete Intel and other non-vertically-integrated systems? Margins, future potential, customer relationship and compounding growth/virtuous cycle.

The margins logic is simple, iPhones and MacBooks make tons more money per unit compared with a CPU. Imagine if improving the performance of a CPU by 15% makes the demand increase by 1%. For Apple improving the performance of a CPU by 15% makes the demand increase by 1% for the whole iPhone or MacBook. For this reason alone, Apple can invest 2-5x more R&D into their chips than everyone else.

The future potential logic is more nuanced:

1. Intel's/whoever's 10 year vision is to build a better CPU/GPU/RAM/Screen/Camera because their customers are the companies buying CPUs/GPUs/Screens/Cameras/RAM. They are focused on the metrics the market has previously used to measure success and want to build to optimize for those metrics e.g. performance per dollar. Intel doesn't pay for the electricity in the datacenter nor through its customers' complaints about battery life. RAM manufacturers aren't looking at Apple's products and asking, "do consumers even replace still RAM?" i.e. they are focused on "micro"-trends.

2. Apple's vision is to build the best product for customers. They look at "macro"-trends into the future and apply their personal preferences at scale. For example, do people even still need replaceable RAM? Will they want 5G in the future, or can we improve the technology to replace it with direct connections to a LEO satellite cluster?

The customer relationship logic:

Lets take one such example of a macro-trend, VR and other wearables. Apple is tracking these trends and can "bet on" them because its in full control but Nvidia, Intel, etc typically don't want to "bet on" these numbers because even if they are fully invested, their partners (which sell to consumers) might back out. Apple also isn't "betting on" because it has a healthy group of early adopters that trust Apple and will buy and try it even tho a "better" product in the same market segment isn't purchased. Creating/retaining that customer relationship lets Apple over invest into keeping heat (i.e. power) low because its thinking about the whole market segment that Apple's VR headset can start to compete in and collect more revenue from.

Compounding growth/virtuous cycle logic is also relatively simple:

Improving the metrics in any of these 3 previous pillars manipulatively improves the other pillars. i.e. better customer relationship increases cashflow, increseses R&D funding, 1. improves product, improving customer relationship; or 2. reduces costs, increasing margins, and loops back to increasing cash flow.


Read the article linked, it explains why Intel and AMD are unable to throw more decoders at the problem.


The problem is the market.

Windows only a single architecture, so they can't really deviate from that. Sure, windows can switch (or, apparently, run on ARM), but due to the fact that windows applications are generally distributed as binaries, lots of apps wouldn't work.

Linux users would have far less issues, and would be a great clientele for a chip like this, but probably too niche a market, sadly.


Windows only a single architecture

People forget that at launch Windows NT ran on MIPS & DEC Alpha in addition to x86. The binary app issue was a killer for the alternative archs.


Pragmatically, windows runs on a single Architecture.

Sure, there's been editions for other architectures, but they're more anecdotal experiments than something usable.

I can go out and buy several weird ARM or PPC devices and run Linux or OpenBSD on them, and run the same stuff I use on my desktop regularly (except Steam).

The fact that windows relies on a stable ABI is it's major anchor (while Linux only guarantees a stable API).


they're more anecdotal experiments than something usable

Wrong. Microsoft explicitly set out multi-architecture support as a design goal for NT. MIPS was the original reference architecture for NT. Microsoft even designed their own MIPS based systems to do it (called 'Jazz'). There was a significant market for the Alpha port, especially in the database arena, and it was officially supported through Windows 2000. They were completely usable, production systems sold in large numbers.

In the end, the market didn't buy into the vision. The ISVs didn't want to support 3+ different archs. Intel didn't like competition. The history is all pretty well documented should one take the time to learn it.


> Microsoft explicitly set out multi-architecture support as a design goal for NT

They set it as a design goal, but that doesn't mean that the achieved it.


Except they did, though apparently you missed it. MIPS was the original port. Alpha was supported from NT 3.1 through Windows 2000, and only died because DEC abandoned the Alpha, not that Microsoft abandon Alpha (it was important to their 64-bit strategy). Itanium was supported from Windows 2003 to 2008R2. Support for Itanium only ended at the beginning of this year, once again because the manufacturer abandoned the chip.

I'm sure you can redefine "achieve" to exclude almost 17 years of support (for Itanium), if you're that committed to being right. Heck, x86-64 support has "only" been around for 20 years or so. Doesn't make it right.


Dec Alpha servers running NT in production used to be a thing.


Linux has ABI guarantees.


DEC Alpha NT could run X86 code thanks to FX!32, and faster than a core you could buy from Intel at the time.


well, for some things. fx32 for the apps people wanted though was deficient. The NT3.1-era Alphas didn't have byte-level performance so things like Excel, Word, etc. all ran terribly, as did Emacs and X. I supported a lab of Alphas running Ultrix and they were dogs for anything interactive and fantastic for anything that was a floating point application.


Yeah...anyone who thinks fx32 was faster in the real world than a native Intel core never actually ran it.


Indeed, but it didn't had anything that justified actually paying big bucks for an Alpha.


x86 instructions are variable length with byte granularity, and the length isn’t known until you’ve mostly decoded an instruction. So, to decode 4 instructions in parallel, AIUI you end up simultaneously decoding at maybe 20 byte offsets and then discarding all the offsets that turn out to be in the middle of an instruction.

So the Intel and AMD decoders may well be bigger and more power hungry than Apple’s.


Maybe they are, assuming there's sufficient area in the die for this.

They would likely still be massive power hogs.


But in one x86 instruction you often have more complex operations. Isn't that part of the reason why Sunny Cove has only 4 wide decode but still the decoders can yield 6 micro-ops per cycle? That single stat makes it look worse than it is in reality, I think.


The whole principle of CISC (v RISC) is that you have more information density in your instruction stream. This means that each register, cache, decode unit, etc. is more effective per unit area & time. Presumably, this is how the x86 chips have been keeping up with fewer elements in terms of absolute # of instructions optimized for. The obvious trade-off being the decode complexity and all the extra area that requires. One may argue that this is a worthwhile trade-off, considering the aggregate die layout (i.e. one big complicated area vs thousands of distributed & semi-complicated areas) and economics of semiconductor manufacturing (defect density wrt aggregate die size).


Except that RISC-V ISA manages to reach infornation density on par with x86 via a simple, backwards-compatible instruction compression scheme. It eats up a lot of coding space, but they've managed to make it work quite nicely. ARM64 has nothing like that, even the old Thumb mode is dead.


You mention most of the big changes, except one. Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard.


> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Cite your number please. Anandtech measured M1's memory latency at 96ns, worse than either a modern Intel or AMD CPU: "In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

vs. "In the DRAM region, we’re measuring 78.8ns on the 5950X"

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


Well my comment mentioned "random (but TLB friendly)", which I define as visiting each cache line exactly once, but only with a few (32-64) pages active.

The reason for this is I like to separate out the cache latency to main memory and the TLB related latencies. Certainly there are workloads that are completely random (thus the term cache thrashing), but there's also many workloads that only have a few 10s of pages active. Doubly so under linux when if needed you can switch to HUGE pages if your workload is TLB thrashing.

For a description of the Anandtech graph you posted see: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

So the cache friendly line is the R per RV prange for the 5950X latencies is on the order of 65ns, the similar line for the M1 is dead on 30ns at around 128KB and goes up slightly in the 256-512KB range. Sadly they don't publish the raw numbers and pixel counting on log/log graphs is a pain. However I wrote my own code that produces similar numbers.

My numbers are pretty much a perfect match, if my sliding window is 64 pages (average swap distance = 32 pages) I get around 34ns. If I drop it to 32 pages I get 32ns.

So the M1, assuming a relatively TLB friendly access pattern only keeping 32-64 pages active is about half the latency of the AMD 5950.

So compare the graphs yourself and I can provide more details on my numbers if still curious.


This reminds me of the Amiga which had FastRAM and ChipRAM. It was all main memory, but the ChipRAM could be directly addressed by all the co-processor HW in the Amiga and the FastRAM could not.

It would be sort of interesting for Intel/AMD to do something like this where they have 16GB on the CPU and the OS has the knowledge to see it differently from external RAM.

Apple is going to have to do this for their real "Pro" side devices as getting 1TB on the Mx will be a non-starter. I would expect to see the M2 (or whatever) with a large amount of the basic on chip RAM and then an external RAM bus also.


Dunno, rumors claim 8 fast cores and 4 slow cores for the follow up. With some package tweaks I think they could double the ram to 32GB inside the package and leave the motherboard interface untouched.

I do wonder how many use cases actually need more then 32GB when you have a low latency NVMe flash with 5+ GB/sec of bandwidth and relatively low latencies. Especially with the special magic sauce that I've seen mentioned related to hardware acceleration for either compressing memory or maybe it was compressing swap.

In any case, I'm not expecting the top of the line for the next releases. Step 1 is low end (mba, mbp13", and mini). Step 2 is mid range, rumored to be a MBP 14.1" and 16 in first half of 2021". After that presumably the mac pro desktop and Imac's "within 2 years". Maybe step #3 will be a dual socket version of step #2 with of course double the ram.


So I am not one of those that screamed about the 16GB limit which was a huge number of comments here on HN. That being said I do know people in the creative industry that have Mac Pros with 1.5TB of RAM and use all of it and hit swap. For a higher end Pro laptop I would be happy at the 32GB range. However in something like an 8K display (which will be coming!) iMac I would like to see 128GB which will not fit on chip. They are going to have to go to a 2 level memory design at some point.


Maybe, or just move the memory chips from the CPU package to the motherboard.


Oh that is very much something they could do, but given the fact that they control the OS completely it would be very interesting to keep the on chip and off chip and enable the software to support understanding which is RAM is where and allow the application developers to tweak items. For example lets say you are editing a very large 8K stream and you tell the app, hey load this video into RAM. You could put the part that is in the current edit window in the on chip RAM and feed the rest of the video into that RAM as the editor moves forward from the 2nd level RAM. There are some interesting possibilities here.

Also from the ASIC yield view it allows for some interesting binning of chips. Let's say the M2 has 32MB on chip plus an off chip memory controller. They could use the ones that pass in the high end, then once that fail a memory test as 16MB on a laptop, etc. Part of keeping ASIC cost down is building the exact same thing and binning the chips in to devices based on yield.


Unless you are doing some crazy synthesis or simulation, 32GB is plenty.

Maybe editing 4K (or in the future, 8K) video might need more?

My brother does a lot of complex thermal airflow simulation, and his workstation has 192GB of RAM, but that is an extreme use case.


8GB MacBook Air can easily handle 4K.

And it can handle 8K for 1-2 streams and starts to lag at 4+.


The Amiga has never been multi cored. Has Vampire accelerators to replace the 68K chips and PowerPC upgrade cards.

Apple in making the M1 Chip is using some of the Amiga IP circa 1985 that speed up the system where the CPU and GPU etc. share memory. Amiga is shattered into different companies, but if they didn't go out of business they would have made a M1 type chip for their Amiga brand.


> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory.

This, right here. It also helps that the L1D is a whopping 128 kB and only 3 cycles of load-use latency.


This is the first I've heard of this. This alone, plus unified memory in general, I bet explains 60% of the performance difference.


I wonder how they managed that.


Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).


Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.

Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.

Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?

Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.


The NVIDIA Carmel cores on 12nm had a 64KB L1D cache with a 2 cycles latency.


Means nothing without saying what the clock goes at.


2.26GHz, on a quite old process.


Latencies like this are doable with a lot of tuning on Intel CPUs; out of the box you'll get to the 40s with fast memory. And those CPUs have three cache levels instead of two...

A good old-fashioned 2010-era gaming PC would already get down to around 50 ns levels.

It's definitely really good, but considering it's rather fast RAM (DDR4 4266 CL16) and doesn't have L3 it's not that surprising.


Apple M1 has three cache levels:

- for big cores, private: 128KB L1D

- for big cores, shared within a cluster: 12MB L2

- system-level cache (shared between everything, CPU clusters, GPU, neural engine...): 16MB

and then you reach RAM.


I've written a benchmark to measure such thins and from what I can tell.

Each fast core has a L1D of 128KB.

The fast cores have a cluster with 12MB, cache misses to to main memory.

The slow cores have a 4MB L2.

The cache misses from the fast L2 can't quite saturate the main memory systems (I believe it's 8 channels of 16 bits). So when all cores are busy you keep 12MB of L2 for fast, 4MB of L2 for the slow cores, and end up getting better throughput from the memory system since you are keeping all 8 channels busy.


Wonder if the SLC is mostly used for coherency purposes and the other blocks then...

And yeah, it's 128-bit wide LPDDR4X-4266, pretty quick imo.


Not just 128 bits wide (standard on high end laptops and most desktops), but 8 channels. The latency is halved and over the last decades I've only been seeing very modest improvements in latency to main memory on the order of 3-5% a year.


Or maybe use the best of both worlds, with soldered-in ultra fast ram, plus large amount of dimm ram.

Same as you can have a storage driver and a smaller nvme.


> Or maybe use the best of both worlds, with soldered-in ultra fast ram

That's basically what L3 cache is on Intel & AMD's existing CPUs. You could add an L4, but at some point the amount of caches you go through is also itself a bottleneck, along with being a bit ridiculous.


The way I see it, you could have a Mac Pro with (let’s say) 32GB of super-fast on-package RAM and arbitrarily upgradable DIMM slots. The consequence would be that some RAM would be faster and some would be a bit slower.

They would be contiguous, not layered.


The non-uniform memory performance of such a solution would be a software nightmare.


Doesn't seem much different than various multichip or multisocket solutions where different parts of memory have different latencies, called NUMA. Basically the OS keeps track of how busy pages are and rebalances things for heavily used pages that are placed poorly.

Similarly, Optane (in dimm form) is basically slow memory, OSs seem to handle it fine. NUMA support seems pretty mature today and handle common use cases well.

With all that said, apple could just add a second CPU to double the ram and cores, seems like a great fit for a Mac Pro.


It doesn't seem any worse than existing NUMA systems today, where memory latency depends on what core you're running on. In contrast, the proposed system would have the same performance for on-board vs plugged DIMM regardless of which CPU is accessing it, which simplifies scheduling — from a scheduling perspective, it's all the same. I think that's easier to work with than e.g. Zen1 NUMA systems.


OSes have had this problem solved for decades; the solution is called "swap files". You could naively get any current OS working in a system with fast and slow RAM by simply creating a ramdisk on the slow memory address block and telling the OS to create a swap file there.


> OSes have had this problem solved for decades; the solution is called "swap files".

What operating systems handle NUMA memory through swapping? The only one I'm familiar with doesn't use a swapping design for NUMA systems, so I'm curious to learn more.


Not really the best idea for the kind of speed baselines and differences discussed here. You can use better ideas like putting GPU memory first in the fast part then the rest in the slow area. You know, like XBox Series does.


Yet apple is managing excellent performance with just a l1+l2.


But the context of this thread is that it is being done with soldered RAM. I don't know how much that matters, just pointing out that you are taking the conversation in a circle.


>"Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard."

I prefer the upgradeability.


> I prefer the upgradeability.

The market doesn't care about such niche concerns, but it'll not flip completely overnight.

I like the idea of upgradeability too, but when the trade-off is such great performance, I'm not going to give that up. It would be different if the performance numbers were not as stark.

The open question is how long this performance will be sustained. If it drops off, then concerns like upgradeability make become higher priority (and an opportunity for hardware vendors.)


Yes, this is a niche concern... called environment protection. The new stuff cannot be upgrade, so when amount of RAM stops to be sufficient, old computer needs to be recycled (a modern word for throwing something into waste bin together with all CO2 that was emitted when computer was produced, not to mention environmental costs of digging rare earths, etc.).

I am still able to use my Lenovo Thinkpad 510T only because I could easily replace HDD with cheap, stock Samsung EVO SSD and throw more RAM.

The absurdity of Apple approach is that Mac Mini with 512 GB SSD is $200 more expensive than the one with 256 GB. 256 GB for $200 is a crazy price, so Apple basically says: hey, pay us a lot, lot more, so maybe you can use our stuff a bit longer, but, in fact, we try to actively discourage you from doing this, since we want you to buy cheaper model and in two-three years you will need to buy a new fancier model.

But Tim Cook will tell you a lot how much he cares about humanity, environment and CO2 emission. Maybe he will even fly his private jet to some conference to tell people how awful is all that oil & gas & coal industry.


On-package RAM and upgradability are not at odds with eachother. If upgradability was desired, we could see socketed SOC. This is one of the things modular phones (project Ara) were about.

Just because RAM was one of the last holdout of upgradability, does not make it inherently more suitable for upgradability.

The problem is a lack of interest in manufacturing repairable and upgradable hardware. It is simply less profitable.


>"The market doesn't care about such niche concerns, but it'll not flip completely overnight."

I do not care all that much abut market either. When I need something it always seems to be there and I do not mind if it is not produced by the biggest guy on the block. If at some point I would not be able to find what I need I'll deal with it but so far it did not happen.


Bandwidth is comparable with other high-end laptop chipsets (I've seen 60-68 GB/s quoted, and recent Ryzens are 68 GB/s). Is the on-chip latency a big factor in the single core performance?


Depends on the workload. Compilers are famous for being branch heavy and random lookups... something that people have reported excellent performance on the M1. Parsing is hard as well (like say javascript).

Of course for any CPU workload it's going to be harder on the memory system when you have video going on. Doubly so for a video conference (compression, decompression, updating the video frames, streaming the results, network activity, reading from the camera, etc).

Seems like the apple memory system wouldn't have received as much R&D as it did unless Apple though it was justified. Clearly the speed, performance, and price show that Apple made quite a few good decisions.


Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

There's just nothing driving most people to have more than 8/16GB of RAM and even photo/video editing has been shown to be a breeze on the 8GB MacBook Air.

I wouldn't be surprised to see laptops move to soldered RAM and SSDs.


> Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

I beg you to look at memory consumption of those "lightweight" web apps


We should definitely not continue Apple's approach of soldering every damn component, even if it comes at the cost of performance


> even if it comes at the cost of performance

Why? What's the purpose of artificially limiting performance when one doesn't need the upgradability?

I've, personally, never upgraded the RAM on any system I've built or carried it to a new motherboard with a new socket. I'm absolutely the target audience for this. I would love this increased performance, as long as it wasn't some surprise. Having the extra plastic on the motherboard is literally e-waste for me. Don't touch my PCI-e slots though.


Used to be I'd upgrade my MBP memory and hard drive to eek out one more year between upgrades. The drive could always come back and be reused as a portable drive, and the best memory for an old machine typically was cheap enough by then that it wasn't that big of a deal.


The best present is receiving something you never knew you needed until you get it, so I love giving RAM (and SSD) for birthdays! That you can keep the same computer but that it simply becomes faster is a nice surprise for many.


Components have flaws, or they break down over time, and soldering components hampers repair and reuse.


I suggest you look up "integrated circuits" and "system on a chip", which is where all of our performance/power improvements have come from. You're in for a shock when it comes to repairability!


Not sure why you're being downvoted, it's completely true! If the SSD in my computer dies, I can just buy another one for cheap (500GB for what, 80 dollars?).

If the SSD in my Macbook/Mac Mini dies, either I can buy a new motherboard, or more likely, a new device. It is not economical nor ecological.

Also, paying 200 dollars for additional 256GB of storage? WTF.


Dunno, increasingly with machine learning, more cores, GPUs, etc the bottleneck is the memory system. How much are you willing to pay for a dimm slot?

Personally I'd rather have half the latency, more bandwidth, and 4x the memory channels instead of being able to expand ram mid life.

However I would want the option to buy 16, 32, and 64GB up front, unlike the current M1 systems that are 8 or 16GB.


Then, make desktops/laptops with 4 or 8 channels. We'd need more dimms, of a smaller size.


Only if you use dimms. If you use the LPDDR4x-4266 each chip has 2 channels x 16 bits. So the M1 has 4 chips and a total of eight 16 bit wide channels.


Does the 8GB variant have all 8 channels or just 4?


My guess is it's the same and using the half density chip in the same family, but I'm just guessing.


Those extra memory will cost you an arm and a leg.


My understanding is that the LPDDR4x chips cost less per GB than the random chips you find in the common dimms. There's also costs (board space, part cost, motherboard layers, and layout complexity) for dimm slots.

Sure manufacturers might try to charge significantly more than market price for on the motherboard RAM, but it's an opportunity to increase their profit margin and ASP. Random 2x16GB dimms on newegg cost $150 per 32GB. Apparently LPDDR are easier to route to, require less power, and cost less for the same amount of ram. I'd happy pay $500 for a motherboard with 64GB of LPDDR4x-4266. Seems like Asus, Gigabyte, Tyan, Supermicro and friends would MUCH rather sell a $500 motherboard with ram than a $150 motherboard without.


Normal rate ( Not Contract Price ) for LPDDR4 / LPDDR4X and LPDDR5 is roughly double the cost of DRAM per GB. Depending on Channels and package, the one used in M1 is likely even more expensive as they fit 4 channel per chip. DIMM and Board Space adds very little to the Total BOM.


Ah, I had heard differently, for the same clock rate?

In any case the apple parts are from what I can tell are:

https://www.skhynix.com/products.do?lang=eng&ct1=36&ct2=40&c...

In particular this one:

https://www.skhynix.com/products.view.do?vseq=2271&cseq=77


If you don’t want it, just don’t buy it, but please don’t tell other people what they should or should not like or need.


Apple's (and everyone elses') anti-repair stance (both in terms of design and in policy) is harming the environment and generating tons of e-waste. Whats wrong with expressing a view that helps the planet?


Because it’s just virtue signalling, not actual environmentalism. What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

https://www.techrepublic.com/article/the-5-greenest-tech-com...


  Apple devices use a minimum of materials and have industry leading usable lifetimes.
Their phones have far longer lifetimes for sure, their laptops? I would like to see evidence of that. Outside of the mostly cheaply made laptops, most laptop/desktop computers can have very long secondary lives. Linux/Windows can run one some very old (multiple decades) machines.


> Because it’s just virtue signalling

Not buying a new MBP and throwing the old one to children in third world countries qualify as "virtue signaling" now ?


Promoting reuse and repair is environmentalism. Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

>What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Reuse and repair is FAR superior to recycle - which actually wastes a lot of energy, in addition to generating e-waste for the parts which are not recycled.

>Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

What good does it do? They are still harming the environment.


> Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

There are plenty of ways around that fact.

Preventing repair while changing nothing else generates more e-waste. But that's not what Apple does.

If you prevent repair in order to also do any or all of the following things at the same time enough, the result is less e-waste than if you didn't prevent repair:

- Use less environmentally harmful materials (e.g. on-board sockets, larger PCBs etc)

- Make the device last longer before it needs repair (reliability, longevity)

- Make the device easier to recycle

> Reuse and repair is FAR superior to recycle

It's a good goal, but it's only superior for sure if everything else is able to be kept the same to make it possible.

Some things really are better for the environment melted down and ground down and then rebuilt from scratch. I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

> What good does it do? They are still harming the environment.

When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect, as following it would probably cause more environmental harm not less.


>- Make the device last longer before it needs repair (reliability, longevity)

If Apple makes devices that last so long, then how come Apple's own extended warranty program generates billions of dollars of revenue? Note that this doesn't include third party repair shops. To me, this indicates a large industry dedicated to repairing Apple products - hardly a niche industry. To me, this indicates that a large amount of Apple devices need repair, something that Apple is hostile to.

Also while AppleCare is easy and convenient for the customer, Apple's "geniuses" do not do board-repair, they simply replace and throw away broken logic boards (which sometimes all they might need is a simple 10 cent capacitor). If that wasn't as bad, they actively prevent other businesses from performing component level repair by blocking access to spare parts.

> I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

If that was the case, then of course, we should recycle. Maybe we should have a case-by-case approach depending on specific products? I'm totally willing to go wherever the evidence leads us. As of now pretty much every single environmental organization promotes reuse over recycling for electronics.

>When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect,

I merely accepted the "number one" in good faith at face value. Digging further with a cursory Google search, things seem a lot more nuanced. That being said, I have no idea what "number one" even means without context.

https://www.fastcompany.com/40561811/greenpeace-to-apple-for...


> If Apple makes devices that last so long,

I don't want to back the idea that Apple does make reliable or long-lived devices, although I'm very happy with my 2013 MBP still. I honestly don't know how reliable they are in practice, although they do seem to keep market value for longer than similar non-Apple devices, and they have supported them with software for a long time (my 2013 is still getting updates).

And I would love to be able to add more RAM to my 2013 MBP, which has soldered-in RAM; and I would love if it were easier to replace the battery, and if the SSD were a standard fast kind that was cheap to get replacements for, and if I could have replaced the screen due to the stuck pixel it has due to a screen coating flaw. So I'm not uncritical of the limitations that come with the device.

I'm only disputing your assertion that preventing repair and reuse of parts inevitably generates more e-waste. It's more nuanced than that.

Of course wherever and in whatever ways we can find to repair, reuse and recycle we should.

But there will always be some situations, especially with high-end technology, where repair and reuse needs more extra materials, components, embodied energy and complexity (and subtle consequences like extra weight adding to shipping costs) resulting in a net loss for the environment.

An extreme example but one that's so small we don't think of it is silicon chips. There is no benefit at all in trying to make "reusable" parts of silicon chips. The whole slab is printed in one complex, dense process. As things like dense 3D printing and processes similar to silicon manufacture but for larger object components come online, we're going to find the same factors apply to those larger objects: It's cheaper (environmentally) to grind down the old object and re-print a new one, than to print a more complex but "repairable" version of the object in the first place.


> - Make the device last longer before it needs repair (reliability, longevity)

2016 MBP owners will appreciated the joke !


Very good point!

I don't want to back the idea that Apple does make long-lived laptops, only that it's hypothetically possible they do sometimes :-)

My 2013 MBP is still going strong thankfully, I'm very happy with it still after all these years.


My last 2013 MBP is still alive only because I was able to source third party battery / power connector...

Though, somehow, if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one... Though, I'm not doing it for the environment, I'm just cheap as f*ck.


> if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one

No, I don't think that argument is being made by anyone.

The argument being made is that to make the laptop more able to have replaceable components could potentially require more environmental costs up front in making that laptop.

I doubt that argument works for the power connector. I suspect that's more to do with making sure Magsafe is really solid, but it might for the battery due to the pouch design instead of extra battery casing, I'm not sure.

There's no question that if you can repair it afterwards you probably should.

By the way, literally all my other laptops either died due to the power connector failing, or I repaired the failing power connector. Sometimes I had to replace the motherboard to sort out the power connector properly, which seems like poor design.

The Apple has been the only one that hasn't failed in that way, which from my anecdote of about 5 laptops says Apple's approach has worked best from that point of view so far. Of course Apple power supply cables keep fraying and needing to be replaced, so it balances out :-)


For a manufacturer on the whole it’s a negligible issue. It’s simply a fact that Apple devices have longer average lifetimes and lower overall environmental impact than any of the other manufacturers. Hence the Greenpeace rating. If you actually care about the environment, as you claim, the choice is clear.

What you are doing is picking a single marginal factor that can make a difference in rare cases, but is next to irrelevant in practice, and raising that above the total environmental impact of the whole range of devices. That’s just absurd.


> It’s simply a fact that Apple devices have longer average lifetimes

I've used the same desktop for the 8 to 6 years, upgrading with STANDARDIZED components over the years, and my laptop from that era still works. Heck, I've got a 18 years old thinpad still working fine.

In the mean time, two MBP died on me. Try again...


Do you care about the overall ecological footprint of Apple, as Greenpeace does, or only a few specific devices in particular? How do you evaluate likely future device lifetimes and ecological footprint, cherry picked statistics or manufacturer track record?

Should I take your evaluation in thus, or trust a detailed whole enterprise evaluation by Greenpeace?


Apple's view uses less material over all. For the vast majority of the machines that A) don't fail and B) are never upgraded in any case, the Apple method of getting rid of sockets reduces the e-waste burden.


Yet it’s Apple’s devices that last the longest and have the highest resale values.


Hermes handbags also have high resale value. That tells us nothing. Apple's anti-repair approach absolutely harms the environment. Certainly they are not alone in this, many/most electronics these days are irreparable. But Apple is actively hostile to the repair industry, which makes them more deserving of criticism.


The repair industry in this case is hostile to the environment. They are incentivized to want computers to break so that they can sell repair services.

It turns out that soldering parts in place makes them less likely to break than a socket whose connections can oxidize or come loose.

The tiny number of devices that can’t be repaired because of soldered components, is dwarfed by the number of devices that never broke in the first place because of soldered components.


> It turns out that soldering parts in place makes them less likely

You've obvious never heard of MBP BGA chip solder ball cracking and rendering the whole device useless...


I didn’t say they never failed. Just that they fail less frequently.

In any case, solder ball cracking results from process issues and is a solved problem: https://www.pcbcart.com/article/content/reasons-for-cracks-i...

Certain not something that would be improved with sockets.


Funny, the only refurbished computer, phone & tablet chain in NL has a pure Apple offering.


Unfortunately, even intel’s white label laptop specs soldered RAM, so I expect the trend to continue in low/mid range PC laptops.

https://www.theverge.com/2020/11/19/21573577/intel-nuc-m15-l...


Maybe for some consumer devices we should try it? Clearly the results are excellent. Most people don't open up and modify their laptops.


> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

Except for a lot of the Apple-centric journalists and podcasters, who have been imagining for years how fast a desktop built on these already-very-fast-when-passively-cooled chips could be.

Not that that matters very much when experienced and real-world workload performance suffers, but as far as I can tell, the M1 is no slouch in that respect either.


I heard 10 years ago, or whatever, that the ipad 2 had the most power efficient CPU available, period. This was told at a keynote by an HPC scientist who cannot be said to be a «apple journalist». Apple have been doing well for a very long time and I’ve expected this moment since that keynote, basically.


OK, I misspoke. I heard the sentiment from the Apple-focused voices in my information bubble, doesn't mean that nobody else said it. It's just that "nobody believed those Geekbench benchmarks" is not completely true.


Yeah, Apple’s CPUs have been doing really well for a while now: https://www.cs.utexas.edu/~bornholt/post/z3-iphone.html


> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

I saw a paper on (I think?) SMT solvers on iPhone. Turned out to be faster than laptops, I kind of brushed over it as irrelevant at the time.



I did believe those benchmarks. I also knew that sustainable load at those speeds if not throttled down would just melt the thing so yes they were irrelevant.


Maybe in practice at the time, but in hindsight they were actually a valid indication of the true performance capabilities of the architecture. It’s just that a phone has too little thermal capacity to sustain the workload.


>"Maybe in practice at the time"

Exactly this. If I am shopping now I do not care how particular CPU/architecture evolves in the future. I only care about what it can practically do now and at what price. As it is now M1 has 4 fast cores and not upgradeable maximum 16GB. For many people it would be more than they ever need. Let them be happy. For my purposes I am way more happy with 16 core new AMD coupled with 128GB RAM (my main desktop at the moment). It runs at sustainable 4.2 GHz without any signs of thermal throttling. Cooler is keeping it at 60C.


As well as too little battery capacity to sustain the workload.

Applying those cell phone methods to a laptop with better cooling and a larger battery was a win.


Another detail that came out today is just how beefy Apple's "little" cores are.

>The performance showcased here roughly matches a 2.2GHz Cortex-A76 which is essentially 4x faster than the performance of any other mobile SoC today which relies on Cortex-A55 cores, all while using roughly the same amount of system power and having 3x the power efficiency.

https://www.anandtech.com/show/16192/the-iphone-12-review/2


Incredible. Now it should be MiDdLe core.


Spot on. Exactly this. It’s like pre-iPhone when people just assumed you had a laptop and a cellphone. Then Apple said “phone computer!” and changed the game. Same with iPad just less innovation shock. Meanwhile we continued to have this delineation of computer / phone while under the hood - to a hardware engineer - it’s all the same. Naturally the chips they produced for iOS-land are beasts. My phone is faster than the computer I had 5 years ago. My M1 air is just a freak of nature. On par with high end machines but passively cooled and cheaper. I’m still kinda in awe. Not a big fan of the hush hush on Apple Silicon causing us all to play catch-up for support, but that’s Apple’s track record I guess.

The M1 is all the things they learned from the A1-A12 chips (or whatever the ordering) which is over a decade of tweaking the design for efficiency (phone) while giving it power (iPad).


> the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Apple threw more hardware at the problem and they lowered the frequency.

By lowering the frequency relative to AMD/Intel parts, they get two great advantages. 1) they use significantly less power and 2) they can do more work per cycle, making use of all of that extra hardware.


> Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Of course Apple’s advantage is not solely due to technical tricks, but neither is it entirely, or even mostly due to an area advantage. If it were so easy, Samsung’s M1 would have been a good core.


Yeah the article is interesting I just like knowing that Apple will keep iterating and the performance gap between Apple Silicon and x86 will continue to grow. I keep spec'ing out an Apple M1 Mac mini only to not pull the trigger because I am curious what an M2 will hold.


The nice thing about Apple hardware resale value is that you can just buy now and upgrade later. I just unloaded a five year old MBP for a lot more than I'd ever have gotten for a five year old PC laptop. Ordered myself a fancy new M1 laptop, and if M1X/M2/whatever-it-is-called is substantially better, then I'll just trade it in on a new model. There's always a new hotness coming next year.


The rumors I've read so far is that the next chip will be the M1-X, and will be a 12-core CPU, with 8 "high performance" cores and 4 "efficiency" cores, and will be released in a 16" MBP. The current M1, by contrast, has an 8-core CPU (4 HP cores and 4 efficiency cores).


Is this with the 16" rumored first half of next year? I hadn't seen the one you're referring to. I'm contemplating options with my current 2018 15" which is about to receive its 3rd top case replacement (besides other issues), and whether I want to press for a full replacement and how long to wait, as AC expires next June.


I saw that rumor as well. Looks compelling. What I’m holding out for is what they bring to the rumored redesigned 16 inch MacBook Pro and the rumored 14 inch MacBook Pro. I think I’ll choose between one of those.


I think they old adage, “don’t buy first generation Apple products” applies here. Seems sensible to wait to see where this goes.


Hardly worth not doing it. The stuff depreciates far slower.

I’ve been using an M1 mini as my desktop for over a week now and am impressed.


Personally I’m just waiting for the software to catch and maybe the landscape to stabilise. It’s hardly irrational to wait and get the 2nd gen.


Yes true but as the respondent said above I’m just waiting for the software to catch up and maybe a redesign on the laptop side. In the mean time I’m saving for one.


Hah, won't this perpetuate? Whenever M(N) is released M(N+1) will be on the horizon with even greater promise.


Yes, been doing that for about 30 years. Now and then you just have to take a leap. I usually buy what was the best last year at a bargain price instead.


This is a good year for getting last year's stuff given how the supply is so low on most of the high profile hardware(new Ryzens, current generation dedicated graphics, game consoles).

I did get something launched this year that reviewed well - a gaming laptop(Legion 5) - but it was not hard to find and I even got it used like-new. Perhaps because nobody's travelling now.


The problem is that a phone with a lightning fast CPU is rather useless with current ecosystems.

I do think there are "technical tricks" though, especially the compatibility memory mode that makes x86 emulation faster than comparable ARM chips. If you call it a trick or finesse is probably a matter of perspective.


Not mentioned in the article is the downside of having really wide decoders (and why they're not likely to get much larger) - essentially the big issue in all modern CPUs is branch prediction because the cost of misprediction on a big CPU is so high - there's a rule-of-thumb that in real-world instruction streams there's a branch every 5 instructions or so ... that means that if you're decoding 8 instructions each bundle has 1 or 2 branches in it, if any are predicted taken you have to throw away the subsequent instructions - if you're decoding 16 instructions you've got 3 or 4 branches to predict, chances of having to throw something away gets higher as you go .... there's a law of diminishing returns that kicks in, and in fact has probably kicked in at 8


The 5nm process is a large factor as well.


...and Apple was able to throw hardware at the problem because they got TSMC's manufacturing process. When everyone else is using 5nm, let's see if any of this other stuff actually matters.


M1 is also the only TSMC 5nm chip that is widely available and there is nothing remotely comparable from a process stand point.


Are they really throwing more hardware at the problem?

The die size for the whole M1 SoC is comparable to or even smaller than Intel processors, and the vast majority of that SoC is non-CPU related stuff, the CPU cores/cache/etc seem to be at most 20% of that die - though a more dense die because of the 5nm process. This also seems to imply that the 'budget' of number of transistors for that CPU-part of the SoC is also comparable to previous Intel processors, not a significant increase. (Assuming 20% of the 16b transistors in M1 is CPU part, it would be 3-ish billion transistors, and the Intel does not seem to publish transistor counts but I believe it's more than that for the Intel i9 chips in last year's macbookspro)

Perhaps my estimates are wrong, but it seems that they aren't throwing more hardware, but managing to achieve much more with the same "amount of hardware" because it is substantially different.


Well you don't have any of the AVX512 nonsense, but probably the OOO of the M1 uses more transistors than on an Intel chip. And Google tells me that a quad core i7-7700K has 2.16 B.


The next question: What prevents Intel or AMD from doing this on their processors?


The article specifically answers this:

- x86 instruction set can't be queued up as easily because instructions have different lengths - 4 decoders max, while Apple has 8 and could go higher.

- Business model does not allow this kind of integration.


> instructions have different lengths

also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).


What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?

I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?


This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.


So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?


Mirco Ops are usually much wider than instructions of the ISA. They are usually not multiple of 8 bits wide either.

An dedicated alternative instruction set would be possible but that would take die space and make x86_64+mine even harder to decode.


From what I understand this is exactly what the instruction decoder does.


They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.


Hmm, this reminds me of Transmeta https://en.wikipedia.org/wiki/Transmeta


They do this in a lot of designs. It's called a micro op cache, or sometimes an L0I cache.


I think the latter is the biggest challenge.

I imagine that Apple's M1 are using what they know about MacOS, what they know about applications in the store, what user telemetry MacOS customers have opted into, all to build a picture of which problems are most important for them to solve for the type of customer who will be buying an M1-equipped MacOS device. They have no requirement to provide something that will work equally well for server, desktop, etc roles, for Windows and Linux, and they have a lot more information about what's actually running on a day-to-day basis.


They say in the article that AMD "can't" build more than 4 decoders. Is that really true? It could mean:

* we can't get a budget to sort it out

* 5 would violate information theory

* nobody wants to wrestle that pig, period

* there are 20 other options we'd rather exhaust before trying

When they've done 12 of those things and the other 8 turn out to be infeasible, will they call it quits, or will someone figure out how to either get more decoders or use them more effectively?


Their business model allowed for them to integrate GPU and video decoder. Of course it allows for this kind of integration. The author is not even in that industry, so a lot of his claims are fishy. Moore's law is not about frequency, for example.


I think what they mean is lack of coordination between software and hardware manufacturers + unwillingness of intel/amd etc to license their IP to dell etc. What is untrue about that?

On Moore's law, yes it's about transistors on a chip, but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.


> What is untrue about that?

The fact that they don't need to license technology. They can bring more functionality into the package that they sell to Dell, etc. like they have already done.

> but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.

That is not the point they are making. Clock frequencies have not changed since the deep-pipelined P4, but transistor count has continued to climb. Here is what the author, who clearly does not know what he is talking about, said about that:

"increasing the clock frequency is next to impossible. That is the whole 'End of Moore’s Law' that people have been harping on for over a decade now."


Seems like backward compatibility completely rules this out. Apple can provide tools and clear migration paths because the chips are only used in their systems. Intel chips are used everywhere, and Intel has no visibility.


I wonder if in 3 or 4 years there will be a new chipmaker startup which offers CPUs to the market similar to the M1. If Intel or AMD won't do it that is.


The problems is competing with the size of Apple (and Intel and AMD). The reason there are so few competing high performance chips on the market is, that it is extremely expensive to design one. And you need a fab with a current process. For many years Intel beat all other companies, because they had the best fabs ans could affort the most R&D. Now Apple has the best fab - TSMCs 5nm - and certainly gigantic developer resources they can afford to spend as the chip design is pretty similar between the iPhones iPads and the Macs.

And of course, as mentioned in the article, any feature in the Apple Silicon will be supported by MacOS. Getting OS support for your new SOC features in other OSes is not a given.


Qualcomm could come out with an M1 class chip, they have the engineering capability, but if Microsoft or Google don’t adopt it with a custom tuned OS and dev tooling customised for that specific architecture ready from day one, they’d lose a fortune.

The same goes in the other direction. If MS does all the work on ARM windows optimised for one vendor’s experimental chip and the chip vendor messes up or pulls out, they’d be shived. It’s too much risk for a company on either side of the table to wear on their own.


Sadly Qualcomm won't develop own CPU core anymore since ARMv8 transition. Samsung stopped developing own core. So only ARM's code design is available for mobile.


The article answers this: x86 has variable-sized instructions, which makes decoding much more complex and hard to parallelize since you don't know the starting point of each instruction until you read it.


AMD until very recently (and Intel almost 25 years ago) used to mark instructions boundaries in the L1 I$ to speed up decode. They stopped recently for some reason though.


For those unaware, ARM (as with many RISC-like architectures) uses 4 bytes for each instruction. No more. No less. (THUMB uses 2, but it’s a separate mode) x86, OTOH, being CISC-based in origin, has instructions ranging from a single byte all the way up to 15[a]).

[a]: It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit, but they do mention (in the SDM) that they may raise or remove the limit in the future.

Why 15 bytes? My guess is so the instruction decoder only needs 4 bits to encode the instruction length. A nice round 16 would need a 5th bit.


> It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit

I remember noticing on the 80286 that you could in theory have an arbitrarily long instruction, and that with the right prefixes or instructions interrupts would be disabled while the instruction was read.

I wondered what would happen if you filled an entire segment with a single repeated prefix, but never got a chance to try it. Would it wrap during decoding, treating it as an infinite length instruction and thereby lock up the system?

My guess is that implementations impose a limit to preclude any such shenanigans.


You could encode 16 bytes–there's no need to save a slot for zero.


I honestly don’t know how the processor counts the instruction length, so it was only pure speculation on my part as to why the limit is 15. Maybe they naively check for the 4-bit counter overflowing to determine if they’ve reached a 16th byte? Maybe they do offset by 1 (b0000 is 1 and b1111 is 16) and check for b1111? I honestly have no idea, and I don’t think we’ll get an answer unless either (1) someone from Intel during x86’s earlier years chimes in, or (2) someone reverses the gates from die shots of older processors.


x86 instruction-length is not encoded in a single field. You have to examine it byte by byte to determine the total length.

There may be internal fields in the decoder that store this data, I suppose.


Yes I am aware. My wording could’ve been better. I was referring to the (possible) internal fields in the decoder.


How does 8 wide decode on ARM RISC compare to 4 wide decode on x64 CISC? If, say, you'd need two RISC ops per CISC op on average, that should be the same, right?


Most instructions map 1:1. x86 instructions can encode memory operands potentially doubling the practical width but a) not all CPUs can decode them at full width and b) in practice compilers generate mem+ops instructions only for a (not insignificant ) minority of instructions.

So the apple to apple (pun intended) practical width difference is closer than it appear, still not as wide.

X86 machines usually target 5-6 wide rename, so that would become a bottleneck (not all instructions require rename of course). I expect that M1 has an 8-wide rename.

Edit: another limitation is that most x86 decoders can only decode 16 bytes at a time and many instructions can be very long, further limiting actual decode throughput.

On the converse the expectation is that most hot code will skip decode completely and is fed from the uop cache. This also saves power.


Thank you for a better synopsis.

Bugs me so much when people don't look at the logical side of things. Tons of mac'n'knights going around downvoting and stating people are wrong that it has something to do with being a RISC processor.

While fundamentally pre 2000s things were more RISC this and CISC that the designs are more similar than ever on x86 and ARM. Just that the components are designed different to handle the different base instruction sets.

Also the article is entirely wrong about SoC Ryzen chips have been SoC since their initial inception. In fact SoC on the First APU. Those carried North Bridge components onto the CPU die.


It also has shared 12MB of L2 cache for the performance cores which is huge.


> But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs.

The author completely misses "the baby in the water"

Yes, X86 core are HUGE, the whole of CPU is for them only.

They can afford wider decode, even though at a giant area cost (which itself would be dwarfed area cost of cache system area.)

The thing is, have more decode, and buffer will still not improve X86 perf by much

Modern X86 has good internal register, and pipeline utilisation, it's simply they don't have something to keep all of those registers busy most of the time!

What it lacks is memory, and cache I/O. All X86 chips today are I/O starved at every level. And that starvation also comes as a result of decades old X86 idiosyncrasies about how I/O should be done.


How does I/O works differently in a M1 chip compared to x86?


I think X86 is the only modern ISA family that still have a separate address space for I/O. It is not used today anymore, but it exists somewhere deep in the chip, and its legacy kind of messed up how the entire wider memory, and cache systems on X86 were designed.

X86 has got memory mapped I/O for modern hardware, but on the way there, X86 memory access got tangled with bus access. X86 still treats the wider memory system as a kind of "peripheral" with mind of its own.

The intricacies how X86 memory access evolved to keep accommodating decades old drivers, and hardware apparently made a grand mess of what you can, and what you cannot memory map, or cache, and many things deeper in the chip.

One of may casualties of that design decision is the X86 cache miss penalty, and an overall expensive memory operations.


I don't really get what you are talking about. Everybody has been doing MMIO for a while now (and by for a while I mean multiple decades), and IO is usually not an issue in personal computers anyway; OOO, compute and the memory hierarchy is. We are not discussing about some mainframes...


> I think X86 is the only modern ISA family that still have a separate address space for I/O. It is not used today anymore, but it exists somewhere deep in the chip, and its legacy kind of messed up how the entire wider memory, and cache systems on X86 were designed.

Internally it's just a bank of memory these days. You can publicly see how HyperTransport has treated it as a weird MMIO range for decades (just like Message Signaled Interrupts), and QPI takes the same approach.


Why can't they just give the IO bus a slower clock and devote the resources to the memory bus? Or, memory-map everything and make the IO area yet another reserved area the BIOS tells the OS about?


I believe, if anybody knew answer to many questions along the lines of yours, the guy would be rich, and Intel would not be in the ditch.

To be clear, the I/O bus with own address space is no longer in use, but design considerations resulting from keeping it around for so long are there.


I find it interesting you use this kind of disparaging tone when discussing Apple Silicon. I also find it interesting that you consider having a wide decoder not as a technical trick but as "throwing hardware at the problem."

However you try and spin it, what it comes down to is this, Apple is somehow designing more performant processors than every other company in the world and we should acknowledge they are beating the "traditional" chip designing companies handily while being new at the game.

If it's as easy as "throwing hardware" at the problem, then Intel and AMD and Samsung etc should have no problem beating Apple right?


I interpreted that differently: they used good engineering to make a speed demon of a chip, not some magic trick that's only good in benchmarks and not real world usage. I don't think it was disparaging at all.


It’s never considered “magical” after the feat has been accomplished. But a year ago if you claimed this is where Apple would be today, a lot of people would say that would require waving a magic wand.


Yeah, it's like saying "well all Apple is doing is throwing good engineering at the problem".

All they've really done is taken an advanced process, coupled it with a powerful architecture they've been iteratively improving on for years, and thrown it at software that has been codesigned to work really well on those chips.

Yeesh!


They didn't just throw hardware at the problem, but also talent. Something that is way more scarce.


Apple is throwing money at the problem. IC size is directly proportional to cost.

They couldn't sell a chip like this at a cost-competitive price on the open market against AMD/Intel products.


Hmm. The M1 is about 120 sq mm, substantially smaller than many Intel designs. The current estimate for 5 nm wafers is about $17K (almost certainly high). A single 300 mm wafer is about 70,700 sq mm. If we get die from 75% of that area, that gives us about a $38 raw die cost. Even with packaging and additional testing, I suspect they would be competitive.


Nice back of the envelope calculation. I think I'd add yield to it though.

TSMC had a 90% yield in 2019 for a 18mm2 chip[1]. Assuming a 120mm2 chip would have more defects, and assuming process improvements since 2019, maybe 80% would be an accurate-ish estimate.

Found an even better number, [2] list the defect rate as 0.11 / 100m2 => 87%.

$38/87% = $43.7

[1] https://www.anandtech.com/show/15219/early-tsmc-5nm-test-chi... [2] https://www.anandtech.com/show/16028/better-yield-on-5nm-tha...


That's a fair point, yield has to be included. I lumped yield in with the pessimistic 75% factor for area utilization of the wafer. I should have been more clear. The area loss for square die on a round wafer should be much less than 25% of the total wafer area.

If you look at process tech cost trends, the $17K is also very pessimistic. I think a customer the size of Apple is probably getting a much better rate than that. Remember, they sell well over 200 million TSMC fabbed chips a year. Hard to know for sure of course, but I imagine these chips are ultimately costing Apple well under $40. We'll never know of course...


The big skew in availability between 8GB and 16GB models implies to me that yields of perfect chips are lower than Apple expected, with too many ending up in the 8GB bin.


The DRAM is on separate chips from the M1 processor. The availability skew is probably just a production forecasting error.


I came to the opposite conclusion. I think most users than expected paid for the 16GB models, leaving extra inventory of the 8GB modules. During black Friday/Cyber monday I saw several discounts on 8GB M1 systems, but none on the 16GB systems.

Hopefully that sends a message to apple to build systems with more memory. Seems insane to invest in a expensive M1 system (once you add storage, 3 year warranty, etc) and get only 8GB. Even it works well today, with a useful life of 3-6 years seems likely the extra 8GB will have significant value over the life of the system... even if it's just to reduce wear on the storage system.


That would imply that the M1 has the DRAM rather than just the controllers on the chip but all of the coverage I've seen says that they are separate chips in the same package.


This is interesting. What sources do you visit to learn about CPU manufacturing trends?


Well, there are a number of industry sites, but here are a few good starting points:

https://en.wikichip.org/wiki/WikiChip https://semiwiki.com https://www.tomshardware.com https://semiaccurate.com (paywall for some articles, very opinionated...)


As the sister comments have noted they almost certainly could given the size of the die.

But in another sense you are right - Apple is throwing money at the problem: their scale and buying power means that they have forced their way to the front of the TSMC queue and bought all the 5nm capacity.


Is that true? Intel chips are known to be overpriced.


Intel probably isn't the best example here, a better comparison would be AMD. Their r&d budget were 20-25% of intel's, yet were able to produce a better performing part with zen3.


Intel fabs their own chips. AMD outsources that. Intel is still on 14nm vs AMD's 7nm. It has been a really long time since AMD has even come close to Intel. The question is whether Intel can recover from their slump before AMD can get enough chips out.


AMDs on a big upswing, design wins in multiple markets from servers down to laptops. The PS5 and Xbox S/X also will help will volumes for the next few years.


>> Intel is still on 14nm

Why do people keep saying that? There's a variety of 10nm processors from Intel on the market.


Are there any chips like this being sold at all by AMD or Intel? Can you get this performance and power consumption anywhere else?


The AMD 4900U has similar power consumption and higher multi-threaded performance, but lower single-threaded performance.

I expect the AMD part to have quite the bit lower GPU performance since it uses 1/3 of the transistors of the M1, 4.9billion transistors vs 16billion.

https://wccftech.com/intel-and-amd-x86-mobility-cpus-destroy...


AMD CPUs are very hard to fully utilize with a single thread, and Intel has always held the single-thread perf crown. Most high-performance use cases are multi-threaded now, so the single-threaded performance delta isn't that significant. The Apple chip is really built for running a snappy GUI: in most of those cases, you need to run ~1M instructions super fast on one thread for a short time. Intel has historically had the crown on this metric, but not any more with all of their problems.


AMD Zen3 outperforms everything Intel has in single threaded performance.

https://wccftech.com/amd-ryzen-9-5950x-16-core-zen-3-cpu-obl...

Or are you saying that some operations like FMAs for example are hard to keep a high utilization for?


Historically yes, but not so with the Zen3.


The closest in the next few months is the new Zen 3 based APUs that AMD is announcing at CES which is in Mid January. Zen 3 is reasonably competitive with the M1 on a per core basis. Not quite the single thread perf, but pretty competitive on multicore throughput.

As a rough estimate I'd expect the AMD chips to be within 10-20%, and you'll be able to run Windows or Linux on them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: