Hacker News new | past | comments | ask | show | jobs | submit login
Why is Apple's M1 chip so fast? (erik-engheim.medium.com)
768 points by socialdemocrat on Nov 30, 2020 | hide | past | favorite | 629 comments



Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

The M1 is really wide (8 wide decode) and has a lot of execution units. It has a huge 630 deep reorder buffer to keep them all filled, multiple large caches and a lot of memory bandwidth.

It is just a monster of a chip, well designed balanced and executed.

BTW this isn’t really new. Apple has been making incremental progress year by year on these processor for their A-series chips. Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop. Well turns out that given the right cooling solution those benchmarks were accurate.

Anandtech has a wonderful deep dive into the processor architecture.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Edit: I didn’t mean to disparage Apple or the M1 by saying that Apple threw hardware at the problem. That Apple was able to keep power low with such a wide chip is extremely impressive and speaks to how finely tuned the chip is. I was trying to say that Apple got the results they did the hard way by advancing every aspect of the chip.


The answer of wide decode and deep reorder buffer gets much closer than the “tricks” mentioned in tweets. That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.

On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.

An insane amount of good engineering went into this CPU.


> An insane amount of good engineering went into this CPU.

I agree but lets not overblow the difficulties either.

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder.

That could be irrelevant for small enough numbers, and ARM is easier to decode than x86. So this can be very well be dominated by other things. What you cite seems to be only about decoding logical register decoding going into the renaming structures, and then for just that tiny part it even tells that "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

> The delay of a full bypass network is quadratic in execution width.

Maybe if that's a problem don't do a full bypass network.

> dropping rapidly as both a function of increasing width and increasing clock rate

Good thing that the clock rate is not too high then :p

More seriously the M1 can keep the beast fed probably because everything is dimensioned correctly, (and yes also because the clocks are not too high, but if you manage to make a wide and slow CPU that actually works well, I don't see why you would want to scale the freq too much high, given you would quickly consume like crazy, and there is only only limited headroom above 3.2GHz anyway). It obviously helps to have a gigantic OOO. So I don't really see where there is so much surprise. Esp. since we saw the progression in the A series.

To finish probably TSMC 5nm does not hurt. The competitors are on bigger nodes and have smaller structures. Coincidence? Or just like it has worked during decades already.


It's not completely groundbreaking, but painting it as an outgrowth of existing trends doesn't give Apple enough credit. The challenges of scaling wider CPUs within available power budgets is widely accepted: https://www.cse.wustl.edu/~roger/560M.f18/CSE560-Superscalar... (for "high-performance per watt cores," optimal "issue width is ~2"). Intel designed an entire architecture, Itanium, around the theory that OOO scaling would hit a point of diminishing returns. https://www.realworldtech.com/poulson ("Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry."). It is also well accepted that we are hitting limits on ability to extract instruction-level parallelism: https://docencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProces... ("There just aren’t enough instructions that can actually be executed in parallel!"); https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp16/cs... ("Hardly more than 1-2 IPC on real workloads").

Apple being able to wring robust performance out of an 8-wide 3.2 GHz design, on a variety of benchmarks, is impressive and unexpected. For example, the M1 outperforms a Ryzen 5950X by 15%. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste.... Zen 3 is either 4- or 8-wide decode (depending on whether you hit the micro-op cache) and boosts to 5 GHz. It beats the 10900k, a 4+1-way design that boosts to 5.1 GHz, by 25%. The GCC subtest, meanwhile, is famous for being branch-heavy code with low parallelism. Apple extracting 80% more IPC from that test than AMD's latest core (which is already a very impressive, very wide core to begin with!) is very unexpected.

A lot of the conventional wisdom is based on assumptions about branch prediction and memory disambiguation, which have major impacts on how much ILP you can extract: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... To do so well, Apple must be doing something very impressive on both fronts.


The i7-1165G7 is within 20% on single core performance of the M1. The Ryzen 4800U is within 20% on multi-core performance. Both are sustained ~25W parts similar to the M1. If you turned x64 SMT/AVX2 off, normalized for cores (Intel/AMD 4/8 vs Apple 4+4), on-die cache (Intel/AMD 12M L2+L3 vs Apple 32MB L2+L3) and frequency (Apple 3.2 vs AMD/Intel 4.2/4.7), you'd likely get very close results on 5nm or 7nm-equivalent same process. Zen 3 2666 vs 3200 RAM alone is about a 10% difference. The M1 is 4266 RAM IIRC.

TBH, Laptop/Desktop level performance is indeed very nice to see out of ARM, after a few years of false starts by a few startups and Qualcomm. Apple designed a wider core they deserve credit for, but wider cores have been a definitive trend starting with the Pentium M vs Pentium 4. There is a trade-off here for die area IMO, AMD/Intel have AVX2 and even AVX512 and SMT on each core, and narrower cores (with smaller structures, higher frequency). Apple has wider cores (larger structures, less frequency, higher IPC). It's not that simple, but it kind of is if you squint a bit.


The i7-1165G7 boosts to 4.7 GHz, 50% higher than the M1. A 75% uplift in IPC (20% more performance at 2/3 the clock speed) compared to Intel’s latest Sunny Cove architecture is enormous. Especially since Sunny Cove is itself the biggest update to Intel’s architecture since Sandy Bridge a decade ago.


Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is only around a ~20% difference in geekbench and at similar sustained power levels. If you want AVX2/512+SMT, a slightly narrower core of realistically 6+ wide with uOP-cache upto 8-wide is an acceptable tradeoff. We have seen Zen 3 go wider from Zen 1/2[1], so wider x64 designs with AVX/SMT should be coming, but this is the squinting part with TSMC 5nm vs 7nm.

[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.

The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.

If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.


Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.

The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.


> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.

It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/

> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.

Theoretical studies have shown that higher ILP is attainable (http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi...) but the M1 suggests some really notable advances in being able to actually extract higher ILP in real-world code.

I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.


The M1 is a 10W part, no? I would kill to see the 25W M-series chip.

Oh and the 10W is for the entire SOC, GPU and memory included.


Nope. Anandtech measured 27W peak on M1 CPU workloads with the average closer to 20W+[1].

The Ryzen 4800U and i7-1165G7 also have comparable GPUs (and TPU+ISP for the i7) within the same ~15-25W TDP. The Intel i7-1165G7 average TDP might be closer to ~30W because of it's 4.7Ghz boost clock, but it's still comparable to the M1.

The i7-1165G7 and 4800U have a few laptop designs with soldered RAM. You can get 17hrs+ of video out of a 4800U laptop with a 60Wh battery[2]. Also comparable with i7-1065G7/i7-1165G7 at 15hrs+/50Wh.

[1] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

[2] https://www.pcworld.com/article/3531989/battery-life-amd-ryz...


Wasn’t 27W for the whole Mac Mini machine using a meter at the wall plug? So that includes losses in the power supply and ssd and everything else outside the chip that uses a bit of juice whereas the AMD tdp is just the chip. I thought Anandtech said there was currently no reliable way to do an ‘apples to apples’ tdp comparison?

Edit: quote from anandtech:

“As we had access to the Mac mini rather than a Macbook, it meant that power measurement was rather simple on the device as we can just hook up a meter to the AC input of the device. It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.”


The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)


I have device with me and on full load on both CPU and GPU it can go up to 25 but for most use cases i see the whole SOC hovering around 15W


M1 is a 20W (max) CPU and a 40W SoC (whole package max).

However, in most intense workloads it doesn't go near 40W, more like ~25W under high load.

Still incredibly impressive.


18W CPU peak power, 10W is their power efficiency comparison point.


The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

https://www.youtube.com/watch?v=_MUTS7xvKhQ&list=PLo11Rczpzu... Check this out, 12.5W power consumption for the M1 CPU vs. 68W CPU power consumption for the Intel i9 CPU of the 16” Macbook Pro, and yet the M1 is 8% faster in Cinebench R23 in multi-core score.


My naive assumption would be that 4c big + 4c little would perform better than 4c/8t all other things being equal (and assuming software was written to optimize for each design respectively). Also no reason you can't have 4c/8t big + 4c/8t little too.


Apple’s 4big + 4LITTLE config performs better than Intel’s 8c/16t mobile chips right now.


> For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15).

That's a decoder for a single field, where width of the field is the parameter it scales by. That would be instruction size or smaller, and instructions don't change size depending on how many you decode at once.

And logically once you separate the instructions you can decode in parallel in fixed time, and if all your instructions are 4 bytes then it takes no circuitry to separate them.

Also: "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."


Your first source does not support your statement.

While there is theoretically a quadratic component, in their words:

> We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width.


> That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

Well, because it doesn't, it's ~25 watts. And also because it runs at just 3ghz. You'll see similar power numbers from x86 CPUs at 3ghz, too. The M1's multicore performance vs. the 4800U and 4900HS demonstrate this nicely.


I haven’t read the linked AnandTech article yet, but is there a clear answer why Apple was able to defy common comp arch wisdom (M1 has wider decode which works fine for various applications/code)?


Check the parent article that explains it well. Apple didn’t defy common comp arch wisdom... they applied it.

The reason why is hard for Intel/AMD to do the same is not the lack of engineering geniuses (I’m sure they have plenty), but the support for a legacy ISA, and a particular business model.

What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat? The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.


> What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat?

Having own silicon, means the upstream will not be able to turn lights on you (Samsung — a company keeping a quarter of its host country's GDP hostage.) I believe the immediate goal of PA Semi purchase was that.

> The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

PA Semi was clearly a diamond in the rough. It took a great insight to single out PA Semi, because on the surface it was a very barebones SoC sweatshop, but in reality PA were the last of Mohicans of US chip design.

PA was a place where all non-Intel IC engineers left to after the severe carnage of microchip businesses of US tech giants like Sun, IBM, HP, DEC, SGI..., and etc.

It was a star team which back then was toiling at router box SoCs.


Just to quantify your adjectives, per the Anandtech article:

> The M1 is really wide (8 wide decode)

In contrast to x86 CPUs which are 4 wide decode.

> It has a huge 630 deep reorder buffer

By comparison, Intel Sunny/Willow has 352.


Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Intel/AMD have had 4-wide decode and issue width for 10 years and I'm glad they're moving to wider machines.

Edited "decode" to "issue" for clarity.


Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.

All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?


https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.

Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.


Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.

But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?

Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.

Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).


I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.

On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.

Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).


X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.

Arm is different but probably easier to decode. So you can widen the decoder.


This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).


Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.


You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.

[1] https://www.researchgate.net/profile/Sally_McKee/publication...


That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.


Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.

Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.

Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40...

[2] http://www.cs.unc.edu/~porter/pubs/instrpop-systor19.pdf


But there's more work being done per average x86_64 instruction due to RMW ops. Hence why they just look at an entire binary.


OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).

But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).

I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?


As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.

As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

Ref: "Modern Processor Design" p268.


> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?


Probably should check Abner's guide, but the P6 is still the rough blueprint for everything (except P4) that intel did since.


They were still doing this in the Nehalem timeframe (possibly influence from Hinton?)


I guess this requires extending the architecture to 8-wide instructions when it makes sense.


What do you mean with "8-wide instructions", and what does that have to do with multiple decoders?


So Intel and AMD are capable of building a chip like this, but the ambitious size meant it was more economically feasible for Apple to build it themselves?

(Not a hardware guy)


Neither Intel nor AMD are capable of doing for a very basic reason, there is no market for it. You can't just release a CPU for which there is no operating system.

Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it.

Apple, by creating new CPU, replace a part of the stack that is owned by Intel by their own which only strengthens them position even if it did not improve any performance.

Apple is invulnerable to other companies copying the CPU and creating their own because they are not really a competition here. Apple sells an integrated product of which CPU is just one component.


That's not entirely true. Windows ARM64 can execute natively on the M1 (through QEMU for hardware emulation, but the instructions execute natively). Intel/AMD could produce an ARM processor that could find a market. They also have a close partnership with Microsoft and I have to believe there would be a path forward there. They could also target Linux.

I haven't seen enough evidence yet though that ARM is the reason M1 performs so efficiently. It may just be the fact that it is on a cutting edge 5nm process with a SoC design. I'm not even sure if the PC/Windows market would adopt such a chip since it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU. Heck even AMD has been retaining backwards compatability with older motherboards since it's been one one socket for a while.

I think for laptops/mobile this makes a lot of sense. For a desktop though I honestly prefer AMD's latest Ryzen Zen 3 chips.


> It may just be the fact that it is on a cutting edge 5nm process with a SoC design.

Yup. It's fast because it's got short distances to memory and everything else. Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power. Using shorter paths to memory also lets you use lower voltages, which means less waste heat and less need to spend effort on cooling and overall power savings for the chip.

Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

There's a reason that manufacturers used to be able to "speed" up a chip by just doing a die shrink - photographically reducing chip masks to make them smaller, which also made them faster with relatively small amounts of work.

As the late Adm. Grace Hopper put it, there are ever so many picoseconds between the two ends of a wire.


> Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

A maximum of a few nanoseconds. Not much in comparison to an overall memory system latency.

> Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power.

You cannot run away from that with just shorter PCB distance. The circuitry for link training is mandated by the standard.

You will need a redesigned memory standard for that.


Until the late 90s on-chip wire delays were something we just didn't care much about speed was limited by gate capacitance - we got speedups when we shrunk the gate sizes on transistors - after the mid 90s RC delays in wires started to matter (not speed of light delays, how fast you can shuffle electrons in there to fill up the C) soon after it got worse because wire RC delays don't scale perfectly with shrinks because of edge effects - this was addressed in a number of ways, high speed systems reduced the R by switching from Al wires to Cu, tools got better able to model those delays and synthesize and do layout at (almost) the same time


Intel/AMD could produce an ARM processor that could find a market.

Intel did have an ARM processor line, and it did have a market. They acquired the StrongARM line from Digital Equipment and evolved it into the XScale line. What Intel didn't want was for something to eat into it's x86 market, and Windows/ARM didn't exist. So they evolved ARM in a different direction than Apple later did. It was very successful in the high-performance embedded market.


"It was very successful in the high-performance embedded market."

As long as you don't define that market as "billions of mobile smartphones".

I remember StrongArms in PDAs back in the early 2000s.

They should have had ready processors for the smartphone, but IIRC they kept pushing x86 on phones.


Fair point. I forgot about the PXA line. I suspect, however, more of the IOP & IXP embedded processors were sold.


>AMD could produce an ARM processor that could find a market.

They did and it couldn't find a market.


IIRC the chip they sort-of released was originally meant for Amazon, but missed targets wildly, leading Amazon to doing one on their own.

Lisa Su put the kibosh on K12 for focus reasons, given how well Zen turned out it was the right call at least for now.


> it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU.

Honestly, it doesn't. You just swap it for a new one and sell the old one.

When you have 8GB, you pay another 8GB and end up with 16.

In this case, you just sell you SoC with 8GB, and but another SoC with 16GB. You'll only pay out the difference.

This is pretty much how you upgrade a relatively recent phone works too.


Depreciation means you won't pay out only the difference, in most cases.


This is kinda where Apple products thrive. They drop in price very little, and have, ultimately, long lives.

iPhone 7s and iPhone 8s are still great low-end devices, and they reach the right market by being resold by people getting a newer one.

I don't see why M1 laptops would be an exception.


> "Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it."

Note that this is the same model that Sun Microsystems, DEC, HP, etc. had and it didn't work out for them.

I'd venture to say that it currently only works out for Apple because Intel has stumbled very, very badly and TSMC has pulled ahead in fabbing process technology. If Intel manages to get back on its feet with both process enhancements and processor architecture (and there's no doubt they've had a wake up call), this strategic move could come back to bite Apple.


Only because of Linux, without it in the picture they would still be selling UNIX workstations.


Without Linux, they would've lasted longer but still would've lost out on price/performance against x86 and Intel's tick-tock cadence well before Intel's current stumble. We might all have wound up running Windows Server in our datacenters.


I doubt places like Hollywood would have migrated to Windows, given their dependency on UNIX variants like Irix.


I don't understand, how do these low level changes impact the OS exactly assuming that the ISA remains the same? It doesn't seem much more impactful than SSE/AVX and friends, i.e. code not specifically optimized for these features won't benefit but it'll still work and people can incrementally recompile/reoptimize for the new features.

After all that's pretty much how Intel has operated all the way since the 8086.

It's not like Itanium where everything had to be redone from scratch basically.


Are you referring to Apple's laptop x86 -> ARM change? Entertaining the idea that the ISA is significant here: Surely there would be a big market for ARM chips in the Android and server sides too, so this shouldn't be the only reason why other vendors aren't making competitive ARM chips. Apple's laptop volumes aren't that big compared to those markets.

And of course you have to factor in the large amount of pain that Apple is imposing on its user and ISV base in addition to the inhouse cost of switching out the OS and supporting two architectures for a long time in parallel. A vendor making chips for Android or servers wouldn't have to bear that.


> You can't just release a CPU for which there is no operating system

sure you can. That's what compilers are for.


Intel had such an attitude once before.


Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[82]

https://en.wikipedia.org/wiki/Itanic

So they didn't have the needed compiler


Surely there were compilers, they just weren't as good (as optimizing) as Intel wished.


Of course there were itanic-targetting compilers, they worked, just not well enough to deliver on marketing promise (edit: and what the hardware was theoretically capable of).


I wonder how HP and Microsoft managed to port HP-UX and Windows without a compiler.


That's kind of the point.

Compilers existed just fine to do the porting, and solved that problem.

Intel's failure is that they were unable to solve a different problem because that compiler didn't exist, one that went well beyond merely porting.

In other words, "That's what compilers are for." is a perfectly fine attitude when those compilers exist, and a bad attitude when they don't exist. Porting is the former, making VLIW efficient is the latter.


GP probably means that you won't be able to sell it, even if there is a compiler. (Not true in the super embedded space, sure.)


It's not that it was more economical, but that at least some of these AMD and Intel would not benefit from due to the ISA: x64 instructions can be up to 15 bytes, so just finding 8 instructions to decode would be costly, and I assume Intel and AMD think more so than the gains from more decoders (you couldn't keep them fed enough to be worth it, basically).


I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.

Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.

(Note that CISC definitely does have it's advantages - especially in large code foot-print server workloads where the dense packing of instructions help - but it might be hindered in typical desktop workloads)

Source: I've worked in front-end CPU micro-architecture research for ~5 years


How do you feel about RISC-V compact instructions? The resulting code seems to be 10-15% smaller than x86 in practice (25-30% smaller than aarch64) while not requiring the weirdness and mode-switching associated with thumb or MIPS16e.

Has there actually been much research into increasing instruction density without significantly complicating decode?

Given the move toward wide decoders, has there been any work on the idea of using fixed-size instruction blocks and huffman encoding?


I can't really comment on the tradeoffs between specific ISAs since I've mainly worked on micro-arch research (which is ISA agnostic for most of the pipeline).

As for the questions on research into looking at decode complexity v instruction density tradeoff - I'm not aware of any recent work but you've got me excited to go dig up some papers now. I suspect any work done would be fairly old - back in the days when ISA research was active. Similar to compiler front-end work (think lex, yacc, grammar etc..) ISA research is not an active area currently. But maybe it's time to revisit it?

Also, I'm not sure if Huffman encoding is applicable to a fixed-size ISA. Wouldn't it be applicable only in a variable size ISA where you devote smaller size encoding to more frequent instructions?


Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.

Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.

We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.

As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.

Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.

Never gave it too much thought before now, but it's an interesting exercise.


Not necessarily. Samsung used to make custom cores that were just as large if not larger than Apple’s (amusingly the first of these was called M1).

Unfortunately, Samsung’s cores always performed worse and used significantly more power than the contemporary Apple cores.

Apple’s chip team has proven capable of making the most of their transistor budget, and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.


> there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

From what I have seen the only difference in efficiency is the manufacturing process. M1 consumes about as much power per core as a Ryzen core. AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.


> AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.

TDP is no where near actual load power use.


> and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

What's that reason?


> What's that reason?

Apple’s efficiency is based on a very wide and deep core that operates at a somewhat lower clock speed. Frequent branches and latency for memory operations can make it difficult to keep a wide core fully utilized. Moreover, wider cores generally cannot clock as high. That’s why Intel and AMD have chosen to pursue narrower cores that can clock near 5 GHz.

The maximum ILP that can be extracted from code can be increased with better branch prediction accuracy, larger out of order window size, and more accurate memory disambiguation: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... The M1 appears to have made significant advances in all three areas, in order for it to be able to keep such a wide core utilized.


What you write makes sense but it does not address why AMD and Intel could not do the same "even if they had the same process, ISA, and area to work with."


They wouldn’t have Apple’s IP relating to branch prediction, memory disambiguation, etc.


I think it's "faith".


Faith implies no data.

Why will Apple always out compete Intel and other non-vertically-integrated systems? Margins, future potential, customer relationship and compounding growth/virtuous cycle.

The margins logic is simple, iPhones and MacBooks make tons more money per unit compared with a CPU. Imagine if improving the performance of a CPU by 15% makes the demand increase by 1%. For Apple improving the performance of a CPU by 15% makes the demand increase by 1% for the whole iPhone or MacBook. For this reason alone, Apple can invest 2-5x more R&D into their chips than everyone else.

The future potential logic is more nuanced:

1. Intel's/whoever's 10 year vision is to build a better CPU/GPU/RAM/Screen/Camera because their customers are the companies buying CPUs/GPUs/Screens/Cameras/RAM. They are focused on the metrics the market has previously used to measure success and want to build to optimize for those metrics e.g. performance per dollar. Intel doesn't pay for the electricity in the datacenter nor through its customers' complaints about battery life. RAM manufacturers aren't looking at Apple's products and asking, "do consumers even replace still RAM?" i.e. they are focused on "micro"-trends.

2. Apple's vision is to build the best product for customers. They look at "macro"-trends into the future and apply their personal preferences at scale. For example, do people even still need replaceable RAM? Will they want 5G in the future, or can we improve the technology to replace it with direct connections to a LEO satellite cluster?

The customer relationship logic:

Lets take one such example of a macro-trend, VR and other wearables. Apple is tracking these trends and can "bet on" them because its in full control but Nvidia, Intel, etc typically don't want to "bet on" these numbers because even if they are fully invested, their partners (which sell to consumers) might back out. Apple also isn't "betting on" because it has a healthy group of early adopters that trust Apple and will buy and try it even tho a "better" product in the same market segment isn't purchased. Creating/retaining that customer relationship lets Apple over invest into keeping heat (i.e. power) low because its thinking about the whole market segment that Apple's VR headset can start to compete in and collect more revenue from.

Compounding growth/virtuous cycle logic is also relatively simple:

Improving the metrics in any of these 3 previous pillars manipulatively improves the other pillars. i.e. better customer relationship increases cashflow, increseses R&D funding, 1. improves product, improving customer relationship; or 2. reduces costs, increasing margins, and loops back to increasing cash flow.


Read the article linked, it explains why Intel and AMD are unable to throw more decoders at the problem.


The problem is the market.

Windows only a single architecture, so they can't really deviate from that. Sure, windows can switch (or, apparently, run on ARM), but due to the fact that windows applications are generally distributed as binaries, lots of apps wouldn't work.

Linux users would have far less issues, and would be a great clientele for a chip like this, but probably too niche a market, sadly.


Windows only a single architecture

People forget that at launch Windows NT ran on MIPS & DEC Alpha in addition to x86. The binary app issue was a killer for the alternative archs.


Pragmatically, windows runs on a single Architecture.

Sure, there's been editions for other architectures, but they're more anecdotal experiments than something usable.

I can go out and buy several weird ARM or PPC devices and run Linux or OpenBSD on them, and run the same stuff I use on my desktop regularly (except Steam).

The fact that windows relies on a stable ABI is it's major anchor (while Linux only guarantees a stable API).


they're more anecdotal experiments than something usable

Wrong. Microsoft explicitly set out multi-architecture support as a design goal for NT. MIPS was the original reference architecture for NT. Microsoft even designed their own MIPS based systems to do it (called 'Jazz'). There was a significant market for the Alpha port, especially in the database arena, and it was officially supported through Windows 2000. They were completely usable, production systems sold in large numbers.

In the end, the market didn't buy into the vision. The ISVs didn't want to support 3+ different archs. Intel didn't like competition. The history is all pretty well documented should one take the time to learn it.


> Microsoft explicitly set out multi-architecture support as a design goal for NT

They set it as a design goal, but that doesn't mean that the achieved it.


Except they did, though apparently you missed it. MIPS was the original port. Alpha was supported from NT 3.1 through Windows 2000, and only died because DEC abandoned the Alpha, not that Microsoft abandon Alpha (it was important to their 64-bit strategy). Itanium was supported from Windows 2003 to 2008R2. Support for Itanium only ended at the beginning of this year, once again because the manufacturer abandoned the chip.

I'm sure you can redefine "achieve" to exclude almost 17 years of support (for Itanium), if you're that committed to being right. Heck, x86-64 support has "only" been around for 20 years or so. Doesn't make it right.


Dec Alpha servers running NT in production used to be a thing.


Linux has ABI guarantees.


DEC Alpha NT could run X86 code thanks to FX!32, and faster than a core you could buy from Intel at the time.


well, for some things. fx32 for the apps people wanted though was deficient. The NT3.1-era Alphas didn't have byte-level performance so things like Excel, Word, etc. all ran terribly, as did Emacs and X. I supported a lab of Alphas running Ultrix and they were dogs for anything interactive and fantastic for anything that was a floating point application.


Yeah...anyone who thinks fx32 was faster in the real world than a native Intel core never actually ran it.


Indeed, but it didn't had anything that justified actually paying big bucks for an Alpha.


x86 instructions are variable length with byte granularity, and the length isn’t known until you’ve mostly decoded an instruction. So, to decode 4 instructions in parallel, AIUI you end up simultaneously decoding at maybe 20 byte offsets and then discarding all the offsets that turn out to be in the middle of an instruction.

So the Intel and AMD decoders may well be bigger and more power hungry than Apple’s.


Maybe they are, assuming there's sufficient area in the die for this.

They would likely still be massive power hogs.


But in one x86 instruction you often have more complex operations. Isn't that part of the reason why Sunny Cove has only 4 wide decode but still the decoders can yield 6 micro-ops per cycle? That single stat makes it look worse than it is in reality, I think.


The whole principle of CISC (v RISC) is that you have more information density in your instruction stream. This means that each register, cache, decode unit, etc. is more effective per unit area & time. Presumably, this is how the x86 chips have been keeping up with fewer elements in terms of absolute # of instructions optimized for. The obvious trade-off being the decode complexity and all the extra area that requires. One may argue that this is a worthwhile trade-off, considering the aggregate die layout (i.e. one big complicated area vs thousands of distributed & semi-complicated areas) and economics of semiconductor manufacturing (defect density wrt aggregate die size).


Except that RISC-V ISA manages to reach infornation density on par with x86 via a simple, backwards-compatible instruction compression scheme. It eats up a lot of coding space, but they've managed to make it work quite nicely. ARM64 has nothing like that, even the old Thumb mode is dead.


You mention most of the big changes, except one. Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard.


> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Cite your number please. Anandtech measured M1's memory latency at 96ns, worse than either a modern Intel or AMD CPU: "In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

vs. "In the DRAM region, we’re measuring 78.8ns on the 5950X"

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


Well my comment mentioned "random (but TLB friendly)", which I define as visiting each cache line exactly once, but only with a few (32-64) pages active.

The reason for this is I like to separate out the cache latency to main memory and the TLB related latencies. Certainly there are workloads that are completely random (thus the term cache thrashing), but there's also many workloads that only have a few 10s of pages active. Doubly so under linux when if needed you can switch to HUGE pages if your workload is TLB thrashing.

For a description of the Anandtech graph you posted see: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

So the cache friendly line is the R per RV prange for the 5950X latencies is on the order of 65ns, the similar line for the M1 is dead on 30ns at around 128KB and goes up slightly in the 256-512KB range. Sadly they don't publish the raw numbers and pixel counting on log/log graphs is a pain. However I wrote my own code that produces similar numbers.

My numbers are pretty much a perfect match, if my sliding window is 64 pages (average swap distance = 32 pages) I get around 34ns. If I drop it to 32 pages I get 32ns.

So the M1, assuming a relatively TLB friendly access pattern only keeping 32-64 pages active is about half the latency of the AMD 5950.

So compare the graphs yourself and I can provide more details on my numbers if still curious.


This reminds me of the Amiga which had FastRAM and ChipRAM. It was all main memory, but the ChipRAM could be directly addressed by all the co-processor HW in the Amiga and the FastRAM could not.

It would be sort of interesting for Intel/AMD to do something like this where they have 16GB on the CPU and the OS has the knowledge to see it differently from external RAM.

Apple is going to have to do this for their real "Pro" side devices as getting 1TB on the Mx will be a non-starter. I would expect to see the M2 (or whatever) with a large amount of the basic on chip RAM and then an external RAM bus also.


Dunno, rumors claim 8 fast cores and 4 slow cores for the follow up. With some package tweaks I think they could double the ram to 32GB inside the package and leave the motherboard interface untouched.

I do wonder how many use cases actually need more then 32GB when you have a low latency NVMe flash with 5+ GB/sec of bandwidth and relatively low latencies. Especially with the special magic sauce that I've seen mentioned related to hardware acceleration for either compressing memory or maybe it was compressing swap.

In any case, I'm not expecting the top of the line for the next releases. Step 1 is low end (mba, mbp13", and mini). Step 2 is mid range, rumored to be a MBP 14.1" and 16 in first half of 2021". After that presumably the mac pro desktop and Imac's "within 2 years". Maybe step #3 will be a dual socket version of step #2 with of course double the ram.


So I am not one of those that screamed about the 16GB limit which was a huge number of comments here on HN. That being said I do know people in the creative industry that have Mac Pros with 1.5TB of RAM and use all of it and hit swap. For a higher end Pro laptop I would be happy at the 32GB range. However in something like an 8K display (which will be coming!) iMac I would like to see 128GB which will not fit on chip. They are going to have to go to a 2 level memory design at some point.


Maybe, or just move the memory chips from the CPU package to the motherboard.


Oh that is very much something they could do, but given the fact that they control the OS completely it would be very interesting to keep the on chip and off chip and enable the software to support understanding which is RAM is where and allow the application developers to tweak items. For example lets say you are editing a very large 8K stream and you tell the app, hey load this video into RAM. You could put the part that is in the current edit window in the on chip RAM and feed the rest of the video into that RAM as the editor moves forward from the 2nd level RAM. There are some interesting possibilities here.

Also from the ASIC yield view it allows for some interesting binning of chips. Let's say the M2 has 32MB on chip plus an off chip memory controller. They could use the ones that pass in the high end, then once that fail a memory test as 16MB on a laptop, etc. Part of keeping ASIC cost down is building the exact same thing and binning the chips in to devices based on yield.


Unless you are doing some crazy synthesis or simulation, 32GB is plenty.

Maybe editing 4K (or in the future, 8K) video might need more?

My brother does a lot of complex thermal airflow simulation, and his workstation has 192GB of RAM, but that is an extreme use case.


8GB MacBook Air can easily handle 4K.

And it can handle 8K for 1-2 streams and starts to lag at 4+.


The Amiga has never been multi cored. Has Vampire accelerators to replace the 68K chips and PowerPC upgrade cards.

Apple in making the M1 Chip is using some of the Amiga IP circa 1985 that speed up the system where the CPU and GPU etc. share memory. Amiga is shattered into different companies, but if they didn't go out of business they would have made a M1 type chip for their Amiga brand.


> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory.

This, right here. It also helps that the L1D is a whopping 128 kB and only 3 cycles of load-use latency.


This is the first I've heard of this. This alone, plus unified memory in general, I bet explains 60% of the performance difference.


I wonder how they managed that.


Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).


Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.

Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.

Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?

Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.


The NVIDIA Carmel cores on 12nm had a 64KB L1D cache with a 2 cycles latency.


Means nothing without saying what the clock goes at.


2.26GHz, on a quite old process.


Latencies like this are doable with a lot of tuning on Intel CPUs; out of the box you'll get to the 40s with fast memory. And those CPUs have three cache levels instead of two...

A good old-fashioned 2010-era gaming PC would already get down to around 50 ns levels.

It's definitely really good, but considering it's rather fast RAM (DDR4 4266 CL16) and doesn't have L3 it's not that surprising.


Apple M1 has three cache levels:

- for big cores, private: 128KB L1D

- for big cores, shared within a cluster: 12MB L2

- system-level cache (shared between everything, CPU clusters, GPU, neural engine...): 16MB

and then you reach RAM.


I've written a benchmark to measure such thins and from what I can tell.

Each fast core has a L1D of 128KB.

The fast cores have a cluster with 12MB, cache misses to to main memory.

The slow cores have a 4MB L2.

The cache misses from the fast L2 can't quite saturate the main memory systems (I believe it's 8 channels of 16 bits). So when all cores are busy you keep 12MB of L2 for fast, 4MB of L2 for the slow cores, and end up getting better throughput from the memory system since you are keeping all 8 channels busy.


Wonder if the SLC is mostly used for coherency purposes and the other blocks then...

And yeah, it's 128-bit wide LPDDR4X-4266, pretty quick imo.


Not just 128 bits wide (standard on high end laptops and most desktops), but 8 channels. The latency is halved and over the last decades I've only been seeing very modest improvements in latency to main memory on the order of 3-5% a year.


Or maybe use the best of both worlds, with soldered-in ultra fast ram, plus large amount of dimm ram.

Same as you can have a storage driver and a smaller nvme.


> Or maybe use the best of both worlds, with soldered-in ultra fast ram

That's basically what L3 cache is on Intel & AMD's existing CPUs. You could add an L4, but at some point the amount of caches you go through is also itself a bottleneck, along with being a bit ridiculous.


The way I see it, you could have a Mac Pro with (let’s say) 32GB of super-fast on-package RAM and arbitrarily upgradable DIMM slots. The consequence would be that some RAM would be faster and some would be a bit slower.

They would be contiguous, not layered.


The non-uniform memory performance of such a solution would be a software nightmare.


Doesn't seem much different than various multichip or multisocket solutions where different parts of memory have different latencies, called NUMA. Basically the OS keeps track of how busy pages are and rebalances things for heavily used pages that are placed poorly.

Similarly, Optane (in dimm form) is basically slow memory, OSs seem to handle it fine. NUMA support seems pretty mature today and handle common use cases well.

With all that said, apple could just add a second CPU to double the ram and cores, seems like a great fit for a Mac Pro.


It doesn't seem any worse than existing NUMA systems today, where memory latency depends on what core you're running on. In contrast, the proposed system would have the same performance for on-board vs plugged DIMM regardless of which CPU is accessing it, which simplifies scheduling — from a scheduling perspective, it's all the same. I think that's easier to work with than e.g. Zen1 NUMA systems.


OSes have had this problem solved for decades; the solution is called "swap files". You could naively get any current OS working in a system with fast and slow RAM by simply creating a ramdisk on the slow memory address block and telling the OS to create a swap file there.


> OSes have had this problem solved for decades; the solution is called "swap files".

What operating systems handle NUMA memory through swapping? The only one I'm familiar with doesn't use a swapping design for NUMA systems, so I'm curious to learn more.


Not really the best idea for the kind of speed baselines and differences discussed here. You can use better ideas like putting GPU memory first in the fast part then the rest in the slow area. You know, like XBox Series does.


Yet apple is managing excellent performance with just a l1+l2.


But the context of this thread is that it is being done with soldered RAM. I don't know how much that matters, just pointing out that you are taking the conversation in a circle.


>"Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard."

I prefer the upgradeability.


> I prefer the upgradeability.

The market doesn't care about such niche concerns, but it'll not flip completely overnight.

I like the idea of upgradeability too, but when the trade-off is such great performance, I'm not going to give that up. It would be different if the performance numbers were not as stark.

The open question is how long this performance will be sustained. If it drops off, then concerns like upgradeability make become higher priority (and an opportunity for hardware vendors.)


Yes, this is a niche concern... called environment protection. The new stuff cannot be upgrade, so when amount of RAM stops to be sufficient, old computer needs to be recycled (a modern word for throwing something into waste bin together with all CO2 that was emitted when computer was produced, not to mention environmental costs of digging rare earths, etc.).

I am still able to use my Lenovo Thinkpad 510T only because I could easily replace HDD with cheap, stock Samsung EVO SSD and throw more RAM.

The absurdity of Apple approach is that Mac Mini with 512 GB SSD is $200 more expensive than the one with 256 GB. 256 GB for $200 is a crazy price, so Apple basically says: hey, pay us a lot, lot more, so maybe you can use our stuff a bit longer, but, in fact, we try to actively discourage you from doing this, since we want you to buy cheaper model and in two-three years you will need to buy a new fancier model.

But Tim Cook will tell you a lot how much he cares about humanity, environment and CO2 emission. Maybe he will even fly his private jet to some conference to tell people how awful is all that oil & gas & coal industry.


On-package RAM and upgradability are not at odds with eachother. If upgradability was desired, we could see socketed SOC. This is one of the things modular phones (project Ara) were about.

Just because RAM was one of the last holdout of upgradability, does not make it inherently more suitable for upgradability.

The problem is a lack of interest in manufacturing repairable and upgradable hardware. It is simply less profitable.


>"The market doesn't care about such niche concerns, but it'll not flip completely overnight."

I do not care all that much abut market either. When I need something it always seems to be there and I do not mind if it is not produced by the biggest guy on the block. If at some point I would not be able to find what I need I'll deal with it but so far it did not happen.


Bandwidth is comparable with other high-end laptop chipsets (I've seen 60-68 GB/s quoted, and recent Ryzens are 68 GB/s). Is the on-chip latency a big factor in the single core performance?


Depends on the workload. Compilers are famous for being branch heavy and random lookups... something that people have reported excellent performance on the M1. Parsing is hard as well (like say javascript).

Of course for any CPU workload it's going to be harder on the memory system when you have video going on. Doubly so for a video conference (compression, decompression, updating the video frames, streaming the results, network activity, reading from the camera, etc).

Seems like the apple memory system wouldn't have received as much R&D as it did unless Apple though it was justified. Clearly the speed, performance, and price show that Apple made quite a few good decisions.


Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

There's just nothing driving most people to have more than 8/16GB of RAM and even photo/video editing has been shown to be a breeze on the 8GB MacBook Air.

I wouldn't be surprised to see laptops move to soldered RAM and SSDs.


> Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

I beg you to look at memory consumption of those "lightweight" web apps


We should definitely not continue Apple's approach of soldering every damn component, even if it comes at the cost of performance


> even if it comes at the cost of performance

Why? What's the purpose of artificially limiting performance when one doesn't need the upgradability?

I've, personally, never upgraded the RAM on any system I've built or carried it to a new motherboard with a new socket. I'm absolutely the target audience for this. I would love this increased performance, as long as it wasn't some surprise. Having the extra plastic on the motherboard is literally e-waste for me. Don't touch my PCI-e slots though.


Used to be I'd upgrade my MBP memory and hard drive to eek out one more year between upgrades. The drive could always come back and be reused as a portable drive, and the best memory for an old machine typically was cheap enough by then that it wasn't that big of a deal.


The best present is receiving something you never knew you needed until you get it, so I love giving RAM (and SSD) for birthdays! That you can keep the same computer but that it simply becomes faster is a nice surprise for many.


Components have flaws, or they break down over time, and soldering components hampers repair and reuse.


I suggest you look up "integrated circuits" and "system on a chip", which is where all of our performance/power improvements have come from. You're in for a shock when it comes to repairability!


Not sure why you're being downvoted, it's completely true! If the SSD in my computer dies, I can just buy another one for cheap (500GB for what, 80 dollars?).

If the SSD in my Macbook/Mac Mini dies, either I can buy a new motherboard, or more likely, a new device. It is not economical nor ecological.

Also, paying 200 dollars for additional 256GB of storage? WTF.


Dunno, increasingly with machine learning, more cores, GPUs, etc the bottleneck is the memory system. How much are you willing to pay for a dimm slot?

Personally I'd rather have half the latency, more bandwidth, and 4x the memory channels instead of being able to expand ram mid life.

However I would want the option to buy 16, 32, and 64GB up front, unlike the current M1 systems that are 8 or 16GB.


Then, make desktops/laptops with 4 or 8 channels. We'd need more dimms, of a smaller size.


Only if you use dimms. If you use the LPDDR4x-4266 each chip has 2 channels x 16 bits. So the M1 has 4 chips and a total of eight 16 bit wide channels.


Does the 8GB variant have all 8 channels or just 4?


My guess is it's the same and using the half density chip in the same family, but I'm just guessing.


Those extra memory will cost you an arm and a leg.


My understanding is that the LPDDR4x chips cost less per GB than the random chips you find in the common dimms. There's also costs (board space, part cost, motherboard layers, and layout complexity) for dimm slots.

Sure manufacturers might try to charge significantly more than market price for on the motherboard RAM, but it's an opportunity to increase their profit margin and ASP. Random 2x16GB dimms on newegg cost $150 per 32GB. Apparently LPDDR are easier to route to, require less power, and cost less for the same amount of ram. I'd happy pay $500 for a motherboard with 64GB of LPDDR4x-4266. Seems like Asus, Gigabyte, Tyan, Supermicro and friends would MUCH rather sell a $500 motherboard with ram than a $150 motherboard without.


Normal rate ( Not Contract Price ) for LPDDR4 / LPDDR4X and LPDDR5 is roughly double the cost of DRAM per GB. Depending on Channels and package, the one used in M1 is likely even more expensive as they fit 4 channel per chip. DIMM and Board Space adds very little to the Total BOM.


Ah, I had heard differently, for the same clock rate?

In any case the apple parts are from what I can tell are:

https://www.skhynix.com/products.do?lang=eng&ct1=36&ct2=40&c...

In particular this one:

https://www.skhynix.com/products.view.do?vseq=2271&cseq=77


If you don’t want it, just don’t buy it, but please don’t tell other people what they should or should not like or need.


Apple's (and everyone elses') anti-repair stance (both in terms of design and in policy) is harming the environment and generating tons of e-waste. Whats wrong with expressing a view that helps the planet?


Because it’s just virtue signalling, not actual environmentalism. What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

https://www.techrepublic.com/article/the-5-greenest-tech-com...


  Apple devices use a minimum of materials and have industry leading usable lifetimes.
Their phones have far longer lifetimes for sure, their laptops? I would like to see evidence of that. Outside of the mostly cheaply made laptops, most laptop/desktop computers can have very long secondary lives. Linux/Windows can run one some very old (multiple decades) machines.


> Because it’s just virtue signalling

Not buying a new MBP and throwing the old one to children in third world countries qualify as "virtue signaling" now ?


Promoting reuse and repair is environmentalism. Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

>What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Reuse and repair is FAR superior to recycle - which actually wastes a lot of energy, in addition to generating e-waste for the parts which are not recycled.

>Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

What good does it do? They are still harming the environment.


> Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

There are plenty of ways around that fact.

Preventing repair while changing nothing else generates more e-waste. But that's not what Apple does.

If you prevent repair in order to also do any or all of the following things at the same time enough, the result is less e-waste than if you didn't prevent repair:

- Use less environmentally harmful materials (e.g. on-board sockets, larger PCBs etc)

- Make the device last longer before it needs repair (reliability, longevity)

- Make the device easier to recycle

> Reuse and repair is FAR superior to recycle

It's a good goal, but it's only superior for sure if everything else is able to be kept the same to make it possible.

Some things really are better for the environment melted down and ground down and then rebuilt from scratch. I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

> What good does it do? They are still harming the environment.

When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect, as following it would probably cause more environmental harm not less.


>- Make the device last longer before it needs repair (reliability, longevity)

If Apple makes devices that last so long, then how come Apple's own extended warranty program generates billions of dollars of revenue? Note that this doesn't include third party repair shops. To me, this indicates a large industry dedicated to repairing Apple products - hardly a niche industry. To me, this indicates that a large amount of Apple devices need repair, something that Apple is hostile to.

Also while AppleCare is easy and convenient for the customer, Apple's "geniuses" do not do board-repair, they simply replace and throw away broken logic boards (which sometimes all they might need is a simple 10 cent capacitor). If that wasn't as bad, they actively prevent other businesses from performing component level repair by blocking access to spare parts.

> I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

If that was the case, then of course, we should recycle. Maybe we should have a case-by-case approach depending on specific products? I'm totally willing to go wherever the evidence leads us. As of now pretty much every single environmental organization promotes reuse over recycling for electronics.

>When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect,

I merely accepted the "number one" in good faith at face value. Digging further with a cursory Google search, things seem a lot more nuanced. That being said, I have no idea what "number one" even means without context.

https://www.fastcompany.com/40561811/greenpeace-to-apple-for...


> If Apple makes devices that last so long,

I don't want to back the idea that Apple does make reliable or long-lived devices, although I'm very happy with my 2013 MBP still. I honestly don't know how reliable they are in practice, although they do seem to keep market value for longer than similar non-Apple devices, and they have supported them with software for a long time (my 2013 is still getting updates).

And I would love to be able to add more RAM to my 2013 MBP, which has soldered-in RAM; and I would love if it were easier to replace the battery, and if the SSD were a standard fast kind that was cheap to get replacements for, and if I could have replaced the screen due to the stuck pixel it has due to a screen coating flaw. So I'm not uncritical of the limitations that come with the device.

I'm only disputing your assertion that preventing repair and reuse of parts inevitably generates more e-waste. It's more nuanced than that.

Of course wherever and in whatever ways we can find to repair, reuse and recycle we should.

But there will always be some situations, especially with high-end technology, where repair and reuse needs more extra materials, components, embodied energy and complexity (and subtle consequences like extra weight adding to shipping costs) resulting in a net loss for the environment.

An extreme example but one that's so small we don't think of it is silicon chips. There is no benefit at all in trying to make "reusable" parts of silicon chips. The whole slab is printed in one complex, dense process. As things like dense 3D printing and processes similar to silicon manufacture but for larger object components come online, we're going to find the same factors apply to those larger objects: It's cheaper (environmentally) to grind down the old object and re-print a new one, than to print a more complex but "repairable" version of the object in the first place.


> - Make the device last longer before it needs repair (reliability, longevity)

2016 MBP owners will appreciated the joke !


Very good point!

I don't want to back the idea that Apple does make long-lived laptops, only that it's hypothetically possible they do sometimes :-)

My 2013 MBP is still going strong thankfully, I'm very happy with it still after all these years.


My last 2013 MBP is still alive only because I was able to source third party battery / power connector...

Though, somehow, if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one... Though, I'm not doing it for the environment, I'm just cheap as f*ck.


> if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one

No, I don't think that argument is being made by anyone.

The argument being made is that to make the laptop more able to have replaceable components could potentially require more environmental costs up front in making that laptop.

I doubt that argument works for the power connector. I suspect that's more to do with making sure Magsafe is really solid, but it might for the battery due to the pouch design instead of extra battery casing, I'm not sure.

There's no question that if you can repair it afterwards you probably should.

By the way, literally all my other laptops either died due to the power connector failing, or I repaired the failing power connector. Sometimes I had to replace the motherboard to sort out the power connector properly, which seems like poor design.

The Apple has been the only one that hasn't failed in that way, which from my anecdote of about 5 laptops says Apple's approach has worked best from that point of view so far. Of course Apple power supply cables keep fraying and needing to be replaced, so it balances out :-)


For a manufacturer on the whole it’s a negligible issue. It’s simply a fact that Apple devices have longer average lifetimes and lower overall environmental impact than any of the other manufacturers. Hence the Greenpeace rating. If you actually care about the environment, as you claim, the choice is clear.

What you are doing is picking a single marginal factor that can make a difference in rare cases, but is next to irrelevant in practice, and raising that above the total environmental impact of the whole range of devices. That’s just absurd.


> It’s simply a fact that Apple devices have longer average lifetimes

I've used the same desktop for the 8 to 6 years, upgrading with STANDARDIZED components over the years, and my laptop from that era still works. Heck, I've got a 18 years old thinpad still working fine.

In the mean time, two MBP died on me. Try again...


Do you care about the overall ecological footprint of Apple, as Greenpeace does, or only a few specific devices in particular? How do you evaluate likely future device lifetimes and ecological footprint, cherry picked statistics or manufacturer track record?

Should I take your evaluation in thus, or trust a detailed whole enterprise evaluation by Greenpeace?


Apple's view uses less material over all. For the vast majority of the machines that A) don't fail and B) are never upgraded in any case, the Apple method of getting rid of sockets reduces the e-waste burden.


Yet it’s Apple’s devices that last the longest and have the highest resale values.


Hermes handbags also have high resale value. That tells us nothing. Apple's anti-repair approach absolutely harms the environment. Certainly they are not alone in this, many/most electronics these days are irreparable. But Apple is actively hostile to the repair industry, which makes them more deserving of criticism.


The repair industry in this case is hostile to the environment. They are incentivized to want computers to break so that they can sell repair services.

It turns out that soldering parts in place makes them less likely to break than a socket whose connections can oxidize or come loose.

The tiny number of devices that can’t be repaired because of soldered components, is dwarfed by the number of devices that never broke in the first place because of soldered components.


> It turns out that soldering parts in place makes them less likely

You've obvious never heard of MBP BGA chip solder ball cracking and rendering the whole device useless...


I didn’t say they never failed. Just that they fail less frequently.

In any case, solder ball cracking results from process issues and is a solved problem: https://www.pcbcart.com/article/content/reasons-for-cracks-i...

Certain not something that would be improved with sockets.


Funny, the only refurbished computer, phone & tablet chain in NL has a pure Apple offering.


Unfortunately, even intel’s white label laptop specs soldered RAM, so I expect the trend to continue in low/mid range PC laptops.

https://www.theverge.com/2020/11/19/21573577/intel-nuc-m15-l...


Maybe for some consumer devices we should try it? Clearly the results are excellent. Most people don't open up and modify their laptops.


> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

Except for a lot of the Apple-centric journalists and podcasters, who have been imagining for years how fast a desktop built on these already-very-fast-when-passively-cooled chips could be.

Not that that matters very much when experienced and real-world workload performance suffers, but as far as I can tell, the M1 is no slouch in that respect either.


I heard 10 years ago, or whatever, that the ipad 2 had the most power efficient CPU available, period. This was told at a keynote by an HPC scientist who cannot be said to be a «apple journalist». Apple have been doing well for a very long time and I’ve expected this moment since that keynote, basically.


OK, I misspoke. I heard the sentiment from the Apple-focused voices in my information bubble, doesn't mean that nobody else said it. It's just that "nobody believed those Geekbench benchmarks" is not completely true.


Yeah, Apple’s CPUs have been doing really well for a while now: https://www.cs.utexas.edu/~bornholt/post/z3-iphone.html


> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

I saw a paper on (I think?) SMT solvers on iPhone. Turned out to be faster than laptops, I kind of brushed over it as irrelevant at the time.



I did believe those benchmarks. I also knew that sustainable load at those speeds if not throttled down would just melt the thing so yes they were irrelevant.


Maybe in practice at the time, but in hindsight they were actually a valid indication of the true performance capabilities of the architecture. It’s just that a phone has too little thermal capacity to sustain the workload.


>"Maybe in practice at the time"

Exactly this. If I am shopping now I do not care how particular CPU/architecture evolves in the future. I only care about what it can practically do now and at what price. As it is now M1 has 4 fast cores and not upgradeable maximum 16GB. For many people it would be more than they ever need. Let them be happy. For my purposes I am way more happy with 16 core new AMD coupled with 128GB RAM (my main desktop at the moment). It runs at sustainable 4.2 GHz without any signs of thermal throttling. Cooler is keeping it at 60C.


As well as too little battery capacity to sustain the workload.

Applying those cell phone methods to a laptop with better cooling and a larger battery was a win.


Another detail that came out today is just how beefy Apple's "little" cores are.

>The performance showcased here roughly matches a 2.2GHz Cortex-A76 which is essentially 4x faster than the performance of any other mobile SoC today which relies on Cortex-A55 cores, all while using roughly the same amount of system power and having 3x the power efficiency.

https://www.anandtech.com/show/16192/the-iphone-12-review/2


Incredible. Now it should be MiDdLe core.


Spot on. Exactly this. It’s like pre-iPhone when people just assumed you had a laptop and a cellphone. Then Apple said “phone computer!” and changed the game. Same with iPad just less innovation shock. Meanwhile we continued to have this delineation of computer / phone while under the hood - to a hardware engineer - it’s all the same. Naturally the chips they produced for iOS-land are beasts. My phone is faster than the computer I had 5 years ago. My M1 air is just a freak of nature. On par with high end machines but passively cooled and cheaper. I’m still kinda in awe. Not a big fan of the hush hush on Apple Silicon causing us all to play catch-up for support, but that’s Apple’s track record I guess.

The M1 is all the things they learned from the A1-A12 chips (or whatever the ordering) which is over a decade of tweaking the design for efficiency (phone) while giving it power (iPad).


> the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Apple threw more hardware at the problem and they lowered the frequency.

By lowering the frequency relative to AMD/Intel parts, they get two great advantages. 1) they use significantly less power and 2) they can do more work per cycle, making use of all of that extra hardware.


> Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

Of course Apple’s advantage is not solely due to technical tricks, but neither is it entirely, or even mostly due to an area advantage. If it were so easy, Samsung’s M1 would have been a good core.


Yeah the article is interesting I just like knowing that Apple will keep iterating and the performance gap between Apple Silicon and x86 will continue to grow. I keep spec'ing out an Apple M1 Mac mini only to not pull the trigger because I am curious what an M2 will hold.


The nice thing about Apple hardware resale value is that you can just buy now and upgrade later. I just unloaded a five year old MBP for a lot more than I'd ever have gotten for a five year old PC laptop. Ordered myself a fancy new M1 laptop, and if M1X/M2/whatever-it-is-called is substantially better, then I'll just trade it in on a new model. There's always a new hotness coming next year.


The rumors I've read so far is that the next chip will be the M1-X, and will be a 12-core CPU, with 8 "high performance" cores and 4 "efficiency" cores, and will be released in a 16" MBP. The current M1, by contrast, has an 8-core CPU (4 HP cores and 4 efficiency cores).


Is this with the 16" rumored first half of next year? I hadn't seen the one you're referring to. I'm contemplating options with my current 2018 15" which is about to receive its 3rd top case replacement (besides other issues), and whether I want to press for a full replacement and how long to wait, as AC expires next June.


I saw that rumor as well. Looks compelling. What I’m holding out for is what they bring to the rumored redesigned 16 inch MacBook Pro and the rumored 14 inch MacBook Pro. I think I’ll choose between one of those.


I think they old adage, “don’t buy first generation Apple products” applies here. Seems sensible to wait to see where this goes.


Hardly worth not doing it. The stuff depreciates far slower.

I’ve been using an M1 mini as my desktop for over a week now and am impressed.


Personally I’m just waiting for the software to catch and maybe the landscape to stabilise. It’s hardly irrational to wait and get the 2nd gen.


Yes true but as the respondent said above I’m just waiting for the software to catch up and maybe a redesign on the laptop side. In the mean time I’m saving for one.


Hah, won't this perpetuate? Whenever M(N) is released M(N+1) will be on the horizon with even greater promise.


Yes, been doing that for about 30 years. Now and then you just have to take a leap. I usually buy what was the best last year at a bargain price instead.


This is a good year for getting last year's stuff given how the supply is so low on most of the high profile hardware(new Ryzens, current generation dedicated graphics, game consoles).

I did get something launched this year that reviewed well - a gaming laptop(Legion 5) - but it was not hard to find and I even got it used like-new. Perhaps because nobody's travelling now.


The problem is that a phone with a lightning fast CPU is rather useless with current ecosystems.

I do think there are "technical tricks" though, especially the compatibility memory mode that makes x86 emulation faster than comparable ARM chips. If you call it a trick or finesse is probably a matter of perspective.


Not mentioned in the article is the downside of having really wide decoders (and why they're not likely to get much larger) - essentially the big issue in all modern CPUs is branch prediction because the cost of misprediction on a big CPU is so high - there's a rule-of-thumb that in real-world instruction streams there's a branch every 5 instructions or so ... that means that if you're decoding 8 instructions each bundle has 1 or 2 branches in it, if any are predicted taken you have to throw away the subsequent instructions - if you're decoding 16 instructions you've got 3 or 4 branches to predict, chances of having to throw something away gets higher as you go .... there's a law of diminishing returns that kicks in, and in fact has probably kicked in at 8


The 5nm process is a large factor as well.


...and Apple was able to throw hardware at the problem because they got TSMC's manufacturing process. When everyone else is using 5nm, let's see if any of this other stuff actually matters.


M1 is also the only TSMC 5nm chip that is widely available and there is nothing remotely comparable from a process stand point.


Are they really throwing more hardware at the problem?

The die size for the whole M1 SoC is comparable to or even smaller than Intel processors, and the vast majority of that SoC is non-CPU related stuff, the CPU cores/cache/etc seem to be at most 20% of that die - though a more dense die because of the 5nm process. This also seems to imply that the 'budget' of number of transistors for that CPU-part of the SoC is also comparable to previous Intel processors, not a significant increase. (Assuming 20% of the 16b transistors in M1 is CPU part, it would be 3-ish billion transistors, and the Intel does not seem to publish transistor counts but I believe it's more than that for the Intel i9 chips in last year's macbookspro)

Perhaps my estimates are wrong, but it seems that they aren't throwing more hardware, but managing to achieve much more with the same "amount of hardware" because it is substantially different.


Well you don't have any of the AVX512 nonsense, but probably the OOO of the M1 uses more transistors than on an Intel chip. And Google tells me that a quad core i7-7700K has 2.16 B.


The next question: What prevents Intel or AMD from doing this on their processors?


The article specifically answers this:

- x86 instruction set can't be queued up as easily because instructions have different lengths - 4 decoders max, while Apple has 8 and could go higher.

- Business model does not allow this kind of integration.


> instructions have different lengths

also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).


What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?

I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?


This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.


So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?


Mirco Ops are usually much wider than instructions of the ISA. They are usually not multiple of 8 bits wide either.

An dedicated alternative instruction set would be possible but that would take die space and make x86_64+mine even harder to decode.


From what I understand this is exactly what the instruction decoder does.


They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.


Hmm, this reminds me of Transmeta https://en.wikipedia.org/wiki/Transmeta


They do this in a lot of designs. It's called a micro op cache, or sometimes an L0I cache.


I think the latter is the biggest challenge.

I imagine that Apple's M1 are using what they know about MacOS, what they know about applications in the store, what user telemetry MacOS customers have opted into, all to build a picture of which problems are most important for them to solve for the type of customer who will be buying an M1-equipped MacOS device. They have no requirement to provide something that will work equally well for server, desktop, etc roles, for Windows and Linux, and they have a lot more information about what's actually running on a day-to-day basis.


They say in the article that AMD "can't" build more than 4 decoders. Is that really true? It could mean:

* we can't get a budget to sort it out

* 5 would violate information theory

* nobody wants to wrestle that pig, period

* there are 20 other options we'd rather exhaust before trying

When they've done 12 of those things and the other 8 turn out to be infeasible, will they call it quits, or will someone figure out how to either get more decoders or use them more effectively?


Their business model allowed for them to integrate GPU and video decoder. Of course it allows for this kind of integration. The author is not even in that industry, so a lot of his claims are fishy. Moore's law is not about frequency, for example.


I think what they mean is lack of coordination between software and hardware manufacturers + unwillingness of intel/amd etc to license their IP to dell etc. What is untrue about that?

On Moore's law, yes it's about transistors on a chip, but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.


> What is untrue about that?

The fact that they don't need to license technology. They can bring more functionality into the package that they sell to Dell, etc. like they have already done.

> but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.

That is not the point they are making. Clock frequencies have not changed since the deep-pipelined P4, but transistor count has continued to climb. Here is what the author, who clearly does not know what he is talking about, said about that:

"increasing the clock frequency is next to impossible. That is the whole 'End of Moore’s Law' that people have been harping on for over a decade now."


Seems like backward compatibility completely rules this out. Apple can provide tools and clear migration paths because the chips are only used in their systems. Intel chips are used everywhere, and Intel has no visibility.


I wonder if in 3 or 4 years there will be a new chipmaker startup which offers CPUs to the market similar to the M1. If Intel or AMD won't do it that is.


The problems is competing with the size of Apple (and Intel and AMD). The reason there are so few competing high performance chips on the market is, that it is extremely expensive to design one. And you need a fab with a current process. For many years Intel beat all other companies, because they had the best fabs ans could affort the most R&D. Now Apple has the best fab - TSMCs 5nm - and certainly gigantic developer resources they can afford to spend as the chip design is pretty similar between the iPhones iPads and the Macs.

And of course, as mentioned in the article, any feature in the Apple Silicon will be supported by MacOS. Getting OS support for your new SOC features in other OSes is not a given.


Qualcomm could come out with an M1 class chip, they have the engineering capability, but if Microsoft or Google don’t adopt it with a custom tuned OS and dev tooling customised for that specific architecture ready from day one, they’d lose a fortune.

The same goes in the other direction. If MS does all the work on ARM windows optimised for one vendor’s experimental chip and the chip vendor messes up or pulls out, they’d be shived. It’s too much risk for a company on either side of the table to wear on their own.


Sadly Qualcomm won't develop own CPU core anymore since ARMv8 transition. Samsung stopped developing own core. So only ARM's code design is available for mobile.


The article answers this: x86 has variable-sized instructions, which makes decoding much more complex and hard to parallelize since you don't know the starting point of each instruction until you read it.


AMD until very recently (and Intel almost 25 years ago) used to mark instructions boundaries in the L1 I$ to speed up decode. They stopped recently for some reason though.


For those unaware, ARM (as with many RISC-like architectures) uses 4 bytes for each instruction. No more. No less. (THUMB uses 2, but it’s a separate mode) x86, OTOH, being CISC-based in origin, has instructions ranging from a single byte all the way up to 15[a]).

[a]: It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit, but they do mention (in the SDM) that they may raise or remove the limit in the future.

Why 15 bytes? My guess is so the instruction decoder only needs 4 bits to encode the instruction length. A nice round 16 would need a 5th bit.


> It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit

I remember noticing on the 80286 that you could in theory have an arbitrarily long instruction, and that with the right prefixes or instructions interrupts would be disabled while the instruction was read.

I wondered what would happen if you filled an entire segment with a single repeated prefix, but never got a chance to try it. Would it wrap during decoding, treating it as an infinite length instruction and thereby lock up the system?

My guess is that implementations impose a limit to preclude any such shenanigans.


You could encode 16 bytes–there's no need to save a slot for zero.


I honestly don’t know how the processor counts the instruction length, so it was only pure speculation on my part as to why the limit is 15. Maybe they naively check for the 4-bit counter overflowing to determine if they’ve reached a 16th byte? Maybe they do offset by 1 (b0000 is 1 and b1111 is 16) and check for b1111? I honestly have no idea, and I don’t think we’ll get an answer unless either (1) someone from Intel during x86’s earlier years chimes in, or (2) someone reverses the gates from die shots of older processors.


x86 instruction-length is not encoded in a single field. You have to examine it byte by byte to determine the total length.

There may be internal fields in the decoder that store this data, I suppose.


Yes I am aware. My wording could’ve been better. I was referring to the (possible) internal fields in the decoder.


How does 8 wide decode on ARM RISC compare to 4 wide decode on x64 CISC? If, say, you'd need two RISC ops per CISC op on average, that should be the same, right?


Most instructions map 1:1. x86 instructions can encode memory operands potentially doubling the practical width but a) not all CPUs can decode them at full width and b) in practice compilers generate mem+ops instructions only for a (not insignificant ) minority of instructions.

So the apple to apple (pun intended) practical width difference is closer than it appear, still not as wide.

X86 machines usually target 5-6 wide rename, so that would become a bottleneck (not all instructions require rename of course). I expect that M1 has an 8-wide rename.

Edit: another limitation is that most x86 decoders can only decode 16 bytes at a time and many instructions can be very long, further limiting actual decode throughput.

On the converse the expectation is that most hot code will skip decode completely and is fed from the uop cache. This also saves power.


Thank you for a better synopsis.

Bugs me so much when people don't look at the logical side of things. Tons of mac'n'knights going around downvoting and stating people are wrong that it has something to do with being a RISC processor.

While fundamentally pre 2000s things were more RISC this and CISC that the designs are more similar than ever on x86 and ARM. Just that the components are designed different to handle the different base instruction sets.

Also the article is entirely wrong about SoC Ryzen chips have been SoC since their initial inception. In fact SoC on the First APU. Those carried North Bridge components onto the CPU die.


It also has shared 12MB of L2 cache for the performance cores which is huge.


> But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs.

The author completely misses "the baby in the water"

Yes, X86 core are HUGE, the whole of CPU is for them only.

They can afford wider decode, even though at a giant area cost (which itself would be dwarfed area cost of cache system area.)

The thing is, have more decode, and buffer will still not improve X86 perf by much

Modern X86 has good internal register, and pipeline utilisation, it's simply they don't have something to keep all of those registers busy most of the time!

What it lacks is memory, and cache I/O. All X86 chips today are I/O starved at every level. And that starvation also comes as a result of decades old X86 idiosyncrasies about how I/O should be done.


How does I/O works differently in a M1 chip compared to x86?


I think X86 is the only modern ISA family that still have a separate address space for I/O. It is not used today anymore, but it exists somewhere deep in the chip, and its legacy kind of messed up how the entire wider memory, and cache systems on X86 were designed.

X86 has got memory mapped I/O for modern hardware, but on the way there, X86 memory access got tangled with bus access. X86 still treats the wider memory system as a kind of "peripheral" with mind of its own.

The intricacies how X86 memory access evolved to keep accommodating decades old drivers, and hardware apparently made a grand mess of what you can, and what you cannot memory map, or cache, and many things deeper in the chip.

One of may casualties of that design decision is the X86 cache miss penalty, and an overall expensive memory operations.


I don't really get what you are talking about. Everybody has been doing MMIO for a while now (and by for a while I mean multiple decades), and IO is usually not an issue in personal computers anyway; OOO, compute and the memory hierarchy is. We are not discussing about some mainframes...


> I think X86 is the only modern ISA family that still have a separate address space for I/O. It is not used today anymore, but it exists somewhere deep in the chip, and its legacy kind of messed up how the entire wider memory, and cache systems on X86 were designed.

Internally it's just a bank of memory these days. You can publicly see how HyperTransport has treated it as a weird MMIO range for decades (just like Message Signaled Interrupts), and QPI takes the same approach.


Why can't they just give the IO bus a slower clock and devote the resources to the memory bus? Or, memory-map everything and make the IO area yet another reserved area the BIOS tells the OS about?


I believe, if anybody knew answer to many questions along the lines of yours, the guy would be rich, and Intel would not be in the ditch.

To be clear, the I/O bus with own address space is no longer in use, but design considerations resulting from keeping it around for so long are there.


I find it interesting you use this kind of disparaging tone when discussing Apple Silicon. I also find it interesting that you consider having a wide decoder not as a technical trick but as "throwing hardware at the problem."

However you try and spin it, what it comes down to is this, Apple is somehow designing more performant processors than every other company in the world and we should acknowledge they are beating the "traditional" chip designing companies handily while being new at the game.

If it's as easy as "throwing hardware" at the problem, then Intel and AMD and Samsung etc should have no problem beating Apple right?


I interpreted that differently: they used good engineering to make a speed demon of a chip, not some magic trick that's only good in benchmarks and not real world usage. I don't think it was disparaging at all.


It’s never considered “magical” after the feat has been accomplished. But a year ago if you claimed this is where Apple would be today, a lot of people would say that would require waving a magic wand.


Yeah, it's like saying "well all Apple is doing is throwing good engineering at the problem".

All they've really done is taken an advanced process, coupled it with a powerful architecture they've been iteratively improving on for years, and thrown it at software that has been codesigned to work really well on those chips.

Yeesh!


They didn't just throw hardware at the problem, but also talent. Something that is way more scarce.


Apple is throwing money at the problem. IC size is directly proportional to cost.

They couldn't sell a chip like this at a cost-competitive price on the open market against AMD/Intel products.


Hmm. The M1 is about 120 sq mm, substantially smaller than many Intel designs. The current estimate for 5 nm wafers is about $17K (almost certainly high). A single 300 mm wafer is about 70,700 sq mm. If we get die from 75% of that area, that gives us about a $38 raw die cost. Even with packaging and additional testing, I suspect they would be competitive.


Nice back of the envelope calculation. I think I'd add yield to it though.

TSMC had a 90% yield in 2019 for a 18mm2 chip[1]. Assuming a 120mm2 chip would have more defects, and assuming process improvements since 2019, maybe 80% would be an accurate-ish estimate.

Found an even better number, [2] list the defect rate as 0.11 / 100m2 => 87%.

$38/87% = $43.7

[1] https://www.anandtech.com/show/15219/early-tsmc-5nm-test-chi... [2] https://www.anandtech.com/show/16028/better-yield-on-5nm-tha...


That's a fair point, yield has to be included. I lumped yield in with the pessimistic 75% factor for area utilization of the wafer. I should have been more clear. The area loss for square die on a round wafer should be much less than 25% of the total wafer area.

If you look at process tech cost trends, the $17K is also very pessimistic. I think a customer the size of Apple is probably getting a much better rate than that. Remember, they sell well over 200 million TSMC fabbed chips a year. Hard to know for sure of course, but I imagine these chips are ultimately costing Apple well under $40. We'll never know of course...


The big skew in availability between 8GB and 16GB models implies to me that yields of perfect chips are lower than Apple expected, with too many ending up in the 8GB bin.


The DRAM is on separate chips from the M1 processor. The availability skew is probably just a production forecasting error.


I came to the opposite conclusion. I think most users than expected paid for the 16GB models, leaving extra inventory of the 8GB modules. During black Friday/Cyber monday I saw several discounts on 8GB M1 systems, but none on the 16GB systems.

Hopefully that sends a message to apple to build systems with more memory. Seems insane to invest in a expensive M1 system (once you add storage, 3 year warranty, etc) and get only 8GB. Even it works well today, with a useful life of 3-6 years seems likely the extra 8GB will have significant value over the life of the system... even if it's just to reduce wear on the storage system.


That would imply that the M1 has the DRAM rather than just the controllers on the chip but all of the coverage I've seen says that they are separate chips in the same package.


This is interesting. What sources do you visit to learn about CPU manufacturing trends?


Well, there are a number of industry sites, but here are a few good starting points:

https://en.wikichip.org/wiki/WikiChip https://semiwiki.com https://www.tomshardware.com https://semiaccurate.com (paywall for some articles, very opinionated...)


As the sister comments have noted they almost certainly could given the size of the die.

But in another sense you are right - Apple is throwing money at the problem: their scale and buying power means that they have forced their way to the front of the TSMC queue and bought all the 5nm capacity.


Is that true? Intel chips are known to be overpriced.


Intel probably isn't the best example here, a better comparison would be AMD. Their r&d budget were 20-25% of intel's, yet were able to produce a better performing part with zen3.


Intel fabs their own chips. AMD outsources that. Intel is still on 14nm vs AMD's 7nm. It has been a really long time since AMD has even come close to Intel. The question is whether Intel can recover from their slump before AMD can get enough chips out.


AMDs on a big upswing, design wins in multiple markets from servers down to laptops. The PS5 and Xbox S/X also will help will volumes for the next few years.


>> Intel is still on 14nm

Why do people keep saying that? There's a variety of 10nm processors from Intel on the market.


Are there any chips like this being sold at all by AMD or Intel? Can you get this performance and power consumption anywhere else?


The AMD 4900U has similar power consumption and higher multi-threaded performance, but lower single-threaded performance.

I expect the AMD part to have quite the bit lower GPU performance since it uses 1/3 of the transistors of the M1, 4.9billion transistors vs 16billion.

https://wccftech.com/intel-and-amd-x86-mobility-cpus-destroy...


AMD CPUs are very hard to fully utilize with a single thread, and Intel has always held the single-thread perf crown. Most high-performance use cases are multi-threaded now, so the single-threaded performance delta isn't that significant. The Apple chip is really built for running a snappy GUI: in most of those cases, you need to run ~1M instructions super fast on one thread for a short time. Intel has historically had the crown on this metric, but not any more with all of their problems.


AMD Zen3 outperforms everything Intel has in single threaded performance.

https://wccftech.com/amd-ryzen-9-5950x-16-core-zen-3-cpu-obl...

Or are you saying that some operations like FMAs for example are hard to keep a high utilization for?


Historically yes, but not so with the Zen3.


The closest in the next few months is the new Zen 3 based APUs that AMD is announcing at CES which is in Mid January. Zen 3 is reasonably competitive with the M1 on a per core basis. Not quite the single thread perf, but pretty competitive on multicore throughput.

As a rough estimate I'd expect the AMD chips to be within 10-20%, and you'll be able to run Windows or Linux on them.


While the M1 is super impressive, I'm kind of wondering how scalable the performance increases they made here are.

They moved high bandwidth RAM to be shared between the CPU & GPU, but they[1] can't just keep expanding the SoC[2] with ever larger amounts of RAM. At some point they will need external RAM in addition to the on chip RAM. Perhaps they will ship 32GB of onboard memory with external RAM being used as swap? This takes a bit away from the super-efficient design they have currently.

Likewise, putting increasingly larger GPUs onto the SoC is going to be a big challenge for higher performance/ Pro setups.

I think Apple really hit the sweet spot with the M1. I suspect the higher end chips will be faster than their Intel/ AMD counterparts, but they won't blow them out of the water the way the current MacBook Air/ MBP/ mini blow away the ~$700-2,000 PC market.

[1] I updated the text here because I'd originally commented here about the price of RAM which is largely irrelevant to the actual limitations.

[2] As has been pointed out below, the memory is not on the die with the CPU, but is in the same package. Leaving the text above intact.


To your main points, they have yet to fully utilize advanced packaging (TSMC 3D for example) to get much more RAM in the SOC. They could also go off package at the expense of power, a good tradeoff for a desktop system. The die size is also small compared to competitive processors (120 mm^2) so they can certainly add more cores. I think with a larger cost budget they'll make the high end sing.

I don't mean to pick on you as you're not alone in what is a very natural skepticism. However, it is somewhat amusing to watch the comments about the M1 over time. When Apple announced the M1 based products many critics were crying impossible, faked, rigged benchmarks, etc... Now that the products have proven to have performance at least as good as claimed, lots of people are suggesting this is some kind of low end fluke result and the higher end systems won't be so great. Just wait. I think we are seeing a tipping point event where RISC (fueled by great engineering) is finally fulfilling its promise from many years ago.


> To your main points, they have yet to fully utilize advanced packaging (TSMC 3D for example) to get much more RAM in the SOC.

The big problem is as you get larger and larger amounts of RAM, the demand drops precariously. The number of people who need 16GB of RAM? Very large. The number who need 32GB is at least an order of magnitude smaller. The number who need 64GB another order of magnitude smaller. The number who need 256GB of RAM or more is likely in the low thousands or even hundreds.

Making a custom package for those kind of numbers becomes prohibitively expensive.

> I don't mean to pick on you as you're not alone in what is a very natural skepticism. However, it is somewhat amusing to watch the comments about the M1 over time. When Apple announced the M1 based products many critics were crying impossible, faked, rigged benchmarks, etc

I've figured from the start that Apple wouldn't make this transition unless there were a significant win here. In my above comment, I think I made it quite clear that I expect Apple's upcoming CPUs to outperform Intel.

I'm just not as certain the delta between Apple's top end CPUs and Intel/ AMD will be as great as the delta between the M1 and the Intel CPU it replaced. So for example, the M series chip might be 20-30% faster than the Intel in the 16" MacBook Pro, not double the performance as it was in the MacBook Air.


> The number of people who need 16GB of RAM? Very large. The number who need 32GB is at least an order of magnitude smaller. The number who need 64GB another order of magnitude smaller. The number who need 256GB of RAM or more is likely in the low thousands or even hundreds.

"need" can be a combination of objective and subjective takes, but I would posit that the amount who would at least purport to need 256GB+ is radically, radically higher than the low thousands.


Nearly every engineer I know would benefit from =>32GB of RAM as opposed to 16GB.

On the consumer side, if more of the market has 32GB available, you can bet that applications will expand to utilize the available space.


I have no idea, but I suspect the number of people willing spend $5000 to upgrade a desktop Mac from 32GB to 256GB is quite low.

The process to provision 8 or 16GB of memory spread over millions of units doesn't seem like it would work well at smaller scales.

Whether that's low 1000s or tens of thousands isn't really important.


It's less than $2000 to upgrade an iMac Pro to 256GB: https://eshop.macsales.com/item/OWC/DID2627DS256/ (and that's a branded kit with a lifetime warranty)


In a laptop, low thousands might be about right. And for RAM, need captures it pretty well. Either you will once in your life need to do something that uses 256GB ram, in which case you need it, or you don't.


Where was the discussion scoped to laptops?

As for need, as alluded to in my previous reply generally I agree in an objective sense (although uses 256GB does not necessarily mean it needs it either). On the subjective side, I have long-observed that people will blindly assert that more memory is better even when they don't profile their peak physical usage and may easily never make use of the amount they have outside of disk caching. Even if folks don't need incredible amounts of memory, that doesn't necessarily stopping them from wanting it even if it provides little to no benefit.


Where I work I have a cloud instance with more than 256GB of RAM.

But I don't need that on my laptop.


Where was the discussion scoped to laptops?


Yes, I expect an ARM MacBook Pro not to be that massively faster in the peak pearformance, but it should run considerably cooler. This means sustained peak performance and overall less fan noise. If I have anything to critisize about the MB Pro, it is that there is just too much fan noise with even not so big loads and the machine just getting very hot. Adding more battery lifetime would also be a welcome improvement. (Though, 8 high performance cores would also deliver a quite impressive speed)


I've had problems with MB Pros getting too hot for at least a decade. My 2008 model was actually too hot to use on my lap. Sun-burn level of hot if I wore shorts, or even something like pajama pants.


This is kind of what I expect of the 16" MBP and the higher end 13" MBP, cooler running with low/ no fan noise, tremendous battery life, 20-50% better performance than the M1, drop 8GB RAM and offer 32GB, better video performance (support dual 5k displays).

It's possible they will launch the iMac with the same chip as the MBP 16 as well.


Fair enough. I may have projected a number of other comments onto yours a bit. Sorry about that. If the delta is similar it would certainly be crazy fast.

As far as the packaging, I agree with your general assertion. However, they ship about 20 million computers and about 240 million iOS devices per year, all of which require custom SOC packages. Needless to say, they are a very large customer with their packaging suppliers. I could be wrong, but I think the leverage they get will keep the costs in line even for a smaller slice of their premium priced products.


Apple has always billed their big selling point in terms of performance per watt. That narrative might change.

I suspect even if the 16" Pro has more modest performance gains than the Air, it will still have massively better battery life. The iMac and Apple's bigger devices have a lot of thermal and power headroom to make up for the shortcomings of Intels CPUs.


It seems like Apple walked away from a bad PowerPC situation into Intel's arms only until they could do something like this... Especially after the past few years where Intel can't execute on the fab process and forced Apple to switch vendors when Intel couldn't execute on 5g.


> To your main points, they have yet to fully utilize advanced packaging (TSMC 3D for example) to get much more RAM in the SOC.

The M1 is using off the shelf LPDDR4 modules on the package but not on the die. 3D stacking is possible for denser DRAM modules but wouldn't make much sense compared to just adding 2 more DRAM modules and spending the physical space on it - there's more than enough room especially in devices like the Mac Mini. A boring ol' daisy chain setup would work perfectly fine here.


Interesting, where do you find the size of these dies? I was really curious about how large the Firestorm cores are compared to say Zen3 or Intel cores.

How many cores do you think Apple can realistically add? Can they get up to something like threadpiper. or are we talking more about something along the lines of a 12 core upper limit?


https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

https://www.techarp.com/computer/amd-zen-3-tech-report/

https://www.tomshardware.com/news/der8auer-intel-core-i9-109...

Here's a summary of numbers from the articles.

The M1 is about 120 sq mm.

The Zen 3 is interesting. It has a separate IO chip coupled with one or two 8 core processors. The 8 core version has a total die area of about 80 + 125 = 205 sq mm. The 16 core is 285 sq mm.

Intel chips are all over the place depending on core design and number of cores. As an example from the reference above, a 6 core i7-8700K is about 154 sq mm. The i9-9900K is 180 sq mm and the 10 core is 206 sq mm.

How many cores can they add? From the die photograph of the M1, a very rough estimate of area dedicated to the 8 cores and associated cache is maybe 40 percent or 48 sq mm. Compared to the 205 sq mm for the 10 core i9, they could add about 16 more cores. Seems unlikely of course because of all the other things you have to do to support that. It is reasonable though to imagine a 180 to 200 sq mm Mx chip with maybe 16 CPU cores and perhaps a few more GPU cores. Fun.

The fundamental limit to die size is what's known as the reticle size. That's the maximum size of the stepped design that is repeated over the wafer. This is a limitation of the lithography equipment. That limit is around 850 sq mm, but no one builds a high volume chip anywhere close to that big because the yield and therefore the cost would be atrocious. Instead, several repeats of the design are included in the reticle and then that is stepped over the entire wafer.


More interesting is comparing core size, I think. Just spitballing in MS Paint, if we chop off the FP the Zen 3 core plus 1MB L2 is probably ~ 3 sq mm. The M1 seems to be about 2 Firestorm complexes high and 4 wide, so a complex would be 15 sq mm, or ~ 3.75 sq mm for a core plus 3MB L2.

https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-...


One thing you havent mentioned in that the Zen 3 processor, is on the 7nm process from TSMC, whereas the M1 is on the 5nm process.


There isn't any fundamental limit. Threadripper's don't have GPUs taking up space and heat budget though.


> but at $200/ 8GB, getting up to 32GB or 64GB is going to be a challenge and quite expensive

It's worth noting that Apple's price for the upgrade doesn't necessarily tell us much about the actual cost to them. They've famously charged exorbitant amounts for RAM for a long time, including charging that same rate, $400/16GB, to bump up the RAM on the current (Intel) 16" Macbook Pro:

https://www.apple.com/shop/buy-mac/macbook-pro/16-inch-space...


Yes!

I updated my comment hopefully made it more clear. The issue is less about the $$ amount, and more about the practicality of including ever larger amounts of RAM on the SoC.


There's no DRAM on the SoC die. The DRAM is separate dies that are packaged close to the SoC die, but they don't have to be that close. A typical graphics card has an array of 8+ DRAM packages surrounding the GPU, with a faster bus speed than Apple's using.

To scale up DRAM capacity and performance, Apple will have to increase the number of DRAM controllers on the SoC and maybe increase the drive strength so that they can push the signals a few cm over PCB instead of less than 1cm over the current SoC package substrate. Neither of those is a particularly difficult scaling problem.


I think you might be underestimating the importance of RAM proximity to the CPU.

The speed of electricity over copper is listed as being 299,792,458 meters per second. A meter is 39.3701 inches, so that would be 11,802,859,050 (11.8 billion) inches per second.

Now imagine a CPU trying to send a round trip electrical signal 4 billion times per second over the the RAM, over the distance of two inches (a 4 inch round trip). That's literally 16 billion inches of distance that you are asking the electrical signal to cover in the space of a second, but we know that the electricity can only physically cover 11.8 billion inches in that second, so we would essentially have a bottleneck due to the physical restrictions of the speed of light.

Now imagine if you could cut that distance down from inches to cm or even millimeters... This is the benefit of having everything together on an integrated chip.


Speed of light delay due to trace length really is not that important to DRAM. Adding ~6cm to the round-trip path would add about 0.3ns to DRAM latency, which is already over 30ns. So we're looking at less than 1% difference between on-package vs on the motherboard. This is a much less important concern than signal integrity and drive strength necessary to maintain bandwidth when moving DRAM further from the CPU.

Your argument would be much closer to relevant if we were discussing SRAM caches.


Thanks, I updated my comment above.


> Moving high bandwidth RAM to be shared between the CPU & GPU

There's a really weird amount of focus on this. Unified memory has been a thing for almost half a decade now. It become a thing as soon as AMD stuck the CPU & GPU behind the same IOMMU ( https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... )

Now you may ask, but why hasn't anyone leverage this before? And the answer is they have, but it's usually not worth doing because usually the integrated GPU isn't worth using for anything. As in, the market for HPC software that doesn't benefit from a discreet GPU is vanishingly small. So nobody really cares about it. But there are applications of this. Things like Intel's QuickSync should be internally leveraging this on the relevant CPUs, for example. So those sorts of frameworks (like MLKit) will just be a bit faster and cheaper to spin up.

The rise in consumer software that does "HPC-like" workloads (eg, facial recognition, voice recognition, etc..) make this more interesting, but there's nothing new here, either. It's certainly not playing any meaningful role in the M1's current performance. The bigger question here would be to what extent does Apple push this. Like is Apple going to try and push integrated GPUs for the iMac Pro? What about the Mac Pro?


Unified memory is quite a bit older than that; the SGI O2[0] had it in 1996.

[0] https://en.wikipedia.org/wiki/SGI_O2


so did original Macintosh, and ZX Spectrum. Unified memory is not a good thing, makes the CPU and GPU fight for access.


However it also means zero copy between them too, there are benefits.


The M1's memory is ported in such a way that there isn't contention for the memory. At least according to the article.


Well, the article is basically just wrong on that, kinda. It is a single memory controller, which means the CPU & GPU are fighting for bandwidth. It's not an exclusive lock, no, but you will see a very sharp decline in memcpy performance if you're also hammering the GPU with blit commands.

Which is the same as basically all integrated GPUs on all modern (or even kinda modern) CPUs. You don't usually have such heavy CPU memory bandwidth and GPU memory bandwidth workloads simultaneously, so it's mostly fine in practice, but there is technically resource contention there.


Is this from real world observation, or just looking at the diagrams or what? Because by all accounts, these chips don’t seem to encounter any issues with the ui or encoding movies or whatever. Which would strongly imply that memory contention between the gpu and cpu is not a significant issue.

If you have actual data that backs up your position, I would love to see it.


As I said, in most scenarios this doesn't matter, so I'm not sure why you're pointing at an average scenario as some sort of counter-argument?

There's a single memory controller on the M1. The CPU, GPU, neural net engine, etc... all share that single controller (hence how "unified memory" is achieved). Given the theoretical maximum throughput of the M1's memory controller is 68GB/s, and that the CPU can hit around 68GB/s in a memcpy, I'm not sure what you're expecting? If you hammer the GPU at the same time, it must by design share that single 68GB/s pipe to the memory. There's not a secondary dedicated pipe for it to use. So the bandwidth must necessarily be split in a multi-use scenario, there's no other option here.

Think of it like a network switch. You can put 5 computers behind a gigabit switch and 99% of the time nobody will ever have an issue. At any given point you can speedtest on one of them and see a full 1gbit/s on it. But if you speedtest on all of them simultaneously you of course won't see each one getting 1gbit/s since there's only a single 1gbit/s connection upstream. Same thing here, just with a 68GB/s 8x16-bit channel LPDDR4X memory controller.

The only way you can get full-throughput CPU & GPU memory performance is if you physically dedicated memory modules to the CPU & GPU independently (such as with a typical desktop PC using discrete graphics). Otherwise they must compete for the shared resource by definition of being shared.


The point I was originally responding to was, "Unified memory is not a good thing, makes the CPU and GPU fight for access." There is no evidence here that this is a significant issue for the M1.

So even if the memory bandwidth is split, the argument that this is a problem is not in evidence.


> There's a really weird amount of focus on this.

Considering how much emphasis Apple put on it, I don't think it is weird at all. Apple made big claims about their CPU and attributed a significant chunk of that performance gain to their unified memory.

Their claims about the CPU have proved largely accurate. Why would they fabricate the reasons behind that performance? It's not like they are throwing off competitors here.


It seems that some analysis overestimates UMA. It won't accelarate simple CPU performance and won't reduce RAM usage unless its data is also used on GPU.

Possibly is they also call stacked RAM as "UMA" rather than shared RAM between CPU and GPU?


What I find odd is how someone rushes out to "debunk" nearly everything Apple has said that they did to optimize the M1. It's clear the M1 is fast, so clearly some of the things they did to optimize it worked. Why Apple would lie about what those optimizations are is a head scratcher.

It strikes me as odd that so many people claim there is zero benefit. I'm left wondering if people think Apple is lying about all of the actual reasons the CPU is fast and have presented this as a smokescreen. Makes zero sense to me.


> Why Apple would lie about what those optimizations are is a head scratcher.

Indeed it is. But marketing isn't really known for being technically accurate, why would Apple's be any exception?

> It strikes me as odd that so many people claim there is zero benefit.

That's not the claim at all. The claim is that unified memory isn't new, and there's a decent chance you already had it. For example, every Macbook Air of the last few years has been unified memory. The M1's unified memory is therefore a continuation of the existing norms & not something different. The M1's IPC is something different. The M1's cache latency is something different. The M1's 8-wide decoder is something different. There's a lot about the M1 that's different. Unified memory just isn't one of them, and unified memory still just doesn't improve CPU performance. All those CPU benchmarks that M1 is tearing up? It'd put up identical scores without unified memory, since none of those benchmarks involve copying/moving data between the CPU and a non-CPU coprocessor like the GPU or neural processor.


> But marketing isn't really known for being technically accurate, why would Apple's be any exception?

The comments and emphasis about the benefits on unified memory come right out of Johny Srouji's mouth. Granted, everything Apple execs say is vetted by marketing and legal, but I can't see Srouji emphasizing a made-up marketing point as a weird red herring either.

As I mentioned above, someone on HN has "debunked" every single optimization Apple says they've done. But the numbers don't lie. Somewhere along the way, some of the "debunking" is full of shit. I'm guessing most of it is.

The simplest explanation is that Apple is being forthright here and people don't understand what they've done or how those optimizations work.


The whole focus on this makes no sense to me as well. UMA has been standard architecture for mobile SoCs since the dawn of time(aside from a few oddballs). Ditto tiling GPUs and the few other things I've seen called out.

Heck the XBox 360 used UMA[1] and the PS3 didn't[2]. It really didn't play into the systems performance directly(other than you might do some crazy tricks like store audio data in VRAM and stream it back). With UMA you can get into cases where heavy CPU reads can impact other parts of the system because the memory controller is shared.

[1] https://en.wikipedia.org/wiki/Xbox_360_technical_specificati...

[2] https://en.wikipedia.org/wiki/PlayStation_3_technical_specif...


I've seen a few very rare edge cases where you can do cool tricks with UMA on mobile (which as you say have been UMA for years). For example on Android for VR you can have the sensor data stored in a chunk of memory that can also be read by the GPU, so your last-second time warp can have that smidge less latency by sampling the latest sensor data as hot off the sensor as it gets, without even bouncing off of the CPU. ( https://developer.android.com/ndk/reference/group/sensor#ase... )

But the best VR experiences are still done on non-UMA desktop PCs with discrete graphics so... At some point the slight efficiency wins of UMA are trumped by just the raw horsepower you get from multiple big dies.


> Moving high bandwidth RAM

Its just LPDDR4. HBM is a different thing. You can tell because HBM has 1024-bits bus per chip, but LPDDR4 is just 128-bits per chip.

There's almost nothing special about the RAM, aside from it being locked to the chip and unable to be upgraded. It seems like the clockrate to the LPDDR4 is a bit higher than average, but its nothing extravagant.

> How do they manage even larger amounts of RAM? Perhaps they will have external RAM in addition to the on chip RAM, with perhaps 32GB of onboard memory and the external RAM being used as swap? This takes a bit away from the super-efficient design they have currently.

They COULD just support a DIMM stick like everyone else. But Apple doesn't want to do that strategy and prefers packaging the RAM and CPU together for higher prices.

This isn't a 1024-bit bus (that requires 1024 wires, usually an interposer). This is just your standard 128-bit LPDDR4, maybe at a slightly higher clockrate than others for a slight advantage in memory.

------

In case of HBM2, you can't upgrade RAM beyond one-stack per 1024-bit bus. A GPU goes 4x wide with 4x 1024-bit busses to 4x different HBM2 stacks for a massive 4096-pin layout (!!!), that's 4096 wires connecting a GPU to individual HBM2 chips.

Hypothetically, a future GPU might go 6-stacks or 8-stacks (8192-wires), but obviously running all those wires gets more-and-more complex the bigger you go. So in practice, GPUs seem to be ~4-stacks, and then you just buy a new GPU and run the entire GPU in parallel when you need more RAM.

> > Moving high bandwidth RAM to be shared between the CPU & GPU

That's not the advantage. When the L3 cache is connected between CPU and GPU, you gain a bandwidth edge in heterogenous systems. AMD / Intel have been connecting L3 caches to their iGPU solutions for over a decade now.

CPU - to DDR4 - to GPU is very slow compared to CPU - to L3 cache - to GPU. Keeping the iGPU and CPU memory-cohesive is a nifty trick, but is kind of standard at this point.


On a related topic, will Apple need to go up to LPDDR5 to get to 32GB of RAM? I read a comment a day or so ago that said that LPDDR4 is limited to 16GB. I'm wondering if Apple can release a 32GB machine within the next 6 months.


Micron makes 32MB LPDDR4 memory chips if I'm reading https://www.micron.com/products/dram/lpdram correctly Above that they plan upon using LPDDR5


I'm not an expert here, just going by the articles I've read on this topic recently. That said...

> Its just LPDDR4. HBM is a different thing.

Hmm, Apple refers to it as High Bandwidth Memory. The Register[1] refers to it as "High Bandwidth Memory" and also:

"This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design."

Which implies to me that some kinds of LPDDR is indeed HBM and that what's on the M1 isn't something which can be replaced by drop in RAM.

> That's not the advantage.

Hmm, again quoting The Reg here because I think it's a fairly independent source on Apple hardware.

"In other words, this memory is shared between the three different compute engines and their cores. The three don't have their own individual memory resources, which would need data moved into them."

Maybe I didn't express this clearly enough above.

> AMD / Intel have been connecting L3 caches to their iGPU solutions for over a decade now.

My point above wasn't that Apple invented this idea. It was that the idea doesn't scale well. Who invented it doesn't really matter.

[1] https://www.theregister.com/2020/11/19/apple_m1_high_bandwid...


You're right. The Register article says that. The Register article says that because they're parroting Apple's marketing. I don't fault The Register for copying Apple. I fault Apple for being misleading with their arguments. LPDDR4x is NOT HBM2. HBM2 is a completely different technology.

"High Bandwidth Memory", or HBM2 (since we're on version 2 of that tech now), HBM2 is on the order of 1024-bit lanes per chip-stack. It is extremely misleading for Apple to issue a press-release claiming they have "high bandwidth memory".

HBM2 is the stuff of supercomputers: the Fujitsu A64FX. Ampere A100 GPUs. Etc. etc. That's not the stuff you'd see in a phone or laptop.

> "In other words, this memory is shared between the three different compute engines and their cores. The three don't have their own individual memory resources, which would need data moved into them."

It would be colossally stupid for the iGPU, CPU, and Neural Engine to communicate over RAM. I don't even need to look at the design: they almost certainly share Last-Level Cache. LLC communications are faster than RAM communications.

From the perspective of the iGPU / CPU / Neural Engine, its all the same (because the "cache" pretends to be RAM anyway). But in actuality, the bandwidth and latency characteristics of that communication are almost certainly optimized to be cache-to-cache transfers, without ever leaving the chip.


> HBM2 is the stuff of supercomputers: the Fujitsu A64FX. Ampere A100 GPUs. Etc. etc. That's not the stuff you'd see in a phone or laptop.

It also made an appearance on some consumer discreet GPUs, notably the Vega 56, Vega 64, and Radeon VII.


No wonder nobody knows about it, if they're keeping it discreet.


You can get in apples current laptop GPUs as well, navi 10 + HBM.


While I can certainly understand a certain amount of frustration at Apple for using phrasing that has been used as an official marketing term, I also have to say that I don't think too highly of whoever came up with the idea that "HBM" should be an exclusive marketing term in the first place, when it's a very straightforward description of memory that includes, but is not limited to, memory within that specific label.

Based on what Apple's doing, it seems perfectly legitimate and reasonable to refer to the memory in the M1 Macs as "high-bandwidth memory", even if its lanes are not 1024 bits wide.


When Supercomputers and GPUs are pushing 1000GB/s with "high bandwidth memory", its utterly ridiculous to call a 70GB/s solution 'high bandwidth' (128-bits x 4266 MT/s).

There's an entire performance tier between normal desktops and supercomputers: the GDDR6 / GDDR6x tier of graphics RAM, pushing 512GB/s (Radeon 6xxx series) and 800GB/s (NVidia RTX 3080).

To call 128-bit x 4266MT/s "high bandwidth" is a complete misnomer, no matter how you look at it. Any quad-channel (Threadripper: ~100GB/s) or hex-channel (Xeon ~150GB/s) already crushes it, let alone truly high-bandwidth solutions. And nobody in their right mind calls Threadripper or Xeon "high bandwidth", we call them like they are: quad-channel or hex-channel.


> "High Bandwidth Memory", or HBM2 (since we're on version 2 of that tech now), HBM2 is on the order of 1024-bit lanes per chip-stack. It is extremely misleading for Apple to issue a press-release claiming they have "high bandwidth memory".

Considering how few people know what HBM2 is, the idea that Apple is trying to make any claims that their solution uses HBM2 seems weird.


> Considering how few people know what HBM2 is, the idea that Apple is trying to make any claims that their solution uses HBM2 seems weird.

https://www.apple.com/mac-pro/

Apple absolutely knows what HBM2 is. They've literally got products with HBM2 in it.

Note: LPDDR4x RAM is typically called "Low Power" RAM, in fact, that's exactly what LPDDR4x is for. Its designed to be a low-power consumption ram for extended battery life.

It takes a special level of marketing (ie: misleading / misdirection) to buy two chips of "low power" RAM, and try to sell it as "high bandwidth RAM".


> "This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design."

> Which implies to me that some kinds of LPDDR is indeed HBM and that what's on the M1 isn't something which can be replaced by drop in RAM.

Sibling comment already outlined that HBM is unique & different from LPDDR. Using LPDDR means it's definitely not HBM.

But you can definitely hit LPDDR4 at 4266 MT/s speeds with drop-in RAM. That's DDR5, which in the "initial" spec goes all the way to 6400 MT/s. More relevantly, DDR5-4800 (so 4800 MT/s) modules are among the first to actually be manufactured: https://www.anandtech.com/show/16142/ddr5-is-coming-first-64...


It is just LPDDR4X, it is normal ram. Claims of something special about the ram are marketing untruths, other than it being high end laptop ram, like you would get on a high end x86 laptop too.

That it is packaged right next to the cpu may reduce latency by half a nanosecond out of ~60.


Where it might help more is by reducing the power needed to run the ram. Putting it on package keeps the trace lengths to a minimum and might reduce the power the memory controller needs to talk to it.


Perhaps the 8 wide, deep ROB is better utilizing the peak throughput of the wide memory?


Which (non-Apple) laptops are shipping with 4266 MT/s LPDDR4X RAM?


Another discussion thread has noted that LPDDR4x can be 16x bits or 32x bits.

DDR4 is always 64-bits. Two channel DDR4 is 128-bits. So right there, 2-channel x 64bits DDR4 is the same bus-width as the 8-channel x 16bits LPDDR4x.

With that being said, 8-channel LPDDR4x is more than most I've heard of. But its not really that much more than DDR4 configurations.

128-bit (2-channel) DDR4 at 3200 MT/s is 51 GB/s bandwidth.

4266 MT/s x 128-bits (8-channel) LPDDR4x is 68GB/s. An advantage, but nothing insurmountable.

--------

A Threadripper can easily run 4-channel 3200 MT/s (100GB/s). Xeons are 6-channel. GPUs are 500GB/s to 800GB/s tier. Supercomputer-GPUs (A100) and Supercomputer-CPUs (A64Fx) are 1000+GB/s.

----

HBM2 has a MINIMUM speed of 250GB/s (single stack, 1024-bits), and often is run in x4 configurations for 1000GB/s. That's what "high bandwidth" means today. Not this ~68GB/s tier Apple is bragging about, but instead HBM is about breaking the PB/s barrier in bandwidth.

--------

But yes, I'll agree that Apple's 68GB/s configuration is an incremental (but not substantial) upgrade over the typical 40GB/s to 50GB/s DDR4 stuff being used today.


Some brief googling tells me:

- Razer Book 13

- Dell XPS 13 9310 2-in-1

- MSI Prestige 14 EVO

- Intel's new whitebox laptop platform

I think I've read that there are now Ryzen laptops shipping with LPDDR4x as well. It's awesome that Apple is using ram with this much bandwidth, but it's not exclusive.


I completely forgot that 11th gen Intels actually supported LPDDR4 / LPDDR4x.

Its kind of ridiculous: LPDDR4 has been used in phones for years, but it took until this year before Intel / AMD added support for it. Ah well. This one is definitely on Intel / AMD's fault for being slow on the uptake.

DDR4 had largely the same power-draw of LPDDR3 (but was missing sleep-mode). So I think CPU makers got a bit lazy and felt like DDR4 was sufficient for the job. But LPDDR4x is leapfrogging ahead... the phone market is really seeing more innovation than the laptop / desktop market in some respects.


Apple is using high-bandwidth memory, but not High Bandwidth Memory (HBM).


> LPDDR4 is just 128-bits per chip.

It is? I thought it was up to 32.


32-bits per channel. Multiple channels per chip.

I'm doing this from memory btw, so I might have gotten something wrong. But my understanding of LPDDR4x is 32-bits per channel, 4x channels per chip (typical: different LPDDR4x chips can have different configurations).


My understanding is 16 bits per channel, and 2 channels per chip, typical.

Wikipedia says the same and here's a randomly selected datasheet where you can see the pinout https://www.mouser.com/datasheet/2/198/43_46LQ32256A_AL-1877...


Sounds like time to look things up!

I just went to Micron's page for LPDDR4: https://www.micron.com/products/dram/lpdram/automotive-lpddr...

Automotive grade, but allegedly still LPDDR4. Looks like they're x16 and x32, as you said.

--------

Hmmm, I know that the M1 is specified as 128-bits on Anandtech's site. If its 4-dies per chip, then 32-bits per die, 4-total dies of LPDDR4? I know Anandtech is claiming a 128-bit bus, so I'm reverse-engineering things from that tidbit of knowledge.

-----

Either way, these 128-bit or 64-bit numbers are all very much smaller than 1024-bit per stack HBM2 (high-bandwidth memory). Much, much, much much smaller. Its clear that the M1 isn't using HBM at all, but some kind of LPDDR4x configuration.


https://images.anandtech.com/doci/16252/M1_575px.png This states 8 channels but teardowns have shown that there are only 2 memory chips so I'm not sure exactly what's going on.


LPDDR4x can have multiple dies per chip.

Based on your discussion point, then I'm thinking x16 per channel, 4-dies per chip, 2-chips for a total of 8-dies for the 8-memory controllers.


But is there a standard pinout with that many data pins?


https://www.digikey.com/en/products/detail/micron-technology...

Hmm... I don't know what's going on.

This chip claims to be LPDDR4x, but it is a 556-pin package. This is in contrast to your earlier data-sheet, which only has 200-pins. Maybe LPDDR4x doesn't have any standardized pinouts?

This isn't exactly where I normally work, so I'm not entirely sure what is going on.


Most likely, it's two DRAM dies per package. Those DRAM packages look ridiculously wide, so the DRAM dies might not even be stacked.


> LPDDR4x is 32-bits per channel, 4x channels per chip

DDR4 bus width is 64 bits per channel. Are you saying DIMMs include a sufficiently fast and complex controller chip?


Clearly, this discussion has proven that I'm a bit out of my area here.

But my understanding is that LPDDR4 have multiplie dies of RAM stacked on top of a chip. There's only two RAM chips on the Apple M1.

https://upload.wikimedia.org/wikipedia/commons/8/83/Apple_M1...

You can clearly see the two LPDDR4x chips packaged with the Apple M1. There's no question that the M1 package contains two DRAM chips. The only question is the configuration of the "insides" of these DRAM chips.

------

Anandtech's interviews give us a preview: 128-bit wide bus (8 x 16-bit it seems), two chips, LPDDR4x protocol. The details beyond that are pure speculation on my part.


> They moved high bandwidth RAM to be shared between the CPU & GPU, but they[1] can't just keep expanding the SoC with ever larger amounts of RAM Perhaps they will have external RAM in addition to the on chip RAM, with perhaps 32GB of onboard memory and the external RAM being used as swap? This takes a bit away from the super-efficient design they have currently.

The RAM isn't on the SOC, it's soldered on the package, which is a very different proposition. Possibly so they don't have to bother routing on the board (meaning they can shrink the board further), and can cool the RAM at the same time as the SoC (without the need for a wider cooler base).

The PS5 also has unified memory and uses faster RAM, the RAM chips are soldered on the board around the SoC.


> The PS5 also has unified memory and uses faster RAM, the RAM chips are soldered on the board around the SoC.

Perhaps this won't be as limiting as I suspected.


>I'm kind of wondering how scalable the performance increases they made here are

It's a good question and really only time will tell, but this same comment was made about these chips in the iPhone: "impressive yes, but would they be able to scale up to power a laptop?" A lot of people said no back then, and yet here we are.

Apple's not stupid and they're not going to slap Intel in the face like this if they're not 100% sure they won't need them again in the future. I'd be willing to bet they already know the answer to these questions (and likely even have 32/64GB M1s running in their labs). There's just no way they'd go into something like this blindly.


I thought I was pretty clear that I fully expect Apple will be able to migrate their entire line.

My only question is whether they will blow the doors off Intel/ AMD the way they did with the M1. With the MacBook Air (and largely with the base MacBook Pro), Apple was able to more than double the processor speed even while increasing processor speed. I'm not convinced we'll see that kind of explosive speed improvements with the 16" MacBook Pro. Battery life will almost certainly trash the current MBP. I'm just not confident we'll see the base 16" MacBook doubling its CPU performance.

FWIW, I don't see anything Apple has done as a slap in the face. They've been pretty reasonable about not mentioning Intel in their performance numbers. Instead all metrics are based on the performance of previous generation MacBook Air.


> Apple's not stupid and they're not going to slap Intel in the face like this if they're not 100% sure they won't need them again in the future.

Is there some explanation why AMD processors are unsuitable in, e.g., the Mac Pro?


Given that AMD licenses recent tech to Microsoft and Sony, I'd be shocked if they would turn down a reasonable offer to design SOC's for/with Apple.


I would be stunned if Apple outsourced any part of their CPU/ GPU design. About the closest I could see to that would be offering discrete graphics on some of their top end models.


Just to add a few things.

The HBM confusion comes from Apple Marketing, they have since corrected the phase.

It uses LPDDR4x, fairly standard across the industry.

The LPDDR4 spec allows up to 8 Channel per package ( Although I have not yet seen any manufactures announced such a SKUs ), meaning Apple could scale to 32 GB, or with LPDDR5 using the same setup gives them 64GB while increasing the bandwidth by ~30%+.

From a portable perspective I think 64GB is good enough. Not to mention larger size of RAM being unsuitable for product with battery.

I have no idea how their Desktop part will play out, especially when Memory bandwidth is required for their universal memory.


This is something I have been trying to understand. What is the practical limit today? How many cores, and how much more memory can Apple add before they run out of Silicon? From what I understand there are a number of practical limits to how big they can make the silicon die, because the equipment on the factories are not really made for arbitrary large chips, and anyway the defect rate will quickly rise if you make too large chips.

Like can we expect say something like a 32 core M3 chip with 64 GB RAM, or will that be too much for this design?


> with perhaps 32GB of onboard memory and the external RAM being used as swap?

I wouldn't be surprised if Apple was working on some ways to optimize swap so that SSDs can fill the roll of "external ram." At this point, PCI Express has the theoretical bandwidth to compete with DDR5, but it can't handle anywhere near the number to transactions. But I bet with some clever OS-level tricks, they could give performance that's good enough for most background applications.


Apple's still working with the same NAND flash memory as everyone else, so there's little opportunity for them to do anything particularly clever at the low level.

But even looking at commodity hardware, high-end SSDs are already capable of handling a comparable number 4kB random reads per second to the number of context switches/page faults a CPU core can handle in the same second. The huge latency disparity is the problem: the SSD would prefer you request 4kB pages dozens at a time, but a software thread can only fault on one page access at a time. Using a larger page size than 4kB will get you much better throughput out of a SSD. On the OS side, swapping out a large number of pages when the active application changes can make good use of SSD bandwidth, but when a single application starts spilling into swap space, you're still fundamentally screwed on the performance front.


While not significantly better, XNU on Apple Silicon uses 16kB pages already.


You could take a stack of DIMMS, add a memory controller and a USB-C interface, and create the world fastest external swap drive.


"Now you got a big problem, because neither Intel, AMD or Nvidia are going to license their intellectual property to Dell or HP for them to make an SoC for their machines."

This isn't quite correct. AMD and Nvidia have made custom SOCs for game consoles for years now; they could create customized SOCs for PCs if this was a critical issue. But I don't believe it is. The PC workloads and use cases follow well-known patterns and market segments and you don't really need a lot of different custom accelerators, or be on a single SOC for that matter. Video-related processing has been part of GPUs for many years.

For that matter, making RISC CPUs is not actually a huge obstacle for these guys. (Nvidia does.) Yes even Intel.

The article also seems to entirely omit the M1's heterogenous core strategy, where 4 of the cores are high performance and the other 4 are optimized for power efficiency. A deeper analysis of this and how software manages them would be more interesting.


>The article also seems to entirely omit the M1's heterogenous core strategy, where 4 of the cores are high performance and the other 4 are optimized for power efficiency. A deeper analysis of this and how software manages them would be more interesting.

When the CPU utilization ramps up on M1 MacBooks, the user interface always remains buttery smooth. I haven't been able to make the M1 Air drop a frame yet no matter how hard I push it. Meanwhile doing anything remotely CPU intensive on an Intel MacBook will turn the user interface into a janky stuttery mess.

I suspect the heterogeneous cores are the "secret" behind this. Performance tasks are delegated to the performance cores, leaving the efficiency cores free to instantly respond to any UI-bound tasks.


Where do you have frame drop issues on your intel Mac? I have a 2017 15 inch with the best i7 and run a 4K and 1440p monitor with no noticeable issues. Just ordered an M1 Air.


Maxed out iMac 2017 here, I can't wait for a new iMac with better performance and zero noise. I love the 5K display and the only comparable display is the Pro Display XDR, which I can't justified.


What are your frame drop issues on your iMac 2017?


It does all the time, when I run anything that taxing the CPU. Switching app doesn't feel instant and I can always hear the fan noise going in the background.



Not to mention, that AMD's IO / Memory controller has a whole ARM CPU on the same die for handling their platform security.


AMD does but AFAIK Nvidia don't license modifiable IP for Switch, PS3, Original Xbox.


Gotta say I'm not a fan of this article. A few points I really disagree with:

1) It puts way too much emphasis on the ISA being a driver for performance. People have been making this claim against x86 forever, going back to the 90s when RISC chips started putting up some very impressive performance numbers. Lots of talk about how the x86 ISA was out of runway. This period existed fo a grand total of about 4 years (94 thru 97) until the Pentium II hit the market, which allowed PCs to start cutting into the workstation market. AMD creating the x86-64 extension handled the addressing limitation well too, letting it flourish in the server space as well.

2) The author seems to think Intel or AMD cannot make SoCs because of their business model. This really isn't the case, and both companies have slowly been moving more "stuff" on package. AMD is already there with the xbox an playstation chips they manufacture. AMD's chiplet approach would be particularly adroit at exploiting this. I get the feeling both AMD and Intel would LOVE it if PC vendors started demanding SoCs. And while the M1 has lots of little speciality processors onboard, both AMD and Intel have more robust processor I/O. That's by design: The M1 isn't designed for the person who needs a gazillion PCI-E lanes for storage and networking (the M1 is likely has half the PCI-E lanes as consumer Zen2/3 cpus and 1/4th to 1/8th compared to Threadripper and EPYC)

3) Glosses over the fact that one of the greatest challenges in computer architecture is coping with the so-called "memory wall": there's a growing latency chasm between cache and memory. Apple cleverly mitigates this by using very fast, low latency memory (LPDDR4x 4266 MT/s vs DDR4 at 3200 MT/s in x86 CPUs). But beyond how many MT/s you're operating with, I would love to see some latency figures. They likely blow your typical consumer DIMMs out of the water. Both AMD and Intel wok around this limitation with larger caches and smarter prefetching, but that only gets you so far. edit: And I would wager much of its impressive performance figures comes from Apple's approach to addressing the memory wall. Memory latency benchmarks seem to strongly favor the M1 over both AMD and Intel CPUs.


The Intel ISA has some very powerful instructions that allow it to keep performance up, but unfortunately decoding n x86 instructions involves doing a number of things that are O(n) in latency and O(n^2) in chip area/power - both instruction boundary detection and micro-op scheduling (since instructions can emit more than one uop) are included here. This isn't too bad when n is 4, but it has stressed both AMD and Intel's capabilities to make it to 8, and I don't think anyone would ever seriously consider building a 16-wide x86 machine. Once memory and I/O buses get fast enough to feed a 16-wide general-purpose computer, x86 as an architecture is in trouble. They have quietly gotten fast enough to feed an 8-wide machine.

LPDDR4 very quietly took the speed crown from DDR4 due to the fact that the signal integrity problems are less demanding (soldered vs DIMM being the standard interface), and it looks like the DDR interface may be near the end of its useful life. Transceiver-based memory modules using standards like CXL and CCIX may be the long-term fix to the problems of DDR4. In the short-term, those are going to hurt, but long-term they should be a lot better.


Common knowledge is that going wider quickly hits diminishing returns so generally there hasn't been a lot of pressure for x86 to go wider.

It seems to be working for Apple though possibly because they are targeting differen trade offs (lower clock speed for example); but maybe their secret is a better modelling of common workloads revealing this common knowledge to be outdated.


I'd like to see this explored more. I think AMD/Intel might be historically looking at C++/compiled workloads more than JS/Java/python interpreted workloads. JITs very well could take better advantage of wider cores than compiled code.

Diminishing returns with increasing cache sizes and going wider doesn't seem to hold back the M1.


Maybe. JITs and runtimes in general might add some "overhead" code that does book-keeping and verifies assumptions (to rollback conditional optimistic optimizations). Maybe wider engines allow running this overhead without impacting actual runtime.

It probably also help hide the cost of type checks and bound checks.


It might just be the combination of better branch prediction, TSMC 5nm, 8-channel memory, 16k pages, along with 8-wide cores+larger structures especially cache are enough to account for the more perf/watt of the M1.


> The author seems to think Intel or AMD cannot make SoCs because of their business model.

You gave a good example of AMD, but Intel has been making the Xeon-D SoC for a while now. They've always looked interesting to me, but outside of expensive SuperMicro motherboards, I've never seen one available to the general public.

I think you are right though - both AMD and Intel would love it if vendors demanded high-performance SoCs.

https://en.wikipedia.org/wiki/Xeon_D


fwiw, I think Intel declared the Xeon-D line dead. They're not planning anything new there.

That said, I don't think they would need much to make a Core series processor into a SoC; the chip does almost everything you'd need already; memory, some of the pci-e, (some) usb, (some) sata, legacy IO are already in the CPU for both Intel and AMD, and both offer chips with GPUs on board too. AMD has a 'chipless' chipset available for ryzen, the A300, it even has a new performance version, the X300. There's still plenty of other chips on the board, but there's nothing that's really a chipset.


That's really disappointing, but entirely unsurprising. It seems that the "market" doesn't ever want anything new, unless someone hits them on the head it. Like what Apple just did.

The neural engine and DSP on-chip are actually what I find most interesting about the Apple chip. It would be nice to see that start to happen with other manufacturers.


On 1) I think that you're right about the impact of the ISA being overdone but lets not forget that Intel has had a process advantage for a long time and that it was partly the volume of consumer sales that supported their R&D and helped them win on servers. That situation has changed now - the process advantage has gone and there is lots of money pouring into TSMC and the mobile SoC designers.

For me the key point on the ISA is that Arm is not limited to two companies: what we've seen in other markets will happen in desktops and servers - the really big customers can take the ISA and build highly competitive SoCs that suit them on the leading process technology and for lower cost.


By what measure are modern AMD CPUs not an SOC (especially the APU models)? They literally can function without a chipset.

The chipsets are little more a PCI-E extender with some built in USB and SATA controllers connected to it.

The CPUs do perform some form of handshaking with the chipset, but after that is done, it does not care. This handshake is literally about being able to lock out certain on die features if you don't buy the more expensive chipsets. (Or partner with AMD to get access to an small 8 pin activator chip to unlock the features you need, and add your own PCI-E extender if desired).


I agree but I think the author is using a different standard and I’m saying AMD and Intel already meet that standard.


The explanation provided by the article is poor. It has very little to do with SoC or specialized instructions -- it's fast on general purpose code.

The main reasons appear to be that it's a solid design, i.e. competitive with what x86-64 processors are offering, and on top of that has some specific characteristics that x86-64 processors don't currently have (but obviously could). It's built on TSMC 5nm, which is possibly the most power efficient process in the world right now. It's using memory which is faster than the DDR4 currently used by the competition. It's a big.LITTLE design, which is more power efficient, because you get strong single thread performance from the big cores and higher efficiency on threaded code from the little cores.

It's not magic. It's engineering.


The article goes into what makes it fast on general purpose code in the bottom half. Mainly, it's the better out-of-order execution that is possible with a large ROB, and how this wouldn't work as well on X86.

Your points don't explain why at 3Ghz this chip can outperform 5Ghz chips in single-threaded workloads. Using a 5nm process can give you more cores and less power usage, but it won't improve IPC. The DDR4 is faster than average, but you can get those kinds of speeds on enthusiast PCs and it doesn't make a big difference there.


> Your points don't explain why at 3Ghz this chip can outperform 5Ghz chips in single-threaded workloads.

It's a trade off. Making it wider is the reason the IPC is higher, but it's also the reason it is a 3GHz chip instead of a 5GHz chip.

The author's explanation of why x86 processors can't do this is also not especially compelling. The reason very wide processors are atypical is that common spaghetti code has poor instruction level parallelism -- the processor can't extract what isn't there, so there is a point at which higher clocks become the way to make bad code run faster. Interestingly, synthetic benchmarks are often less susceptible to this because the code is optimized to maximize ILP. It's a shame we still have so few real world benchmarks (mostly because the applications still haven't been ported).

> The DDR4 is faster than average, but you can get those kinds of speeds on enthusiast PCs and it doesn't make a big difference there.

It depends heavily on the task, but the difference in some cases is close to 40%:

https://www.tomshardware.com/reviews/best-ram-speed,5951-6.h...


Exactly. The far larger ROB, great branch predictor, large L1 and significantly decreased memory latency combined with a lower core frequency allcontribute to keep the core fed and prevent it stalling.

Still sustaining 8-wide (or even just 4-wide) is very hard. Apparently most code has an ILP of 1.5 on average.

I guess that the large width helps recovering from stalls (somehow absorbing spikes).


It'll be curious when they inevitably release the "M2" chip with higher memory capacity. If M2 has 32gb support I'll buy one immediately. For now, I'll be sticking to my "never buy gen 1 of any technology" rule. Right along side my, "use tech until it breaks" rule, however if you incentivize yourself to be a "power user" meeting the latter rule is far easier than the former ;) .


> never buy gen 1 of any technology

I totally understand that sentiment, but in some ways this isn’t the first generation. While it’s the first desktop chip that Apple has RELEASED, they’ve been making their own chips since the iPhone 5s (IIRC). I imagine all the big bugs are sorted out.


By the time docker, virtualization, and programming environments are mostly sorted out, there will likely be a successor to the M1 on the market, so while I'm not GP, I do agree with the sentiment in this case. Especially with an ISA change.


Virtualization and Docker are both well on their way. You can download working VM systems both very simple to the complex with QEmu. A Docker engineer announced on Twitter that he had a early (pre-alpha) version of Docker working on the M1.

https://forums.macrumors.com/threads/success-virtualize-wind...

https://forums.macrumors.com/threads/ubuntu-linux-virtualize...

https://twitter.com/morlhon/status/1332609373051478016?s=21


Docker is still super early in development, see the most recent update here: https://github.com/docker/roadmap/issues/142#issuecomment-73...

I wouldn't expect official stable support to be available very soon. Even then, even if most of the official Docker image library will have ARM support, a ton of community images will need to be ported – all of this will take quite some time.


Not to mention Homebrew and associated packages.


But they run ARM docker images, not x86.


This. So far, after a week, my M1 is flawless. This is not first gen tech by a mile.

I’m slightly in shock to be honest. I was about to bail out to Linux and buy a monster PC but I can’t get a new Ryzen or Radeon card at the moment due to supply so I blew the cash on a Mini, which was actually in stock.


I'll be waiting for the M3X in 2022, at which point they should be using TSMC's 3nm node (another 15% more performance and density) and have had 2 years for the software ecosystem to get settled on ARM.


You can see the gaps in the lineup that they have yet to fill, since the 4-port 13" Macbook Pro and high end Mac Mini are still being sold with Intel CPUs.

Both of those machines need at least 32GB options to properly replace them. I would guess we see that support in an M1X with more cores and more Thunderbolt lanes next year as they work their way up the lineup.


It's a total guess, but I figure "M" is for "Mobile", and it will increment up in numbers like the A series has done in iphones/ipads

Then we'll see some other letter (Maybe) "D1" for Desktop perhaps for the machines next up the list performance-wise (higher-end Mac Book Pros and minis and lower-end iMacs.)

And finally we'll see "P1" for the pros, which is the iMac Pro and Mac Pro. The real beast, hoping for at least 128 cores.

All of them will just go up in number as the years go on.


Don't you think "M" is just for "Mac"?


It sure could be, but I'll be shocked if they have their current M1 in the $999 MacBookAir and something still called "M1(x)" in the Mac Pro that has 128+ cores and >128GB RAM, etc. etc.

IMO, that's a totally different product, and it needs a different naming convention. But I'm just guessing.


They put an M1 in a mac mini, I think it just stands for Mac. "M2 Max" or something is exactly what I expect for their next desktop chip name


I suspect M stands for Mac, but it's fair to note that the Mini (and iMac) were using mobile (laptop) hardware previously, so it's certainly plausible that M stands for mobile.


The Intel Mac Mini and iMac both use Intel's desktop chip line, with 65 watt TDP.


Huh, I stand corrected, thanks. Did that change at some point or have I been consistently wrong?


That I don't know! Except the original Bondi iMac definitely was laptop parts.


I think you are right in the P series, that would make a bunch of sense


I think if you want a MB Air, it is absolutely fine to by now, as the M1 has precisely targetted for this machine. For the other machines, for sure there are variations to come which have more cores, more memory etc. For those, you should wait until they appear.


Buying an M1 now that I will give to my wife once the M2 16 inch comes out. A family member opted for an iPad over a 15 inch MBP that was collecting dust so I’m not really out any money on the tech until it breaks rule.

I almost ordered an intel 16 MBP inch a couple weeks ago. Thankfully I didn’t pull the trigger.


I'm am very appreciative to Apple for finally closing the endless "The ISA doesn't matter" nonsense. The ISA does matter and it's part of what enabled Apple to go so wide.

However, there the article is slightly misleading. The ROB isn't where instructions are issued from, that would be the schedulers which usually holds a much smaller set of instruction (16 per scheduler is common). The ROB holds everything that was decoded and not yet retired, thus including instruction that haven't yet been issued to schedulers and more importantly, instruction that have been executed for not retired (eg. might be waiting on older instruction to retire first).


Historically, I have yet to see a single case where the ISA has mattered. Every ISA will of course take advantage of its unique features, but in general fast chips come from companies with money and talent, not a particular ISA.


I think Itanium mattered. Its EPIC instruction set tried to push the scheduling complexity from hardware into the compiler, which proved too difficult. That's part of why it failed.


Great example. Seeing as I'm working on my Merced box I felt like commenting on this.

Intel/HP did many things wrong when they designed EPIC. The absence of dynamic scheduling was supposed to have made it simpler and thus have a faster cycle time. Alas it was so bloated when it finally arrived and horrendously complicated that it clock very slowly (mine @ 733 MHz while contemporaries ran at 1+ GHz).

The EPIC designers probably also didn't anticipate the advances in OoO machines that got really good. Trying to statically schedule for unknown load-misses isn't just difficult, it's impossible without code explosion.

EPIC got some things right, but design by committee and betting again dynamic scheduling doomed it.


My recollection is that when P. A. Semi was acquired by Apple, a major part of their "secret sauce" was a custom cell library. And then Intrinsity, a few years later, also seemed to be using custom cells on Samsung's process. Is this recollection / understanding correct? Is Apple still using custom cell libraries rather than TSMC's standard library?


What does "cell" mean in this context?


It's one level of abstraction above the transistor level; in other words gates and sets of gates. Designing at the raw transistor level is tedious, time-consuming, and requires specialized knowledge of low-level electrical and physics characteristics. When you design with standard cells, you only have to think about the logic functionality.

https://en.wikipedia.org/wiki/Standard_cell

The parent's question is about whether Apple really did do some of the designing at the transistor level by designing their own cells, rather than using a standard library. This could give them an edge that ordinary users of the standard library wouldn't have.


https://en.wikipedia.org/wiki/Standard_cell#Library

"A standard cell library is a collection of low-level electronic logic functions such as AND, OR, INVERT, flip-flops, latches, and buffers. "

I think SRAMs are also built of standard cells & there can be a lot of flexibility to improve on the fab's library, particularly with multiple ports for caches.


It's the basic low-level building block of an integrated circuit.[0]

[0] https://en.wikipedia.org/wiki/Standard_cell



The team at P.A.Semi was the team that developed the DEC Alpha and the DEC StrongARM. After DEC folded, they went on to develop PowerPC based chips. Shortly after they completed their PowerPC design they were bought up by Apple.

https://en.wikipedia.org/wiki/P.A._Semi


Ironically StrongARM was sold to Intel who developed it as XScale before selling to Marvell.


Couldn't Intel and AMD implement a "Rosetta2" like strategy? That is, couldn't they ship CPUs that fundamentally are not decoding x86 ops, but some different ISA, and then layer a translation layer on top of it?

The Transmeta Crusoe used to do this, and I think the NVidia Denver cores did too?

I think fundamentally though this analysis highly depends on single-threaded performance being the bottleneck for most apps. A lot of sophisticated workloads are GPU or TPU driven, and so hand-waving away multi-thread performance, and treating single-thread performance as the end-all for client side performance I think is overemphasizing its importance.

Also, the "they don't control the stack" argument is wrong. Effectively, Intel/AMD + Microsoft acted as a duopoly, and if you toss in NVidia, the structure of DirectX and Windows is largely a cooperative effort between these OEMs. If Intel/AMD needed some fundamentally new support for some mechanism to boost Windows performance by 20-30%, Microsoft would work with them to ship it.


The PC world has already switched architectures 32 bit to 64 bit already so its doable.

The big issue I think is if you are Intel / AMD why would you do this? Even if there are performance gains do you really want to undermine one of the things (backwards compatibility) that distinguishes you from the competition? Plus what would the new ISA be?

Plus Intel still has the memory of Itanium.


The whole m1 vs intel and AMD saga reminds me of innovators dilemma and the lesser referenced investors solution by Clayton Christensen.

In the solution book it basically details that everything goes through an ebb and flow of aggregation and disaggregation. This was often pointed out with Craigslist as the original aggregator and how many startups have been founded now to disaggregate it.

Interesting to read this article from that perspective. Intel and AMD disaggregated and found success for decades but now aggregating with SoC looks to yield massively better results.

Interesting to see certain business patterns continue to reappear.

Also shows that in tech no one has dominance if they don’t adapt.


As I've written elsewhere and the article discusses in excellent detail, something that's missing in the conversation is a discussion of the custom application specific silicon over here, https://areoform.wordpress.com/2020/11/30/tanstaafl-apples-m... (on HN, https://news.ycombinator.com/item?id=25256025 )

I dislike linking to myself, but the take is long and it essentially points out that Apple themselves hint at Application Specific silicon in the M1 in their keynote, and when added with their acquisitions of ASIC specialists, it's not a long-take to say that comparing the M1 against most vanilla X86 processors isn't quite fair. It's a very different beast with its own tradeoffs.

By colocating several different application specific processors under one roof and selling it as a CPU, Apple may have revitalized/invented something new. It's a bit like the iPhone - taking elements that were already there and catalyzing them into a new form. This new not-quite-a-CPU SoC likely has significant implications for the future of computing that should be discussed and detailed. Because,

There Ain't No Such Thing As A Free Lunch.


But the cores are really good in real world, general purpose computing so I'm not sure the ASS (Application Specific Silicon) is more than half the story.


While that’s no doubt helping in, say, classifying photos in iPhoto, it’ll do nothing for SPECInt. The impressive general purpose benchmark results are down to the CPU.


A lot of the stuff you’re referring to as “application specific” is already present in Intel Macs with the T2 chip. Except obviously slower and older. People aren’t discussing it because it isn’t really “new”


They have indeed moved the system controller etc from the T2 onto the M1 more or less wholesale. If you look at the TEM image I have added to my post of the M1's sister chip, the A14, you'll see that the classical CPU and GPU take up only a small fraction of the die. The rest isn't memory as a naive take would suggest. It's custom silicon handling a large basket of previously CPU tasks.


No more a tradeoff than something like AVX-512 though right? Or the custom instructions Intel will add for large customers like Facebook?


I am wondering how the M1 is doing regarding the various side-channel attacks that came to live with the Spectre/Meltdown publications. Is the M1 less prone to side channel attacks by design, or is this just something not yet known?


I'm sure it'll be impervious to any attack vector just like the T2 chip. Oh, wait. Ooops.

I'm sure all of the hackers are having a field day looking for vulns in the new chip. It'd be a hacker's fantasy to be the first one to find something on the brand new chip.


The core design is largely identical to the Apple's mobile processor line.


A few years ago, Apple's Portland, Oregon office hired dozens of top architects away from the Intel site 10 miles away (for big $$$). For those of you that don't know Oregon Hosts D1X (Intel's gigantic cutting-edge fab) and several architecture teams, predominantly veterans from the P4 product line.

This probably has something to do with it.


Ugh. Any way around the requirement to sign in with Google? Very frustrating that this is required to read a damn blog post.


I use uBlock origin. Never saw any signing form.


disable third-party cookies in your browser


I'm never sure if I like or hate when a site's adblock/article limit/sign in requirement etc is so easy to circumvent.

On the one hand it's good for me, on the other I feel like, couldn't you have put in a bit more effort? if you're going to make it that easy why bother?

(I know it's just because most people don't realise)


I use the excellent Readium bookmarklet[0]. Drag it to your bookmarks bar and whenever you encounter a Medium article you can't read, just click it and problem solved.

[0]https://sugoidesune.github.io/readium/


Use an incognito tab


Incognito mode of your browser


Using a JavaScript toggle button also works.


Block javascript from medium.com


All these specialized co-processors make me a bit sad. I've already got a graphics card that can decode some video formats, but not new ones, and that will never change until it becomes e-waste. Now we get the same thing for image formats, ML, etc? And you're even more locked in to these co-processors since they use space that could have been used to have a dozen general purpose cores. So good luck ever watching a video that doesn't use a codec your hardware supports.


A lot of modern hardware coprocessors are hybrid blocks that can be configured through firmware. You can dedicate a very small DSP with the block to feed and manage it. I haven't been involved in a new chip design in a few years, but a lot of specialized cores were vector processors (very narrowly defined vector processors). I don't know if that has changed.

I find it surprising that you have a graphics card that won't decode new video formats, unless the vendor has completely abandoned support for it. I have never encountered a modern graphics card that wasn't completely configured and run with firmware.


Maybe I just don't pay attention, but I've never heard of new firmware going out to multi-year old graphics cards and adding new features. Sure it may be possible, but what company is going to go through the trouble of that?


I'd be ok with coprocessors if there was an easy way to swap them in and out.


A lot fewer molecules of e-waste with the tiny SoC, at least.


Also isn't Apple has it "easy" designing a chip from scratch with 0 backward compatibility + they manage the HW + SW.

Intel & AMD have to support thousands of different config:

- I can plug my old memory it works

- I can plug my old graphic card it work ( PCI-e )

Apple has 0 constrains about legacy hardware, software ( beside that emulation layer ), so yes it's a great achievments, but it's very different from designing a chip for PC.


You put "that emulation layer" in parentheses but it's no trivial feat writing an x86/x86_64 emulator with high enough fidelity to run almost all Mac apps at almost native speeds.


Furthermore the M1 has some hardware support for this, via a runtime-toggle to force x86-style total store ordering.


Question for the chip design crowd here: Would it be feasible for Intel/AMD to design a fast ISA from the ground up and add it as a runtime option to their x86_64 chips? Like, on context switch to userland, the kernel puts the chip into whatever ISA is in the respective process's binary. And the syscall instruction and interrupts put the chip back into the kernel ISA. So you would have backwards compatiblity, but new stuff could use the faster ISA. I guess my main question is if there would be enough die space to support two separate instruction decoders.


The short answer is no. Basically, you are asking them to really have two sets of cores on a single die and switch back and forth.

I think the more realistic point of view is that x86 has likely run its course, and now it is time to move on.

The long answer is look up Transmeta and do some reading. Basically they designed a RISC CPU core with an external microcode engine, that was able to run x86 instructions. It was a valiant effort, but it didn't work as well as hoped.

Here is a fun history read that mentions Transmeta as well as P.A.Semi. P.A.Semi is the core team that designed the A1 (and following Apple cores) after being bought by Apple in 2008. Prior to that, the team had designed the DEC Alpha (awesome processor) and StrongARM cores. Later, they developed PowerPC cores. A team with really deep CPU talent, and in particular, prior ARM ISA experience, when they were acquired in 2008.

Sorry, I forgot the link on the first post....

https://www.linleygroup.com/newsletters/newsletter_detail.ph...


You say no, but the PowerPC 615 is an example of a core which did that. You just need two instruction decoders (and get to keep one of them off at a time, which is awesome for dark silicon reasons).


This is what Intel attempted with Itanium. The hardware implementation was too slow so they pulled it.

More recently is Nvidia's Denver project, which has a private ISA and dynamically translates ARM instructions. It's an open secret they had it working for x86 too, but it was scuttled due to patent threats.


I suspect they never shipped the x86 version because it wasn't competitive. Nvidia's CPUs couldn't keep up with contemporary Arm Cortex parts (probably a good part of the reason they bought them), let alone Intel and AMD stuff.


Vertical integration seems to have really paid off for Apple as it's really what allowed them to pull this off. While I certainly expect cloud providers to continue adopting ARM CPUs, I'm worried that the enthusiast/DIY desktop market will be irrelevant if x86 continues to lag behind ARM. It certainty doesn't seem like buying an Ampere server is as simple as getting an x86 machine. Perhaps POWER10 or eventually RISC-V offerings will become more accessible & competitive?


Regarding vertical integration, I wonder if their owning the OS and the hardware stacks lead to special insights on what would lead to better performance.


And they are completely unconstrained in where to put the compute. They can decide to add new special hardware to the SOC and just support it in the OS.


Power 10 does have some nice features so it would be pretty nice to see IBM seize the opportunity, not that they will.


Great explanation and also a good read on challenges Intel and AMD will face going forward. I am wondering if any of the windows / linux laptop makers will follow suit? Or will it be too hard to use the generic windows / linux and optimize it for custom built hardware?


> I am wondering if any of the windows / linux laptop makers will follow suit?

The article was actually spot-on in giving a reason why this won't be the case: AMD, Intel, and NVIDIA aren't going to license any of their IP to the likes of Dell, HP, MSI, Toshiba, Lenovo, TongFeng or ASUS.

Only NVIDIA and Qualcomm are left in the ARM SoC market (in terms of proprietary chip designs) since Samsung shut down their custom CPU design department end of last year. Neither of the two is competitive in terms of performance with Apple's custom chips and their SoCs are OK for SBCs, tablets, and Chromebook-level hardware but in no way competitive with even mid-range offerings from AMD and Intel.

Another factor is vendor lock-in. Since M1-style ARM-based SoCs wouldn't allow for discrete GPUs and RAM upgrades, creating a carefully segmented range of SKUs would be next to impossible. Even Apple struggles to do so as seen by the virtually non-existent differentiation between the M1 Macbook Air and -Pro.

AMD- and Intel-based systems allow OEMs to create bespoke product lines for each target market: NVIDIA Quadros and 32+GB of RAM for CAD/CAM and design work, powerful systems for gamers, CPUs with "pro"-features for businesses, etc.

An SoC is the same for all products. Sure, you can do some minimal differentiation by disabling a core or two, binning and cooling, but in the end there's very little practical difference between an SoC with 7 or 8 GPU cores... I'm pretty sure the MBA will cannibalise sales of the MBP big time for this exact reason.


Hope so. I could happily use arm chips on mobile.


Why bother? Computing is blazing fast already. Anything that is slow should use multi threading.

There's a reason most computers compromise on the processor performance, we don't need it. Take that extra few hundred and give me more ram and a better video card.


Because battery life.


So it's not the speed like the title says? Lol


It’s obvious that those are related, isn’t it? I want to have a powerful laptop which I can carry around all day. Yet latest Intel ultrabooks either throttle or get hot and loud. So much energy just get wasted on generating heat, that’s why passively cooled m1 looks like a wonder.


It’s the subtext isn’t it. Apple went on and on about performance PER WATT. They could crank up the power and shorten the battery life any time they like.


So my 2012 MBP is on its last legs (love that thing). So now we have the M1, and I'm hearing impressive things. But my question is, as a developer who does, lets face it, pretty much the gamut, will I be handicapped in the near term? I use a combination of IntelliJ IDEs and Emacs, with MacPorts, and a large variety of open source and closed source tools. I just don't want the headache of having some part of my workflow cutoff.


Same question here, except I disagree with all the discussion on other threads about "wow, 16gb RAM is magically sufficient for most use cases." I'm sure it is now sufficient for web browsing + IDE + Slack, but what about anything beyond that?

I realize 16gb RAM goes farther on M1 vs before, but some things just need RAM and there is no way around it. If i'm running a VM or multiple VMs (e.g., for android app development), or multiple docker containers -- I just need more RAM.

I'm hoping that the upcoming 16" MBP upgrade offers more RAM tiers.


In almost the same position (MacBook Pro 2013) and I plan on waiting for the 16". By the time that's out we'll see how many of these things are getting quickly ported - my expectation looking at how excited every is, is that things will get ported FAST. But there's time to wait and see before the one I actually want comes out.


MacPorts is working pretty well. About 80% in my testing. I've been able to work around anything that doesn't build. It seems much further along than HomeBrew.


Good to hear (for us MacPorts users!)


Well, at least Homebrew are claiming that "There won’t be any support for native ARM Homebrew installations for months to come." (https://github.com/Homebrew/brew/issues/7857) so if you rely on a large variety of open source and closed source tools that might be a problem.


Almost everything works if you start terminal in rosetta mode. cmd+i on terminal.app and check the rosetta box.


Just had this discussion at work. We have a mix of MacBook Pros and high end Windows 10 laptops, and every one of them is running a VM hosting Ubuntu 18.04

Is the M1 going to have a VM capable of running a native Ubuntu instance?


Based on what people are saying here, it does seem that by the time the 16" comes out, support should be pretty good. That's pretty impressive, in a way.


MacPort isn't doing all that badly right now.


is this based on your own experience?


Not who you responded to but in my personal experience it is about 80% of what I've tried to build. I mostly had trouble with VideoLan x264 (but oddly not x265).


Yes.


Because of the 12MB L2 cache as the point of unification. 865 snapdragon has a 4mb L3. Same as my recent intel MacBook.

Apple can use at least as much more die as amd/qcom/intel’s profit margin.


Also being on TSMC5 on top of significant micro architecture improvements ; ~630 instruction deep ROB, ability to queue ~150 outstanding load instruction and ~100 outstanding stores and huge TLBs which reduces swapping the anandtech article does a great job going into explaining this.


All that helps, but if you look at the die you are going to see a massive 12MB L2 with a bunch of routing to all the L1s. That’s where the majority of the cost is going vs any other chip.


If cache size was the answer then Intel would have already stuck more on their CPUs, and AMD's 32MB on Zen3 would be carrying far more weight against the M1 than it is.

Bigger caches mean higher latency caches, it's not strictly a matter of bigger is better.


It’s an L2, amd zen has 2x16mb L3s, with a wacky point of unification. L3 is muuuuch slower than L2.

Apple’s L2 is huge and is the PoU for all the cores. I wouldn’t be surprised if it has direct routing to all the L1s since they are so small. 5nm doesn’t hurt.


> amd zen has 2x16mb L3s

Zen 3 is not, it's a single 32MB L3 per chiplet.

> L3 is muuuuch slower than L2.

"L2" and "L3" are just how many layers away from the CPU it is, it's not an intrinsic "type" of cache. The speed is a function of the size.


Intel probably should have dropped their profit margin just to prevent this embarrassment. This is the foot in the door. Or through the door.


Is anybody here using one of these for serious development, and can comment on performance? On reddit there are some comments about anything Java-related being rather slow for instance.


Depending on your definition of serious (I'd say it's serious when people pay money for it), I'm doing that and am quite happy with it. I haven't tried building any Java code though, only Go and Swift.

I'm seeing about 2x performance (eyeballing build times) of a recent Intel MBP for compiling a medium size Go codebase, running natively on arm64. The machine is noticeably faster than previous iterations for most tasks.

I've made an effort to avoid emulated code though so am running a patched version of Go 1.16 and tools. Not all tools have officially supported native versions yet, but the performance is definitely there.


I have the Azul JDK working but I haven't really done much with it. Been side-tracked by ffmpeg, QEmu and SimpleVM. But I plan to do some testing soon.


There's a need to use a native Java runtime instead of one running on Rosetta (like what IntelliJ ships).

Azul Zulu is one.


It really depends on your workflow. Also see: https://news.ycombinator.com/item?id=25238608


If they're still not using a native JVM Java devs will be running Java under Rosetta. While rosetta can run code with JITs the performance is nowhere near AOT perf (for many reasons)


You can get M1 native Java JDK from Azul.

https://www.azul.com/downloads/zulu-community/?package=jdk


I'm also curious. Phoronix added clash/GHC compiler benchmarks which is relevant to my interests.


As the article points out part of the reason is that the ARM's RISC instruction set has a lot more room for optimization than the x86 CISC instruction set, which Intel has been milking since 1972. If that were the whole story, ARMs would have been faster than Intel long before now, so there's obviously much more going on than simply RISC vs. CISC.

Still, I find it gratifying that my least favorite instruction set of all time may finally be going the way of the dinosaur.


> Still, I find it gratifying that my least favorite instruction set of all time may finally be going the way of the dinosaur.

Only I fear it won't be replaced by anything more open.


Interesting article here, but a lot of the stuff laid at Apple's feet here did not begin at Apple. Yes, they implemented it in a high power design, and yes, they did a great job, and it is really performing well, but all of this has been common in cell phone chips available from various vendors for over a decade.

Multiple cores, multiple DSPs, ISPs, GPUs, secure enclaves, encryption engines, high speed serial controllers, ADC/DAC, plus all the stuff to support the cellular modem such as viterbi encoders/decoders. All these little specialized blocks all embedded on the chip, and the DDR soldered directly to the top of it.

This has been a common technique for over a decade. Why isn't your phone as fast as the M1? How long do you think the M1 would run on your cell phone battery? Everything is an engineering trade off.

So Apple basically took the last decade of smart phone architecture, and blew it up with more transistors, to accomplish more processing throughput.

They didn't invent the technique of hybrid multi-core silicon by any stretch of the imagination.

You can go back to the mid 1990s and find single chip 56K dialup modems with a general purpose CPU, a DSP, onboard RAM, onboard NOR FLASH, and a half dozen dedicated modem hardware blocks, all on a single die.


Apple is about three years ahead of any other mobile chip maker. Sure, it's the same type of SoC and general architecture, but somehow Apple's CPU and GPU designs are miles ahead of anyone else.


CPU is exceptionally great but I don't think Apple is also leading GPU. I doubt M1's GPU efficiency is mostly came from 5nm process.


My phone is has comparable single-threaded performance with an M1. But it happens to have an A14 chip inside it ;)


There are some technical things I feel are somewhat misrepresented:

1. Why don't AMD/Intel use a unified GPU/CPU memory setup like Apple? Because the big discrete gpus would be too big and hot to be part of the same chip as a cpu, and need faster memory than DDR4.

2. Zen3 is clocked at 5ghz: the best of them can turbo to 4.9 for a little bit, base clocks are ~3.4 to 3.8 depending on core count / model


> Why don't AMD/Intel use a unified GPU/CPU memory setup like Apple?

Well, they do, for like 10+ years now


There has been talk of using eGPUs with the M1 chips. Obviously there would be downside to having to move stuff over a slow Thunderbolt 3 connection but given that the GPU can be much bigger, how do you think that would play out?

Would that offer higher performance? If so on what kind of workloads?


It's an interesting take, and I agree with the issue being the x86 difficulties. There was a link today about "how many registers an x86 processor has" and it really reminded me of how many issues and legacy the processors have to deal with

I think the solution would be to move the x86 cruft to an SW emulation layer, and focus on base performance. Something like this (though I'm skeptical of taking things like rings and memory protection out of hw http://www.emulators.com/docs/nx03_10fixes.htm )

Your latest Intel chip boots like an 8086 from the 70s and it has to do several steps before it goes to protected mode. It has to deal with x87, MMX, SSE, SSE2, AVX etc (ARM has NEON and that's it.). Every register has several ways of being accessed: AH/AL, AX, EAX, RAX. Wanna access a partial part of a register on ARM? AND/Shift.


X86's legacy really only affects the decoding logic, whereas the actual CPU itself could be almost any superscalar execution stage (simplified).

For example, amd64 has about 20 general purpose registers, but the silicon that actually executes it could have something like an order of magnitude more to play with (i.e. for register renaming)


From the article

> If we have more decoders we can chop up more instructions in parallel and thus fill up the ROB faster.

> And this is where we see the huge differences. The biggest baddest Intel and AMD microprocessor cores have 4 decoders, which means they can decode 4 instructions in parallel spitting out micro-ops. > But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs.

> You see, for x86 an instruction can be anywhere from 1–15 bytes long. On a RISC chip instructions are fixed size. Why is that relevant in this case?

> Because splitting up a stream of bytes into instructions to feed into 8 different decoders in parallel becomes trivial if every instruction has the same length.


I found this recent ColdFusion video on the background of ARM and RISC processors really educational, I think most will learn _something_ from it: https://www.youtube.com/watch?v=OuF9weSkS68


> So e.g. smaller ARM CPUs don’t use micro-ops at all. But that also means they cannot do things such as OoO.

Perhaps I'm picking nits, but the second sentence is not true. The dependency/resource-tracking and scheduling required for OoO execution do not depend upon using micro-ops.


> Sure Intel and AMD may simply begin to sell whole finished SoCs. But what are these to contain? PC makers may have different ideas of what they should contain.

There used to be many competing CPUs with different instruction sets. I personally used z80*, 6502, 680xx, 80xx, Alpha, one from HP and who knows how many others. Add PowerPC and was Itanium a not x86 CPU? Then we almost standardized on one for a long time, plus ARM lately. No reason for we won't standardize on one SoC. It could be everybody buying only Apple computers (unlikely) or some company winning the forthcoming SoC wars. Or business as usual and stay on the current general purpose architecture forever, which is also unlikely given the optimization advantages of a SoC like the M1.


So in the (excellent) article, they have two advantages: System on a Chip and enhanced out of order execution. Wasn't out of order execution implicated in Spectre and Meltdown? How are they mitigating this issue? I imagine for a new chip they could design something in.


Out of order execution makes your code run faster like 20x-50x. You can’t produce a CPU without it and compete in the market. There are many variations of spectre and meltdown bugs, some of them might be fixed in the CPU and some in the OS. So, there is no simple answer to your question I believe.


I would just wonder if increasing the quantity of OoO makes mitigation more difficult in some way.


Afaik all spectre/meltdown family attacks are possible even with a single OoO execution. But increasing the quantity(adding more complexity) might cause new batch of security issues ofcourse.


A lot of people are asking why Intel or AMD haven't produced a CPU like this, but that's not so interesting to ask because maybe that's something to do with the x86 ISA.

What about Qualcomm, ARM, and Samsung though? What's their excuse?


Maybe it simply doesn't economically make sense for Qualcomm and even less so for Samsung. Their SoCs equip a lot more entry-level or mid-range low-margin phones, compared to Apple.

There might be the need to make more powerful, desktop-class SoCs in the future if Google and Microsoft push for that. But I imagine the real competition will come mostly from the server market for now. Unless Nvidia starts making beefier SoCs, maybe the upcoming Orin.


The one thing that still puzzles me is, how that unified memory thing works and actually handles contention between GPU and CPU without slowing either down. There's not much information out there how it actually works.

Anybody has any details?


So they have 'co-processors', like the Amiga, but on a chip!

Recall, this is what made the Amiga such a groundbreaking computer back in the 80s and early 90s: separate co-processors that handled graphics, sound, DMA, IO,...

It worked (for a while) because the Amiga was an integrated system of hardware-and-OS - exactly like Apple. Unlike the MS-DOS and Windows PCs of the time, there was no need to support third-party hardware.

Together with a really nice OS, the platform encouraged creativity and gave rise to a generation of coders, artists, and musicians.

So everyone: now you know what it felt like when the Amiga first appeared on the home computing scene!


> Sure Intel and AMD may simply begin to sell whole finished SoCs. But what are these to contain? PC makers may have different ideas of what they should contain. You potentially get a conflict between Intel, AMD, Microsoft and PC makers about what sort of specialized chips should be included.

It will be interesting to see what happens to the Wintel business in the years to come - is this the beginning of the end? If Apple can build a strong price/performance advantage in a way that can't be replicated by PC makers and, for example, start bringing out cheaper 'SE' versions of their laptops.


After analyzing the benchmarks I have a suspition that the apple compiler does some automatic parallelization and issues some instructions on the GPU, since the GPU shares the memory coherency with the CPU. That's why roseta2 performance is slower, as optimizing loops on the binary is harder. If they are not doing that, they should, as it will be something like an AVX512 for the poor energy budget. Just lunching a debate, as the wide decode with long pipeline smels like a Pentium4 crap or a miracle.


Interesting article, and call my cynical but wasn't anyone else immediately reminded of the excellent HN post earlier this month [1], and the Apple big-picture play proposed there?

"Those shiny new Apple Silicon macs that Apple just announced, three times faster and 50% more battery life? They won’t run any OS before Big Sur."

[1] https://sneak.berlin/20201112/your-computer-isnt-yours/


Requires an account. Is there a less restricted URL for the story?


Everyone talks about its top scoring benchmarks, which I presume are the Firestorm cores, but I have to wonder what engineering marvels went into the Icestorm cores?


"RISC CPUs have a choice. So e.g. smaller ARM CPUs don’t use micro-ops at all. But that also means they cannot do things such as OoO."

I don't agree with this. What do uops have anything to do with ooo execution. I can easily make a CPU with no uops doing ooo execution. Unless I'm misunderstanding and the article is just saying smaller cores don't do ooo since they are trying to stay within a certain area/power footprint.


Intel will already sell you a CPU/GPU with a unified memory architecture and hardware video encode/decode. Intel's mobile parts are already SoCs (the U and Y models have the PCH on-chip, and the U's TDP range is probably the closest to the M1). These are not the reasons the M1 is faster than x86, but the article focusing on them so much means I'm more skeptical about the claims around OOP.


I (as a tech lead in IBM) had a sales visit from P.A. Semi years ago- we needed a CPU for a co-processor board. We didn't use their chip, and I was amazed that Apple bought them soon after. But then consider who was the founder:

https://en.wikipedia.org/wiki/Daniel_W._Dobberpuhl


Recently passed, RIP


I'm super curious -- since Apple is not marketing on Ghz and playing the Intel game, how do they deal with variable chip manufacturing yield? Is there a narrow band or minimum chip speed that they'll accept and toss anything performing below?

Does the SoC become so physically larger that yields also go down by quite a bit, and any defects on 5nm become much more troublesome?


If, in an alternate universe, the people who have designed the amazing hardware we have today suddenly switched to work on designing web backends of medium-sized tech companies, would they do a better job than the random mess of databases, services, microservices, HTTP APIs, GraphQL APIs, client-side logic, server-side logic, etc that we have today?


I keep coming back to the implications for the rest of the industry:

1. What does this mean for x86?

2. What does the answer to question #1 mean for Windows and Microsoft? Are they going to sit by and let Apple lap them?

3. What does the answer to question #2 mean for cloud computing? AWS is already making ARM chips... does the M1 change anything fundamentally about where the cloud goes with ARM?


#1: The writing's on the wall for x86, probably. Once its former monopoly rent is only justified by high-end "scientific computing" uses -- well you know what happened to IBM/DEC/SGI/HP and their workstation solutions...


Intel is going to be strongly motivated to pull their finger out and match the performance of the M1. Increase the performance of the chip, and release SOCs with RAM on the package for laptops, not just for phones.


At least for #2, see Windows ARM

https://docs.microsoft.com/en-us/windows/arm/


1. x86 (and then AMD64) has had vast amounts of R&D money and talent thrown at it for decades, with the result that it has out-performed every competitor: SPARC, MIPS, Alpha, PA-RISC, Itanium, etc. As a consequence of this domination, it has a massive software, tooling, and expertise advantage.

But ... ARM, which through a series of coincidences, ended up being the right architecture at the right time for mobile devices, has had its own injection of talent and money, and has advanced rapidly in the last 15 years. The M1 is a long way from the ARM710 or StrongARM of the late 90's!

Similarly, computer circuitry has evolved. Early minicomputers used small circuit boards with just a few gates to assemble their CPUs. In the early 80's, bigger CPUs were built from multiple "bit slice" chips ganged together. The modern PC motherboard has been a thing since the era of the Apple ][, growing through 8, 16, 32 and now 64 CPUs with ever-increasing speed and power, but maintaining a model of a motherboard and expansion cards. SoCs have been a thing for embedded devices for some time now: the M1 is perhaps the first really high-profile "computer" to use this technique to integrate its major architectural components in one "chip" (package). I think it's inevitable that others will follow, given the obvious performance and power gains to be had.

AMD and Intel do this to some extend already. The AMD chips powering the new Platstation and Xbox consoles are SoCs, and Intel has sold embedded variants of its CPUs that are basically SoCs for some years now too.

x86 has a long future ahead of it, but for the first time since Alpha (?) it's seeing a major competitor. Unlike DEC though, ARM has a broad and deep ecosystem behind it, and (as the article points out) there are some advantages to the instruction set that x86-64 will perhaps struggle to overcome.

2. Microsoft already has the Surface. They've known for some time that Apple's integration of hardware and software is hard to beat, and that the iPhone represented a technology that could be applied to laptops and desktops any time Apple wanted to do so. So ... they've already made steps in the direction of competing, but they have a lot of business-model challenges to overcome in order to do so.

Honestly, I'd be much more interested in the futures of Dell, HP(E), Lenovo, etc, than Microsoft. This is a massive challenge for them.

3. No. The cloud vendors are already moving towards ARM for power reasons. The AWS Nitro board is begging to be integrated with a Gravitron -- I would be surprised if Amazon doesn't come up with a compute platform that's powered by a similarly integrated SoC in the very near future.

The M1 helps accelerate this process because all Open Source software will now have to treat AARCH64 as a first-class citizen (which, to be fair, the Raspberry Pi has already helped to do). Similarly, commercial software vendors now have another market that will need a native ARM version in 2 years (after Rosetta2 is phased out, like Rosetta was for PPC). So .. I don't think it's changed the cloud direction at all, but it could make the transition faster.


> But Apple has a crazy 8 decoders. Not only that but the ROB is something like 3x larger. You can basically hold 3x as many instructions. No other mainstream chip maker has that many decoders in their CPUs.

Does this mean I'd expect to see better IPC (instructions per cycle) on these chips vs an x86, due to better memory parallelism?


Yes that's that main reason the article gives that a 2.3ghz laptop chip is almost as fast as a 5ghz zen3 chip with a lot more cooling.


Thanks, @socialdemocrat, that is a good intro for someone (like me) who doesn't understand hardware architecture.


Great! I thought there was many people who wanted some in depth discussion but which did not necessarily have a lot of prior knowledge about the topic. E.g. Anandtech gives a lot of great detailed info, but hard to read if you are not a regular reader of their website.


Every discussion thread about M1 has a dozen people saying "Give me a 64GB M2 with more cores and more ports and I'll buy one." But I just want a 15 inch M1 Air. For most people 16GB and 4+4 is more than enough, and it's the larger screen that differentiates the models.


I don't understand why the bug fuss about M1. After all M1 is 5nm CPU. AMD Renoir is better than Intel for the same reason, it's 7nm while Intel can't get below 10nm yet.

So while M1 is probably a great CPU I expect its advantages will mostly vanish once AMD and Intel move to 5nm also.


Couple of years back I bought a big gaming laptop to do ML and deep learning on, TensorFlow, Keras, all that. i7 and NVidia 1080, 8G of GPU ram. I know TensorFlow has been ported to Big Sur/M1, anyone know how well it performs with sharing memory between CPU and GPU?


Good packaging and TSMC's better than expected results are certainly a big part of the picture. TSMC: '5nm yielding better than 7nm.' Apple can put the higher leakage chips in mac minis where they can afford to pump more power through.


What do we know about the memory model? I see refs to not copying memory which suggests there isn’t even any L1 cache on die. Is this true? Anybody got a detailed arch diagram of the M1’s memory hierarchy?


I've been wondering what kind of hardware optimizations might be possible based on tagged pointers? If you know a memory location is uniquely referenced for example when you load or store ...


Anyone wanting linux on the M1 there's now a patreon at https://www.patreon.com/marcan/


I still don't get why they didn't use RISC-V. Sure, they have an ARM license, but running existing ARM software on the M1 doesn't seem very useful. Why not use the superior ISA?


Why use different arch to iPhone/iPad? M1 borrows most design from A chip and they spent time to optimize software for Arm.


As someone who's generally been tech-aware for decades, this seems like a revolution. I've never seen reviews like this. I keep buying AAPL. I think they're going to eat Intel.


I suspect Intel will in the future have separate processors which can less painfully take advantage of more decoders, perhaps with a restricted instruction set?


As a person who has written tons of machine and assembly code and dealt with decades of working directly with the cpu: this is a well written and accurate article.


With out of order execution being apparently a big deal here, what are the odds of side-channel attacks based on this OoOE being found for this architecture?


I wonder, has anyone written about to what extent XCode has been optimized for the M1? With the wide decoder loop unrolling gets a lot more attractive....


So basically Intel shouldn't have abandoned Itanium.


Cant wait to see this chip in the 16-inch.


Did Apple patent anything in their chip?


Of course they did.


Care to be more specific?


Every company files for hardware patents; you want to have a portfolio that can be used for reciprocal license agreements.


Hello! Thanks for the sharing, would have been nice not doing that on Medium, it's not free anymore. But thanks again!


I think what will make Apple Silicon a success is that they are not going to purposely slow down progress for the sake of sales and marketing. Apple is still an engineering-driven organization, despite their marketing genius. They’re not going to intro filler products here and there to squeeze the market.


Apple M1 is so fast because you paid a lot for it and want to feel good about that I guess.


It's almost 2021.

Is it still ok to post paywalled articles?


It’s almost 2021.

Is it still ok to post Medium articles?


We are better with paywalled articles than without. Moreover, writing articles doesn't necessarily have to be free work.


What are the technical reasons why the M1 chip is so fast. Does Apple use any exotic tricks? Can Intel and AMD do the same trick in their chips?


They use this one trick that AMD and others hate: they paid TSMC to use up all of their 5nm capacity.


I have an anecdote possibly related to this. I work for a semiconductor company and we were in pretty good standing with TSMC. All of a sudden we were informed we should longer contact the usual engineers them and a new team of engineers was assigned to us. It was very clear the new staff was their B-team. Rumor around the office the A-team went to support Apple.


The word I learned in years past because of Apple that applies to it in several ways is "monopsony": the ability to control the market for a product by being the dominant buyer of same.


That is not what monopsony means. Monopsony means there is only one buyer. Apple is one of many buyers. However, they are able to secure the resources since they can pay the most.


Monopsony means there’s a dominant buyer, not just a single buyer.

Fortune was talking about it almost 10 years ago.

https://macdailynews.com/2011/07/05/how-apple-became-a-monop...

Of course they’ve been accused in the Supreme Court of being a monopsonist with regard to the App Store, but that’s a separate discussion.

(Update: regrettably my economic theory is weak, so while I think it makes sense to apply the term, after more reading I can't argue the case with a reasonable level of conviction.

Certainly there should be some way to describe a buyer who locks up a supply of components to the exclusion of competitors, as Apple has done on multiple occasions dating at least to the original iPod hard drives.)


https://en.wikipedia.org/wiki/Monopsony

See the box at the top of article showing difference between monopoly, monopsony, oligopoly, oligopsony.

A monopsony would mean the buyer gets to dictate all the terms, and the seller has to sell at very low price. In this case, TSMC has something very valuable, and Apple is paying a considerable amount (presumably, I haven’t seen real numbers) to the seller to make sure no one else gets it.

In fact, TSMC is a monopoly, the only seller of a much sought after product.


Sounds like apple made multiple specialized processors for specific tasks, built a motherboard around that, and cooked all that into the OS. It's a bit of a cheat since they aren't building a generic computer architecture, but a very very specialized one.

This is different with say Windows architecture. In windows vendors work on generic cpus/gpus/motherboards. Some hardware isn't exactly available or uniformally accessible due to drivers and such. This is kind of the huge benefit Apple has by creating the hardware, OS, drivers, languages and libraries.


This is what they do with everything. It is the reason for vertical integration.

Calling it ‘a cheat’ is somewhat weird.

The point is that a general purpose CPU is a bad solution for a personal computer.

The neural engine is about processing the ambiguous world that humans occupy so that the human doesn’t have to translate so much for the computer.

The GPU is about communicating visually with humans.


GP said a bit of a cheat not that it was cheating.

Point is that if you try to run a different OS, it will likely run slower, perhaps significantly slower than running macOS


Anecdotally, I saw some folks on reddit's linux board that were running Linux in virtual machines on the M1 and getting quite decent results.


Given that the individual cores benchmark at the very high end of any core available today, this is almost certainly not significant.

If another OS doesn’t take advantage of the specialized processors, it will run slower, but only on specialized loads.


I see Apple following the path of Nintendo. They don't need to compete on performance, their users aren't buying for performance. Vertical integration puts you at tremendous risk of falling behind.

The future is keeping Apple "fresh", "cool", etc... While keeping product costs low to compensate for a falling market share.

Really says something that Apple is willing to compromise on long term hardware performance.


How are they following the Nintendo path? These machines are clearly faster than the market segment, and apple is the leader in ARM CPU technology at this point, and they’ve been the mobile CPU leader for about 7 years, whereas Nintendo’s latest offering integrated mostly commodity technology from Nvidia Tegra. I don’t see how you can draw this parallel.


Can you explain how they are ‘compromising on long term performance’?

I don’t see any logic that supports this claim.


Nintendo has, since the Wii, shied away from bleeding-edge or custom components, preferring to use cheaper, proven tech that they can more easily maintain but is still strong enough for their needs. Just look at the Switch -- it uses the Tegra X1 in a device that launched in March of 2017, despite the X2 having been available for more than a year.

Apple, on the other hand, has been running rings around Qualcomm and other ARM vendors for a while now, and sharing tech makes it all the more important that they continue to be competitive.


I think only really the SNES/N64 years embodied the cutting-edge solutions associated with novel games. Their whole ethos to entertainment (especially the handheld line becoming the core product in a way) still feels embodied in Yokoi's "lateral thinking with seasoned technology" philosophy.


Yeah and I think it will be quite interesting to see in the years ahead how this kind of highly specialized approach to computing will stack up against the more generic wintel approach, and how customers will respond.

Customers may end up with having the choice between a very high performance system at competitive prices, vs an open architecture which has more flexibility in what you add, but which will ultimately cost more and have worse performance.

You can see this play out in different markets. Just look at Tesla e.g. They are taking the Apple approach to car making with full vertical integration. They are doing everything from making their own alloys to creating their own machine learning hardware.

I think we are at some kind of paradigm shift and the old guard has been caught off guard.


If that were the case, though, we would expect to see performance gains only in areas the M1 is specialized for -- but that's not what we see. The M1's GeekBench scores are impressively high, and general-computing workloads like compiling code are very fast as well, indicating that the M1 is fast in general, not just for specific tasks.

There's been a lot of hype about things like "specialized hardware for NSObjects," but AFAIK that's more of a flawed/outdated design on Intel' side than a specialization on Apple's: the "magic" is really just ARM's weaker & more modern memory consistency model, which makes things like reference counting substantially faster.


Apple did optimize for some common use cases, reference counting (as used by objc++), javascript tweaks, and for emulation for rosetta2. However they also made a great general purpose CPU that will run many standard codes impressively quickly for impressively little power.


Today all of this is exciting, but I have a bad feeling this is going to slide Apple towards a low performance future.

They Apple'd themselves and locked themselves in.

In 2 years, secrets aside, Apple would need to beat all 3 big dogs to be the top. How long can they sustain that?


Apple has been beating “the big dogs” in performance for years with this processor and OS architecture in mobile & tablet segments. They’re switching laptop and desktop segments to this architecture because the mobile architecture already overtook the big dogs and has been growing the lead for a couple of generations. The A14 phone chip is already faster than most laptop CPUs, just like the A12 phone CPU was faster than most laptop CPUs when it was released.


They've been sustaining it for many years already, that's how they've gotten to the point of being able to passing up the incumbents. They didn't just start on this path this year, its the culmination of a long play strategy.


How do you reach this conclusion? Their individual cores and memory architecture still beat almost everyone else on general purpose loads without any of the specialized engines.


Today they overdelivered, why is it then so hard to believe they'll meet the expectations of a scaled up SoC? It's guaranteed.


Couldn't they could just switch back then?


They could at significant expense.

But why would they? The users don't buy for performance.


Basically every laptop from here on out is going to be benchmarked against these devices.

And battery life seems to be the sleeper performance metric here. Every review is going to be "you could get 6+ more hours if you just bought the macbook"


IIRC iPhone did well on SMT solver benchmarks as well.


I see two main reasons:

- 5nm process (1.8x the density of 7nm [1])

- unified memory directly on the SoC

[1] https://hexus.net/tech/news/industry/130175-tsmc-5nm-will-im...


Unified memory is old, every CPU with an integrated GPU is on unified memory. It's not doing anything at all for M1's general performance.

The main reason for M1's performance is just that Apple managed to get an 8-wide CPU to work & be fed. That's it. That's the entirety of the story of the M1. Apple got 8-wide to work, while Intel & AMD are still 4-wide. ARM's fixed length instructions are helping a ton there, but Apple also put in work to feed it.

All the other shit about specialized units & unified memory & vertical integration is all irrelevant and mostly wrong. The existing Intel Macbook Airs are all unified memory with specialized hardware units for specialized tasks, too (Intel QuickSync is 9 years old now - dedicated silicon for the specialized task of video encoding). Apple did absolutely nothing new or interesting on that front. Other than marketing. They are magic at marketing. And also magic at making a super wide CPU front-end, with really good cache latency numbers.


That's overly simplified. Larger order buffer, larger caches, lower latency caches, more outstanding loads, etc.

They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Maybe the usual intel core i7/i9 or ryzen 5000 look rather quaint with their 60ns or higher memory latencies.


> They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Where are you getting those numbers? From https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

"In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

Which would put M1's DRAM latency at worse than a modern Intel or AMD desktop, which is measuring around 70-80ns: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


I've replicated them myself with my own code, so I'm pretty confident. It doesn't hurt that my numbers match Anandtech's, at least for the range of arrays they use and only using a single thread.

On pretty much any current CPU if you randomly access an array significantly larger than cache (12MB in the M1 case) you end up thrashing the TLB which significantly impacts latency. The number of pages that can be quickly access depends on the number of page in the TLB.

To separate out TLB latency from memory latency I allow controlling size of the sliding window for randomizing the array, so that only a few pages are heavily used at any one time, while visiting each cache line exactly once.

That's exactly what the brown "R per RV prange" does. For more info look at the description at: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

My code builds an array, then does a knuth shuffle, but modified so the maximum shuffle distance is 64 pages, so the average shuffle is 32 pages or so. I get a nice clean line at 34ns. With 2 or 4 threads I get a throughput (not true latency) of a cacheline every 21ns. With 8 threads (using the 4 slow and 4 fast cores) I get a somewhat better cacheline per 12.5ns.

Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.


> Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

Eh? That's not how memory latency works. The cheaper consumer chips with "non-spec" RAM and without ECC are regularly better here than the enterprise stuff. This isn't something that scales with price.


Sure, ECC and in particular registered memory does increase the latency a bit. But servers are designed for throughput and have multiple memory channels to better feed the large amount of cores involved, up to 64 cores for the new AMD epyc chips. The amazing thing is that the Apple M1 can fetch random cachelines almost as fast as a current AMD Epyc.


You're confusing throughput & latency here. More channels increases throughput, but doesn't improve latency.

The M1's memory bandwidth is ~68GB/s, which is of course a tiny fraction of AMD Epyc's ~200GB/s per socket.

Epyc's latency isn't even competitive with AMD's own consumer parts, so I'm really not sure why you're surprised that Epyc's latency is also worse than the M1's?


I'm not surprised the latency on the M1 is better than Epyc, but it's near half of any other consumer part, like say the AMD Rzyen 5950x. When accessed in a TLB friendly way (not TLB thrashing) the M1 manages 30ns which is excellent.

Even more impressively is that the random cacheline throughput is also excellent. So if all 8 cores have a cache miss the M1 memory system is very good at keeping multiple pending requests in flight to achieve surprisingly good throughput. Granted this isn't pure latency, so I call it throughput. Getting a random cacheline per 12ns is quite good, especially for a cheap low power system. Normally getting more than 2 memory channels on a desktop requires something exotic like an AMD threadripper.


Is this a bot? This is just lifted from the article verbatim.

Do I get my Turing tester badge?


IIRC Apple got some exotic automated optimization tech when when they acquired PA Semi - I think it was mentioned in the Steve Jobs biography. The A chips have always been impressive and the timing lines up.


Did you read the article? Does not seem like it.


read the article!


I just had to jump on here to say how wrong the article is on numerous things.

- Ryzen is an APU - No it's an SoC the northbridge and memory controllers are on the CPU. All Ryzen are SoC. - Ryzen also contains the memory controller on circuit.

- CISC vs RISC - While CPUs are fundamentally this nature for x86(CISC) and ARM(RISC) they both carry similar on level instruction sets today a mix of complex and reduced.

Why is Apple's chips better at the moment? Because they actually been throwing money into the RND of the processor. cough intel cough

You actually see the same perf gains on AMD since it can now flex it's muscles. AMD laptops are getting 10+ hour battery life. If we give AMD a few more years the perf will probably escape the M1 since the chiplet design scales better.


> In real world test after test, the M1 Macs are not merely inching past top of the line Intel Macs, they are destroying them. In disbelief people have started asking how on earth this is possible?

I'm getting sick of this. The M1 is an impressive chip for sure, but it's ridiculous to compare them to the Intel chips in last year's MacBooks. Those chips were old, the cooling solution was terrible, and Intel had already fallen way behind AMD for the kinds of tasks you might actually use a MacBook for. The cynic in me thinks Apple intentionally handicapped the 2019 models to make the M1 look as good as possible... and it clearly worked.

Can we please stop comparing apples to oranges (pun intended)?

Like I said, the M1 is impressive - especially in terms of power efficiency - but we should be comparing it to the likes of an Intel Core i7-1185G7 or the imminent AMD Ryzen 7 5800U.


What are you talking about? The last gen Macbook Air and Pro had Ice Lake chips with Iris Plus graphics, why would that be not a fair comparison? Sure, the Air did have a terrible cooling system, but so does the new Air. The 2020 4-port 13” Macbook Pro that still continues to be sold as they’re the only 13” Macbooks that can be configured with 4TB SSD and 32GB RAM as of now, has a 4-core Ice Lake that Apple literally push to its limits to squeeze every ounce of performance they can, if you watched a review or two covering that machine. Linus Tech Tips complained that it was running too hot — not thermal throttling, just going really fast. And the M1’s GPU is like 2-2.5x faster than even those machines’ iGPU.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: