Why is Apple's M1 chip so fast?

benjaminl · on Nov 30, 2020

Unlike what has been said on Twitter the answer to why the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.

The M1 is really wide (8 wide decode) and has a lot of execution units. It has a huge 630 deep reorder buffer to keep them all filled, multiple large caches and a lot of memory bandwidth.

It is just a monster of a chip, well designed balanced and executed.

BTW this isn’t really new. Apple has been making incremental progress year by year on these processor for their A-series chips. Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop. Well turns out that given the right cooling solution those benchmarks were accurate.

Anandtech has a wonderful deep dive into the processor architecture.

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Edit: I didn’t mean to disparage Apple or the M1 by saying that Apple threw hardware at the problem. That Apple was able to keep power low with such a wide chip is extremely impressive and speaks to how finely tuned the chip is. I was trying to say that Apple got the results they did the hard way by advancing every aspect of the chip.

rayiner · on Nov 30, 2020

The answer of wide decode and deep reorder buffer gets much closer than the “tricks” mentioned in tweets. That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

The limit that keeps you from arbitrarily scaling up these numbers isn’t transistor count. It’s delay—how long it takes for complex circuits to settle, which drives the top clock speed. And it’s also power usage. The timing delay of many circuits inside a CPU scare super-linearly with things like decode width. For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15). The delay of the issue queues is quadratic both in the issue width and the depth of the queues. The delay of a full bypass network is quadratic in execution width. Decoding N instructions at a time also requires a register renaming unit that can perform register renaming for that many instructions per cycle, and the register file must have enough ports to be able to feed 2-3 operands to N different instructions per cycle. Additionally, big, multi-ported register files, deep and wide issue queues, and big reorder buffers also tend to be extremely power hungry.

On the flip side, the conventional wisdom is that most code doesn’t have enough inherent parallelism to take advantage of an 8-wide machine: https://www.realworldtech.com/shrinking-cpu/2/ (“The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate.”). At the very least, such designs tend to be very application-dependent. Branch-y integer code like compilers tend to perform poorly on such wide and slow designs. The M1 by contrast manages to come close to Zen 3, which is already a high ILP CPU to begin with, despite a large clock speed deficit (3.2 ghz versus 5 ghz). And the performance seems to be robust—doing well on everything from compilation to scientific kernels. That’s really phenomenal and blows a lot of the conventional wisdom out of the water.

An insane amount of good engineering went into this CPU.

temac · on Nov 30, 2020

> An insane amount of good engineering went into this CPU.

I agree but lets not overblow the difficulties either.

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder.

That could be irrelevant for small enough numbers, and ARM is easier to decode than x86. So this can be very well be dominated by other things. What you cite seems to be only about decoding logical register decoding going into the renaming structures, and then for just that tiny part it even tells that "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

> The delay of a full bypass network is quadratic in execution width.

Maybe if that's a problem don't do a full bypass network.

> dropping rapidly as both a function of increasing width and increasing clock rate

Good thing that the clock rate is not too high then :p

More seriously the M1 can keep the beast fed probably because everything is dimensioned correctly, (and yes also because the clocks are not too high, but if you manage to make a wide and slow CPU that actually works well, I don't see why you would want to scale the freq too much high, given you would quickly consume like crazy, and there is only only limited headroom above 3.2GHz anyway). It obviously helps to have a gigantic OOO. So I don't really see where there is so much surprise. Esp. since we saw the progression in the A series.

To finish probably TSMC 5nm does not hurt. The competitors are on bigger nodes and have smaller structures. Coincidence? Or just like it has worked during decades already.

rayiner · on Nov 30, 2020

It's not completely groundbreaking, but painting it as an outgrowth of existing trends doesn't give Apple enough credit. The challenges of scaling wider CPUs within available power budgets is widely accepted: https://www.cse.wustl.edu/~roger/560M.f18/CSE560-Superscalar... (for "high-performance per watt cores," optimal "issue width is ~2"). Intel designed an entire architecture, Itanium, around the theory that OOO scaling would hit a point of diminishing returns. https://www.realworldtech.com/poulson ("Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry."). It is also well accepted that we are hitting limits on ability to extract instruction-level parallelism: https://docencia.ac.upc.edu/master/MIRI/PD/docs/03-HPCProces... ("There just aren’t enough instructions that can actually be executed in parallel!"); https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp16/cs... ("Hardly more than 1-2 IPC on real workloads").

Apple being able to wring robust performance out of an 8-wide 3.2 GHz design, on a variety of benchmarks, is impressive and unexpected. For example, the M1 outperforms a Ryzen 5950X by 15%. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste.... Zen 3 is either 4- or 8-wide decode (depending on whether you hit the micro-op cache) and boosts to 5 GHz. It beats the 10900k, a 4+1-way design that boosts to 5.1 GHz, by 25%. The GCC subtest, meanwhile, is famous for being branch-heavy code with low parallelism. Apple extracting 80% more IPC from that test than AMD's latest core (which is already a very impressive, very wide core to begin with!) is very unexpected.

A lot of the conventional wisdom is based on assumptions about branch prediction and memory disambiguation, which have major impacts on how much ILP you can extract: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... To do so well, Apple must be doing something very impressive on both fronts.

ece · on Dec 1, 2020

The i7-1165G7 is within 20% on single core performance of the M1. The Ryzen 4800U is within 20% on multi-core performance. Both are sustained ~25W parts similar to the M1. If you turned x64 SMT/AVX2 off, normalized for cores (Intel/AMD 4/8 vs Apple 4+4), on-die cache (Intel/AMD 12M L2+L3 vs Apple 32MB L2+L3) and frequency (Apple 3.2 vs AMD/Intel 4.2/4.7), you'd likely get very close results on 5nm or 7nm-equivalent same process. Zen 3 2666 vs 3200 RAM alone is about a 10% difference. The M1 is 4266 RAM IIRC.

TBH, Laptop/Desktop level performance is indeed very nice to see out of ARM, after a few years of false starts by a few startups and Qualcomm. Apple designed a wider core they deserve credit for, but wider cores have been a definitive trend starting with the Pentium M vs Pentium 4. There is a trade-off here for die area IMO, AMD/Intel have AVX2 and even AVX512 and SMT on each core, and narrower cores (with smaller structures, higher frequency). Apple has wider cores (larger structures, less frequency, higher IPC). It's not that simple, but it kind of is if you squint a bit.

rayiner · on Dec 1, 2020

The i7-1165G7 boosts to 4.7 GHz, 50% higher than the M1. A 75% uplift in IPC (20% more performance at 2/3 the clock speed) compared to Intel’s latest Sunny Cove architecture is enormous. Especially since Sunny Cove is itself the biggest update to Intel’s architecture since Sandy Bridge a decade ago.

ece · on Dec 1, 2020

Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is only around a ~20% difference in geekbench and at similar sustained power levels. If you want AVX2/512+SMT, a slightly narrower core of realistically 6+ wide with uOP-cache upto 8-wide is an acceptable tradeoff. We have seen Zen 3 go wider from Zen 1/2[1], so wider x64 designs with AVX/SMT should be coming, but this is the squinting part with TSMC 5nm vs 7nm.

[1] https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

rayiner · on Dec 1, 2020

Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.

The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.

If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.

ece · on Dec 1, 2020

Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.

The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.

rayiner · on Dec 1, 2020

> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.

It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/

> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.

Theoretical studies have shown that higher ILP is attainable (http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi...) but the M1 suggests some really notable advances in being able to actually extract higher ILP in real-world code.

I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.

r00fus · on Dec 1, 2020

The M1 is a 10W part, no? I would kill to see the 25W M-series chip.

Oh and the 10W is for the entire SOC, GPU and memory included.

ece · on Dec 1, 2020

Nope. Anandtech measured 27W peak on M1 CPU workloads with the average closer to 20W+[1].

The Ryzen 4800U and i7-1165G7 also have comparable GPUs (and TPU+ISP for the i7) within the same ~15-25W TDP. The Intel i7-1165G7 average TDP might be closer to ~30W because of it's 4.7Ghz boost clock, but it's still comparable to the M1.

The i7-1165G7 and 4800U have a few laptop designs with soldered RAM. You can get 17hrs+ of video out of a 4800U laptop with a 60Wh battery[2]. Also comparable with i7-1065G7/i7-1165G7 at 15hrs+/50Wh.

[1] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

[2] https://www.pcworld.com/article/3531989/battery-life-amd-ryz...

willyt · on Dec 1, 2020

Wasn’t 27W for the whole Mac Mini machine using a meter at the wall plug? So that includes losses in the power supply and ssd and everything else outside the chip that uses a bit of juice whereas the AMD tdp is just the chip. I thought Anandtech said there was currently no reliable way to do an ‘apples to apples’ tdp comparison?

Edit: quote from anandtech:

“As we had access to the Mac mini rather than a Macbook, it meant that power measurement was rather simple on the device as we can just hook up a meter to the AC input of the device. It’s to be noted with a huge disclaimer that because we are measuring AC wall power here, the power figures aren’t directly comparable to that of battery-powered devices, as the Mac mini’s power supply will incur a efficiency loss greater than that of other mobile SoCs, as well as TDP figures contemporary vendors such as Intel or AMD publish.”

_s8vu · on Dec 2, 2020

The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

glasshead969 · on Dec 1, 2020

I have device with me and on full load on both CPU and GPU it can go up to 25 but for most use cases i see the whole SOC hovering around 15W

sspiff · on Dec 2, 2020

M1 is a 20W (max) CPU and a 40W SoC (whole package max).

However, in most intense workloads it doesn't go near 40W, more like ~25W under high load.

Still incredibly impressive.

jtuente · on Dec 1, 2020

18W CPU peak power, 10W is their power efficiency comparison point.

_s8vu · on Dec 2, 2020

The M1 doesn’t use 24W, it uses 12-16 watts. https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste... (CPU, GPU + DRAM combined)

https://www.youtube.com/watch?v=_MUTS7xvKhQ&list=PLo11Rczpzu... Check this out, 12.5W power consumption for the M1 CPU vs. 68W CPU power consumption for the Intel i9 CPU of the 16” Macbook Pro, and yet the M1 is 8% faster in Cinebench R23 in multi-core score.

p1necone · on Dec 1, 2020

My naive assumption would be that 4c big + 4c little would perform better than 4c/8t all other things being equal (and assuming software was written to optimize for each design respectively). Also no reason you can't have 4c/8t big + 4c/8t little too.

_s8vu · on Dec 2, 2020

Apple’s 4big + 4LITTLE config performs better than Intel’s 8c/16t mobile chips right now.

Dylan16807 · on Dec 1, 2020

> For example, the delay in the decode stage itself scales quadratically with the width of the decoder: ftp://ftp.cs.wisc.edu/sohi/trs/complexity.1328.pdf (p. 15).

That's a decoder for a single field, where width of the field is the parameter it scales by. That would be instruction size or smaller, and instructions don't change size depending on how many you decode at once.

And logically once you separate the instructions you can decode in parallel in fixed time, and if all your instructions are 4 bytes then it takes no circuitry to separate them.

Also: "We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width."

Tuna-Fish · on Dec 1, 2020

Your first source does not support your statement.

While there is theoretically a quadratic component, in their words:

> We found that, at least for the design space and technologies we explored, the quadratic component is very small relative to the other components. Hence, the delay of the decoder is linearly dependent on the issue width.

kllrnohj · on Dec 1, 2020

> That still doesn’t explain how Apple built an 8-wide CPU with such deep OOO that operates on 10-15 watts.

Well, because it doesn't, it's ~25 watts. And also because it runs at just 3ghz. You'll see similar power numbers from x86 CPUs at 3ghz, too. The M1's multicore performance vs. the 4800U and 4900HS demonstrate this nicely.

QuixoticQuibit · on Nov 30, 2020

I haven’t read the linked AnandTech article yet, but is there a clear answer why Apple was able to defy common comp arch wisdom (M1 has wider decode which works fine for various applications/code)?

diegof79 · on Nov 30, 2020

Check the parent article that explains it well. Apple didn’t defy common comp arch wisdom... they applied it.

The reason why is hard for Intel/AMD to do the same is not the lack of engineering geniuses (I’m sure they have plenty), but the support for a legacy ISA, and a particular business model.

What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat? The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

baybal2 · on Nov 30, 2020

> What Apple defies is common business survival instincts: why spent so much in RD of a chip if there are market leaders that are impossible to beat?

Having own silicon, means the upstream will not be able to turn lights on you (Samsung — a company keeping a quarter of its host country's GDP hostage.) I believe the immediate goal of PA Semi purchase was that.

> The answer seems to be obvious now... but probably it wasn’t obvious when Apple acquired PA Semi in 2008.

PA Semi was clearly a diamond in the rough. It took a great insight to single out PA Semi, because on the surface it was a very barebones SoC sweatshop, but in reality PA were the last of Mohicans of US chip design.

PA was a place where all non-Intel IC engineers left to after the severe carnage of microchip businesses of US tech giants like Sun, IBM, HP, DEC, SGI..., and etc.

It was a star team which back then was toiling at router box SoCs.

libria · on Nov 30, 2020

Just to quantify your adjectives, per the Anandtech article:

> The M1 is really wide (8 wide decode)

In contrast to x86 CPUs which are 4 wide decode.

> It has a huge 630 deep reorder buffer

By comparison, Intel Sunny/Willow has 352.

pclmulqdq · on Nov 30, 2020

Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Intel/AMD have had 4-wide decode and issue width for 10 years and I'm glad they're moving to wider machines.

Edited "decode" to "issue" for clarity.

BlackFingolfin · on Nov 30, 2020

Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.

All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?

pclmulqdq · on Nov 30, 2020

https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.

Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.

BlackFingolfin · on Nov 30, 2020

Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.

But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?

Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.

Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).

pclmulqdq · on Nov 30, 2020

I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.

On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.

Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).

jlouis · on Nov 30, 2020

X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.

Arm is different but probably easier to decode. So you can widen the decoder.

pixelpoet · on Nov 30, 2020

This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).

throwarchitect · on Nov 30, 2020

Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.

wk_end · on Dec 1, 2020

You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.

[1] https://www.researchgate.net/profile/Sally_McKee/publication...

throwarchitect · on Dec 1, 2020

That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.

wk_end · on Dec 1, 2020

Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.

Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.

Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40...

[2] http://www.cs.unc.edu/~porter/pubs/instrpop-systor19.pdf

monocasa · on Dec 1, 2020

But there's more work being done per average x86_64 instruction due to RMW ops. Hence why they just look at an entire binary.

BlackFingolfin · on Nov 30, 2020

OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).

But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).

I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?

kens · on Nov 30, 2020

As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.

As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

Ref: "Modern Processor Design" p268.

Dylan16807 · on Dec 1, 2020

> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?

happycube · on Dec 1, 2020

Probably should check Abner's guide, but the P6 is still the rough blueprint for everything (except P4) that intel did since.

95014_refugee · on Dec 1, 2020

They were still doing this in the Nehalem timeframe (possibly influence from Hinton?)

coliveira · on Nov 30, 2020

I guess this requires extending the architecture to 8-wide instructions when it makes sense.

BlackFingolfin · on Nov 30, 2020

What do you mean with "8-wide instructions", and what does that have to do with multiple decoders?

natchy · on Nov 30, 2020

So Intel and AMD are capable of building a chip like this, but the ambitious size meant it was more economically feasible for Apple to build it themselves?

(Not a hardware guy)

lmilcin · on Nov 30, 2020

Neither Intel nor AMD are capable of doing for a very basic reason, there is no market for it. You can't just release a CPU for which there is no operating system.

Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it.

Apple, by creating new CPU, replace a part of the stack that is owned by Intel by their own which only strengthens them position even if it did not improve any performance.

Apple is invulnerable to other companies copying the CPU and creating their own because they are not really a competition here. Apple sells an integrated product of which CPU is just one component.

nightski · on Nov 30, 2020

That's not entirely true. Windows ARM64 can execute natively on the M1 (through QEMU for hardware emulation, but the instructions execute natively). Intel/AMD could produce an ARM processor that could find a market. They also have a close partnership with Microsoft and I have to believe there would be a path forward there. They could also target Linux.

I haven't seen enough evidence yet though that ARM is the reason M1 performs so efficiently. It may just be the fact that it is on a cutting edge 5nm process with a SoC design. I'm not even sure if the PC/Windows market would adopt such a chip since it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU. Heck even AMD has been retaining backwards compatability with older motherboards since it's been one one socket for a while.

I think for laptops/mobile this makes a lot of sense. For a desktop though I honestly prefer AMD's latest Ryzen Zen 3 chips.

Accujack · on Nov 30, 2020

> It may just be the fact that it is on a cutting edge 5nm process with a SoC design.

Yup. It's fast because it's got short distances to memory and everything else. Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power. Using shorter paths to memory also lets you use lower voltages, which means less waste heat and less need to spend effort on cooling and overall power savings for the chip.

Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

There's a reason that manufacturers used to be able to "speed" up a chip by just doing a die shrink - photographically reducing chip masks to make them smaller, which also made them faster with relatively small amounts of work.

As the late Adm. Grace Hopper put it, there are ever so many picoseconds between the two ends of a wire.

baybal2 · on Dec 1, 2020

> Shortening the wire also lowers latency between all the various on board devices, so communicating everywhere is faster.

A maximum of a few nanoseconds. Not much in comparison to an overall memory system latency.

> Shorten the wire to memory cells and not only can you make signaling faster and run the memory at faster clock speed but you can do it with less accessory hardware for signal conditioning and error correction, which saves complexity and power.

You cannot run away from that with just shorter PCB distance. The circuitry for link training is mandated by the standard.

You will need a redesigned memory standard for that.

Taniwha · on Dec 1, 2020

Until the late 90s on-chip wire delays were something we just didn't care much about speed was limited by gate capacitance - we got speedups when we shrunk the gate sizes on transistors - after the mid 90s RC delays in wires started to matter (not speed of light delays, how fast you can shuffle electrons in there to fill up the C) soon after it got worse because wire RC delays don't scale perfectly with shrinks because of edge effects - this was addressed in a number of ways, high speed systems reduced the R by switching from Al wires to Cu, tools got better able to model those delays and synthesize and do layout at (almost) the same time

kjs3 · on Dec 1, 2020

Intel/AMD could produce an ARM processor that could find a market.

Intel did have an ARM processor line, and it did have a market. They acquired the StrongARM line from Digital Equipment and evolved it into the XScale line. What Intel didn't want was for something to eat into it's x86 market, and Windows/ARM didn't exist. So they evolved ARM in a different direction than Apple later did. It was very successful in the high-performance embedded market.

AtlasBarfed · on Dec 1, 2020

"It was very successful in the high-performance embedded market."

As long as you don't define that market as "billions of mobile smartphones".

I remember StrongArms in PDAs back in the early 2000s.

They should have had ready processors for the smartphone, but IIRC they kept pushing x86 on phones.

kjs3 · on Dec 1, 2020

Fair point. I forgot about the PXA line. I suspect, however, more of the IOP & IXP embedded processors were sold.

imtringued · on Dec 1, 2020

>AMD could produce an ARM processor that could find a market.

They did and it couldn't find a market.

happycube · on Dec 1, 2020

IIRC the chip they sort-of released was originally meant for Amazon, but missed targets wildly, leading Amazon to doing one on their own.

Lisa Su put the kibosh on K12 for focus reasons, given how well Zen turned out it was the right call at least for now.

WhyNotHugo · on Nov 30, 2020

> it lacks upgradability. It's really nice to be able to swap out RAM and/or GPU.

Honestly, it doesn't. You just swap it for a new one and sell the old one.

When you have 8GB, you pay another 8GB and end up with 16.

In this case, you just sell you SoC with 8GB, and but another SoC with 16GB. You'll only pay out the difference.

This is pretty much how you upgrade a relatively recent phone works too.

BenjiWiebe · on Dec 1, 2020

Depreciation means you won't pay out only the difference, in most cases.

WhyNotHugo · on Dec 4, 2020

This is kinda where Apple products thrive. They drop in price very little, and have, ultimately, long lives.

iPhone 7s and iPhone 8s are still great low-end devices, and they reach the right market by being resold by people getting a newer one.

I don't see why M1 laptops would be an exception.

ThrowawayR2 · on Nov 30, 2020

> "Apple can pull it off because they already own entire stack from hardware to operating system to cloud services and the can swap out a component like CPU for a different architecture and release new version of OS that supports it."

Note that this is the same model that Sun Microsystems, DEC, HP, etc. had and it didn't work out for them.

I'd venture to say that it currently only works out for Apple because Intel has stumbled very, very badly and TSMC has pulled ahead in fabbing process technology. If Intel manages to get back on its feet with both process enhancements and processor architecture (and there's no doubt they've had a wake up call), this strategic move could come back to bite Apple.

pjmlp · on Dec 1, 2020

Only because of Linux, without it in the picture they would still be selling UNIX workstations.

ThrowawayR2 · on Dec 1, 2020

Without Linux, they would've lasted longer but still would've lost out on price/performance against x86 and Intel's tick-tock cadence well before Intel's current stumble. We might all have wound up running Windows Server in our datacenters.

pjmlp · on Dec 1, 2020

I doubt places like Hollywood would have migrated to Windows, given their dependency on UNIX variants like Irix.

simias · on Nov 30, 2020

I don't understand, how do these low level changes impact the OS exactly assuming that the ISA remains the same? It doesn't seem much more impactful than SSE/AVX and friends, i.e. code not specifically optimized for these features won't benefit but it'll still work and people can incrementally recompile/reoptimize for the new features.

After all that's pretty much how Intel has operated all the way since the 8086.

It's not like Itanium where everything had to be redone from scratch basically.

fulafel · on Dec 1, 2020

Are you referring to Apple's laptop x86 -> ARM change? Entertaining the idea that the ISA is significant here: Surely there would be a big market for ARM chips in the Android and server sides too, so this shouldn't be the only reason why other vendors aren't making competitive ARM chips. Apple's laptop volumes aren't that big compared to those markets.

And of course you have to factor in the large amount of pain that Apple is imposing on its user and ISV base in addition to the inhouse cost of switching out the OS and supporting two architectures for a long time in parallel. A vendor making chips for Android or servers wouldn't have to bear that.

throwaway_pdp09 · on Nov 30, 2020

> You can't just release a CPU for which there is no operating system

sure you can. That's what compilers are for.

skavi · on Nov 30, 2020

Intel had such an attitude once before.

throwaway_pdp09 · on Nov 30, 2020

Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[82]

https://en.wikipedia.org/wiki/Itanic

So they didn't have the needed compiler

The_Colonel · on Nov 30, 2020

Surely there were compilers, they just weren't as good (as optimizing) as Intel wished.

throwaway_pdp09 · on Dec 1, 2020

Of course there were itanic-targetting compilers, they worked, just not well enough to deliver on marketing promise (edit: and what the hardware was theoretically capable of).

pjmlp · on Nov 30, 2020

I wonder how HP and Microsoft managed to port HP-UX and Windows without a compiler.

Dylan16807 · on Dec 1, 2020

That's kind of the point.

Compilers existed just fine to do the porting, and solved that problem.

Intel's failure is that they were unable to solve a different problem because that compiler didn't exist, one that went well beyond merely porting.

In other words, "That's what compilers are for." is a perfectly fine attitude when those compilers exist, and a bad attitude when they don't exist. Porting is the former, making VLIW efficient is the latter.

loeg · on Nov 30, 2020

GP probably means that you won't be able to sell it, even if there is a compiler. (Not true in the super embedded space, sure.)

masklinn · on Nov 30, 2020

It's not that it was more economical, but that at least some of these AMD and Intel would not benefit from due to the ISA: x64 instructions can be up to 15 bytes, so just finding 8 instructions to decode would be costly, and I assume Intel and AMD think more so than the gains from more decoders (you couldn't keep them fed enough to be worth it, basically).

redraga · on Dec 1, 2020

I can't comment on the economics of it but I can comment on the technical difficulties. The issue for x86 cores is keeping the ROB fed with instructions - no point in building a huge OoO if you can't keep it fed with instructions.

Keeping the ROB full falls on the engineering of the front-end, and here is where CISC v RISC plays a role. The variable length of x86 has implications beyond decode. The BTB design becomes simpler with a RISC ISA since a branch can only lie in certain chunks in a fetched instruction cache line in a RISC design (not so in CISC). RISC also makes other aspects of BPU design simpler - but I digress. Bottom line, Intel and AMD might not have a large ROB due to inherent differences in the front-end which prevent larger size ROBs from being fed with instructions.

(Note that CISC definitely does have it's advantages - especially in large code foot-print server workloads where the dense packing of instructions help - but it might be hindered in typical desktop workloads)

Source: I've worked in front-end CPU micro-architecture research for ~5 years

hajile · on Dec 1, 2020

How do you feel about RISC-V compact instructions? The resulting code seems to be 10-15% smaller than x86 in practice (25-30% smaller than aarch64) while not requiring the weirdness and mode-switching associated with thumb or MIPS16e.

Has there actually been much research into increasing instruction density without significantly complicating decode?

Given the move toward wide decoders, has there been any work on the idea of using fixed-size instruction blocks and huffman encoding?

redraga · on Dec 2, 2020

I can't really comment on the tradeoffs between specific ISAs since I've mainly worked on micro-arch research (which is ISA agnostic for most of the pipeline).

As for the questions on research into looking at decode complexity v instruction density tradeoff - I'm not aware of any recent work but you've got me excited to go dig up some papers now. I suspect any work done would be fairly old - back in the days when ISA research was active. Similar to compiler front-end work (think lex, yacc, grammar etc..) ISA research is not an active area currently. But maybe it's time to revisit it?

Also, I'm not sure if Huffman encoding is applicable to a fixed-size ISA. Wouldn't it be applicable only in a variable size ISA where you devote smaller size encoding to more frequent instructions?

hajile · on Dec 2, 2020

Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.

Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.

We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.

As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.

Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.

Never gave it too much thought before now, but it's an interesting exercise.

skavi · on Nov 30, 2020

Not necessarily. Samsung used to make custom cores that were just as large if not larger than Apple’s (amusingly the first of these was called M1).

Unfortunately, Samsung’s cores always performed worse and used significantly more power than the contemporary Apple cores.

Apple’s chip team has proven capable of making the most of their transistor budget, and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

imtringued · on Dec 1, 2020

> there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

From what I have seen the only difference in efficiency is the manufacturing process. M1 consumes about as much power per core as a Ryzen core. AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.

skavi · on Dec 1, 2020

> AMD also has a mobile chip with 8 non heterogeneous cores that has around the same TDP as the M1.

TDP is no where near actual load power use.

The_Colonel · on Nov 30, 2020

> and there’s reason to believe neither Intel nor AMD could achieve Apple’s efficiency even if they had the same process, ISA, and area to work with.

What's that reason?

rayiner · on Dec 1, 2020

> What's that reason?

Apple’s efficiency is based on a very wide and deep core that operates at a somewhat lower clock speed. Frequent branches and latency for memory operations can make it difficult to keep a wide core fully utilized. Moreover, wider cores generally cannot clock as high. That’s why Intel and AMD have chosen to pursue narrower cores that can clock near 5 GHz.

The maximum ILP that can be extracted from code can be increased with better branch prediction accuracy, larger out of order window size, and more accurate memory disambiguation: http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi.... The M1 appears to have made significant advances in all three areas, in order for it to be able to keep such a wide core utilized.

The_Colonel · on Dec 1, 2020

What you write makes sense but it does not address why AMD and Intel could not do the same "even if they had the same process, ISA, and area to work with."

rayiner · on Dec 1, 2020

They wouldn’t have Apple’s IP relating to branch prediction, memory disambiguation, etc.

Accujack · on Nov 30, 2020

I think it's "faith".

yazaddaruvala · on Dec 1, 2020

Faith implies no data.

Why will Apple always out compete Intel and other non-vertically-integrated systems? Margins, future potential, customer relationship and compounding growth/virtuous cycle.

The margins logic is simple, iPhones and MacBooks make tons more money per unit compared with a CPU. Imagine if improving the performance of a CPU by 15% makes the demand increase by 1%. For Apple improving the performance of a CPU by 15% makes the demand increase by 1% for the whole iPhone or MacBook. For this reason alone, Apple can invest 2-5x more R&D into their chips than everyone else.

The future potential logic is more nuanced:

1. Intel's/whoever's 10 year vision is to build a better CPU/GPU/RAM/Screen/Camera because their customers are the companies buying CPUs/GPUs/Screens/Cameras/RAM. They are focused on the metrics the market has previously used to measure success and want to build to optimize for those metrics e.g. performance per dollar. Intel doesn't pay for the electricity in the datacenter nor through its customers' complaints about battery life. RAM manufacturers aren't looking at Apple's products and asking, "do consumers even replace still RAM?" i.e. they are focused on "micro"-trends.

2. Apple's vision is to build the best product for customers. They look at "macro"-trends into the future and apply their personal preferences at scale. For example, do people even still need replaceable RAM? Will they want 5G in the future, or can we improve the technology to replace it with direct connections to a LEO satellite cluster?

The customer relationship logic:

Lets take one such example of a macro-trend, VR and other wearables. Apple is tracking these trends and can "bet on" them because its in full control but Nvidia, Intel, etc typically don't want to "bet on" these numbers because even if they are fully invested, their partners (which sell to consumers) might back out. Apple also isn't "betting on" because it has a healthy group of early adopters that trust Apple and will buy and try it even tho a "better" product in the same market segment isn't purchased. Creating/retaining that customer relationship lets Apple over invest into keeping heat (i.e. power) low because its thinking about the whole market segment that Apple's VR headset can start to compete in and collect more revenue from.

Compounding growth/virtuous cycle logic is also relatively simple:

Improving the metrics in any of these 3 previous pillars manipulatively improves the other pillars. i.e. better customer relationship increases cashflow, increseses R&D funding, 1. improves product, improving customer relationship; or 2. reduces costs, increasing margins, and loops back to increasing cash flow.

ArchOversight · on Nov 30, 2020

Read the article linked, it explains why Intel and AMD are unable to throw more decoders at the problem.

WhyNotHugo · on Nov 30, 2020

The problem is the market.

Windows only a single architecture, so they can't really deviate from that. Sure, windows can switch (or, apparently, run on ARM), but due to the fact that windows applications are generally distributed as binaries, lots of apps wouldn't work.

Linux users would have far less issues, and would be a great clientele for a chip like this, but probably too niche a market, sadly.

kjs3 · on Dec 1, 2020

Windows only a single architecture

People forget that at launch Windows NT ran on MIPS & DEC Alpha in addition to x86. The binary app issue was a killer for the alternative archs.

WhyNotHugo · on Dec 4, 2020

Pragmatically, windows runs on a single Architecture.

Sure, there's been editions for other architectures, but they're more anecdotal experiments than something usable.

I can go out and buy several weird ARM or PPC devices and run Linux or OpenBSD on them, and run the same stuff I use on my desktop regularly (except Steam).

The fact that windows relies on a stable ABI is it's major anchor (while Linux only guarantees a stable API).

kjs3 · on Dec 4, 2020

they're more anecdotal experiments than something usable

Wrong. Microsoft explicitly set out multi-architecture support as a design goal for NT. MIPS was the original reference architecture for NT. Microsoft even designed their own MIPS based systems to do it (called 'Jazz'). There was a significant market for the Alpha port, especially in the database arena, and it was officially supported through Windows 2000. They were completely usable, production systems sold in large numbers.

In the end, the market didn't buy into the vision. The ISVs didn't want to support 3+ different archs. Intel didn't like competition. The history is all pretty well documented should one take the time to learn it.

WhyNotHugo · on Dec 14, 2020

> Microsoft explicitly set out multi-architecture support as a design goal for NT

They set it as a design goal, but that doesn't mean that the achieved it.

kjs3 · on Dec 15, 2020

Except they did, though apparently you missed it. MIPS was the original port. Alpha was supported from NT 3.1 through Windows 2000, and only died because DEC abandoned the Alpha, not that Microsoft abandon Alpha (it was important to their 64-bit strategy). Itanium was supported from Windows 2003 to 2008R2. Support for Itanium only ended at the beginning of this year, once again because the manufacturer abandoned the chip.

I'm sure you can redefine "achieve" to exclude almost 17 years of support (for Itanium), if you're that committed to being right. Heck, x86-64 support has "only" been around for 20 years or so. Doesn't make it right.

kazinator · on Dec 14, 2020

Dec Alpha servers running NT in production used to be a thing.

kazinator · on Dec 14, 2020

Linux has ABI guarantees.

monocasa · on Dec 1, 2020

DEC Alpha NT could run X86 code thanks to FX!32, and faster than a core you could buy from Intel at the time.

foobiekr · on Dec 1, 2020

well, for some things. fx32 for the apps people wanted though was deficient. The NT3.1-era Alphas didn't have byte-level performance so things like Excel, Word, etc. all ran terribly, as did Emacs and X. I supported a lab of Alphas running Ultrix and they were dogs for anything interactive and fantastic for anything that was a floating point application.

kjs3 · on Dec 4, 2020

Yeah...anyone who thinks fx32 was faster in the real world than a native Intel core never actually ran it.

pjmlp · on Dec 1, 2020

Indeed, but it didn't had anything that justified actually paying big bucks for an Alpha.

amluto · on Nov 30, 2020

x86 instructions are variable length with byte granularity, and the length isn’t known until you’ve mostly decoded an instruction. So, to decode 4 instructions in parallel, AIUI you end up simultaneously decoding at maybe 20 byte offsets and then discarding all the offsets that turn out to be in the middle of an instruction.

So the Intel and AMD decoders may well be bigger and more power hungry than Apple’s.

outworlder · on Nov 30, 2020

Maybe they are, assuming there's sufficient area in the die for this.

They would likely still be massive power hogs.

trynumber9 · on Nov 30, 2020

But in one x86 instruction you often have more complex operations. Isn't that part of the reason why Sunny Cove has only 4 wide decode but still the decoders can yield 6 micro-ops per cycle? That single stat makes it look worse than it is in reality, I think.

bob1029 · on Nov 30, 2020

The whole principle of CISC (v RISC) is that you have more information density in your instruction stream. This means that each register, cache, decode unit, etc. is more effective per unit area & time. Presumably, this is how the x86 chips have been keeping up with fewer elements in terms of absolute # of instructions optimized for. The obvious trade-off being the decode complexity and all the extra area that requires. One may argue that this is a worthwhile trade-off, considering the aggregate die layout (i.e. one big complicated area vs thousands of distributed & semi-complicated areas) and economics of semiconductor manufacturing (defect density wrt aggregate die size).

zozbot234 · on Nov 30, 2020

Except that RISC-V ISA manages to reach infornation density on par with x86 via a simple, backwards-compatible instruction compression scheme. It eats up a lot of coding space, but they've managed to make it work quite nicely. ARM64 has nothing like that, even the old Thumb mode is dead.

sliken · on Nov 30, 2020

You mention most of the big changes, except one. Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard.

kllrnohj · on Dec 1, 2020

> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory, about half of what I've seen anywhere else. Impressive.

Cite your number please. Anandtech measured M1's memory latency at 96ns, worse than either a modern Intel or AMD CPU: "In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

vs. "In the DRAM region, we’re measuring 78.8ns on the 5950X"

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

sliken · on Dec 1, 2020

Well my comment mentioned "random (but TLB friendly)", which I define as visiting each cache line exactly once, but only with a few (32-64) pages active.

The reason for this is I like to separate out the cache latency to main memory and the TLB related latencies. Certainly there are workloads that are completely random (thus the term cache thrashing), but there's also many workloads that only have a few 10s of pages active. Doubly so under linux when if needed you can switch to HUGE pages if your workload is TLB thrashing.

For a description of the Anandtech graph you posted see: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

So the cache friendly line is the R per RV prange for the 5950X latencies is on the order of 65ns, the similar line for the M1 is dead on 30ns at around 128KB and goes up slightly in the 256-512KB range. Sadly they don't publish the raw numbers and pixel counting on log/log graphs is a pain. However I wrote my own code that produces similar numbers.

My numbers are pretty much a perfect match, if my sliding window is 64 pages (average swap distance = 32 pages) I get around 34ns. If I drop it to 32 pages I get 32ns.

So the M1, assuming a relatively TLB friendly access pattern only keeping 32-64 pages active is about half the latency of the AMD 5950.

So compare the graphs yourself and I can provide more details on my numbers if still curious.

myrandomcomment · on Nov 30, 2020

This reminds me of the Amiga which had FastRAM and ChipRAM. It was all main memory, but the ChipRAM could be directly addressed by all the co-processor HW in the Amiga and the FastRAM could not.

It would be sort of interesting for Intel/AMD to do something like this where they have 16GB on the CPU and the OS has the knowledge to see it differently from external RAM.

Apple is going to have to do this for their real "Pro" side devices as getting 1TB on the Mx will be a non-starter. I would expect to see the M2 (or whatever) with a large amount of the basic on chip RAM and then an external RAM bus also.

sliken · on Nov 30, 2020

Dunno, rumors claim 8 fast cores and 4 slow cores for the follow up. With some package tweaks I think they could double the ram to 32GB inside the package and leave the motherboard interface untouched.

I do wonder how many use cases actually need more then 32GB when you have a low latency NVMe flash with 5+ GB/sec of bandwidth and relatively low latencies. Especially with the special magic sauce that I've seen mentioned related to hardware acceleration for either compressing memory or maybe it was compressing swap.

In any case, I'm not expecting the top of the line for the next releases. Step 1 is low end (mba, mbp13", and mini). Step 2 is mid range, rumored to be a MBP 14.1" and 16 in first half of 2021". After that presumably the mac pro desktop and Imac's "within 2 years". Maybe step #3 will be a dual socket version of step #2 with of course double the ram.

myrandomcomment · on Nov 30, 2020

So I am not one of those that screamed about the 16GB limit which was a huge number of comments here on HN. That being said I do know people in the creative industry that have Mac Pros with 1.5TB of RAM and use all of it and hit swap. For a higher end Pro laptop I would be happy at the 32GB range. However in something like an 8K display (which will be coming!) iMac I would like to see 128GB which will not fit on chip. They are going to have to go to a 2 level memory design at some point.

sliken · on Nov 30, 2020

Maybe, or just move the memory chips from the CPU package to the motherboard.

myrandomcomment · on Dec 1, 2020

Oh that is very much something they could do, but given the fact that they control the OS completely it would be very interesting to keep the on chip and off chip and enable the software to support understanding which is RAM is where and allow the application developers to tweak items. For example lets say you are editing a very large 8K stream and you tell the app, hey load this video into RAM. You could put the part that is in the current edit window in the on chip RAM and feed the rest of the video into that RAM as the editor moves forward from the 2nd level RAM. There are some interesting possibilities here.

Also from the ASIC yield view it allows for some interesting binning of chips. Let's say the M2 has 32MB on chip plus an off chip memory controller. They could use the ones that pass in the high end, then once that fail a memory test as 16MB on a laptop, etc. Part of keeping ASIC cost down is building the exact same thing and binning the chips in to devices based on yield.

greyhair · on Nov 30, 2020

Unless you are doing some crazy synthesis or simulation, 32GB is plenty.

Maybe editing 4K (or in the future, 8K) video might need more?

My brother does a lot of complex thermal airflow simulation, and his workstation has 192GB of RAM, but that is an extreme use case.

threeseed · on Nov 30, 2020

8GB MacBook Air can easily handle 4K.

And it can handle 8K for 1-2 streams and starts to lag at 4+.

cable2600 · on Dec 2, 2020

The Amiga has never been multi cored. Has Vampire accelerators to replace the 68K chips and PowerPC upgrade cards.

Apple in making the M1 Chip is using some of the Amiga IP circa 1985 that speed up the system where the CPU and GPU etc. share memory. Amiga is shattered into different companies, but if they didn't go out of business they would have made a M1 type chip for their Amiga brand.

brandmeyer · on Nov 30, 2020

> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory.

This, right here. It also helps that the L1D is a whopping 128 kB and only 3 cycles of load-use latency.

jnwatson · on Nov 30, 2020

This is the first I've heard of this. This alone, plus unified memory in general, I bet explains 60% of the performance difference.

AlphaSite · on Nov 30, 2020

I wonder how they managed that.

xpuente · on Nov 30, 2020

Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).

brandmeyer · on Dec 1, 2020

Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.

Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.

Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?

Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.

my123 · on Nov 30, 2020

The NVIDIA Carmel cores on 12nm had a 64KB L1D cache with a 2 cycles latency.

throwaway_pdp09 · on Nov 30, 2020

Means nothing without saying what the clock goes at.

my123 · on Nov 30, 2020

2.26GHz, on a quite old process.

formerly_proven · on Nov 30, 2020

Latencies like this are doable with a lot of tuning on Intel CPUs; out of the box you'll get to the 40s with fast memory. And those CPUs have three cache levels instead of two...

A good old-fashioned 2010-era gaming PC would already get down to around 50 ns levels.

It's definitely really good, but considering it's rather fast RAM (DDR4 4266 CL16) and doesn't have L3 it's not that surprising.

my123 · on Nov 30, 2020

Apple M1 has three cache levels:

- for big cores, private: 128KB L1D

- for big cores, shared within a cluster: 12MB L2

- system-level cache (shared between everything, CPU clusters, GPU, neural engine...): 16MB

and then you reach RAM.

sliken · on Nov 30, 2020

I've written a benchmark to measure such thins and from what I can tell.

Each fast core has a L1D of 128KB.

The fast cores have a cluster with 12MB, cache misses to to main memory.

The slow cores have a 4MB L2.

The cache misses from the fast L2 can't quite saturate the main memory systems (I believe it's 8 channels of 16 bits). So when all cores are busy you keep 12MB of L2 for fast, 4MB of L2 for the slow cores, and end up getting better throughput from the memory system since you are keeping all 8 channels busy.

my123 · on Nov 30, 2020

Wonder if the SLC is mostly used for coherency purposes and the other blocks then...

And yeah, it's 128-bit wide LPDDR4X-4266, pretty quick imo.

sliken · on Nov 30, 2020

Not just 128 bits wide (standard on high end laptops and most desktops), but 8 channels. The latency is halved and over the last decades I've only been seeing very modest improvements in latency to main memory on the order of 3-5% a year.

aldanor · on Nov 30, 2020

Or maybe use the best of both worlds, with soldered-in ultra fast ram, plus large amount of dimm ram.

Same as you can have a storage driver and a smaller nvme.

kllrnohj · on Nov 30, 2020

> Or maybe use the best of both worlds, with soldered-in ultra fast ram

That's basically what L3 cache is on Intel & AMD's existing CPUs. You could add an L4, but at some point the amount of caches you go through is also itself a bottleneck, along with being a bit ridiculous.

sjwright · on Nov 30, 2020

The way I see it, you could have a Mac Pro with (let’s say) 32GB of super-fast on-package RAM and arbitrarily upgradable DIMM slots. The consequence would be that some RAM would be faster and some would be a bit slower.

They would be contiguous, not layered.

compiler-guy · on Nov 30, 2020

The non-uniform memory performance of such a solution would be a software nightmare.

sliken · on Nov 30, 2020

Doesn't seem much different than various multichip or multisocket solutions where different parts of memory have different latencies, called NUMA. Basically the OS keeps track of how busy pages are and rebalances things for heavily used pages that are placed poorly.

Similarly, Optane (in dimm form) is basically slow memory, OSs seem to handle it fine. NUMA support seems pretty mature today and handle common use cases well.

With all that said, apple could just add a second CPU to double the ram and cores, seems like a great fit for a Mac Pro.

loeg · on Nov 30, 2020

It doesn't seem any worse than existing NUMA systems today, where memory latency depends on what core you're running on. In contrast, the proposed system would have the same performance for on-board vs plugged DIMM regardless of which CPU is accessing it, which simplifies scheduling — from a scheduling perspective, it's all the same. I think that's easier to work with than e.g. Zen1 NUMA systems.

bhuber · on Nov 30, 2020

OSes have had this problem solved for decades; the solution is called "swap files". You could naively get any current OS working in a system with fast and slow RAM by simply creating a ramdisk on the slow memory address block and telling the OS to create a swap file there.

loeg · on Nov 30, 2020

> OSes have had this problem solved for decades; the solution is called "swap files".

What operating systems handle NUMA memory through swapping? The only one I'm familiar with doesn't use a swapping design for NUMA systems, so I'm curious to learn more.

temac · on Nov 30, 2020

Not really the best idea for the kind of speed baselines and differences discussed here. You can use better ideas like putting GPU memory first in the fast part then the rest in the slow area. You know, like XBox Series does.

sliken · on Nov 30, 2020

Yet apple is managing excellent performance with just a l1+l2.

creato · on Nov 30, 2020

But the context of this thread is that it is being done with soldered RAM. I don't know how much that matters, just pointing out that you are taking the conversation in a circle.

FpUser · on Nov 30, 2020

>"Maybe motherboards should stop coming with dimms and use the apple approach to get great bandwidth and latency and come in 16, 32, and 64GB varieties by soldering LPDDR4x on the motherboard."

I prefer the upgradeability.

zerkten · on Dec 1, 2020

> I prefer the upgradeability.

The market doesn't care about such niche concerns, but it'll not flip completely overnight.

I like the idea of upgradeability too, but when the trade-off is such great performance, I'm not going to give that up. It would be different if the performance numbers were not as stark.

The open question is how long this performance will be sustained. If it drops off, then concerns like upgradeability make become higher priority (and an opportunity for hardware vendors.)

piokoch · on Dec 1, 2020

Yes, this is a niche concern... called environment protection. The new stuff cannot be upgrade, so when amount of RAM stops to be sufficient, old computer needs to be recycled (a modern word for throwing something into waste bin together with all CO2 that was emitted when computer was produced, not to mention environmental costs of digging rare earths, etc.).

I am still able to use my Lenovo Thinkpad 510T only because I could easily replace HDD with cheap, stock Samsung EVO SSD and throw more RAM.

The absurdity of Apple approach is that Mac Mini with 512 GB SSD is $200 more expensive than the one with 256 GB. 256 GB for $200 is a crazy price, so Apple basically says: hey, pay us a lot, lot more, so maybe you can use our stuff a bit longer, but, in fact, we try to actively discourage you from doing this, since we want you to buy cheaper model and in two-three years you will need to buy a new fancier model.

But Tim Cook will tell you a lot how much he cares about humanity, environment and CO2 emission. Maybe he will even fly his private jet to some conference to tell people how awful is all that oil & gas & coal industry.

alexvoda · on Dec 2, 2020

On-package RAM and upgradability are not at odds with eachother. If upgradability was desired, we could see socketed SOC. This is one of the things modular phones (project Ara) were about.

Just because RAM was one of the last holdout of upgradability, does not make it inherently more suitable for upgradability.

The problem is a lack of interest in manufacturing repairable and upgradable hardware. It is simply less profitable.

FpUser · on Dec 1, 2020

>"The market doesn't care about such niche concerns, but it'll not flip completely overnight."

I do not care all that much abut market either. When I need something it always seems to be there and I do not mind if it is not produced by the biggest guy on the block. If at some point I would not be able to find what I need I'll deal with it but so far it did not happen.

epakai · on Nov 30, 2020

Bandwidth is comparable with other high-end laptop chipsets (I've seen 60-68 GB/s quoted, and recent Ryzens are 68 GB/s). Is the on-chip latency a big factor in the single core performance?

sliken · on Nov 30, 2020

Depends on the workload. Compilers are famous for being branch heavy and random lookups... something that people have reported excellent performance on the M1. Parsing is hard as well (like say javascript).

Of course for any CPU workload it's going to be harder on the memory system when you have video going on. Doubly so for a video conference (compression, decompression, updating the video frames, streaming the results, network activity, reading from the camera, etc).

Seems like the apple memory system wouldn't have received as much R&D as it did unless Apple though it was justified. Clearly the speed, performance, and price show that Apple made quite a few good decisions.

threeseed · on Nov 30, 2020

Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

There's just nothing driving most people to have more than 8/16GB of RAM and even photo/video editing has been shown to be a breeze on the 8GB MacBook Air.

I wouldn't be surprised to see laptops move to soldered RAM and SSDs.

baybal2 · on Dec 1, 2020

> Memory usage has definitely stalled over the last decade as more applications move to the web or mobile devices.

I beg you to look at memory consumption of those "lightweight" web apps

impassionedrule · on Nov 30, 2020

We should definitely not continue Apple's approach of soldering every damn component, even if it comes at the cost of performance

nomel · on Nov 30, 2020

> even if it comes at the cost of performance

Why? What's the purpose of artificially limiting performance when one doesn't need the upgradability?

I've, personally, never upgraded the RAM on any system I've built or carried it to a new motherboard with a new socket. I'm absolutely the target audience for this. I would love this increased performance, as long as it wasn't some surprise. Having the extra plastic on the motherboard is literally e-waste for me. Don't touch my PCI-e slots though.

hinkley · on Nov 30, 2020

Used to be I'd upgrade my MBP memory and hard drive to eek out one more year between upgrades. The drive could always come back and be reused as a portable drive, and the best memory for an old machine typically was cheap enough by then that it wasn't that big of a deal.

schrijver · on Dec 1, 2020

The best present is receiving something you never knew you needed until you get it, so I love giving RAM (and SSD) for birthdays! That you can keep the same computer but that it simply becomes faster is a nice surprise for many.

ksk · on Nov 30, 2020

Components have flaws, or they break down over time, and soldering components hampers repair and reuse.

nomel · on Dec 1, 2020

I suggest you look up "integrated circuits" and "system on a chip", which is where all of our performance/power improvements have come from. You're in for a shock when it comes to repairability!

vondro · on Dec 1, 2020

Not sure why you're being downvoted, it's completely true! If the SSD in my computer dies, I can just buy another one for cheap (500GB for what, 80 dollars?).

If the SSD in my Macbook/Mac Mini dies, either I can buy a new motherboard, or more likely, a new device. It is not economical nor ecological.

Also, paying 200 dollars for additional 256GB of storage? WTF.

sliken · on Nov 30, 2020

Dunno, increasingly with machine learning, more cores, GPUs, etc the bottleneck is the memory system. How much are you willing to pay for a dimm slot?

Personally I'd rather have half the latency, more bandwidth, and 4x the memory channels instead of being able to expand ram mid life.

However I would want the option to buy 16, 32, and 64GB up front, unlike the current M1 systems that are 8 or 16GB.

mcdevilkiller · on Dec 1, 2020

Then, make desktops/laptops with 4 or 8 channels. We'd need more dimms, of a smaller size.

sliken · on Dec 1, 2020

Only if you use dimms. If you use the LPDDR4x-4266 each chip has 2 channels x 16 bits. So the M1 has 4 chips and a total of eight 16 bit wide channels.

ricebunny · on Dec 3, 2020

Does the 8GB variant have all 8 channels or just 4?

sliken · on Dec 3, 2020

My guess is it's the same and using the half density chip in the same family, but I'm just guessing.

gggmaster · on Dec 1, 2020

Those extra memory will cost you an arm and a leg.

sliken · on Dec 1, 2020

My understanding is that the LPDDR4x chips cost less per GB than the random chips you find in the common dimms. There's also costs (board space, part cost, motherboard layers, and layout complexity) for dimm slots.

Sure manufacturers might try to charge significantly more than market price for on the motherboard RAM, but it's an opportunity to increase their profit margin and ASP. Random 2x16GB dimms on newegg cost $150 per 32GB. Apparently LPDDR are easier to route to, require less power, and cost less for the same amount of ram. I'd happy pay $500 for a motherboard with 64GB of LPDDR4x-4266. Seems like Asus, Gigabyte, Tyan, Supermicro and friends would MUCH rather sell a $500 motherboard with ram than a $150 motherboard without.

ksec · on Dec 1, 2020

Normal rate ( Not Contract Price ) for LPDDR4 / LPDDR4X and LPDDR5 is roughly double the cost of DRAM per GB. Depending on Channels and package, the one used in M1 is likely even more expensive as they fit 4 channel per chip. DIMM and Board Space adds very little to the Total BOM.

sliken · on Dec 1, 2020

Ah, I had heard differently, for the same clock rate?

In any case the apple parts are from what I can tell are:

https://www.skhynix.com/products.do?lang=eng&ct1=36&ct2=40&c...

In particular this one:

https://www.skhynix.com/products.view.do?vseq=2271&cseq=77

simonh · on Nov 30, 2020

If you don’t want it, just don’t buy it, but please don’t tell other people what they should or should not like or need.

ksk · on Nov 30, 2020

Apple's (and everyone elses') anti-repair stance (both in terms of design and in policy) is harming the environment and generating tons of e-waste. Whats wrong with expressing a view that helps the planet?

simonh · on Nov 30, 2020

Because it’s just virtue signalling, not actual environmentalism. What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

https://www.techrepublic.com/article/the-5-greenest-tech-com...

StillBored · on Dec 1, 2020

  Apple devices use a minimum of materials and have industry leading usable lifetimes.

Their phones have far longer lifetimes for sure, their laptops? I would like to see evidence of that. Outside of the mostly cheaply made laptops, most laptop/desktop computers can have very long secondary lives. Linux/Windows can run one some very old (multiple decades) machines.

alacombe · on Dec 1, 2020

> Because it’s just virtue signalling

Not buying a new MBP and throwing the old one to children in third world countries qualify as "virtue signaling" now ?

ksk · on Nov 30, 2020

Promoting reuse and repair is environmentalism. Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

>What matters environmentally is aggregate device lifetime, so you get the most use out of the materials. Apple devices use a minimum of materials and have industry leading usable lifetimes. They are also designed to be highly recyclable.

Reuse and repair is FAR superior to recycle - which actually wastes a lot of energy, in addition to generating e-waste for the parts which are not recycled.

>Greenpeace rated Apple the number 1 most environmentally friendly of the big technology companies.

What good does it do? They are still harming the environment.

jlokier · on Dec 1, 2020

> Preventing repair (as Apple does) generates more e-waste. There really is no way around that fact.

There are plenty of ways around that fact.

Preventing repair while changing nothing else generates more e-waste. But that's not what Apple does.

If you prevent repair in order to also do any or all of the following things at the same time enough, the result is less e-waste than if you didn't prevent repair:

- Use less environmentally harmful materials (e.g. on-board sockets, larger PCBs etc)

- Make the device last longer before it needs repair (reliability, longevity)

- Make the device easier to recycle

> Reuse and repair is FAR superior to recycle

It's a good goal, but it's only superior for sure if everything else is able to be kept the same to make it possible.

Some things really are better for the environment melted down and ground down and then rebuilt from scratch. I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

> What good does it do? They are still harming the environment.

When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect, as following it would probably cause more environmental harm not less.

ksk · on Dec 1, 2020

>- Make the device last longer before it needs repair (reliability, longevity)

If Apple makes devices that last so long, then how come Apple's own extended warranty program generates billions of dollars of revenue? Note that this doesn't include third party repair shops. To me, this indicates a large industry dedicated to repairing Apple products - hardly a niche industry. To me, this indicates that a large amount of Apple devices need repair, something that Apple is hostile to.

Also while AppleCare is easy and convenient for the customer, Apple's "geniuses" do not do board-repair, they simply replace and throw away broken logic boards (which sometimes all they might need is a simple 10 cent capacitor). If that wasn't as bad, they actively prevent other businesses from performing component level repair by blocking access to spare parts.

> I'm guessing big old servers running 24x7 are in this category: Recycling the materials into new computers takes a lot of energy, but just running the old server takes a huge amount of energy over its life compared with the newer, faster, more efficient ones you could make from the same materials. I would be surprised if not recycling was less harmful than recycling.

If that was the case, then of course, we should recycle. Maybe we should have a case-by-case approach depending on specific products? I'm totally willing to go wherever the evidence leads us. As of now pretty much every single environmental organization promotes reuse over recycling for electronics.

>When saying Apple should change they way they manufacture to be more like other manufacturers for environmental benefit, Apple being rated number 1 tells you that the advice is probably incorrect,

I merely accepted the "number one" in good faith at face value. Digging further with a cursory Google search, things seem a lot more nuanced. That being said, I have no idea what "number one" even means without context.

https://www.fastcompany.com/40561811/greenpeace-to-apple-for...

jlokier · on Dec 1, 2020

> If Apple makes devices that last so long,

I don't want to back the idea that Apple does make reliable or long-lived devices, although I'm very happy with my 2013 MBP still. I honestly don't know how reliable they are in practice, although they do seem to keep market value for longer than similar non-Apple devices, and they have supported them with software for a long time (my 2013 is still getting updates).

And I would love to be able to add more RAM to my 2013 MBP, which has soldered-in RAM; and I would love if it were easier to replace the battery, and if the SSD were a standard fast kind that was cheap to get replacements for, and if I could have replaced the screen due to the stuck pixel it has due to a screen coating flaw. So I'm not uncritical of the limitations that come with the device.

I'm only disputing your assertion that preventing repair and reuse of parts inevitably generates more e-waste. It's more nuanced than that.

Of course wherever and in whatever ways we can find to repair, reuse and recycle we should.

But there will always be some situations, especially with high-end technology, where repair and reuse needs more extra materials, components, embodied energy and complexity (and subtle consequences like extra weight adding to shipping costs) resulting in a net loss for the environment.

An extreme example but one that's so small we don't think of it is silicon chips. There is no benefit at all in trying to make "reusable" parts of silicon chips. The whole slab is printed in one complex, dense process. As things like dense 3D printing and processes similar to silicon manufacture but for larger object components come online, we're going to find the same factors apply to those larger objects: It's cheaper (environmentally) to grind down the old object and re-print a new one, than to print a more complex but "repairable" version of the object in the first place.

alacombe · on Dec 1, 2020

> - Make the device last longer before it needs repair (reliability, longevity)

2016 MBP owners will appreciated the joke !

jlokier · on Dec 1, 2020

Very good point!

I don't want to back the idea that Apple does make long-lived laptops, only that it's hypothetically possible they do sometimes :-)

My 2013 MBP is still going strong thankfully, I'm very happy with it still after all these years.

alacombe · on Dec 1, 2020

My last 2013 MBP is still alive only because I was able to source third party battery / power connector...

Though, somehow, if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one... Though, I'm not doing it for the environment, I'm just cheap as f*ck.

jlokier · on Dec 1, 2020

> if I trust some argument made here, it would be better for the environment to buy a whole new laptop rather than fix the existing one

No, I don't think that argument is being made by anyone.

The argument being made is that to make the laptop more able to have replaceable components could potentially require more environmental costs up front in making that laptop.

I doubt that argument works for the power connector. I suspect that's more to do with making sure Magsafe is really solid, but it might for the battery due to the pouch design instead of extra battery casing, I'm not sure.

There's no question that if you can repair it afterwards you probably should.

By the way, literally all my other laptops either died due to the power connector failing, or I repaired the failing power connector. Sometimes I had to replace the motherboard to sort out the power connector properly, which seems like poor design.

The Apple has been the only one that hasn't failed in that way, which from my anecdote of about 5 laptops says Apple's approach has worked best from that point of view so far. Of course Apple power supply cables keep fraying and needing to be replaced, so it balances out :-)

simonh · on Nov 30, 2020

For a manufacturer on the whole it’s a negligible issue. It’s simply a fact that Apple devices have longer average lifetimes and lower overall environmental impact than any of the other manufacturers. Hence the Greenpeace rating. If you actually care about the environment, as you claim, the choice is clear.

What you are doing is picking a single marginal factor that can make a difference in rare cases, but is next to irrelevant in practice, and raising that above the total environmental impact of the whole range of devices. That’s just absurd.

alacombe · on Dec 1, 2020

> It’s simply a fact that Apple devices have longer average lifetimes

I've used the same desktop for the 8 to 6 years, upgrading with STANDARDIZED components over the years, and my laptop from that era still works. Heck, I've got a 18 years old thinpad still working fine.

In the mean time, two MBP died on me. Try again...

simonh · on Dec 1, 2020

Do you care about the overall ecological footprint of Apple, as Greenpeace does, or only a few specific devices in particular? How do you evaluate likely future device lifetimes and ecological footprint, cherry picked statistics or manufacturer track record?

Should I take your evaluation in thus, or trust a detailed whole enterprise evaluation by Greenpeace?

greyhair · on Nov 30, 2020

Apple's view uses less material over all. For the vast majority of the machines that A) don't fail and B) are never upgraded in any case, the Apple method of getting rid of sockets reduces the e-waste burden.

lotsofpulp · on Nov 30, 2020

Yet it’s Apple’s devices that last the longest and have the highest resale values.

ksk · on Nov 30, 2020

Hermes handbags also have high resale value. That tells us nothing. Apple's anti-repair approach absolutely harms the environment. Certainly they are not alone in this, many/most electronics these days are irreparable. But Apple is actively hostile to the repair industry, which makes them more deserving of criticism.

zepto · on Dec 1, 2020

The repair industry in this case is hostile to the environment. They are incentivized to want computers to break so that they can sell repair services.

It turns out that soldering parts in place makes them less likely to break than a socket whose connections can oxidize or come loose.

The tiny number of devices that can’t be repaired because of soldered components, is dwarfed by the number of devices that never broke in the first place because of soldered components.

alacombe · on Dec 1, 2020

> It turns out that soldering parts in place makes them less likely

You've obvious never heard of MBP BGA chip solder ball cracking and rendering the whole device useless...

zepto · on Dec 1, 2020

I didn’t say they never failed. Just that they fail less frequently.

In any case, solder ball cracking results from process issues and is a solved problem: https://www.pcbcart.com/article/content/reasons-for-cracks-i...

Certain not something that would be improved with sockets.

the-dude · on Nov 30, 2020

Funny, the only refurbished computer, phone & tablet chain in NL has a pure Apple offering.

dan1234 · on Nov 30, 2020

Unfortunately, even intel’s white label laptop specs soldered RAM, so I expect the trend to continue in low/mid range PC laptops.

https://www.theverge.com/2020/11/19/21573577/intel-nuc-m15-l...

beaned · on Nov 30, 2020

Maybe for some consumer devices we should try it? Clearly the results are excellent. Most people don't open up and modify their laptops.

Kolja · on Nov 30, 2020

> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

Except for a lot of the Apple-centric journalists and podcasters, who have been imagining for years how fast a desktop built on these already-very-fast-when-passively-cooled chips could be.

Not that that matters very much when experienced and real-world workload performance suffers, but as far as I can tell, the M1 is no slouch in that respect either.

pletnes · on Nov 30, 2020

I heard 10 years ago, or whatever, that the ipad 2 had the most power efficient CPU available, period. This was told at a keynote by an HPC scientist who cannot be said to be a «apple journalist». Apple have been doing well for a very long time and I’ve expected this moment since that keynote, basically.

Kolja · on Nov 30, 2020

OK, I misspoke. I heard the sentiment from the Apple-focused voices in my information bubble, doesn't mean that nobody else said it. It's just that "nobody believed those Geekbench benchmarks" is not completely true.

fiddlerwoaroof · on Nov 30, 2020

Yeah, Apple’s CPUs have been doing really well for a while now: https://www.cs.utexas.edu/~bornholt/post/z3-iphone.html

vmchale · on Nov 30, 2020

> Just nobody believed those Geekbench benchmarks showing that in short benchmarks your phone could be faster than your laptop.

I saw a paper on (I think?) SMT solvers on iPhone. Turned out to be faster than laptops, I kind of brushed over it as irrelevant at the time.

fiddlerwoaroof · on Nov 30, 2020

https://www.cs.utexas.edu/~bornholt/post/z3-iphone.html

FpUser · on Nov 30, 2020

I did believe those benchmarks. I also knew that sustainable load at those speeds if not throttled down would just melt the thing so yes they were irrelevant.

simonh · on Nov 30, 2020

Maybe in practice at the time, but in hindsight they were actually a valid indication of the true performance capabilities of the architecture. It’s just that a phone has too little thermal capacity to sustain the workload.

FpUser · on Dec 1, 2020

>"Maybe in practice at the time"

Exactly this. If I am shopping now I do not care how particular CPU/architecture evolves in the future. I only care about what it can practically do now and at what price. As it is now M1 has 4 fast cores and not upgradeable maximum 16GB. For many people it would be more than they ever need. Let them be happy. For my purposes I am way more happy with 16 core new AMD coupled with 128GB RAM (my main desktop at the moment). It runs at sustainable 4.2 GHz without any signs of thermal throttling. Cooler is keeping it at 60C.