However, it’s very much not true that Apple is “cheating” their way to performance with a bunch of these tricks for their own software (as much as you can call this cheating). Their processors are optimized for general-purpose workloads, and blow everything else out of the water even if you’re doing something profoundly “un-specialized” like run Photoshop or a Linux VM.
It’s easy to come up with takes like these that point at buzzwords that you’ve seen being thrown around in the last couple weeks in an attempt to explain this chips. The reality is that the reasons these chips are fast are either unknown or boring. I suspect that these will one out as we play around with them more, but we don’t have the details right now.
That "competitor" btw is not Intel - they already fail at the basics - but AMD.
Regarding a comparison with AMD: I think we’ll have to see what Apple’s professional chips can do. My guess is they’re pulling ahead already.
Yeah, you can of course say that each of these is a "specific use case". But which
use cases other than running native MacOS apps built on Apple's frameworks, x86 MacOS apps and web applications do you have in mind? I see those covering a lot of what people tend to do with their MacBooks.
I disagree, I think we have plenty of information. This is what happens when a huge proportion of your die isn't doing instruction decoding. x86/amd64 are old and crufty. Lessons have been learned, and had been learned for some time, but people (AMD/Intel) were not brave enough to fix them because they didn't have adequate control of their software ecosystem to say, "Your software from your old computer will not run/will not run as well on your new computer." (For emphasis, let me just put it clearly what I'm claiming: x86 was designed in the 1970s. This is what happens when you use an ISA that was not designed in the 1970s.)
Apple, however, does have control of their ecosystem, and they're allowed to be brave, provided they can support their ecosystem.
Here's my hot take: Intel tried to be brave with Itanium, but failed because compilers weren't up to the job at the time. That's changed. First things first, we're going to see mainstream desktop-class/laptop-class ARM being extended to linux and Windows. But I also believe that sometime in the next ten years, after all this newfangled high-performance ARM stuff is sorted out, we're also going to see someone give VLIW another shot, and they're going to succeed this time.
That's a misconception. Yes, decoding x86 is somewhat more complicated, as far as I'm aware that's mostly because instruction length differs at a byte granularity. Still, the area dedicated to it simply isn't that large on those huge out of order designs.
I'm sure the instruction encoding plays some role, but I suspect what we're seeing is rather down to consistently good micro-architecture execution over the years and Apple being ahead of everybody else on manufacturing process.
Compared to the x86 processors that exist today, M1 also benefits from having the memory in-package.
Intel tried to be brave with Itanium, but failed because compilers weren't up to the job at the time.
Compilers "failed" because VLIW is fundamentally not a particularly useful idea. The main problem for general purpose single-thread performance is memory latency, and VLIW just doesn't help there.
I think the problem isn’t just that it’s not particularly useful, but that it’s an actively bad idea. It statically encodes ILP that we get dynamically in chip designs today, which means that ILP can no longer react to changing conditions, like literally any core architecture changes as the underlying hardware evolves (or even on the same die, say finding itself running on an efficiency core rather than a performance core or vice versa) or running in an SMT environment in which it is not the only instruction stream going through the pipeline.
I don't recall if they had an idea for SMT as well or not. It's possible they wrote it off entirely, particularly after Spectre and all that.
Not sure why anyone thinks that’s an argument for doing it.
Because it didn't take very much effort and it doesn't require changing distribution models or abandoning those things.
... and it still may not make a difference.
I can tell you some companies were not too pleased about that.
Without AMD in the picture, Itanium issues would have eventually been sorted out.
Cyrix, VIA and friends were ocean drops, hardly available outside US.
"However, IBM required that Intel find a second-source supplier because production had to be guaranteed and it was too risky to rely on a single company as the sole source of its chips."
Feeling a bit nostaligic now as my first job was with IBM selling the original PC to small businesses. Not the hardest job!
Also don't get the point of the cheering for AMD, because apparently many don't get that without Intel licenses, which depend on Intel existence, AMD won't have much to play with anyway.
Intel licensed some stuff to Cyrix, VIA, AMD, Harris and possibly others at the time. But if AMD didn’t license AMD64 back (by way of cross licensing) things might have been grim for Intel back in 2003-2008 when AMD was killing it previously.
Intel did nothing out of the goodness of their heart, and neither did AMD - so I’m not sure why it is bothering you that people are cheering for AMD for providing (at this point) a technically and economically better product, and hating on Intel for being a fat cat. The pendulum may swing back sooner or later, both technically/economically and public opinion wise.
Dunno why people would keep thinking this is a good idea, but if someone wants to try eating it with a new VLIW design yet again, I’m always down for some popcorn.
Dynamically finding ILP at instruction dispatch time is always going to be smarter than a compiler trying to guess at compile time. IMO, it’s worth the area. Even if you assume the compiler gets it right and finds tons of ILP, it’ll get it wrong for the next hardware rev when your core inevitably changes and you have to break apart your VLIW instructions and get ILP back dynamically anyway. To say nothing of shared pipelines in an SMT processor, where VLIW basically runs counter to SMT and the only way to know which units are going to be available on a specific cycle is during runtime.
> and they're going to succeed this time.
I don’t see any reason why VLIW would succeed more in future attempts than in past attempts.
For example, compilers already model the CPU's pipeline and ILP capacity to inform instruction selection. VLIW designs are an attempt to make that cooperation more explicit, similar in spirit to branch/prefetch hints. Maybe there's still a way to do that well?
And re: microarchitectural changes in the next hardware rev: https://news.ycombinator.com/item?id=25237056
These ideas might ultimately fail, but they seem worth looking in to, at least.
I don’t think we’re really learning much by pretending building a new one is a good idea. Or that the last efforts had much value.
No one is saying don’t make progress on compilers. Look around you and you’ll see the things you just referenced already exist in practice in production today. (Among others: compiler/architecture co-design is a well established and fertile area of work. Compiling to intermediate formats which specialize for a device later is in use to some degree in App Store, which keeps programs in an intermediate format and thins binaries down to specific slices on demand for iOS devices, etc.) None of these technologies justify building a new VLIW chip or have frankly much to do with those efforts. The best thing we could learn from the huge amount of effort that’s gone into VLIW architectures is actually learn the lesson that in front of us, demonstrated repeatedly, that unless something substantial changes, VLIW does not produce a better architecture for processor cores and the assumption that it would was wrong.
Intel tried to be brave with the 432.
The 8086 was rushed as a stopgap. It was never intended as a serious design.
Intel tried to be a little brave with the 286.
Intel tried to be brave with the i860.
Intel tried to be brave with Itanium.
In the meantime Intel have made countless of billions with souped up 8086s. And we're stuck with them. Or were.
But yes, Apple spends a lot on CPU design and it shows; ARM's Cortex performance for things like NEON is way, way worse.
From my impressions of seeing how these get implemented, it seems like some hardware features end getting added to the chip by the hardware team and then the software engineers get told that it now exists. So that’s another reason why I doubt that they are “cheating” on benchmarks.
Knowing the tricks of a trade, like software development, means the difference between starving or success.
In the same way there are "tricks" that you know about relationships and money and work that can change your life radically.
Trick in English have different meanings. It can be something intended for deception or illusion, but it could also be a habit or mannerism.
In Apple case, those tricks are not really tricks, but strategic informed decisions.
Apple has lots of knowledge about what the market needs and is willing to pay for, they have access to the purchase data of tens of millions of people.
If you have no access to real information, some decisions of the companies that do will look comical or nonsensical.
Have you maybe also heard which details could have been misrepresented?
(FWIW, Rosetta itself does “cheat”: the translated code and the runtime shares an address space, and memory accesses are not checked, so it’s possible to write an “Intel” binary that is aware of the runtime and can read emulator state.)
And again: ARM already has the X in the pipeline which is going to close part of the ARM-Apple A-series gap.
Also go pick a bone somewhere else man, how sad do you have to be to multi-downvote on HN? This isn’t Reddit.
Uh, no. Implementing TSO in a highly performant way is not "cheating", it's a difficult engineering feat. And no, TSO is not some fancy Intel thingy. It's a standard memory model.
The POWER7 through POWER9 processors from IBM implement an equivalent to this called "SAO", or "Strong Access Ordering Mode". To quote an IBM engineer:
> Currently, power has a weaker memory model than x86. Implementing a stronger memory model allows an emulator to more efficiently translate x86 code into power code, resulting in faster code execution.
You'll see just how deep down support for this mode goes. To implement this sort of strong memory ordering requires modifications e.g. SAO mode introduces a new pipeline hazard to the L2 cache and requires deep support in the MMU, including corner cases like the virtual memory for VMs and nested VMs.
It's not a casual addition to the uarch.
It just occurred to me Apple has taken a very different approach to solving this problem.
From what I can tell, TSO on apple's CPUs is just process state implemented in their "high performance" cores that you get from flipping a bit in the MSR register (?).
IBM's version is implemented in the memory system, and is applied to pages of memory (on Linux you call mprotect to mark a page SAO). TSO-semantics are then applied regardless of which core on a potentially multiprocessor, SMP system accesses it.
The Apple implementation seems to be more limited -- but in a totally inconsequential way (can't share data between TSO and non-TSO processes). But it does beg the question, what happens when you have shared memory between a Rosetta 2 process and native ARM code? If the ARM code is running on a core without TSO, what gremlins come out to eat your data?
As for sharing data, a non-TSO thread can basically assume that a thread using TSO is equivalent to a non-TSO thread using only ldapr/stlr for memory access. And those interactions are well-defined.
It's a playful use of the term.
Edit: Lol I know, downvote away, icgaf I haven’t played a game besides minesweeper in months.
I'm curious if this means that while any app is running in Rosetta2, does that mean the entire M1 CPU is in "Intel TSO mode", and therefore all processes run slower as a result, or can it do that for just the Rosetta2 Process(es)?
As a side note, there is a often a bit of terminology confusion on whether a cpu is the "die" or the "core". When discussing microarchitecture core == cpu in most cases
For some values of large?
>Apple now sells more watches than the entire Swiss watch industry
Apple's watch was their usual polished refinement of existing, mainly Chinese, SIM-equipped smartwatch designs. They didn't invent a market segment out of a vacuum.
Of course the expectation was that the Apple Watch should be just as popular as the iPhone and transcend the watch industry completely. Wearables would be the next utterly indispensable technology, maybe even threatening the phone itself. In comparison, 'merely' dominating the existing watch industry and gobbling large chunks of the fitness tracker market seems underwhelming.
I was and still am. As I said, most people I know don't wear a watch, myself included. It was the same in 2014 and 2015.
The most significant product produced since Jobs passing is the continuous focus on developing technology in the open that can ultimately be applied to groundbreaking products.
The M1 is the natural evolution of the A series processors and the software development tools Apple has been working on in the open for years.
The next major mobile product from Apple is the natural evolution of the micronization of hardware systems seen on the Apple Watch, tied together with secure and private storage of personal data.
Like iPhone and iPad, Apple Watch is a great product in of itself. But the true product of Apple has been focused iterative delivery.
... Excepted that this is not a product, be any definition ever.
Having fun with writing CLI apps and daemons for the watch?
Plus POSIX support in IoT like devices exists since ages.
“ On Friday, September 1, 2017, after a round of layoffs that started in Oracle Labs in November 2016, Oracle terminated SPARC design after the completion of the M8. Much of the processor core development group in Austin, Texas, was dismissed, as were the teams in Santa Clara, California, and Burlington, Massachusetts. SPARC development continues with Fujitsu returning to the role of leading provider of SPARC servers, with a new CPU due in the 2020 time frame.”
It’s not like System76 is going to be able to buy M1s (dang sounds like a weapon when you put it that way ) and Ryzen might not compete with near term successors to M1.
Also, AFAIK MS translation of x86 -> ARM works different than Rosetta2, so the results aren't just because "Apple added Intel's memory ordering to their CPU" (whatever that means)
The memory ordering stuff is cool, calling it cheating might be a bit harsh although it would be cool if other ARM cores would support something like this.
I am tired of people bringing up that instruction as though it's magically making JS faster compared to intel. At best it's a leveling the playing field instruction.
Implementing TSO isn't "cheating", it's hard work that other companies aren't willing to do.
Ref counting isn't magically faster than other GC methods - there's arguments that it is slower - it does however use much less memory than pretty much any other GC scheme, hence the lower memory needs on iOS vs Android.
I think the better argument here is that iOS and MacOS use RC in the underlying objc libs. Having a CPU that works better around that makes sense to increase performance for those particular OSes.
As far as perf the general argument is that dropping the need for refcounting saves time, and that removal also helps caching due to reduced per-object size.
That said I’m not sure if those comparisons are comparing to generational or moving collectors (which are the low latency collectors) because those start needing write barriers.
That does make it pretty novel on Apple's end. I agree with Heinlein that "a good idea is worth one bottle of scotch"; that the execution is where the vast majority of the work is and where the vast majority of praise should lie.
And that's coming from someone who's not as bullish as some on Apple's execution. I think AMD's cores are better and compared 1/1 on the same process node Apple would look a little behind overall. But Apple for sure gets praise for their execution here (optionality in store order), which is remarkably good.
What really defines Apple’s Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an 8-wide decode block, Apple’s Firestorm is by far the current widest commercialized design in the industry.
A +-630 deep ROB is an immensely huge out-of-order window for Apple’s new core, as it vastly outclasses any other design in the industry. Intel’s Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD’s newest Zen3 core makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure.
On the Integer side, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit.
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors.
The huge caches also appear to be extremely fast – the L1D lands in at a 3-cycle load-use latency. AMD has a 32KB 4-cycle cache, whilst Intel’s latest Sunny Cove saw a regression to 5 cycles when they grew the size to 48KB.
There are plenty of other details in the piece, although bear in mind that the M1 is more beefy than the A14 detailed here (more i/o bandwidth, larger system cache, etc.).
And the memory model isn't even cheating or specific to Intel x86.
And the worst of all it is getting freaking hyped up all over the place even on HN. It is like watching misinformation spreading like wild fire in real time.
You've got to wonder what 3nm will bring, not just for Apple, but for the CPU industry in general.
Comparing the 7nm AMD Zen 3 to the 5nm A14 leads to a simple extrapolation to 3nm: we could see 256KB+ L1 caches, 1K deep ROBs, etc...
Like say you have the Ampere Altra Max with 128 cores ARM processor. Can Apple do that theoretically or is Ampere only able to do that because they make significantly smaller cores, or is it perhaps because they don't have an SoC.
I am just trying to understand better the tradeoffs between smaller weaker cores vs lager but fewer cores. Like if Apple wants to compete in the top end where e.g. AMD has 64 cores. Will that even be possible for Apple?
The interesting thing is the very deep ROB. Conventional wisdom is that because of the cumulative branch misprediction probability, ROBs hit diminishing returns relatively quickly. Apple apparently has a best in class preductor, but still that ROB is huge.
Maybe they do tricks like automatically converting some branches to predication.
I wish we'd get that on x86_64 too. Especially since it seems like an obvious and easy win.
Heres a snipped quote.
"this further improvement is because uncontended acquire-release atomics are about the same speed as regular load/store on A14"
"Weaker memory model makes acquire-release atomics possible to implement much more efficiently, in exchange for not hiding some classes of multithreading bugs"
Another thing is apparently their branch predictor is specially tuned to work with how Objective-c dispatch works (which iirc Swift inherits in places too, and not just when interfacing with Obj-C APIs)
What the hell does that even mean ?
Disregarding the nonsensical attribution of magic to Swift vs Java, if "reference counting" on translated x86 code is faster, then I suppose M1 does "lock add / cmpxchng" or whatever faster than X86.
I would love to know how. Other than just as a side-effect of a SoC, UMA architecture.
fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1
Weaker memory model makes acquire-release atomics possible to implement much more efficiently, in exchange for not hiding some classes of multithreading bugs
That's the part I'd love someone knowledgeable to explain.
> double the speed of reference counting
The simplest low latency system memory making atomics faster in some contexts (faster cache write back/fetch).
Then atomic operations (using Aquire/Release) are theoretically faster on ARM due to the weak memory ordering (I'm not sure if this does help for translated x86 code).
Potentially other reasons can include:
- maybe optimized fetch add/sub instructions (or more precise the ARM instructions used by fetch add/sub functions in higher level languages)
- maybe tweaks to the coherence protocol
- maybe specialized optimizations in the pipelineing
All in all it means Apple focused on making sure fetch add/sub are fast (most likely as this is what is normally used for Rc).
It is possible because if you don’t have TSO(total store order), you can employ atomic operations without waiting/flushing previous items in the CPU’s store buffer
Maybe they aggressively speculate around them.
The other thing is that the M1 processor seems to be a lot wider than contemporary processors. This means it can (potentially) do more operations per clock. Wider processors are harder to clock faster, but it works for Apple. Apple has no desire to put in the cooling you need to get chips running at 4+ GHz, so it's not a big deal if their chips are clock limited. On the other hand, Intel and AMD like their cores to approach 5 GHz at the top end.
If I have to get the clock to reach 2 cores/units at the same time that is easier then say 4 cores/units. As you increase the clock speed the time to reach all the cores decreases. So if you are wider you would need to reach more cores/units in the same period of time, hence "harder".
EDIT (To make it a little more clear)
To make it more clear the voltage change is typically represented in books as a vertical line, but that is not the case it's diagonal and fuzzy. By fuzzy I mean not a straight line but will have some tiny mini downs on the way up.
Different parts of the circuits are going to respond to the up or the down. They are also can vary based on the exact voltage. For example if I have a 5V up one circuit might consider 4.8V to be up and another could be 4.9 or 4.7.
Silicon has improved, but there still is limits of scale based on size, volts and timing.
So Apple has twice as many decoders, eight, and may actually be able to keep adding to that number while AMD and Intel may get stuck on 4, thanks to the x86 CISC legacy.
I'm not sure why everyone focuses on the decoders as the primary determinant of width. There are other bottlenecks which may be narrower than the decoders, and the decoders may not even be used when a uop cache or something like a "loop buffer" (LSD on Intel) is present.
So AMD doesn't believe that wider chips aren't useful or that 4 is a limit, because they went to 5-wide in Zen (or 6, depending on how you count it) and I expect them to go wider in the future.
Intel went from 4 wide (narrowest bottleneck) to 5 wide in Ice Lake.
Wider chips are the future, and there is no "hard wall" at 4 for x86, just like any earlier width increase: just constantly diminishing returns.
The set of x86 instructions people actually use is sufficiently small that the (ignoring the amortized cost of the rest of the ISA - it should be considered, but we can't really know how much space it actually takes up on the die) the two architectures have, to first order at least, converged such that the tricks are now all in the details like memory ordering and scheduling than the actual instructions as per se.
This is why you don't see Intel being competitive in low power applications where ARM excels.
 is the Webkit commit that added support for FJCVTZS, it was between a 0.5 to 2% speedup - that's a big win for a mature optimizer.
Can anyone explain what they did exactly?
It probably would be legal to enable TSO everywhere as you said, but weaker ordering lets the CPU do more aggressive optimizations
> 19/ ...even when translating x86 code, all that reference counting overhead (already more efficient than garbage collection) gets dropped in half. Yet another weird performance enhance to add to all the others.
Not at all, as proven by known benchmarks.
Putting refcounting in the CPU was the only way Apple managed to make it fast enough.
Objective-C only went with refcounting because Apple failed to make their tracing GC work flawless across frameworks with mixed compiler flags alongside C semantics, thus having the compiler automate retain/release was a more sane option to do.
Likewise, Swift had a requirement to have flawless interoperability with Objective-C runtime and libraries, so the natural option was to adopt reference counting instead of a translation layer across both worlds.
.NET / COM interoperability is a good example on how to integrate a tracing GC with reference counting can turn into an engineering feat.
Speaking of which UWP, which is 100% COM based, is somehow noticeable slower than pure Win32 applications, besides the sandboxing, the main reason is naturally AddRef/Release everywhere.
Hence why C++/WinRT as C++/CX replacement is full of tricks, moving destruction to background threads, wrapping COM handles in stack allocations, taking advantage of constexpr.
So yeah, it is tricks and marketing how those tricks are sold, specially to crowds that care more about the next SPA framework release than how compilers work and CPUs work.
Yeah, then your code is shit and was relying on undefined behaviour.
Now that you just scroll down like any other website, it just sounds like tired complaining.
Not having to read the text backwards is a extremely low bar, I expect more from a text.
Twitter is the wrong tool for the task, a very bad one.
lol, not taking the bait
- modern architecture having fast atomic fetch_add/sub
- weak memory ordering making atomic fetch_add/sub even faster (if Acquire/Release ordering is used)
- optimize the usage of reference counting by eliminating pointless reference counting, e.g. as done by Objective-C/Swift automatic reference counting (ARC). Or e.g. done when using borrows in rust for everything except the places you know you really need to clone the Atomic Reference Counter (also ARC but a different one ;=) ).
Reference counting was for a long time basically guaranteed to be slower then garbage collection because of slower atomics, no weak memory ordering and "naive" usage of reference counting introducing a lot of unnecessary reference counter increased and decreases.
But with modern hardware and compilers it's no longer as clear cut, at least in practice wrt. to modern hardware.
and hey, here we are
>>>We use this framework to compare
the time-space performance of a range of garbage collectors to explicit memory management with the Lea memory allocator. Comparing runtime, space consumption, and virtual memory footprints
over a range of benchmarks, we show that the runtime performance
of the best-performing garbage collector is competitive with explicit memory management when given enough memory. In particular, when garbage collection has five times as much memory
as required, its runtime performance matches or slightly exceeds
that of explicit memory management. However, garbage collection’s performance degrades substantially when it must use smaller
heaps. With three times as much memory, it runs 17% slower on
average, and with twice as much memory, it runs 70% slower. Garbage collection also is more susceptible to paging when physical
memory is scarce. In such conditions, all of the garbage collectors
we examine here suffer order-of-magnitude performance penalties
relative to explicit memory management.<<<
GC is "competitive" when is has 5x as much memory according to the authors. Well, probably this is because GCs are fast when they have low to no memory pressure. They simply don't need to garbage collect anything. In fact, a non-collecting, non-managed, memory approaches can be faster than any memory management. That is, you have a short lived application that just allocates what it needs, never frees anything, then dies when done. But this is not apples to apples.
The paper also says that with 2x more memory than manual management, GC is 70% slower and under paging scenarios GC can suffer "order-or-magnitude performance penalties relative to explicit memory management."
Except under unusual conditions, there is no case where GC is faster than manual.
> Except under unusual conditions, there is no case where GC is faster than manual.
Their simple generational GC outperforms manual memory management in every single workload in that paper given enough memory, so that statement is clearly false. A production quality GC does so with less memory.
> Well, probably this is because GCs are fast when they have low to no memory pressure. They simply don't need to garbage collect anything.
This is wrong. It's because generational GCs start to look like arena allocators when they have enough memory, with cheap bump allocation and bulk deallocation. On most of their workloads, their simple generational GC already outperforms naive explicit memory management with less than 3x the memory, and this would certainly be better with explicit tuning or a production GC.