Hacker News new | past | comments | ask | show | jobs | submit login
Apple CPU tricks: memory reordering, JavaScript support, ref counting (twitter.com/erratarob)
277 points by simonpure on Nov 27, 2020 | hide | past | favorite | 191 comments

It’s important to note that most of the things mentioned here are just “tricks”: they’re fun to discover and talk about, but they really only end up being minor wins in practice. TSO is great…if you are trying to make a simpler Rosetta (it’s not even necessary on the M1, although I think Apple is still using it for convenience; I’m still trying to find out where). The JavaScript instruction speeds up…one specific rounding in JavaScript. Fast atomics help…if you’re that one specific part of Apple’s runtime that has taken advantage of it yet. They are all cool, important little optimizations, and I’m sure that few other companies could do what Apple is doing in this space.

However, it’s very much not true that Apple is “cheating” their way to performance with a bunch of these tricks for their own software (as much as you can call this cheating). Their processors are optimized for general-purpose workloads, and blow everything else out of the water even if you’re doing something profoundly “un-specialized” like run Photoshop or a Linux VM.

It’s easy to come up with takes like these that point at buzzwords that you’ve seen being thrown around in the last couple weeks in an attempt to explain this chips. The reality is that the reasons these chips are fast are either unknown or boring. I suspect that these will one out as we play around with them more, but we don’t have the details right now.

I disagree regarding the importance of this kind of "tricks". You are right in pointing to all the "boring" stuff that is considered state of the art in chip design being the main reason why these chips are impressive. Apple seems to have pulled all the right strings in that regard. But there is a point at which you can't execute the known optimizations any more perfect than you already do - when you're already using the best available lithography, implementing all the well-known architectural benefits like big/little, big caches, wide decoders etc., you eventually hit a brick wall and the only thing that will get you further are new and creative tricks like the ones discussed in the linked tweets. Each may just add a little, but together they add up to a noticeable improvement with regard to a competitor who also executed all the standard stuff well, but maybe did not think about these particular new optimizations, or might not even be able to implement them due to not being in control of the entire vertical ecosystem.

That "competitor" btw is not Intel - they already fail at the basics - but AMD.

The point is that in benchmarks these things don’t really show up at all and Apple still leads. The things I mentioned are only useful because of specific usecases, and saying that Apple is ahead because of those is inaccurate. The point is that the tricks are extremely important for Apple, but not in an argument of “why is Apple making fast processors”.

Regarding a comparison with AMD: I think we’ll have to see what Apple’s professional chips can do. My guess is they’re pulling ahead already.

So JavaScript is not benchmarked, fast ref counting does not improve native application benchmark results, and the memory model trick does nothing to improve benchmark results of x86 applications?

Yeah, you can of course say that each of these is a "specific use case". But which use cases other than running native MacOS apps built on Apple's frameworks, x86 MacOS apps and web applications do you have in mind? I see those covering a lot of what people tend to do with their MacBooks.

> The reality is that the reasons these chips are fast are either unknown or boring. I suspect that these will one out as we play around with them more, but we don’t have the details right now.

I disagree, I think we have plenty of information. This is what happens when a huge proportion of your die isn't doing instruction decoding. x86/amd64 are old and crufty. Lessons have been learned, and had been learned for some time, but people (AMD/Intel) were not brave enough to fix them because they didn't have adequate control of their software ecosystem to say, "Your software from your old computer will not run/will not run as well on your new computer." (For emphasis, let me just put it clearly what I'm claiming: x86 was designed in the 1970s. This is what happens when you use an ISA that was not designed in the 1970s.)

Apple, however, does have control of their ecosystem, and they're allowed to be brave, provided they can support their ecosystem.

Here's my hot take: Intel tried to be brave with Itanium, but failed because compilers weren't up to the job at the time. That's changed. First things first, we're going to see mainstream desktop-class/laptop-class ARM being extended to linux and Windows. But I also believe that sometime in the next ten years, after all this newfangled high-performance ARM stuff is sorted out, we're also going to see someone give VLIW another shot, and they're going to succeed this time.

This is what happens when a huge proportion of your die isn't doing instruction decoding.

That's a misconception. Yes, decoding x86 is somewhat more complicated, as far as I'm aware that's mostly because instruction length differs at a byte granularity. Still, the area dedicated to it simply isn't that large on those huge out of order designs.

I'm sure the instruction encoding plays some role, but I suspect what we're seeing is rather down to consistently good micro-architecture execution over the years and Apple being ahead of everybody else on manufacturing process.

Compared to the x86 processors that exist today, M1 also benefits from having the memory in-package.

Intel tried to be brave with Itanium, but failed because compilers weren't up to the job at the time.

Compilers "failed" because VLIW is fundamentally not a particularly useful idea. The main problem for general purpose single-thread performance is memory latency, and VLIW just doesn't help there.

> Compilers "failed" because VLIW is fundamentally not a particularly useful idea.

I think the problem isn’t just that it’s not particularly useful, but that it’s an actively bad idea. It statically encodes ILP that we get dynamically in chip designs today, which means that ILP can no longer react to changing conditions, like literally any core architecture changes as the underlying hardware evolves (or even on the same die, say finding itself running on an efficiency core rather than a performance core or vice versa) or running in an SMT environment in which it is not the only instruction stream going through the pipeline.

Adapting to advancing hardware isn't impossible. It's been awhile since I've looked so details are a little hazy, but I know the Mill architecture had an answer to this. I believe they were using effectively two-pass compilation. The first compiled against an abstract version of the architecture, and this is what was distributed. Then this could be further specialized on the user's machine as at that point the limits of the hardware would be known.

I don't recall if they had an idea for SMT as well or not. It's possible they wrote it off entirely, particularly after Spectre and all that.

You are correct that with a lot of effort you can get a compiler to emit VLIW instructions that can claw back some of these problems and have some, but not all, of the information you have available to the processor core at runtime which can make things slightly smarter. And with all of that effort which involves changing software distribution models, jettisoning energy efficient flexible heterogenous architectures, and abandoning SMT, you might be able to get something close to performing as well as… what we already have today without VLIW.

Not sure why anyone thinks that’s an argument for doing it.

> Not sure why anyone thinks that’s an argument for doing it.

Because it didn't take very much effort and it doesn't require changing distribution models or abandoning those things.

Not disputing what you're saying on instruction decoding but there is a lot of other stuff on an x86 processor that Apple hasn't needed or chosen to implement: 32bit, AVX, legacy SIMD, legacy modes, Intel Management Engine plus probably a more complex instruction set. Individually these may not make a difference but add them all up ....

> Individually these may not make a difference but add them all up ....

... and it still may not make a difference.

I have AVX512 on my laptop. Is it used by any software I use? No. Does it take up a lot of die area? Definitely.

Doesn't Apple use TSMC manufacturing?

Yes, and they are the only ones on TSMC 5nm currently.

That's because apple bought the entire 5nm production capacity/runs for basically for the entire last quarter of 2020.

I can tell you some companies were not too pleased about that.

Ah yeah, Apple has f-you money and it knows how to use it.

They did this with FedEx planes out of China years back, forget which iPhone version, but they basically did the same thing they booked everything. All FedEx planes only had iPhones and everybody complained. I don't think they have tried that stunt again.

Is there anything preventing those companies from "ganging" up on Apple and buying capacity jointly? Are they too many and too small? Is it possible to share wafer-space with other companies or does having multiple designs add too much complexity/cost?

I think Apple essentially outbid them, which it was able to do because it is has both large volume and large profit margins.

Which companies?


Intel failed because AMD exists and has a cross license agreement that allowed them to be clever.

Without AMD in the picture, Itanium issues would have eventually been sorted out.

There would have been no Intel on IBM compatible PCs without x86 cross licensing, so you could make the point that AMD allowed Intel to thrive.

What? I was there on the early days, that just makes no sense.

Cyrix, VIA and friends were ocean drops, hardly available outside US.

Very unlikely that IBM would have used the 8088 without a second source - so literally Intel needed firms like AMD.

Not unlikely, it was a condition.

Nope, it was a condition from government to buy PCs from IBM, most of their computers have always been single source.

So what were the second source from all major IBM computers throughout its history?

Sorry, don't know what point you're making here.

"However, IBM required that Intel find a second-source supplier because production had to be guaranteed and it was too risky to rely on a single company as the sole source of its chips."


That was a first for IBM, and althought not mentioned there I seem to recall it had something to do with selling PC to the goverment as well.

I seem to recall that IBM had second source rights to the 8088 too which presumably made the choice easier.

Fair enough, then I am most likely wrong here.

I have to say I'm confused now!

Feeling a bit nostaligic now as my first job was with IBM selling the original PC to small businesses. Not the hardest job!

Well, yes, if Intel had no competition at all they might have been able to force whatever garbage on the market.

I actually liked Itanium and don't see it as garbage, and see as an unfortunate turn of events that I am not running one on this laptop.

Also don't get the point of the cheering for AMD, because apparently many don't get that without Intel licenses, which depend on Intel existence, AMD won't have much to play with anyway.

20 or so years later, it is my impression that there are no compilers that can do enough ILP (on software not specifically written for the Itanium) to justify the Itanium’s VLIW.

Intel licensed some stuff to Cyrix, VIA, AMD, Harris and possibly others at the time. But if AMD didn’t license AMD64 back (by way of cross licensing) things might have been grim for Intel back in 2003-2008 when AMD was killing it previously.

Intel did nothing out of the goodness of their heart, and neither did AMD - so I’m not sure why it is bothering you that people are cheering for AMD for providing (at this point) a technically and economically better product, and hating on Intel for being a fat cat. The pendulum may swing back sooner or later, both technically/economically and public opinion wise.

I don't know much about Itaniumm, but that was a server architecture, right? That target platform would made it difficult to achieve the same performance per watt as the mobile focused ARM chips.

> we're also going to see someone give VLIW another shot

Dunno why people would keep thinking this is a good idea, but if someone wants to try eating it with a new VLIW design yet again, I’m always down for some popcorn.

Dynamically finding ILP at instruction dispatch time is always going to be smarter than a compiler trying to guess at compile time. IMO, it’s worth the area. Even if you assume the compiler gets it right and finds tons of ILP, it’ll get it wrong for the next hardware rev when your core inevitably changes and you have to break apart your VLIW instructions and get ILP back dynamically anyway. To say nothing of shared pipelines in an SMT processor, where VLIW basically runs counter to SMT and the only way to know which units are going to be available on a specific cycle is during runtime.

> and they're going to succeed this time.

I don’t see any reason why VLIW would succeed more in future attempts than in past attempts.

Even if compiler-only ILP is doomed to fail, maybe VLIW optimists are just hoping something useful might come out of everything we've learned?

For example, compilers already model the CPU's pipeline and ILP capacity to inform instruction selection. VLIW designs are an attempt to make that cooperation more explicit, similar in spirit to branch/prefetch hints. Maybe there's still a way to do that well?

And re: microarchitectural changes in the next hardware rev: https://news.ycombinator.com/item?id=25237056

These ideas might ultimately fail, but they seem worth looking in to, at least.

I think you are correct that there is a section of the field enchanted by the idea of VLIW and desperately hoping to justify any possible reason why building these chips was less of a bad notion than they’ve turned out to be each and every time.

I don’t think we’re really learning much by pretending building a new one is a good idea. Or that the last efforts had much value.

No one is saying don’t make progress on compilers. Look around you and you’ll see the things you just referenced already exist in practice in production today. (Among others: compiler/architecture co-design is a well established and fertile area of work. Compiling to intermediate formats which specialize for a device later is in use to some degree in App Store, which keeps programs in an intermediate format and thins binaries down to specific slices on demand for iOS devices, etc.) None of these technologies justify building a new VLIW chip or have frankly much to do with those efforts. The best thing we could learn from the huge amount of effort that’s gone into VLIW architectures is actually learn the lesson that in front of us, demonstrated repeatedly, that unless something substantial changes, VLIW does not produce a better architecture for processor cores and the assumption that it would was wrong.

> Intel tried to be brave with Itanium

Intel tried to be brave with the 432.

And failed.

The 8086 was rushed as a stopgap. It was never intended as a serious design.

Intel tried to be a little brave with the 286.

And failed.

Intel tried to be brave with the i860.

And failed.

Intel tried to be brave with Itanium.

And failed.

In the meantime Intel have made countless of billions with souped up 8086s. And we're stuck with them. Or were.

The JS rounding mode instruction wasn't even Apple's invention, it's standard ARMv8.

But yes, Apple spends a lot on CPU design and it shows; ARM's Cortex performance for things like NEON is way, way worse.

Yes, that’s true. But Apple heavily influences the standard, and they’re usually the first to actually implement new revisions; plus they pick and choose which extensions they want too. That being said, the “Apple added JavaScript instructions and that’s why their processors are fast at web browser benchmarks” meme needs to die.

From my impressions of seeing how these get implemented, it seems like some hardware features end getting added to the chip by the hardware team and then the software engineers get told that it now exists. So that’s another reason why I doubt that they are “cheating” on benchmarks.

"Tricks" makes great difference in life. I can jump over a meter high with my bicycle because someone who knew did teach me the "bunnyhop trick". That and other "tricks" give you the ability to do things save on a bicycle that most people consider impossible or too risky.

Knowing the tricks of a trade, like software development, means the difference between starving or success.

In the same way there are "tricks" that you know about relationships and money and work that can change your life radically.

Trick in English have different meanings. It can be something intended for deception or illusion, but it could also be a habit or mannerism.

In Apple case, those tricks are not really tricks, but strategic informed decisions.

Apple has lots of knowledge about what the market needs and is willing to pay for, they have access to the purchase data of tens of millions of people.

If you have no access to real information, some decisions of the companies that do will look comical or nonsensical.

I know exactly why these things exist: it’s because they serve specific strategic interests that Apple has. They aren’t how the processor is getting great single-core scores on Geekbench.

The most informative post at the moment for me, with more detailed technical information for those who are interested in what's actually making Apple's M1 CPU fast, based on the article by Anandtech, is much lower on this page, by GeekyBear:


The AnandTech article is not bad, but I have heard that their tests misrepresent some microarchitectural details that they didn’t manage to guess the underlying design of correctly.

> misrepresent some microarchitectural details

Have you maybe also heard which details could have been misrepresented?

I heard that the 630 number for the ROB isn’t quite right.

I believe with regards to TSO, it is actually a huge benefit performance wise. Without you would need to effectively have a barrier after every memory operation ARM, reducing performance drastically.

Implementing the memory ordering in hardware is a huge performance benefit, but they didn’t have to do it with a switch like they have right now. ARMv8.4-RCpc would be enough, and I believe M1 supports these too.

(FWIW, Rosetta itself does “cheat”: the translated code and the runtime shares an address space, and memory accesses are not checked, so it’s possible to write an “Intel” binary that is aware of the runtime and can read emulator state.)

Isn't it a switch because the weak memory model, whilst providing less guarantees is more performant?. So they are trying to have both the x86 mode which isn't as performant, but works a lot faster in hardware than the software implementation and the faster weaker ARM model.

The thing I mentioned should be better in that case since it’d enforce the right memory ordering on a per-instruction basis.

The ls/st instructions take a large amount of encoding space for all the addrmodes. I don't think the rcpc instructions and other acq/rel instructions were given the same encoding space.

People won’t believe it’s not Apple its magic sauce and ARM is really just that good until ARM processors start being a mainstream thing in Windows laptops. ARM its ARM-X already looks pretty juicy and the follow-ups are sure to get better and better.

If Apple is free riding on how great ARM is, how come everybody else’s mobile ARM chips are such pants in comparison?

You (and apparently most of HN) is misunderstanding me: I’m not saying Apple isn’t doing great stuff or that they’re just coasting on ARMs tech, but it is the ARM ISA that enables them to make the M1 so ridiculously good. But people won’t believe that and will attribute it to Apple’s vertical integration until they are holding a Windows laptop with an ARM CPU and crazy battery life, thermals etc.

And again: ARM already has the X in the pipeline which is going to close part of the ARM-Apple A-series gap.

No, we understand fine; we disagree with the claim that M1 is good because of the ARM ISA. Had Apple picked something else a decade ago we’d probably see them pulling ahead with that, too.

Then you don’t really understand, because Apple couldn’t have pushed x86 this far, even if they had gotten a license.

Also go pick a bone somewhere else man, how sad do you have to be to multi-downvote on HN? This isn’t Reddit.

I fail to see why Apple could not have pushed x86 "this far". I'd be happy to hear why you think so, though. I don't really understand the second part of your comment, however: I have no bone to pick with you, nor do I regularly downvote people. Least of all you, since the only interaction I recall having with you is your replies to my comments, which automatically disables downvoting.

It’s Apple, not ARM.

> 4/ So Apple simply cheated. They added Intel's memory-ordering to their CPU. When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering.

Uh, no. Implementing TSO in a highly performant way is not "cheating", it's a difficult engineering feat. And no, TSO is not some fancy Intel thingy. It's a standard memory model.

To add to this:

The POWER7 through POWER9 processors from IBM implement an equivalent to this called "SAO", or "Strong Access Ordering Mode". To quote an IBM engineer:

  > Currently, power has a weaker memory model than x86. Implementing a stronger memory model allows an emulator to more efficiently translate x86 code into power code, resulting in faster code execution.
What's interesting is IBM publishes pretty detailed manuals about their processors. You can take a look at the POWER9 CPU manual [0] and search for "SAO".

You'll see just how deep down support for this mode goes. To implement this sort of strong memory ordering requires modifications e.g. SAO mode introduces a new pipeline hazard to the L2 cache and requires deep support in the MMU, including corner cases like the virtual memory for VMs and nested VMs.

It's not a casual addition to the uarch.


It just occurred to me Apple has taken a very different approach to solving this problem.

From what I can tell, TSO on apple's CPUs is just process state implemented in their "high performance" cores that you get from flipping a bit in the MSR register (?).

IBM's version is implemented in the memory system, and is applied to pages of memory (on Linux you call mprotect to mark a page SAO). TSO-semantics are then applied regardless of which core on a potentially multiprocessor, SMP system accesses it.

The Apple implementation seems to be more limited -- but in a totally inconsequential way (can't share data between TSO and non-TSO processes). But it does beg the question, what happens when you have shared memory between a Rosetta 2 process and native ARM code? If the ARM code is running on a core without TSO, what gremlins come out to eat your data?

[0] https://www.setphaserstostun.org/power9/POWER9_um_OpenPOWER_...

Apple did implement this easier memory model without data sharing because instead of ibm their plan is to get rid of amd64. Why support strange amd64/arm combinations when your plan is to fully migrate to arm. You can do the easier work and use more die space for arm optimizations and effectively showing the developers they‘ll get a speed improvement when porting the code.

According to TSOEnabler, threads with the TSO property set simply aren’t scheduled onto cores that don’t support TSO.

As for sharing data, a non-TSO thread can basically assume that a thread using TSO is equivalent to a non-TSO thread using only ldapr/stlr for memory access. And those interactions are well-defined.

I don't think sharing is a big issue: TSO has some implicit acquire and releas barriers on read and writes, so the weaker side just needs to issue the relevant barriers explicitly.

On the M1 this should work on any core.

i believe in this case cheating is being used to mean that they were clever, working around the problem, not doing anything wrong

Yeah. "How do we make the Wii run Game Cube games? We cheat by putting a Game Cube in the Wii!"

It's a playful use of the term.

Or for us old [feeling] folks: it’s “cheating” by getting a world map and playing through the puzzles and minibosses we would still encounter... not getting a Game Genie and sliding under Mother Brain.

Edit: Lol I know, downvote away, icgaf I haven’t played a game besides minesweeper in months.

I honestly just don't understand what you're trying to say.

I honestly was just kind of flow of consciousness riffing so I don’t blame you. Just overextending the metaphor of gaming applied to cheating with applied knowledge versus cheating with deus ex machina

> When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering.

I'm curious if this means that while any app is running in Rosetta2, does that mean the entire M1 CPU is in "Intel TSO mode", and therefore all processes run slower as a result, or can it do that for just the Rosetta2 Process(es)?

I don't have any specific info, but I would be very surprised if this MSR setting wasn't on a per-core basis. So only cores that are currently running code that requires x86-TSO would be running in that mode.

Per core, yes. It gets set on entry and exit from the kernel for threads that have been marked as needing it.

A cpu (core) executes one process at a time. There is a lot of cpu (core) global state that gets saved / restored upon context switch (either between processes or just between user mode and kernel mode).

As a side note, there is a often a bit of terminology confusion on whether a cpu is the "die" or the "core". When discussing microarchitecture core == cpu in most cases

Finally a company has the guts to move us from the straitjacket of x86. I hope this is just the beginning of much needed innovation in CPU architecture. Due to the MS/Intel duopoly we had to suffer x86 for decades with few real options.

I think the M1 macs are the first really significant product that Apple has introduced since Steve Job's death.

Apple Watch put a Unix computer with a 16-hour battery life, an LTE modem & multiple health sensors on my wrist. I’d count that as significant.

Well, I respectfully disagree. The Apple Watch is a successful product, but it really didn't completely disrupt a large existing industry like the iPhone did. I think it's possible that the M1 will trigger the death of x86 personal computers. If that happens, it won't have been an iPhone-level disruption, but it will be a lot more significant than the watch.

> The Apple Watch is a successful product, but it really didn't completely disrupt a large existing industry

For some values of large?

>Apple now sells more watches than the entire Swiss watch industry


Casio sells more F-91W digital watches per year than the entire Swiss watch industry, too...

Apple's watch was their usual polished refinement of existing, mainly Chinese, SIM-equipped smartwatch designs. They didn't invent a market segment out of a vacuum.

The Swiss watch industry isn't what it was. Most people I know don't wear a watch.

Wait, where did the goal posts go. Has anyone seen them? I swear they were here a minute ago.

To be fair to the person you're responding to, the goalposts look to have been "disrupt a large existing industry." "Large" is obviously subjective, but if the example of the disrupted industry is the watch industry, "the watch industry isn't large so it doesn't count" is a reasonable response within the constraints of the original goalposts.

Except it did by creating a larger market. Some of that was flashy at the start like the gold watches but ultimately created a new market. Personally I quit wearing a watch sometime around 2010 when I found I was checking my phone and didn't need a watch. Now I am checking my watch for time and weather and not pulling my phone out.

I suppose so, but who was saying this in 2014 and 2015 when the other smart watches were out and Apple's offering was only a rumour? Back then the consensus was the watch would be a watershed test of whether the team at Apple had it in them. Now that they've conclusively nailed it, we discover it didn't count after all.

Of course the expectation was that the Apple Watch should be just as popular as the iPhone and transcend the watch industry completely. Wearables would be the next utterly indispensable technology, maybe even threatening the phone itself. In comparison, 'merely' dominating the existing watch industry and gobbling large chunks of the fitness tracker market seems underwhelming.

> who was saying this in 2014 and 2015 when the other smart watches were out and Apple's offering was only a rumour?

I was and still am. As I said, most people I know don't wear a watch, myself included. It was the same in 2014 and 2015.

But it clearly isn’t a small industry, and wearables are now responsible for c$40bn annual revenue at Apple (includes AirPods now though, also invented after jobs), so dismissing it does seem to be changing the goalposts.

The goalposts were with widely used computing devices this entire time, but some people want to move them next to typewriters and watches.

I hadn’t worn a watch for decades. Then Apple Watch came along and I’ve worn one every day for the last five years. I call that disruption.

"it really didn't completely disrupt a large existing industry" - Have you checked market share of Apple Watch ?

Both of these comments miss the point.

The most significant product produced since Jobs passing is the continuous focus on developing technology in the open that can ultimately be applied to groundbreaking products.

The M1 is the natural evolution of the A series processors and the software development tools Apple has been working on in the open for years.

The next major mobile product from Apple is the natural evolution of the micronization of hardware systems seen on the Apple Watch, tied together with secure and private storage of personal data.

Like iPhone and iPad, Apple Watch is a great product in of itself. But the true product of Apple has been focused iterative delivery.

> “But the true product of Apple has been focused iterative delivery.”

... Excepted that this is not a product, be any definition ever.

Where are the UNIX APIs on watchOS?

Having fun with writing CLI apps and daemons for the watch?

Plus POSIX support in IoT like devices exists since ages.

watchOS lets you call much of the UNIX APIs, you (as third-party developer) just aren’t allowed to make command-line tools and daemons.

Try to make a watchOS app only with UNIX APIs.

Try to make an app on any platform only with UNIX APIs? GUI toolkits and other platform frameworks always exist.

It works perfectly alright on UNIX clones, with a terminal and keyboard, both are missing from Apple watch.

2020 is finally the year of RISC on the desktop

it was a joke. apple has literally made RISC desktops before (which you failed to link)

Indeed, but my list wasn't going to be exhaustive anyway.

uh, no.

Well, maybe... “ On Friday, September 1, 2017, after a round of layoffs that started in Oracle Labs in November 2016, Oracle terminated SPARC design after the completion of the M8. Much of the processor core development group in Austin, Texas, was dismissed, as were the teams in Santa Clara, California, and Burlington, Massachusetts.[4][5] SPARC development continues with Fujitsu returning to the role of leading provider of SPARC servers, with a new CPU due in the 2020 time frame.”

How long before 2020 Fujitsu rebinned SPARC chips are sold to OEMs for SPARCstations runnin’ popOS?

It’s not like System76 is going to be able to buy M1s (dang sounds like a weapon when you put it that way ) and Ryzen might not compete with near term successors to M1.

Beware what you wish for.

I see future regrets. ARM based chips tend to be a close ecosystem.

Is there any reference/confirmation for all of this? The fact that the author types "Swift" under quotes and doesn't know that refcount is something inherited from Objective C and the whole OS rely on it makes the rest of claims a bit more gossipy.

Also, AFAIK MS translation of x86 -> ARM works different than Rosetta2, so the results aren't just because "Apple added Intel's memory ordering to their CPU" (whatever that means)

The author did type "Swift" but it applies to Obj-C also, obviously. I wouldn't discount the whole post because of that slip-up. The optimizations made to reference counting are more than "gossip". David Smith posted about this weeks ago also, if you need more reference/confirmation than that I'm not sure what to tell you.

may also apply to Rust where we often make use of reference counting when the borrow checker isn’t enough

ARC probably yes (assuming the standard library annotates the atomic operations properly). RC no. That being said, it sounds like in general uncontended atomic operations are faster so this might benefit a lot of atomic code in any language.

While I was also thrown off by the lack of reference to Objective-C, in context I think the author used quotes around “Swift” to make it clear it’s a proper name and that he’s not talking about a “swift (ie fast) programming language”.

Re: “JavaScript” instructions. I think it means FJCVTZS which is not Apple specific and only yields a fairly minor improvement. It’s not even really that JS specific, it’s definitely designed to help with JS numbers but it’s not like a large chunk of JS functionality is in silicon.

The memory ordering stuff is cool, calling it cheating might be a bit harsh although it would be cool if other ARM cores would support something like this.

Quick note: the magic "help JS function" is literally just a double->int conversion that uses the x86 sentinel values and rounding modes. The effect on JS is that it removes a set of branches following ((int)someDouble) that are needed to match the x86 semantics that JS mandates. The instruction is not magically faster than x86. In fact more or less by definition it's just doing what x86 does.

I am tired of people bringing up that instruction as though it's magically making JS faster compared to intel. At best it's a leveling the playing field instruction.

Implementing TSO isn't "cheating", it's hard work that other companies aren't willing to do.

Ref counting isn't magically faster than other GC methods - there's arguments that it is slower - it does however use much less memory than pretty much any other GC scheme, hence the lower memory needs on iOS vs Android.

RC vs GC really depends on workload/memory and the volatility of the allocate to deallocate timeframe.

I think the better argument here is that iOS and MacOS use RC in the underlying objc libs. Having a CPU that works better around that makes sense to increase performance for those particular OSes.

Oh I didn’t mean to distract from the improvements to refcounting - it sounds very much like they’ve significantly improved the perf of uncontended atomic increment vs Intel which is obviously a win on iOS and OS X as objc and Swift both inline refcounting - I think on windows/COM it is through a virtual call? In which case improving the increment itself seems like it would not be a huge win.

As far as perf the general argument is that dropping the need for refcounting saves time, and that removal also helps caching due to reduced per-object size.

That said I’m not sure if those comparisons are comparing to generational or moving collectors (which are the low latency collectors) because those start needing write barriers.

Finally I see somebody mentioning the memory ordering. I've been wondering for a while how they've managed to get multithreaded x86 emulation working fast. I had heard mentions of "kernel support", but hadn't really looked into it much. It turns out that what I had guessed was correct, and that they simply added intel's memory ordering as an optional mode to their processor; "When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering."

RISC-V also has a weak memory ordering by default, with TSO (like x86) as an optional feature. So this is not a novel approach in any way.

Well sure, POWER7 implemented a strong-ordering mode specifically for x86 emulation a decade ago [1]

[1] https://marc.info/?l=linux-mm&m=121382852909406&w=2

Thanks, I have been looking for this for ages and I couldn't find a reference. I was starting to think I immagined it.

Still implemented up to POWER9 as well, due to problem-state backwards compatibility by two generations I presume.

There's a spec for that, but I don't know of any synthesizable RV cores with optional TSO that have actually been made, whether ASICs _or_ configurable logic.

That does make it pretty novel on Apple's end. I agree with Heinlein that "a good idea is worth one bottle of scotch"; that the execution is where the vast majority of the work is and where the vast majority of praise should lie.

And that's coming from someone who's not as bullish as some on Apple's execution. I think AMD's cores are better and compared 1/1 on the same process node Apple would look a little behind overall. But Apple for sure gets praise for their execution here (optionality in store order), which is remarkably good.

Thank you! Posting anything long-form on Twitter is madness. Whether it's posting an image of long-form text, or tweeting. Individual. Sentences. Like. You're. Captain. Kirk.

Anandtech has already posted a list of ways that Apple's big ARM core implementation (in the iPhone version of the chip) differs from industry norms, ARM and x86.

Some examples


What really defines Apple’s Firestorm CPU core from other designs in the industry is just the sheer width of the microarchitecture. Featuring an 8-wide decode block, Apple’s Firestorm is by far the current widest commercialized design in the industry.

Re-order Buffer:

A +-630 deep ROB is an immensely huge out-of-order window for Apple’s new core, as it vastly outclasses any other design in the industry. Intel’s Sunny Cove and Willow Cove cores are the second-most “deep” OOO designs out there with a 352 ROB structure, while AMD’s newest Zen3 core makes due with 256 entries, and recent Arm designs such as the Cortex-X1 feature a 224 structure.

Execution Units:

On the Integer side, we find at least 7 execution ports for actual arithmetic operations. These include 4 simple ALUs capable of ADD instructions, 2 complex units which feature also MUL (multiply) capabilities, and what appears to be a dedicated integer division unit.

On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors.


Apple’s designs are monstrous, and the A14 Firestorm cores continue this trend. Last year we had speculated that the A13 had 128KB L1 Instruction cache, similar to the 128KB L1 Data cache for which we can test for, however following Darwin kernel source dumps Apple has confirmed that it’s actually a massive 192KB instruction cache. That’s absolutely enormous and is 3x larger than the competing Arm designs, and 6x larger than current x86 designs, which yet again might explain why Apple does extremely well in very high instruction pressure workloads, such as the popular JavaScript benchmarks.

The huge caches also appear to be extremely fast – the L1D lands in at a 3-cycle load-use latency. AMD has a 32KB 4-cycle cache, whilst Intel’s latest Sunny Cove saw a regression to 5 cycles when they grew the size to 48KB.


There are plenty of other details in the piece, although bear in mind that the M1 is more beefy than the A14 detailed here (more i/o bandwidth, larger system cache, etc.).

Thanks. These articles focusing on stuff like JavaScript instructions as being the reason the M1 is so fast are off the mark. Apple implemented an 8-wide out of order core. That’s not only a huge accomplishment. But it defies what was until recently conventional wisdom: that going much beyond a 4-wide core would be well past the point of diminishing returns in an OOO core.

I wish more people just read Anandtech ( which is a well established Site, no idea why many are not reading it ) but decide to make crap up like this twitter thread.

Javascript instruction has nothing to do with it and even JSC guy from Apple confirms it. ( The instruction wasn't even used at the time the comparison was written )

And the memory model isn't even cheating or specific to Intel x86.

And the worst of all it is getting freaking hyped up all over the place even on HN. It is like watching misinformation spreading like wild fire in real time.

This is the kind of thing I expected to see Apple throw at the CPU design, given the availability of high transistor density of the 5nm TSMC process.

You've got to wonder what 3nm will bring, not just for Apple, but for the CPU industry in general.

Comparing the 7nm AMD Zen 3 to the 5nm A14 leads to a simple extrapolation to 3nm: we could see 256KB+ L1 caches, 1K deep ROBs, etc...

What kind of limits do Apple face in adding more cores. I don't have a good sense of how big these Firestorm cores are compared to say Zen3. Or how big the overall chip is and what the size limitations are.

Like say you have the Ampere Altra Max with 128 cores ARM processor. Can Apple do that theoretically or is Ampere only able to do that because they make significantly smaller cores, or is it perhaps because they don't have an SoC.

I am just trying to understand better the tradeoffs between smaller weaker cores vs lager but fewer cores. Like if Apple wants to compete in the top end where e.g. AMD has 64 cores. Will that even be possible for Apple?

I believe power also has an 8-wide decoder.

The interesting thing is the very deep ROB. Conventional wisdom is that because of the cumulative branch misprediction probability, ROBs hit diminishing returns relatively quickly. Apple apparently has a best in class preductor, but still that ROB is huge.

Maybe they do tricks like automatically converting some branches to predication.

"They did something in their CPU to double the speed of reference counting."

I wish we'd get that on x86_64 too. Especially since it seems like an obvious and easy win.

For an the references to this, I haven't yet seen a description of what the optimization is. Anyone know?

I think I found the answer in this tweet.


Heres a snipped quote.

"this further improvement is because uncontended acquire-release atomics are about the same speed as regular load/store on A14"

x86 has total store order, all stores are queued(store buffer) in the CPU and then pushed to the cache subsystem. Last item has to wait previous ones. Arm doesn’t have TSO, so CPU can reorder stores in the queue or issue stores out of order etc. depending on memory barrier use.

If you read the Catfish_Man tweet, he says that even under rosetta refcounting is faster, and under rosetta M1 has TSO.

What's the actual optimization though?

The tweet author in the linked thread later writes:

"Weaker memory model makes acquire-release atomics possible to implement much more efficiently, in exchange for not hiding some classes of multithreading bugs"

Have you read the tweet and thread? If so, I don’t understand what you’re asking.

Twitter wasn't making the relevant responses really visible. _kbh_ pointed it out

Refcounting in a single-threaded context is extremely quick and efficient. What's expensive is atomic reference counting, as seen e.g. in C++ shared_ptr<> and Swift ARC.

Interestingly the Twitter thread is saying uncontended atomic operations are generally as fast as non-atomic, so I don’t think the weaker memory model is the totality of the reason.

Another thing is apparently their branch predictor is specially tuned to work with how Objective-c dispatch works (which iirc Swift inherits in places too, and not just when interfacing with Obj-C APIs)

>8/ Another "magic" trick is how their "Swift" programming language uses "reference counting" instead of the "garbage collection" in Android. They did something in their CPU to double the speed of reference counting.

What the hell does that even mean ?

Disregarding the nonsensical attribution of magic to Swift vs Java, if "reference counting" on translated x86 code is faster, then I suppose M1 does "lock add / cmpxchng" or whatever faster than X86.

I would love to know how. Other than just as a side-effect of a SoC, UMA architecture.

Maybe this: https://twitter.com/Catfish_Man/status/1326238434235568128

fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1


Weaker memory model makes acquire-release atomics possible to implement much more efficiently, in exchange for not hiding some classes of multithreading bugs

I guess I understand the ARM behavior, but both this thread and the original one talk about x86 retain/release being faster on M1 as well.

That's the part I'd love someone knowledgeable to explain.

> What the hell does that even mean ?

> double the speed of reference counting

The simplest low latency system memory making atomics faster in some contexts (faster cache write back/fetch).

Then atomic operations (using Aquire/Release) are theoretically faster on ARM due to the weak memory ordering (I'm not sure if this does help for translated x86 code).

Potentially other reasons can include:

- maybe optimized fetch add/sub instructions (or more precise the ARM instructions used by fetch add/sub functions in higher level languages)

- maybe tweaks to the coherence protocol

- maybe specialized optimizations in the pipelineing

All in all it means Apple focused on making sure fetch add/sub are fast (most likely as this is what is normally used for Rc).

I think it says refcounting twice faster for swift on M1 compared to swift on x86.

It is possible because if you don’t have TSO(total store order), you can employ atomic operations without waiting/flushing previous items in the CPU’s store buffer

But apparently they are faster even in TSO mode.

Maybe they aggressively speculate around them.

They also seem to have added video codec decoders, because it can run 8k timelines pretty well in final cut. 4k without dropping frames even on the macbook air.

Every GPU including Nvidia, AMD, and Intel have video decoders.

Yes, but I think the latest codec that canon uses in their newest cameras for 10 bit 8k raw are now supported. I'm not sure if it's hvac265 or something else

FWIW, the name is HEVC or H.265. AVC is H.264.

This yes, but it still gave me a chuckle to read HVAC, because I often see HEVC and think HVAC.

I do need a significant amount of cooling when doing GPU-heavy things.

When I read this on Twitter yesterday, it seemed like it was just a summary of what has already been reported by others elsewhere. Not sure why it’s worth posting here, where all this info has already been covered in more detail in earlier threads.

Is the ARM ISA fundamentally faster than x86? If the underlying architectures are converging as described, it kinda seems that the difference between the two instruction sets are more legal than technical.

I think there's a couple things here. x86 defines a pretty strict ordering of memory operations, where ARM is more relaxed; as discussed elsewhere in the thread, this means atomic operations, used for synchronization, will need to wait for pending stores to finish on x86, but not on ARM.

The other thing is that the M1 processor seems to be a lot wider than contemporary processors. This means it can (potentially) do more operations per clock. Wider processors are harder to clock faster, but it works for Apple. Apple has no desire to put in the cooling you need to get chips running at 4+ GHz, so it's not a big deal if their chips are clock limited. On the other hand, Intel and AMD like their cores to approach 5 GHz at the top end.

Can you elaborate on the "Wider processors are harder to clock faster" part?

Clock is a up volt and down volt. There is a period of time it takes to reach either up or down.

If I have to get the clock to reach 2 cores/units at the same time that is easier then say 4 cores/units. As you increase the clock speed the time to reach all the cores decreases. So if you are wider you would need to reach more cores/units in the same period of time, hence "harder".

EDIT (To make it a little more clear) To make it more clear the voltage change is typically represented in books as a vertical line, but that is not the case it's diagonal and fuzzy. By fuzzy I mean not a straight line but will have some tiny mini downs on the way up.

Different parts of the circuits are going to respond to the up or the down. They are also can vary based on the exact voltage. For example if I have a 5V up one circuit might consider 4.8V to be up and another could be 4.9 or 4.7.

Silicon has improved, but there still is limits of scale based on size, volts and timing.

As far as I understand, AMD has come out and set it does not make sense for them to make wider than 4 instruction decoders. It seems the CISC architecture creates an upper limit for decoders, as complexity rapidly grows when you have no idea where the next instruction begins in a variable length ISA.

So Apple has twice as many decoders, eight, and may actually be able to keep adding to that number while AMD and Intel may get stuck on 4, thanks to the x86 CISC legacy.

Intel has been at 5 decoders since Skylake (~2015).

I'm not sure why everyone focuses on the decoders as the primary determinant of width. There are other bottlenecks which may be narrower than the decoders, and the decoders may not even be used when a uop cache or something like a "loop buffer" (LSD on Intel) is present.

So AMD doesn't believe that wider chips aren't useful or that 4 is a limit, because they went to 5-wide in Zen (or 6, depending on how you count it) and I expect them to go wider in the future.

Intel went from 4 wide (narrowest bottleneck) to 5 wide in Ice Lake.

Wider chips are the future, and there is no "hard wall" at 4 for x86, just like any earlier width increase: just constantly diminishing returns.

Isn't the extra 2 on Zen not a decoder unit, but rather a 6-way dispatch[1] with op-cache being used to kept the execution unit busy? Or was that the reason behind calling it a 5 or 6-wide depending on how one's may count it considering cache miss and all?

[1]: https://images.anandtech.com/doci/16214/Zen3_arch_5.jpg

Little yes, little no.

The set of x86 instructions people actually use is sufficiently small that the (ignoring the amortized cost of the rest of the ISA - it should be considered, but we can't really know how much space it actually takes up on the die) the two architectures have, to first order at least, converged such that the tricks are now all in the details like memory ordering and scheduling than the actual instructions as per se.

No, just a lot of brute force and focused optimization going on the M1. There are a few advantages, (it’s a newer architecture after all) but nothing that justifies the current difference.

x86, as a legacy architecture, has a lot of baggage and complexity and takes more space and energy to decode and translate instructions to micro-ops. Also because it was not designed as a RISC arch, but rather "backported" as one, it misses out on some RISC advantages.

This is why you don't see Intel being competitive in low power applications where ARM excels.

About TSO, I do wonder whether Apple will stop implementing it in its future chips when the transition to Apple silicon is over and Rosetta 2 is dropped.

Can someone say more about the JavaScript optimization?

They might be referring to the ARM FJCVTZS instruction[0].

It speeds up converting JavaScript doubles to 32-bit integers.

[1] is the Webkit commit that added support for FJCVTZS, it was between a 0.5 to 2% speedup - that's a big win for a mature optimizer.

[0] https://stackoverflow.com/a/50966903/

[1] https://bugs.webkit.org/show_bug.cgi?id=184023#c24

In addition to FJCVTZS instruction, they added more floating point execution units and registers. Also did lower latency of pipelined fp instructions

> They did something in their CPU to double the speed of reference counting.

Can anyone explain what they did exactly?

Also, what are their JavaScript specific instructions?

I wonder if TSO "violates" ARM standard conformance in some way? Perhaps not since it makes the memory ordering stronger, and is therefore upwards compatible. Or perhaps Apple doesn't care any more about strictly conforming to ARM's specs, which would be interesting news.

M1 does follow standard ARM memory ordering rules in general, TSO is only enabled for processes running under Rosetta2 emulation

It probably would be legal to enable TSO everywhere as you said, but weaker ordering lets the CPU do more aggressive optimizations

> 18/ Another "magic" trick is how their "Swift" programming language uses "reference counting" instead of the "garbage collection" in Android. They did something in their CPU to double the speed of reference counting.

> 19/ ...even when translating x86 code, all that reference counting overhead (already more efficient than garbage collection) gets dropped in half. Yet another weird performance enhance to add to all the others.

Not at all, as proven by known benchmarks.


Putting refcounting in the CPU was the only way Apple managed to make it fast enough.

Objective-C only went with refcounting because Apple failed to make their tracing GC work flawless across frameworks with mixed compiler flags alongside C semantics, thus having the compiler automate retain/release was a more sane option to do.

Likewise, Swift had a requirement to have flawless interoperability with Objective-C runtime and libraries, so the natural option was to adopt reference counting instead of a translation layer across both worlds.

.NET / COM interoperability is a good example on how to integrate a tracing GC with reference counting can turn into an engineering feat.

Speaking of which UWP, which is 100% COM based, is somehow noticeable slower than pure Win32 applications, besides the sandboxing, the main reason is naturally AddRef/Release everywhere.

Hence why C++/WinRT as C++/CX replacement is full of tricks, moving destruction to background threads, wrapping COM handles in stack allocations, taking advantage of constexpr.

So yeah, it is tricks and marketing how those tricks are sold, specially to crowds that care more about the next SPA framework release than how compilers work and CPUs work.

It would be really interesting to see the performance difference of TSO on and off for the arm assembly.

This is about as accurate as Rob Graham's other hot takes. He's no expertise in CPU microarchitecture, and he's way off the mark again.

>When you recompile code for a new architecture, it usually breaks.

Yeah, then your code is shit and was relying on undefined behaviour.

Twitter makes no sense as platform for long reads. The format is atrocious and difficult to follow. (When all that you have is a hammer all problems seem a nail)

>Please don't complain about website formatting, back-button breakage, and similar annoyances. They're too common to be interesting.


This was a legitimate criticism back when there was no way to thread tweets and you had to read them in reverse.

Now that you just scroll down like any other website, it just sounds like tired complaining.

This complaint is about more than mere website formatting (per HN guidelines). There's now 11 distinct reply threads of fragmented discussion to dig into, for anyone that might be interested. Hokusai is right to complain about this. And people savvy enough to write the article should know better. Post it on a blog and tweet the link. How hard is that?

It is hard, some people are plainly incapable of expressing themselves in article form. Or refuse to. There was plenty of shaming already and it did not help.

I wasn’t commenting on his essay writing skills. I appreciate many people aren’t wordsmiths. Literally the same 11 points in a single block of contiguous text, posted anywhere (GitHub, LinkedIn, Medium etc etc).

It is still atrocious. Designers have developed thru the ages great fonts, layouts and a miriad of tools to create text that is easy to read.

Not having to read the text backwards is a extremely low bar, I expect more from a text.

Twitter is the wrong tool for the task, a very bad one.

> reference counting is faster than garbage collection

lol, not taking the bait

But reference counting is (potentially) faster then garbage collection. At least under some circumstances:

- modern architecture having fast atomic fetch_add/sub

- weak memory ordering making atomic fetch_add/sub even faster (if Acquire/Release ordering is used)

- optimize the usage of reference counting by eliminating pointless reference counting, e.g. as done by Objective-C/Swift automatic reference counting (ARC). Or e.g. done when using borrows in rust for everything except the places you know you really need to clone the Atomic Reference Counter (also ARC but a different one ;=) ).

Reference counting was for a long time basically guaranteed to be slower then garbage collection because of slower atomics, no weak memory ordering and "naive" usage of reference counting introducing a lot of unnecessary reference counter increased and decreases.

But with modern hardware and compilers it's no longer as clear cut, at least in practice wrt. to modern hardware.

the bait is that reference counting is (often considered to be) a type of garbage collection, and that correction is frequently made in comment threads

and hey, here we are

Reference counting cannot be faster than naive manual memory management because it does the same thing with the additional overhead of reference counting. Even a very simple generational GC beats naive explicit memory management on most nontrivial workloads given enough memory. https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf

You are misunderstanding this paper deeply. Quoted summary from the paper itself:

>>>We use this framework to compare the time-space performance of a range of garbage collectors to explicit memory management with the Lea memory allocator. Comparing runtime, space consumption, and virtual memory footprints over a range of benchmarks, we show that the runtime performance of the best-performing garbage collector is competitive with explicit memory management when given enough memory. In particular, when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management. However, garbage collection’s performance degrades substantially when it must use smaller heaps. With three times as much memory, it runs 17% slower on average, and with twice as much memory, it runs 70% slower. Garbage collection also is more susceptible to paging when physical memory is scarce. In such conditions, all of the garbage collectors we examine here suffer order-of-magnitude performance penalties relative to explicit memory management.<<<

GC is "competitive" when is has 5x as much memory according to the authors. Well, probably this is because GCs are fast when they have low to no memory pressure. They simply don't need to garbage collect anything. In fact, a non-collecting, non-managed, memory approaches can be faster than any memory management. That is, you have a short lived application that just allocates what it needs, never frees anything, then dies when done. But this is not apples to apples.

The paper also says that with 2x more memory than manual management, GC is 70% slower and under paging scenarios GC can suffer "order-or-magnitude performance penalties relative to explicit memory management."

Except under unusual conditions, there is no case where GC is faster than manual.

I understood that part completely. You misunderstood my point, which is that with enough memory, generational GC outperforms naive explicit memory management, which is impossible with reference counting.

> Except under unusual conditions, there is no case where GC is faster than manual.

Their simple generational GC outperforms manual memory management in every single workload in that paper given enough memory, so that statement is clearly false. A production quality GC does so with less memory.

> Well, probably this is because GCs are fast when they have low to no memory pressure. They simply don't need to garbage collect anything.

This is wrong. It's because generational GCs start to look like arena allocators when they have enough memory, with cheap bump allocation and bulk deallocation. On most of their workloads, their simple generational GC already outperforms naive explicit memory management with less than 3x the memory, and this would certainly be better with explicit tuning or a production GC.

Isn't it amazing how your well-informed comments get downvoted?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact