The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.
The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.
It's surprising that (AFAIK) Qualcomm didn't implement TSO in the chips they made for the recent-ish Windows ARM machines. If anything they need fast x86 emulation even more than Apple does since Windows has a much longer tail of software support than macOS, there's going to be important Windows apps that stubbornly refuse to support native ARM basically forever.
It's definitely surprising that Qualcomm didn't. Not only does Windows have a longer tail of software to support, but given that the vast majority of Windows machines will continue to be x86-64, there's little incentive to do work to support ARM.
With the Mac, Apple told everyone "we're moving to ARM and that's final." With Windows, Microsoft is saying, "these ARM chips could be cool, what do you think?" On the Mac, you either got on board or were left behind. Users knew that the future was ARM and bought machines even if there might be some short-term growing pains. Developers knew that the future was ARM and worked hard to support it.
But with Windows, there isn't a huge incentive for users to switch to ARM and there isn't an incentive for developers to encourage it. You can say there's some incentive if the ARM chips are better. While Qualcomm's chips are good, the benchmarks aren't really ahead of Intel/AMD and they aren't the power-sipping processors that Apple is putting out.
If Apple hadn't implemented TSO, Mac users/developers would still switch because Apple told them to. Qualcomm has to convince users that their chips are worth the short-term pain - and that users shouldn't wait a few years to make the switch when the ecosystem is more mature. That's a much bigger hill to climb.
Still, for Qualcomm, they might not even care about losing a little money for 5-10 years if it means they become one of the largest desktop processor vendors for the following 20+ years. As long as they can keep Microsoft's interest in ARM as a platform, they can bide their time.
> With the Mac, Apple told everyone "we're moving to ARM and that's final."
In ~mid 2020, when macs were all-but-confirmed to be moving to Apple-designed chips, but before we had any software details, some commentators speculated that they thought Apple wouldn't build a compatibility layer at all this time around.
I wonder if possible Qualcomm doesn’t super care about the long tail of software? Like maybe MS has some stats indicating that a very large percentage of software that they think will be used on these devices is first party, or stuff that reasonably should be expected to be compiled for ARM.
How does the windows App Store work anyway, can they guarantee that all the stuff there gets compiled for ARM?
Anyway, it is Windows not MacOS. The users expect some rough edges and poor craftsmanship, right?
The Qualcomm chips come from their acquisition of Nuvia, who were originally designing the chips as server chips, where the workload would presumably be Linux stuff compiled for the right arch. They probably didn't have time to redesign the chip from the original server-oriented part to add TSO.
> I wonder if possible Qualcomm doesn’t super care about the long tail of software?
Qualcomm's success is based more on its patent portfolio and how well it uses it, more than any other single factor. It doesn't really have to compete on quality, and their support has long been terrible - they're one of the main drivers of Android's poor reputation for hardware end-of-life. It doesn't matter though, because they have no meaningful competition in many areas.
They are also the main reason ARM Chromebooks are a relatively recent development. Google wanted 5-10 years of support, Qualcomm preferred 5-10 minutes.
Qualcomm has been phoning it in in various forms for over a decade, including forcing MS to ship machines that do not really pass windows requirements (like broken firmware support). Maybe it got fixed with recent Snapdragon X, but I won't hold my breath.
We're talking about a company that, if certain personal sources are to be believed, started the Snapdragon brand by deciding to cheapen out on memory bandwidth despite feedback that increasing it was critical and leaving the client to find out too late in the integration stage.
Deciding that they make better money by not spending on implementing TSO, or not spending transistors on bigger caches, and getting more volume at lower cost, is perfectly normal.
Last time I checked, the default behavior for Microsoft's translation was to pretend that the hardware is doing TSO, and hope it works out. So that should obviously be fast, but occasionally wrong.
My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.
I'm also fairly certain that the TSO changes to the memory system are non-trivial, and it's possible that Qualcomm doesn't see it as a value-add in their chips - and they're probably right. Windows machines are such a hot mess that outside a relatively small group of users (who probably run Linux anyway, so aren't anyone's target market), nobody would know or care what TSO is. If it add costs and power and doesn't matter, why bother?
> My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance.
Games are a pretty notable exception that demand high performance and for the most part will be stuck on x86 forever. Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.
> Brand new games might start shipping native ARM Windows binaries if the platform gets enough momentum, but games have very limited support lifecycles so it's unlikely that many released before that point will ever be updated to ARM native.
Unity supports Windows ARM. Unreal: probably never. IMO, the PC gaming market is so fragmented, short of Microsoft developing games for the platform, like pre-sales scale multi-millions that EGS did, games on ARM will only happen by complete accident, not because it makes sense.
> My guess is that the sort of "legacy x86-forever" apps for Windows don't really need much in the way of performance. Think your classic Visual Basic 6 sort of thing that a business relies on for decades.
In my experience, there's a lot of that kind of software around that was initially designed for a much simpler use-case, and has decades of badly coded features bolted in, with questionable algorithmic choices. It can be unreasonably slow in modern hardware.
Old government database sites are the worst examples in my experience. Clearly tested with a few hundred records, but 15 years later there's a few million and nobody bothered to create a bunch of indexes so searches take a couple minutes. I guess this way they can just charge to upgrade the hardware once in a while instead.
TSO only matters for programs that are internally multithreaded or which run multiple processes that have shared memory segments.
Most legacy programs like Visual Basic 6 are not of this kind.
For any other kinds of applications, the operating system handles the concurrency and it does this in the correct way for the native platform.
Nevertheless, the few programs for which TSO matters are also those where performance must have mattered if the developers bothered to implement concurrent code. Therefore low performance of the emulated application would be noticeable.
> Qualcomm didn't implement TSO in the chips they made
I’m not sure they can do that.
Under Technology License Agreement, Qualcomm can build chips using ARM-designed CPU cores. Specifically, Qualcomm SQ1 uses ARM Cortex A76/A55 for the fast/slow CPU cores.
I don’t think using ARM designed cores is enough to implement TSO, need custom ARM cores instead of the stock ones. To design custom ARM cores, Qualcomm needs architecture license from ARM which was recently been cancelled.
SQ1/SQ2 was their previous attempt, the recently released Snapdragon Elite chips use the fully custom Oryon ARM cores they got from the Nuvia acquisition. That acquisition is what caused the current licensing drama with ARM, but for the time being they do have custom cores.
> custom Oryon ARM cores they got from the Nuvia acquisition
Nuvia was developing server CPUs. By now, I believe backward compatibility with x86 and AMD64 is rather unimportant for servers. Hosting providers and public clouds have been offering ARM64 Linux servers for quite a few years now, all important server-running software already has native ARM64 builds.
On a first order analysis, Qualcomm doesn't want good x64 support, because good x64 support furthers the lifetime of x64, and delays the "transition" to ARM. In the final analysis, I doubt that is an economically rational strategy, because even if there is to be a transition away from x64, you need a good legacy and migration story. And I doubt such a transition will happen in the next 10 years, and certainly not spurred by anything in Microsoft land.
So maybe it's rational after all, because they know these Windows ARM products will never succeed, so they're just saving themselves the cost/effort of good support.
> On a first order analysis, Qualcomm doesn't want good x64 support, because good x64 support furthers the lifetime of x64, and delays the "transition" to ARM.
The logical thing for Qualcomm in their current market share to do is to implement TSO now, then after they get momentum, create a high-end/low-end tier, and disable TSO for the low-end tier to force vendors to target both ARM/x68.
What Qualcomm is doing now makes them look like they just don't care.
> create a high-end/low-end tier, and disable TSO for the low-end tier
Wouldn’t that make the low-end tier run faster than the high-end tier, or force them to leave some performance on the table there?
Also, would a per-process flag that controls TSO be possible? Ignoring whether it’s easy to do in the hardware, the only problem I can think of with that is that the OS would have to set that on processes when they start using shared memory, or forbid using shared memory by processes that do not have it set.
I would disagree that this is first order. First order is making the transition as smooth as possible, which obviously means having a very good translation layer. Only then you should even think about competing on comparability.
Does this not depend on how one sees the Arm transition matter playing out?
It at least conceivable and IMHO, plausible for Qualcomm to see Apple, phones on ARM and aging in demographics all speaking to a certain Arm transition?
I wouldn't be so sure. Windows on ARM has existed for more then a decade with almost zero adoption. Phones, both Apple and Android, have been ARM since forever. The only additional player is that Apple has moved their Macs to ARM. This to me means it would be pretty stupid for them to just throw up their hands and say "they will come". Because it didn't happen for a decade prior.
Maybe. Just trying to see it from other points of view.
A decade ago, Apple was on Intel and Microsoft had not advanced many plans in play today. Depending on the smoke they are blowing people's way, one could get an impression ARM is a sure thing.
Frankly, I have no desire to run Windows on ARM.
Linux? Yep.
And I am already on a Mac M1.
I sort of hope it fails personally. I want to see the Intel PC continue in some basic form.
Does Windows's translation take advantage of those where they exist? E.g. if I launch an aarch64 Windows VM on my M2, does it use the M2's support for TSO when running x86_64 .exes or does it insert these memory barriers?
If not, it makes sense that Qualcomm didn't bother adding them.
I would expect it to not use TSO, because the toggle for it isn't, to the best of my knowledge, a general userspace toggle. It's something the kernel has to toggle, and so a VM may or may not (probably does not) even have access to the SCRs (system control registers) to change it.
There may be some kernel interface to allow userspace to toggle that, but that's not the same as being a userspace-accessible SCR (and I also wouldn't expect it to be passed through to a VM - you'd likely need a hypercall to toggle it, unless the hypervisor emulated that, though admittedly I'm not quite as deep weeds on ARMv8 virtualization as I would prefer at the moment.
Hmm, you’re right - maybe my memory serves incorrectly but yeah it seems it is privileged access but the interface is open to all processes to toggle the bit.
Without that kernel support, all processes in the VM (not just Rosetta-translated ones) are opted-in to TSO:
> Without selective enablement, the system opts all processes into this memory mode [TSO], which degrades performance for native ARM processes that don’t need it.
Before Sequoia, a Linux VM using Rosetta would have TSO enabled all the time.
With Sequoia, TSO is not enabled for Linux VMs, and that kernel patch (posted in the last few weeks) is required for Rosetta to be able to enable TSO for itself. If the kernel patch isn't present, Rosetta has a non-TSO fallback mode.
The OS can use what hardware supports, Mac OS does because SEG is a tightly integrated group at Apple whereas Microsoft treats hardware vendors at arm's length (pun unintended). There are roadmap sharing, planning events through leadership but it is not as cohesive as it is at Apple.
> As far as I know this is not part of the ARM standard, but it also isn’t Apple specific: Nvidia Denver/Carmel and Fujitsu A64fx are other 64-bit ARM processors that also implement TSO (thanks to marcan for these details).
I'm not sure how to interpret that—do these other parameters have distinct/proprietary TSO extensions? Are they referring to a single published (optional) extension that all three implement? The linked tweet has been deleted so no clues there, and I stopped digging.
TSO is nice to have but it's definitely not necessary. Rosetta doesn't even require TSO on Linux anymore by default. It performs fine for typical applications.
Barrier injection isn't the issue as much as the barriers becoming expensive. There's no reason a CPU can't use the techniques of TSO support to support lesser barriers just as cheaply.
I get pretty close to native performance with Rosetta 2 for Linux and I don't believe TSO is being used or taken advantage of. I wonder how important it really is.
True, and it is a little more relaxed than sequential consistency.
For simple loads and stores, the x86 CPUs do not reorder the loads between themselves or the stores between themselves. Also the stores are not done before previous loads.
Only some special kinds of stores can be reordered, i.e. those caused by string instructions or the stores of vector registers that are marked as NT (non-temporal).
So x86 does not need release stores, any simple store is suitable for this. Also store barriers are not normally needed. Acquire fences a.k.a. acquire barriers are sometimes needed, but much less often than on CPUs with weaker ordering for the memory accesses (for acquire fences both x86 and Arm Aarch64 have confusing mnemonics, i.e. LFENCE on x86 and DMB/DSB of the LD kind on Aarch64; in both cases these instructions are not load fences as suggested by the mnemonics, but acquire fences).
When converting x86 code to Aarch64 code, there are many cases when simple stores must be replaced with release stores (a.k.a. Store-Release instructions in the Arm documentation) and there are many places where acquire barriers must be inserted, or, less frequently, store barriers must be inserted (for non-optimally written concurrent code it may also be necessary to replace some simple loads with Load-Acquire instructions of Aaarch64).
The rest are all techniques in reasonably common use, but unless you have hardware support for x86's strong memory ordering, you cannot get very good x86-on-ARM performance, because it's by no means clear when strong memory ordering matters, and when it doesn't, inspecting existing code - so you have to liberally sprinkle memory barriers around, which really kill performance.
The huge and fast L1I/L1D cache doesn't hurt things either... emulation tends cache-intensive.