Apple has already done this to an extent. M1 has undocumented instructions and a...

ithkuil · on Oct 2, 2021

Are the tuning models in opensource toolchains for Intel CPUs been released by Intel or figured out over time by opensource developers?

mhh__ · on Oct 2, 2021

Intel publish a very thick optimization manual, which is a good help.

Compilers aren't great at using the real parameters of the chip (i.e. LLVM knows how wide the reader buffer is but I'm not sure if it actually can use the information), but knowing latencies for ISel and things like that is very helpful. To get those details you do need to rely on people like (the god amongst men) agner fog.

mhh__ · on Oct 3, 2021

That should say reorder buffer above.

kolbusa · on Oct 2, 2021

Intel contributes optimizations to gcc and llvm.

jagger27 · on Oct 2, 2021

Apple also contributes plenty to LLVM, more than Intel actually, naively based on commit counts of @apple.com and @intel.com git committer email addresses. This isn't very surprising given that Chris Lattner worked at Apple for over a decade.

mhh__ · on Oct 2, 2021

LLVM is Apple's baby, beyond its genesis under Lattner. They hate the GPL, that's it.

The thing about their contributions is that they upstream stuff they want other people to standardize on but aren't doing it out of love, as far as I can tell e.g. Valve have a direct business interest in having Linux kicking ass, Apple actively loses out (psychologically at least) if a non-apple toolchain is as good as theirs.

pjmlp · on Oct 3, 2021

Apple used to contribute plenty to LLVM.

Nowadays less so, for them C++ support is good enough for what they make out of it (MSL is based on C++14 and IO/Driver Kit use an embedded subset).

The main focus is how well it is churning Objective-C, Swift and their own special flavour of bitcode.

None of them end up upstream as one might wish for.

ithkuil · on Oct 2, 2021

They do now; I think I remember the time when they didn't.

ant6n · on Oct 2, 2021

Apple M1 also has that x86 emulation mode where memory accesses have the same ordering semantics as on x86. It's probably one of the main things giving Rosetta almost 1:1 performance with x86.

https://mobile.twitter.com/ErrataRob/status/1331735383193903...

zamadatix · on Oct 2, 2021

TSO support at the hardware level is a cool feature but it's a bit oversold here. Most emulated x86 code doesn't require it and usually not at every memory instruction when it does. For instance the default settings in Window's translation implementation do not do anything to guarantee TSO.

Rosetta is also a long way from 1:1 performance, even you're own link says ~70% the speed. That's closer to half speed than it is to full speed.

The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast. This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.

In a fun twist of fate 32 bit x86 apps also work under Windows on the M1 even though the M1 doesn't support 32 bit code.

inkyoto · on Oct 3, 2021

> The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast.

M1 is fast and efficient, but Rosetta 2 does not emulate x64 in real time. Rosetta 2 is a static binary translation layer where the x86 code is analysed, translated and stashed away in a disk cache (for future invocations) before the application starts up. Static code analysis allows for multiple heuritstics to be applied at the binary code translation time where the time to do so is plentiful. The translated code then runs natively at the near native ARM speed. There is no need to appeal to varying deities or invoke black magic and tricks – it is that straightforward and relatively simple. There have been mentions of the translated code being further JIT'd at the runtime, but I have not seen the proof of that claim.

Achieving even 70% of native CPU speed whilst emulating a foreign ISA _dynamically (in real time)_ is impossible on von Neumann architectures due to the unpredictability of memory access paths, even if the host ISA provides the hardware assistance. This is further compounded with the complexity of the x86 instruction encoding, which is where most benefits the hardware assisted emulation would be lost (it was already true for 32-bit x86, and is more complex for amd64 and SIMD extensions).

> This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.

Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.

MobiusHorizons · on Oct 3, 2021

> but I have not seen proof of that claim.

x86 apps with JITs can run [1]. For instance I remember initially Chrome didn't have a native version, and the performance was poor because the JITted javascript had to be translated at runtime.

[1]: https://developer.apple.com/documentation/apple-silicon/abou...?

zamadatix · on Oct 3, 2021

> There have been mentions of the translated code being further JIT'd at the runtime, but I have not seen the proof of that claim.

I've seen mentions of a JIT path but only if the AOT path doesn't cover the use case (e.g. an x86 app with dynamic code generation) not as an optimization pass. https://support.apple.com/guide/security/rosetta-2-on-a-mac-...

Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.

> Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.

This section was referring to the emulation performance not native code performance:

"it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "

Though I'll take native apps any day I can find them :).

inkyoto · on Oct 6, 2021

> Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.

AOT (or, static binary translation before the application launch) vs JIT does make a big difference. JIT always carries a pressure of the «time spent JIT'ting vs performance» tradeoff, which AOT does not. The AOT translation layer has to be fast, but it is a one-off step, thus it invariably can afford spending more time analysing the incoming x86 binary and applying more heuristics and optimisaitons yielding a faster performing native binary product as opposed to a JIT engine that has to do the same, on the fly, under tight time constraints and under a constant threat of unnecessarily screwing up CPU cache lines and TLB lookups (the worst case scenario for a freshly JIT'd instruction sequence spilling over into a new memory page).

> "it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "

I still fail to comprehend which tricks you are referring to, and I also would be very much keen to see actual figures substantiating the AOT vs JIT emulation speed statement.

mrpippy · on Oct 2, 2021

Rosetta 2 also emulates 32-bit x86 code, you can try it out with CrossOver (CodeWeavers’ commercial version of Wine)

rdsnsca · on Oct 3, 2021

No it does not. It isCrossOver letting you run 32-bit 86 (Windows apps only , I think) not Rosetta 2

saagarjha · on Oct 3, 2021

Yes it does. Rosetta includes full emulation for 32-bit instructions, allowing Crossover to continue to not have to be an emulator.

(The parent commenter also works at CodeWeavers, so he would know :P)

ant6n · on Oct 2, 2021

Breaking memory ordering will breaks software - if a program requires it (which is already hard to know), how would you know which memory is accessed by multiple threads?

zamadatix · on Oct 2, 2021

It's not just a question of "is this memory accessed by multiple threads" and call it a day for full TSO support being mandated it's a question of "is the way this memory is accessed by multiple threads actually dependent on memory barriers for accuracy and if so how tight do those memory barriers need to be". For most apps the answer is actually "it doesn't matter at all". For the ones it does matter heuristics and loose barriers are usually good enough. Only in the worst case scenario that strict barriers are needed does the performance impact show up and even then it's still not the end of the world in terms of emulation performance.

As far as applying it the default assumption for apps is they don't need it and heuristics can try to catch ones that do. For well known apps that do need TSO it's part of the compatibility profile to increase the barriers to the level needed for reliable operation. For unknown apps that do need TSO you'll get a crash and a recommendation to try running in stricter emulation compatibility but this is exceedingly rare given the above 2 things have to fail first.

Details here https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...

im3w1l · on Oct 2, 2021

> For unknown apps that do need TSO you'll get a crash

Sure about that? Couldn't it lead to silent data corruption?

reitzensteinm · on Oct 3, 2021

Yes, it absolutely can. Shameless but super relevant plug. I'm (slowly) writing a series of blog posts where I simulate the implications of memory models by fuzzing timing and ordering: https://www.reitzen.com/

I think the main reason why it hasn't been disastrous is that most programs rely on locks, and they're going to be translating that to the equivalent ARM instructions with a full memory barrier.

Not too many consumer apps are going to be doing lockless algorithms, but where they're used all bets are off. You can easily imagine a queue where two threads grab the same item from, for instance.

my123 · on Oct 2, 2021

Heuristics are used. For example, memory accesses relative to the stack pointer will be assumed to be thread-local, as the stack isn’t shared between threads. And that’s just one of the tricks in the toolbox. :-)

The result of those is that the expensive atomics aren’t applied to all accesses at all on hardware that doesn’t expose a TSO memory model.

Someone · on Oct 2, 2021

Nitpick: relative speed differences do not add up; they multiply. A speed of 70 is 40% faster than a speed of 50, and a speed of 100 is 42.8571…% faster than a speed of 70 (corrected from 50. Thanks, mkl!). Conversely, a speed of 70 is 30% slower than a speed of 100, and a speed of 50 is 28.57142…% slower than one of 70.

=> when comparing speed, 70% is about exactly halfway between 50% and 100% (the midpoint being 100%/√2 ≈ 70.7%)

zamadatix · on Oct 3, 2021

Not sure what the nit is supposed to be, 70% is indeed less than sqrt(1/2) hence me mentioning it. And yes, it's closer to 3/4 than half or full, but the thing being pointed out wasn't "what's the closest fraction" rather "it's really not that close to 1:1".

mkl · on Oct 2, 2021

> a speed of 100 is 42.8571…% faster than a speed of 50

I think you mean either "100% faster" or "faster than a speed of 70" there.

ant6n · on Oct 3, 2021

Meh, in the context of emulation, which ran at 5% before JITs, 70% is pretty close to 1:1 performance. Given the M1 is also a faster cpu than any x86 Macbooks, and it´s really a wash (yes, recompiling under arm is faster...)

saagarjha · on Oct 3, 2021

No, it's not. Switching to TSO allows for fast, accurate emulation, but if they didn't have that they would just go the Windows route and drop barriers based on heuristics, which would work for almost all software. The primary driver behind Rosetta's speed is extremely good single-threaded performance and a good emulator design.

my123 · on Oct 2, 2021

That’s a single ACTLR bit present in the publicly released XNU kernel source code.

mancerayder · on Oct 2, 2021

I'm asking out of naiveté here -- how were they (kernel maintainers) able to get the Linux kernel to support M1 with undocumented instructions?

ducktective · on Oct 2, 2021

My guess: if the situation is similar to Windows laptops, they just use a subset of OEM features and provide a sub-par experience (like lack of battery usage optimizations, flaky suspend/hibernate, second-tier graphics support, etc)

Now, I'm typing this on a GNU/Linux machine, but let's face it, all of the nuisances I mentioned are legit and constant source of problems in tech support forums.

frabert · on Oct 2, 2021

A kernel doesn't need to use all the instructions a CPU offers -- only the ones it needs.

ithkuil · on Oct 2, 2021

If the extra instructions also operate in extra state (e.g. extra registers) a kernel needs to know about their existence so it can correctly save and restore state on context switches

my123 · on Oct 2, 2021

Not necessarily/really, the custom extensions still need to be enabled by the kernel before they can be used.

As such, it isn’t actually an issue.

ndesaulniers · on Oct 3, 2021

You're confusing MSRs (which don't have to be saved/restored on context switch) with GPRs (which do).

ithkuil · on Oct 3, 2021

Well if there is a truly undocumented extension, how do I know it doesn't come with it's own registers (e.g. like a floating point unit does)

saagarjha · on Oct 3, 2021

Apple reserves certain MSRs that apply on a per-process basis, and thus must be applied on context switch.

Someone · on Oct 2, 2021

It doesn’t need to use them, but it must be aware of them, insofar as they may introduce security problems.

As an example, if the kernel doesn’t know of DMA channels, and it requires setup code to prevent user-level code from using them to copy across process boundaries, the kernel will run fine, but have glaring security problems.

surajrmal · on Oct 2, 2021

What dma channels doesn't require mapping registers into the user space process to work? There aren't usually magic instructions you have to opt into disabling as far as I know.

im3w1l · on Oct 2, 2021

Those are not security problems, they are insecurity features readjusts tinfoil.

heavyset_go · on Oct 2, 2021

I'm assuming they aren't using them or they've reverse engineered the ones they need to use.

R0b0t1 · on Oct 2, 2021

Don't use them. The instructions necessary to run Linux are likely inherited from the normal ARM set.

kevingadd · on Oct 2, 2021

The undocumented instructions aren't required in order to use the hardware

mhh__ · on Oct 2, 2021

The instructions are (all?) for acceleration I think.

saagarjha · on Oct 3, 2021

Not all of them, but the ones that don't (GXF, memory compression, etc.) are opt-in.

szundi · on Oct 2, 2021

And I would like to emphasize that when Apple used Intel, even then it was not commercially viable to use their platform. Bringing in ARM did change less than one would think at first.

saagarjha · on Oct 3, 2021

A14 is described in LLVM: https://github.com/llvm/llvm-project/blob/9e4d72675f476386cf...

mhh__ · on Oct 3, 2021

That's for the instruction selection patterns, no? I couldn't see a pipeline model in there last time I checked.

P.S. GCC descs are a goldmine for old arch's.

saagarjha · on Oct 3, 2021

That is true, but I think this is the case for all the ARM cores. I didn't spot a scheduler for Cortex-A73, for example.