Hacker News new | past | comments | ask | show | jobs | submit login

Apple M1 also has that x86 emulation mode where memory accesses have the same ordering semantics as on x86. It's probably one of the main things giving Rosetta almost 1:1 performance with x86.

https://mobile.twitter.com/ErrataRob/status/1331735383193903...




TSO support at the hardware level is a cool feature but it's a bit oversold here. Most emulated x86 code doesn't require it and usually not at every memory instruction when it does. For instance the default settings in Window's translation implementation do not do anything to guarantee TSO.

Rosetta is also a long way from 1:1 performance, even you're own link says ~70% the speed. That's closer to half speed than it is to full speed.

The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast. This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.

In a fun twist of fate 32 bit x86 apps also work under Windows on the M1 even though the M1 doesn't support 32 bit code.


> The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast.

M1 is fast and efficient, but Rosetta 2 does not emulate x64 in real time. Rosetta 2 is a static binary translation layer where the x86 code is analysed, translated and stashed away in a disk cache (for future invocations) before the application starts up. Static code analysis allows for multiple heuritstics to be applied at the binary code translation time where the time to do so is plentiful. The translated code then runs natively at the near native ARM speed. There is no need to appeal to varying deities or invoke black magic and tricks – it is that straightforward and relatively simple. There have been mentions of the translated code being further JIT'd at the runtime, but I have not seen the proof of that claim.

Achieving even 70% of native CPU speed whilst emulating a foreign ISA _dynamically (in real time)_ is impossible on von Neumann architectures due to the unpredictability of memory access paths, even if the host ISA provides the hardware assistance. This is further compounded with the complexity of the x86 instruction encoding, which is where most benefits the hardware assisted emulation would be lost (it was already true for 32-bit x86, and is more complex for amd64 and SIMD extensions).

> This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.

Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.


> but I have not seen proof of that claim.

x86 apps with JITs can run [1]. For instance I remember initially Chrome didn't have a native version, and the performance was poor because the JITted javascript had to be translated at runtime.

[1]: https://developer.apple.com/documentation/apple-silicon/abou...?


> There have been mentions of the translated code being further JIT'd at the runtime, but I have not seen the proof of that claim.

I've seen mentions of a JIT path but only if the AOT path doesn't cover the use case (e.g. an x86 app with dynamic code generation) not as an optimization pass. https://support.apple.com/guide/security/rosetta-2-on-a-mac-...

Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.

> Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.

This section was referring to the emulation performance not native code performance:

"it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "

Though I'll take native apps any day I can find them :).


> Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.

AOT (or, static binary translation before the application launch) vs JIT does make a big difference. JIT always carries a pressure of the «time spent JIT'ting vs performance» tradeoff, which AOT does not. The AOT translation layer has to be fast, but it is a one-off step, thus it invariably can afford spending more time analysing the incoming x86 binary and applying more heuristics and optimisaitons yielding a faster performing native binary product as opposed to a JIT engine that has to do the same, on the fly, under tight time constraints and under a constant threat of unnecessarily screwing up CPU cache lines and TLB lookups (the worst case scenario for a freshly JIT'd instruction sequence spilling over into a new memory page).

> "it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "

I still fail to comprehend which tricks you are referring to, and I also would be very much keen to see actual figures substantiating the AOT vs JIT emulation speed statement.


Rosetta 2 also emulates 32-bit x86 code, you can try it out with CrossOver (CodeWeavers’ commercial version of Wine)


No it does not. It isCrossOver letting you run 32-bit 86 (Windows apps only , I think) not Rosetta 2


Yes it does. Rosetta includes full emulation for 32-bit instructions, allowing Crossover to continue to not have to be an emulator.

(The parent commenter also works at CodeWeavers, so he would know :P)


Breaking memory ordering will breaks software - if a program requires it (which is already hard to know), how would you know which memory is accessed by multiple threads?


It's not just a question of "is this memory accessed by multiple threads" and call it a day for full TSO support being mandated it's a question of "is the way this memory is accessed by multiple threads actually dependent on memory barriers for accuracy and if so how tight do those memory barriers need to be". For most apps the answer is actually "it doesn't matter at all". For the ones it does matter heuristics and loose barriers are usually good enough. Only in the worst case scenario that strict barriers are needed does the performance impact show up and even then it's still not the end of the world in terms of emulation performance.

As far as applying it the default assumption for apps is they don't need it and heuristics can try to catch ones that do. For well known apps that do need TSO it's part of the compatibility profile to increase the barriers to the level needed for reliable operation. For unknown apps that do need TSO you'll get a crash and a recommendation to try running in stricter emulation compatibility but this is exceedingly rare given the above 2 things have to fail first.

Details here https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...


> For unknown apps that do need TSO you'll get a crash

Sure about that? Couldn't it lead to silent data corruption?


Yes, it absolutely can. Shameless but super relevant plug. I'm (slowly) writing a series of blog posts where I simulate the implications of memory models by fuzzing timing and ordering: https://www.reitzen.com/

I think the main reason why it hasn't been disastrous is that most programs rely on locks, and they're going to be translating that to the equivalent ARM instructions with a full memory barrier.

Not too many consumer apps are going to be doing lockless algorithms, but where they're used all bets are off. You can easily imagine a queue where two threads grab the same item from, for instance.


Heuristics are used. For example, memory accesses relative to the stack pointer will be assumed to be thread-local, as the stack isn’t shared between threads. And that’s just one of the tricks in the toolbox. :-)

The result of those is that the expensive atomics aren’t applied to all accesses at all on hardware that doesn’t expose a TSO memory model.


Nitpick: relative speed differences do not add up; they multiply. A speed of 70 is 40% faster than a speed of 50, and a speed of 100 is 42.8571…% faster than a speed of 70 (corrected from 50. Thanks, mkl!). Conversely, a speed of 70 is 30% slower than a speed of 100, and a speed of 50 is 28.57142…% slower than one of 70.

=> when comparing speed, 70% is about exactly halfway between 50% and 100% (the midpoint being 100%/√2 ≈ 70.7%)


Not sure what the nit is supposed to be, 70% is indeed less than sqrt(1/2) hence me mentioning it. And yes, it's closer to 3/4 than half or full, but the thing being pointed out wasn't "what's the closest fraction" rather "it's really not that close to 1:1".


> a speed of 100 is 42.8571…% faster than a speed of 50

I think you mean either "100% faster" or "faster than a speed of 70" there.


Meh, in the context of emulation, which ran at 5% before JITs, 70% is pretty close to 1:1 performance. Given the M1 is also a faster cpu than any x86 Macbooks, and it´s really a wash (yes, recompiling under arm is faster...)


No, it's not. Switching to TSO allows for fast, accurate emulation, but if they didn't have that they would just go the Windows route and drop barriers based on heuristics, which would work for almost all software. The primary driver behind Rosetta's speed is extremely good single-threaded performance and a good emulator design.


That’s a single ACTLR bit present in the publicly released XNU kernel source code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: