"There is no hardware in the M1's for x86 emulation. Rosetta 2 does on the fly t...

"There is no hardware in the M1's for x86 emulation. Rosetta 2 does on the fly translation for JIT and caches translation for x86_864 binaries on first launch."

This is not quite correct.

First, as I understand it anyways, Rosetta 2 mostly tries to run whole-binary translation on first launch, rather than starting up in JIT and caching results. It does have a JIT engine, but it's a fallback for cases which a static translation can't handle, such as self-modifying code (which is to say, any x86 JIT engine).

Second, there is some hardware! The CPU cores are always running Arm instructions, but Apple put in a couple nonstandard extensions to alter CPU behavior away from standard Arm in ways which make Rosetta 2 far more efficient.

The first is for loads and stores. x86 has a strong memory ordering model which permits very little program-visible write reordering. Arm has a weaker memory model which allows more write reordering. If you were writing something like Rosetta 2, and naively converted every memory access to a plain Arm load or store, the reorderings permitted by Arm rules would cause nasty race conditions and the like.

So, Apple put in a mode bit which causes a CPU core to obey x86 memory ordering rules. The macOS scheduler sets this bit when it's about to pass control to a Rosetta-translated thread, and clears it when it takes control back.

The second mode bit concerns FPU behavior. IEEE 754 provides plenty of wiggle room such that compliant implementations can be different enough that they produce different results from the same sequence of mathematical operations. You can probably see where this is going, right? Turns out that Arm and x86 don't always produce bit-exact results.

Since Apple wanted to guarantee very high levels of emulation fidelity, they provided another CPU mode bit which makes the FPU behave exactly like a x86 SSE FPU.