Hacker News new | past | comments | ask | show | jobs | submit login
About the Rosetta Translation Environment (developer.apple.com)
221 points by loun on June 23, 2020 | hide | past | favorite | 251 comments



> Rosetta can translate most Intel-based apps, including apps that contain just-in-time (JIT) compilers.

How on Earth does it do that? If executable code is being generated at runtime, it's going to be x86_64 binary machine code still (there are too many ways to generate valid machine code, and it won't know right away whether you're JITting, or cross compiling and actually want x86_64), so Rosetta would need to detect when the code's about to be run, or when it's marked as executable, and translate the machine code in that bit of memory just in time. The length of the ARM code might be longer, so it would have to be in a different block of memory, with the original x86_64 code replaced with a jump to the new code or something.

It's late at night here, so maybe I'm missing a simpler approach, but I'm a bit surprised they have it working reliably enough to make such a general statement (there being a great variety of JIT systems). From a quick search I can't tell if Microsoft's x86-on-ARM translation in Windows 10 ARM supports JITs in the program being run.


They might be using something like an NX bit on the generated x86_64 page, so that whenever the code attempts to jump into it, a page fault is generated, and the kernel is able to handle that, kicking in the JIT compilation and translating the code / address. This is essentially a "double JIT" so there will likely be a performance hit.

Since they control the silicon, Apple might also be leveraging a specialized instruction / feature on the CPUs (e.g. a "foreign" bit that's able to mark memory pages as being from another architecture, and some addressing scheme that links them to one or more native pages behind the scenes)

Maybe the A series chips even have "extra" registers / program counters / interrupts that aid in accelerating this emulation process.


Think of binary translation as just another kind of compiler. It parses machine code, generates an IR, does things to the IR, and then codegens machine code in a different ISA. (Heck, Rosetta 2 is probably built on LLVM. Why wouldn't it be? Apple already put so much work into it. They could even lean on similar work like https://github.com/avast/retdec .)

During the "do things to IR" phase of compilation, you can do static analysis, and use this to inform the rest of the compilation process.

The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2). Static analysis of the IR can find and label such instructions; and you can use this to replace them in the IR with more-abstract "enter JITed code" instrinsic ops.

Then, in the codegen phase of this compiler, you can have such intrinsics generate a shim of ARM instructions. Conveniently, since this intrinsic appears at the end of the JIT process, the memory at the address passed to the intrinsic will almost certainly contain finalized x86 code, ready to be re-compiled into ARM code. So the shim can just do a (hopefully memoized) call to the Rosetta JIT translator, passing the passed-in address, getting back the address of some ARM code, and jumping to that.


> The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2).

This is probably mostly true. But actually reliably finding such cases, in the most hostile environment imaginable (namely, on x86-64 binary code), would require heroic program analysis that I as a compiler engineer would judge unrealistic.

I would find it much more plausible for the system to watch for something like "jump to a target in a memory page that at some point had the W flag set, has since had W removed and X set, and has not been jumped to since that", and then trigger a compilation of that page starting at the given jump address, and a rewrite of the jump instruction. As proposed in https://news.ycombinator.com/item?id=23614894 this could be done by watching for page faults, though presumably it would mean intercepting places where the application tries to set the X bit on a page and setting NX instead so that the page fault can be caught.


> Rosetta 2 is probably built on LLVM. Why wouldn't it be?

LLVM is kind of slow; JavaScriptCore abandoned it years ago for their FTL backend.

> The unique pattern of machine-code that must occur in any implementation of JIT, is a jump to a memory address that was computed entirely at runtime, i.e. with a https://en.wikipedia.org/wiki/Use-define_chain for that address value that leads back to a call to mmap(2) or malloc(2).

  void *region = mmap(0x100000000, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED, 0, 0);
  memcpy(region, code, 0x1000);
  mprotect(0x100000000, 0x1000, PROT_READ | PROT_EXEC);
  reinterpret_cast<(void (*)())>(region)();


Re: your rebuttal code — that’s still a use-define chain. You don’t need the same literal pointer; you just need to know that the value of the pointer ultimately depended on a the output of mmap(2). Since the mmap(2) region address is passed into memcpy(2)—and memcpy(2) can fail, producing NULL—the output of memcpy(2) does then depend on the input. (Even if it didn’t, you could just lie to the compiler and tell it to pretend that memcpy(2) always depends on its input.)


> and memcpy(2) can fail, producing NULL

Huh, this is news to me. The memcpy(3) man page on my Linux box (there is no memcpy(2) here) doesn't mention this either, is this some special MacOS or BSD feature of memcpy? Under what circumstances would it determine that it should fail?

Your reasoning is strange anyway since the memcpy has nothing to do with anything, the implicit information flow from mmap to mprotect would exist even if the memcpy and the region variable were removed:

  mmap(0x100000000, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED, 0, 0);
  mprotect(0x100000000, 0x1000, PROT_READ | PROT_EXEC);


The original PowerPC on Intel Rosetta was pretty amazing.

First, most programs do much of their work inside the OS - rendering, network, interaction, whatever, so that's not emulated, Rosetta just calls the native OS functions after doing whatever input translation is necessary. So, nothing below a certain set of API's is translated.

You have to keep a separate translated binary in memory, and be able to compile missing bits as you encounter them, while remembering all your offset adjustments. It worked amazingly well during the PowerPC transition. Due to so many things running natively on x86, the translated apps frequently faster than running native on PowerPC macs!


> the translated apps frequently faster than running native on PowerPC macs!

This gets repeated a lot and it's generally false. In fact, most of the time they were slower, and realistically you would expect this. On a clock for clock basis, an OG Mac Pro 2.66GHz was about 10-20% slower than a Quad G5 2.5GHz running the same PowerPC software. In some benchmarks, the Quad G5 was still faster at running PowerPC software than the 3.0GHz (see https://barefeats.com/quad06.html ). When the Core Duo had to pay the Rosetta tax, even upper-spec G4s could get past it (https://barefeats.com/rosetta.html) and it stood no chance against the G5.

Where I think this misconception comes from is that on native apps (and Universal apps have native code), these first Intel Macs cleaned the floor with the G4 and most of the time nudged past even the mighty Quad, and by the second generation it wasn't a contest anymore. But for the existing PowerPC software that was available during the early part of the Intel transition, Power Macs still ran PowerPC software best overall. It wasn't Rosetta that made the Intel Macs powerful, it was just bridging things long enough to buy time for Universal apps to emerge. Rosetta/QuickTransit was marvelous technology, but it wasn't that marvelous.


I had a quad 2.5GHz water cooled Mac Pro on my desk, and the OG Intel mac pro, as well as Apple's dev system. I can tell you with 100% certainty, that Google Earth ran faster on the Intel Mac Pro under Rosetta than natively on the PPC Mac Pro, as I'm the person who ported it. When I had the native version working on Intel, that was the fastest of all, by far.


History repeats itself. Even though Apple didn’t make the claim this time, they did claim that PPCs would run 68K apps faster because they would spin much of their time running native code. This wasn’t true then either - at least not for the first gen PPC Macs.


Typically the way systems do this is by translating small sections of straightline code, and patching the exits as they are translated. So you start by saying translate the block at address 0x1234. That code may go until a jump to address 0x4567. When translating that jump, they instead make a call to the runtime system which says "where is the translated code starting at address 0x4567?" If the code doesn't exist, it goes ahead and translates that block and patches the originally jump to skip the runtime system next time around.

This means early on in the program's run you spend a lot of time translating code, but it pretty quickly stabilizes and you spend most of your time in already translated code.

Of course, if your program is self modifying then the system needs to do some more work to invalidate the translation cache when the underlying code is modified.


It probably just hooks mmap() (or mprotect()) and looks for when the JIT sets PROT_EXEC on memory it hasn't translated yet.

Here's a tutorial for a simple JIT that uses mmap() with PROT_EXEC to write machine code to before executing:

https://github.com/spencertipping/jit-tutorial


Since they don't enforce W^X you need to go a little deeper than that since some JITs reuse the same memory later for new traces.

So what you'll do is silently enforce W^X to trap code modifications, but not propagate those traps to the emulated code.


Right, if you look at the example JIT project I posted, they start by creating the memory with mmap() and PROT_EXEC before generating any code. So yes, you'd need to trap subsequent writes to allow for retranslation [I assume they have hooks in the Darwin kernel for this].


Or they just enforce W^X in the rosetta runtime by intercepting client mmap calls and fix it up as far as the client code is concerned by catching SIGBUS first.

I don't see anything here that requires extra kernel hooks.


Pages containing x86 code are never actually being marked executable in the (ARM) pagetables, because that would be nonsensical: The CPU itself does not know how to run x86 code. mmap/mprotect from x86 with the exec bit therefore does something else than native code.


Right, but the x86 code _thinks_ it does, and those semantics have to be maintained.

What that practically means is that x86 executable pages get mapped as RO by default as seen by the ARM side even if they would normally be RWX. Modifications trap, the handler mprotects the pages as writeable, flush the translation cache for that region, then returns from the trap. Then on an x86 jump to that region, the JIT cache sees no traces for that region, marks the page as RO again so that modifications will trap, and starts recompiling (or maybe just interpreting; they may be using a tiered approach in the JIT).


Yeah, I specifically said they don't get marked executable in the page tables. From the x86 code's perspective, they of course look executable.


"Just"

This is some serious black magic, especially if it actually works. Even if it only works some of the time I give it an A for effort, since they could easily have just said "JITs will not work with Rosetta." Most applications do not contain JITs.

My guess is there are enough important Mac apps that contain JITs of some form to merit this work, probably high performance math and graphics kernels that JIT processing pipelines or something like that. I wonder if some pro studio editing applications do this.


> Most applications do not contain JITs.

Java apps and JavaScript apps (Electron) come to mind.


True, but those are trivial to port as there are already mature JDKs and JavaScript VMs for ARM64. Porting an electron app should be a matter of checking it out and building it with ARM support. Applications with internal JITs custom-built for X64 are going to be a lot tougher.


Electron doesn't support aarch64 builds compiled from aarch64 hosts yet (so you have to cross compile), and doesn't support cross compiling if you have any native code addons.

Some apps out there are between a rock and a hard place if it weren't for rosetta.


I suspect this will change fairly soon now that there will be a significant non-mobile aarch64 market in the future


There's still a non trivial amount of time to get that propagated. First merged into chromium, the merged into electron, then actually used by the client programs.


I'd say it would still be less than the first ARM Mac shipping time


>This is some serious black magic

If there were ever a time and a place for black magic, it would be this use case: swapping out an enormous layer (the hardware) at the bottom of the stack, while keeping everything at the top of the stack working seamlessly.


There might be insights from the way Dolphins PPC JIT works:

https://www.reddit.com/r/emulation/comments/2xq5ar/how_close...


I'm not sure why they're making a big deal about this, couldn't the original Rosetta do this too? QEMU has been doing this since (I think) even before the original Rosetta, they call it user mode emulation. You run as if it was a normal emulator but also trap syscalls and forward them to the native kernel instead of emulating a kernel too.

I'm more interested in how they're doing the AOT conversion and (presumably) patching the result to still be able to emulate for JITs. That'd be (comparatively) simple if it was just for things from the iOS and Mac App Stores since Apple has the IR for them still but they made it sound like it was more generic than that.

https://www.qemu.org/docs/master/user/main.html


> I'm not sure why they're making a big deal about this

Don’t they need to communicate to developers and users what software will (and will not) run on their upcoming computers?


The original Rosetta couldn’t do it, or couldn’t do some part of it. I remember because Sixtyforce couldn’t run. I believe that it could have to do with self-modifying code.


I'm not sure why they're making a big deal about this

Because they're trying to get the message across to regular people (not HN types) that their software will continue working with the new chips, and they don't have to flee the Apple ecosystem.


> I'm not sure why they're making a big deal about this, couldn't the

That's par for course with Apple. They never acknowledge competitors including when its themselves. Everything they do or describe is awesome and magical, right now! It could be an incremental improvement, it could be a half decade old established technology, it could be something completely unexpected and science fiction turned into reality. Thats just how Apple works.


>That's par for course with Apple. They never acknowledge competitors

Unlike which company that does?



That's not acknowledging competitor technologies.

That's referencing them to tell how better you are: "Optional Clear View™ mast offers up to 40% greater visibility than leading competitors."

Apple does that too.


My gut feeling is that it will be about the same as Itanium/HP Envy x2 emulation. Emulation of highly optimized hardware where code is generated by highly optimized compilers without an order of magnitude slowdown is just too good to be true.


Valgrind does something similar (x86->intermediate language->x86), and is only about 4x slower than native with all the analyses disabled. I’d guess they left some optimizations out to make it easier to implement the dynamic checks it supports.


Valgrind leaves most instructions as they were doesn't it? If you're not touching the dynamic memory it should be as fast. You wouldn't be able to do that with complex MMX or SSE2 instructions with Arm translation.


It lifts them to a simplified version of x86 so they can be instrumented / transformed more easily. I think that implies you get different instructions when it lowers back to x86, but I could be wrong. (I’ve written a specialized valgrind instrumentation tool or two, but didn’t look too carefully at the execution half of their codebase.)


> If executable code is being generated at runtime, it's going to be x86_64 binary machine code still

JITs have to explicitly make the memory they write the generated code to executable. The OS "just" needs to fail to actually make the page executable, and then handle the subsequent page fault by transpiling or interpreting the x86 code therein.


And it wouldn't make sense to make any x86 process originating pages actually executable on a page table level anyway, as the CPU can never actually "execute" it, as machine code.


I wonder how it handles the stricter memory ordering from x86? For example, this code:

  void store_value_and_unlock(int *p, int *l, int v)
  {
    *p = v;
    __sync_lock_release(l, 0);
  }
on x86_64 compiles to:

   mov     dword ptr [rdi], edx
   mov     dword ptr [rsi], 0
   ret
vs on arm64:

   str     w2, [x0]
   stlr    wzr, [x1]
   ret
Notice how the second store on ARM is a store with release semantics to ensure correct memory ordering as it was intended in the original C code. This information is lost as it's not needed on x86 which guarantees (among other things) that stores are visible in-order from all other threads.


That's the big piece I've been wondering about too.

Three options as I see it (none of them great):

1) Pin all threads in an x86 process to a single core. You don't have memory model concerns on a single core.

2) Don't do anything? Just rely on apps to use the system provided mutex libraries, and they just break if they try to roll their own concurrency? Seems like exactly the applications you care about (games, pro apps), would be the ones most likely to break.

3) Some stricter memory model in hardware? Seems like that'd go against most of the stated reason for switching to ARM in the first place.


There's another option:

4) Translate x86 loads and stores to acquire load and release stores, to align with the x86 semantics. These already exist in the ARM ISA, so it's not much of a stretch at all.

This is the one I'm betting on.


Expect you'd need to do that with every single store since the x86 instruction stream doesn't have those semantics embedded in it.

That'd kill your memory perf by at least an order of magnitude, and kill perf for other cores as well. It'd be cheaper to just say "you only get one core in x86 mode". Essentially you'd be marking every store as an L1 and store buffer flush and only operating out of L2.


ARMv8.1 adds a bunch of improved atomic instructions that basically implement the same functionality as x86. Because x86 has atomic read-modify-write instructions by default; You need emulate those too.

The ARMv8.1 extensions look like they have been explicitly designed to allow emulation of x86.

The implication is that implementations of ARMv8.1 can (and perhaps should) implement cache/coherency subsystems with high preformance atomic operations.

And I'm willing to bet Apple has made sure their implementation is good at atomics.


So the ARMV8.1 extensions aren't for emulating x86, they're a reflection of how concurrency hardware has changed over the years.

It used to be that (and you can see this in the RISC ISAs from the 80s/90s, but AIUI this is what happend in x86 microcode as well)

* the CPU would read a line and lock it for modifications in cache.

* the CPU core would do the modification

* the CPU would write the new value down to L2 with an unlock, and L2 would now allow this cache line to be accessed

So you're looking at ~15 cycles of contention from L2 through the CPU core and back down to L2 of a lock for that line.

If this all looks really close to a subset of hardware transnational memory when you realize that the store can fail and the CPU has to do it again if some resource limit exceeded, you're not alone.

Then somebody figured out that you can just stick an ALU directly in L2, and send the ALU ops down to it in the memory requests instead of locking lines. That reduces the contention time to around a couple cycles. These ALU ops can also be easily included in the coherency protocol, allowing remote NUMA RMW atomics without thrashing cache lines like you'd normally need to.

This is why you see both in the RISC-V A extension as well. The underlying hardware has implementations in both models. I've heard rumors that earlier ARM tried to do macro op fusion to build atomic RMWs out of common sequences, but there were enough versions of that in the wild that it didn't give the benefits they were expecting.

However, all that being said, atomics are orthogonal to what I'm talking about. It's x86's TSO model, and how it _lacks_ barriers in places that ARM requires them, with no context just from the instruction stream about where they're necessary that's the problem here, not emulating the explicit atomic sequences.


On the other hand, the ARMv8.5 flag manipulation instructions almost certainly were added specifically for x86 emulation.


Yes, you need it for every store.

It absolutely does not imply that it would kill your memory perf by an order of magnitude! Remember that x86 does this as their normal store behavior and they don't take an order of magnitude hit.

The easiest approach that leads to fast release stores is just to have a stronger store pipeline that doesn't allow store-store or load-store reordring. The latter basically comes for free and most weak pipelines already preserve it.

The store-store ordering, on the other hand, does have a cost: primarily in requiring stores to drain order, and handling misses in order. Nothing close to an order of magnitude.

A higher performance design would allow store store reordering but not around release stores. Allowing most ARM code to the full benefit while allowing fast release stores for x86.

I think you are mixing up release stores with more expensive atomics like seq-cst stores or atomic RMWs. There is no need for a store buffer drain, ever, for release stores.


What I can say is that they aren't pinning to only a single core, so the answer is elsewhere.


That would be too expensive, unless the emulation in general is ao high not to matter.

If store releases and load aquires were very cheap already why make them distinct from normal one anyway?

I suspect that either the CPU is already TSO in practice or has some TSO mode.


Then why not expose that to ARM applications running on it?


Apple might not want to promote this to an architectural guarantee. It is only needed for the few years required to transition away from x86, but if applications start relying on it it will need to maintain it forever.


If they run with it for long enough it will all but become an architectural guarantee as people unintentionally write incorrect programs that happen to still work right.


> If store releases and load aquires were very cheap already why make them distinct from normal one anyway?

Perhaps if the ARM ISA was designed today, they would, I'm not sure.

My impression is that ARM added them when (a) it became obvious the way the wind was blowing with respect to modern memory models, like Java (sort of), C and C++, where acquire and release are the dominant paradigm (and I doubt any other langue will stray very far), and (b) they could see that these operations implementations can be implemented at a relatively low complexity cost and reasonable performance, compared to the existing barrier approach.

That said, there is a lot of room between "very cheap" and "too expensive", and it is probably hardware dependent. On some simple microcontroller-like design that doesn't want to do any extra work to support these, they might be very expensive, just relying on something like a full barrier which they have to support anyways.

However, on bigger designs, it is not necessarily very expensive to implement efficient release stores. They will still be slower than plain stores, but may not all that much slower. It's nothing like sequentially consistent stores, or barrier types that require a full store buffer drain.

Mostly all you need to need is to ensure that the stores drain in order. Actually the requirement is even weaker: you can't let release stores pass older stores. So it depends on your store buffer design and prevents some optimizations like merging and out-of-order commit to L1D, but those are far from critical optimizations (and you can still do them for plain stores at the cost of a bit more complexity).

If you are designing a core where release stores only occur in concurrent code, e.g,. as a result of a release store in C++, or as a lock release or whatever, you don't need to make them that fast. If they are 10x slower than regular stores, it's probably OK.

However, if you are designing a chip where you know you are going to be running a lot of emulated x86 code where every store is a store release, then yeah you are doing to do a bit more work to make these fast.

What are the other options? (1) and (2) in the GPs list seem very unlikely to me. (1) would almost certainly be much worse than release stores for multithreaded apps and destroy performance in popular creator apps.

(3) is certainly plausible and one variant of what I'm suggesting here: it means making plain stores and release stores the same thing (and I guess for loads). Definitely possible but seems less likely to me than (4).

Another possibility is a TSO mode, as you suggest. Perhaps this is somewhat easier than dynamically handling release stores, I'm not sure.


sorry for the confusion, I'm aware why ARM added those instructions (a very good decision compared to the bizarre barriers available on classic RISCs); what I meant is, if store release can be implemented to be as fast as a normal store, why wouldn't Apple just give release semantics to normal stores? I guess this allow them to be forward compatible with future more relaxed architectures, but I don't think they care about forward compat of translated code. Still it is certainly possible that you are right.

Re TSO mode, I thought I remembered that Power8/9 had a TSO mode (explicitly for compatibility with x86 code), but I can't find any reference to it right now.


I think understood the first time, but my answer is:

"I never said release stores can be implemented as fast as regular stores, but they might well be fast enough. Maybe half the throughout (1/cycle) with some other ordering related slowdowns.

In particular, maybe you can implement them as fast an acq/rel all the time CPU mode would be".

Or something like that.


Yes, I could see a slightly slower but fast enough store and load be good enough (and in fact it would be great in general, not just translation).

The reason I'm thinking the cpu might actually be tso is that I haven't seen significant evidence that, in an high performance cpu, tso is a significant performance bottleneck. For the last 15 year or so Intel had the best performing memory subsystem and didn't seem significantly hampered by reordering constraints compared to, say, POWER.


Yes, it's not a devastating impact, but my thinking on this has shifted a bit lately to "somewhat significant" impact. For example, I believe the strong store-store ordering requirement significantly hurts Intel chips when cache misses and hits are mixed and an ABA scenario occurs as described at [1].

Also, it seems that Apple ARM chips exhibit essentially unlimited memeory level parallelism, while until very recently Intel chips had a hard limit of 10 or 12 outstanding requests, and in a very-hand wavy way some have claimed that this may be related to the difficulty of maintaining ordering.

More recent, Ice Lake can execute two stores per cycle, but can only commit one store per cycle to the L1D, unless two consecutive stores are to the same cache line. The "consecutive" part of that requirement comes directly from the store-store ordering requirement, and is a significant blow for some high store throughput workloads.

Similarly, I believe the whole "memory ordering mis-speculation on exiting a spin lock spin" thing, which was half the reason for the pause instruction, also comes from the strong memeory model.

None of these are terrible restrictions on performance, but they aren't trivial either. Beyond that it is hard to estimate the cost of the memeory ordering buffer in terms of power use, etc.

I agree that Intel has made chips with power memory subsystems despite this, but it is really hard to compare across vendors anyway: R&D and process advantage can go a long way to papering over many faults.

[1] https://www.realworldtech.com/forum/?threadid=173441&curpost...


Thanks for the RWT link, I had missed that discussion back then. I normally assume that store-store reordering is not a huge deal as the store buffer hides any latency and blocking, but I failed to appreciate that the store buffer filling up is an issue.

But which architectures do not actually drain the buffer in order in practice? Even very relaxed RISCs (including ARM) normally respect causality (i.e. memory_order_consume) and it seems to me that if, say, an object pointer is made visible before the pointed object, that can violate this guarantee, right?

You say that Apple CPUs show near unlimited MLP, do you have any pointers?


Never mind, of course memory order consume still requires a store-store fence between the two stores so reordering stores is still possible


> 3) Some stricter memory model in hardware? Seems like that'd go against most of the stated reason for switching to ARM in the first place.

I would assume that a more strict memory model would be enabled only for processes that needs it (ie, Rosetta translated ones). So a cpu-flag is set/cleared when entering/exiting user mode for those processes. Does this require a separate/special cache coherency protocol? A complete L1d flush when entering/leaving these processes (across all CPUs)? Not and expert in this field and it feels complicated for sure. Is it worth it for just emulating "legacy" applications during a transitional period? Perhaps Apple can pull it off though.


> I would assume that a more strict memory model would be enabled only for processes that needs it (ie, Rosetta translated ones). So a cpu-flag is set/cleared when entering/exiting user mode for those processes. Does this require a separate/special cache coherency protocol?

The benefits you'd get from going to a weaker memory model are by not having that extra coherency in the critical path in the first place. Adding extra muxes in front of it to make it optional would be worse than just having it on all the time.

> A complete L1d flush when entering/leaving these processes (across all CPUs)?

That wouldn't help because two threads could be running at the same time on different cores against their respective L1 and store buffers.


> The benefits you'd get from going to a weaker memory model are by not having that extra coherency in the critical path in the first place. Adding extra muxes in front of it to make it optional would be worse than just having it on all the time.

Indeed true, good point.

> That wouldn't help because two threads could be running at the same time on different cores against their respective L1 and store buffers.

Of course, this was related to the cost of switching coherency protocol during context switch. But as you say the overhead of just making it switchable is prohibitive in itself.


It can be enabled per instruction.

Atomic instructions (and ARMv8.1 added a bunch of new atomic read-modify-write instructions that line up nicely with x86) use the new cache coherency protocol, while the older non-atomic instructions keep the relaxed memory model.

Though, I'm not sure if it's worth it to keep two concurrency protocols around. I wouldn't be surprised if the non-atomic instructions get an undocumented improvement to their memory model.


Elsewhere in the documentation[0] Apple explicitly calls out that code that relies on x86 memory ordering will need to be modified to contain explicit barriers. All sensible code will do this already.

[0] https://developer.apple.com/documentation/apple_silicon/addr...


Sure, for source translation that's sorta fine (although I wouldn't want to be the one debugging it). The issue is binary translation, there is really no such a thing as a acquire or release barrier on x86.


Does anyone have any insight into the business logistics involved in a transition like this?

I presume Apple has done maintenance on transitioning desktop OS X to ARM as an option for a very long time. How many years ago would they have had to decide that was the direction they were going to make it reality? 2015? How many people would have been working on it? How many billions of dollars? How does the cost of developing the software compare to the cost of developing a new processor, or compared to tooling a production line for an entirely new platform?

I'm really curious about the share of institutional resources and attention this likely involved. I wonder how important it is as context to Apple over the past few years.

I also wonder if it heralds a new direction in personal computers. Every time I've considered that my Mac isn't that great, the alternatives aren't that great either. Would Apple ever decide to seriously compete for a dominant share of the personal computer market?

And finally, I am also curious about the accounting and financial disclosures required in such decisions. How much are institutional investors informed of long range plans? There's naturally a lot more pressure to distribute cash to shareholders if you don't know about some five-year-plan, yet that pressure waxes and wanes from activist investors, and institutional investors like CalPERS always seemed to be on board with Apple's direction anyway. Do unofficial leaks prevent violations of insider trading rules?


While I have talked about the financial of an All in ARM Mac platform, that was from an angle of each individual component. But you could also look at it from a company R&D perspective.

Example. Apple ship ~10M MacBook Air a year. Current spending on MacBook Intel chip is roughly $200 per unit. As long as Apple could make it below $200, they make more profits or lower the price.

That is a per unit, per model angle. i.e Apple could easily have fitted an A12Z inside Macbook Air, and save them at least $150.

Or you could look at it from a company total perspective.

Apple currently spend $2B annual R&D ( a made up number )on its Chip design. How much more additional would we need if we need to design for Mac CPU as well. $500M / $1B? Great, and do it. ( If you follow Apple closely, Apple's Net Profits and R&D has been at a fixed percentage for a fairly Long time )

>And finally, I am also curious about the accounting and financial disclosures required in such decisions.

There is nothing that needs to disclose until these decision make material effect on its stock performance.

It is worth pointing out Apple receives $10B annually from Google for having it as default search engine on Safari alone. While it is expensive to design its own chip, comparatively speaking it is still peanuts for Apple.

At the rate things are going, Apple could have spend more money on Apple TV+ Originals than R&D spend on Mac CPU alone.


Apple already supports ARM so they have the whole toolchain, most of their operating system, and many frameworks ported. Most of the heavy lifting was done in Project Purple (aka the first iPhone).

Porting from PPC to x86 was arguably more difficult as a lot more of the framework code had endianness assumptions baked in.


> Porting from PPC to x86 was arguably more difficult as a lot more of the framework code had endianness assumptions baked in.

Did it? The code they inherited from NeXT didn't assume endianness, and until Rhapsody Developer Release 2 they shipped x86 builds. Some applications like iTunes ran on Windows, with a partial port of the Cocoa libraries underlying them (this, too, was a descendent of old NeXT support for Windows).

It's entirely plausible that Apple didn't have to do much; as they said at the time, they never stopped supporting x86 after Rhapsody Developer Release 2, and they had the OS running on x86 the entire time. I'd assume from the point of view of most developers working on OS X they simply had requirements to avoid endian assumptions, etc., and occasionally the team maintaining x86 support would send a nondescript patch to "fix the endianness assumption".


As Steve Jobs himself said in the famous Mac Intel keynote, Mac OS X ran on Intel from day one, "just in case". After all Darwin has always supported x86, and NEXTSTEP ran fine on x86 too, so probably it was not that difficult from a core system standpoint migrating to x86.


> How many billions of dollars?

Rounding-error-sized marginal costs.

Even if you don't plan on pivoting architectures, it's good practice to write as much code as possible without assembly/intrinsics. If you're writing standards-compliant C/C++/Objective-C/Swift/* it's not a huge burden to change architectures. They also have past-experience with such transitions, so hopefully they kept the "keep the code portable" mantra close over the past 10+ years.

The choice to do this seems relatively obvious for Apple. They have spent years building up their silicon game with portables. Their frustration with Intel is also palpable, so it was probably "just" a waiting game (not "how" or "if") in deciding when to pull the trigger and start using those chips outside of iPhones/iPads.


If you have to maintain any highly optimized assembly/intrinsics code, it’s a very good idea to also maintain an “as portable as feasible” C version for comparison testing.


Probably a big opportunity cost though. It surely takes a lot of time from senior engineers to port the software, figure out this whole translation system, etc...


The story was that it took two engineers to keep OS X compatible with Intel when they were on PPC.

It took only a small company - Connectix - to make a better 68K emulator that worked perfectly with PPC/Classic MacOS than Apple could.


> * I am also really curious about the accounting and financial disclosures required in such decisions. *

There are none, beyond the dollar figures set aside for R&D.

> How much are institutional investors informed of long range plans?

In the case of Apple, very little to not at all. Other companies have different policies - most share roadmaps and long range plans rather freely compared to Apple.


Institutional investors can’t know more than retail outside investors. That would be illegal. Why would they need to make any public disclosures? They don’t publish margins per segment.


I've been using Macs at work for a long time, but sadly this will mark the end of that era.

In particular, this limitation on Rosetta rules out an ARM-based Mac for work:

> Virtual Machine apps that virtualize x86_64 computer platforms

My job requires me to use a piece of proprietary Windows-only software for a large portion of my work. If I can't use this software I can't do my job. Currently I run it in VMware Fusion on an Intel Mac, which is a perfect solution for me - I get the great features of MacOS, plus I can run the proprietary toolset that my job requires.

There is a very remote possibility that Windows for ARM could be virtualized by some future version of VMware and the proprietary toolset could run under that, but I'm not holding my breath.

Due to budget constraints, I don't think there's any way that my work would spring for a MacBook Pro plus a Windows machine for me.

On the flip side, Windows 10 seems to be getting really good, so I expect that I'll be just as happy and productive with Windows 10+WSL2.


My understanding is that that line only refers to VM applications written and compiled for x86, and doesn't stop a VirtualBox (or any other VM provider) from compiling an ARM binary that doesn't require Rosetta.

I don't think this is a policy restriction so much as there not being support for Intel virtualisation technologies that that might rely on.


Think this is accurate. Any modern x86 virtualization software depends on hypervisor assistance from Intel/AMD to run at reasonable speeds, and that's just virtualizing x86 on x86--more like a context shift between sandboxes. Full virtualization through a CPU emulator to boot isn't very likely to be practical anytime soon.

I suspect they'll treat it as a policy restriction, though, so they wouldn't have to deal with emulator compatibility going forward. Any solution that would be even minimally able to support virtualizing x86 would probably be ugly to keep running. If VMW or Parallels gets something going, my guess is they'll have to work through Apple to get an exception.


That line just means you can’t run an emulator on top of an emulator, which makes sense. Performance would suck.

What will happen instead is that companies like VMware will release emulators that run natively on Apple chips. (Rather than trying to run on top of Rosetta.) However, Intel Windows performance might still suck in those. Remains to be seen.


> companies like VMware will release emulators that run natively on Apple chips

Exactly. This has happened before e.g. Microsoft's VirtualPC [0].

> However, Intel Windows performance might still suck in those

Bingo. I remember how agonisingly slow VirtualPC was for anything remotely useful (forget gaming).

[0] https://web.archive.org/web/20071031005603/http://www.micros...


Maybe with this growing interest in ARM desktops someone could consider continuing that project that was going around a few years ago of integrating QEMU and WINE. Some old x86 Win32 apps would probably run fine under, like they do on Microsoft's emulator.


Tim Cook says new Intel Macs are coming out this year. If you can convince your company to buy you one, you should be good until 2030.

Not being able to do your job would be enough for even my craptastic company to buy me a new machine.


2030 is a stretch. Apple has generally supported new Mac hardware for 5-7 years, and my guess is the timeline is only going to get shorter for Intel-based Macs going forward, considering their resources are going to be disproportionately allocated to the newer versions.


First Intel Macs were released in January 2006 and OS X dropped support for Power PC in August 2009 with 10.6 Snow Leopard. If the same timeline is followed here, Intel support will be dropped around 2024. I wouldn't expect to get 10 years out of an Intel Mac.


It will be at least several years before you will not be able to purchase a new Intel-based Mac.


From another article, Apple claims that the transition will be complete in two years. I have 2.5 years left before I will be eligible to have my work computer replaced.

Depending on the exact timing of Apple's product cycles and the procurement cycle, I'll either be getting one of the last Intel Macs or a Windows machine of some description.

If an Intel Mac is still available at this time, that brings other concerns - in particular, Macs are on a five-year replacement cycle at my work, so even if they are available it may not be prudent to buy one. Will it get new MacOS versions for its 5-year lifetime? Will it still get security updates five years down the road? Will the developers of the applications I use still ship x86_64 versions of their Mac applications?

Because of all of that uncertainty, it may end up being a better option to just get a Windows machine when I'm eligible. Fortunately, a lot of that uncertainty should be cleared up by the time that happens.

EDIT: Another remote possibility is that VMware figures out some way to fill in the gaps that Rosetta doesn't. After all, they did figure out how to do virtualization on a platform that didn't even support it.


I think this may largely be a non-issue by the time it rolls around, as while software emulation of x86_64 might prove tricky, there's one thing that runs x86_64 really well. Actual x86_64 chips.

This is what they used way back in the day. Apple (and others) actually made x86 (non-64) add-on cards for PPC Macs that were effectively PCs on a card to slap in a Mac.

Would be relatively trivial to do the same now. PC on a PCIe card, maybe expose itself to the host as just a high speed network card.

Use a Ryzen embedded chip, SO-DIMMS, job's a good-un.

All the major software stuff would already work. Headless diskless iSCSI PC.

Bonus feature that it also acts as a hardware dongle if you want some magic sauce MacOS software to make the process easier.


> There is a very remote possibility that Windows for ARM could be virtualized by some future version of VMware and the proprietary toolset could run under that, but I'm not holding my breath.

How is that remote? We've already seen Parallels Desktop virtualize Linux for ARM.


As far as I can tell, you can't buy Windows for ARM, you can only get a license by buying specific hardware.

Aside from that, most people who want Windows support on a Mac want support for x86 Windows, not the ARM version. So even if it does run, it's likely to be a hobbyist/ hacker thing, not a product.


Windows for ARM has a built-in emulator for x86 binaries, so I think most users trying to run random legacy programs would be satisfied with it. (Gamers, not so much…)


Sounds like there will be Intel Macs for at least a few years coming so it's likely you'll have some time before you have to worry about it. By the time Apple quits shipping Intel Macs it may be a non-issue.


> Currently I run it in VMware Fusion

That’s how you’d do it on an ARM Mac too.

Rosetta is about executing Mac apps on ARM Macs. For Windows apps you’re still going to go to a third party, most likely the same third-parties as before.


It will be very slow to emulate a Windows x86 kernel on ARM.


I have a very similar situation: I loved using MacBook Pros as my work computer for over 10 years now, but while my job consists of 80% of cross-platform Java development stuff, it also includes working on and compiling various custom binaries (from big fat Chromium builds down to little system-level libraries and Linux kernel modules). All three major OSes are targets: MacOS, but most importantly Windows and Linux - especially Linux, because that's the main production environment on which our end products usually run, while the others are mostly needed to enable our devs to develop on their preferred platforms.

Windows and Linux MUST be x86_64, because that stuff needs to run on that arch eventually - it's of no use if it "works on my (ARM) machine". The Mac platform has, until now, been the ideal platform for my work, because it comes with by far the best OS on which I can be as productive as possible, while at the same time being able to execute all three targets within performant VM environments, greatly simplifying all the low-level development tasks as well as testing and experimentation. Even USB hardware could transparently be routed into the VMs - something that's also very important for me, as I often need to deal with weird and rather unusual periphery, and which is highly problematic when working on remotely hosted VMs, which I did try for a while back then when MacBooks had that RAM cap at 16GB in order to be able to utilize more memory.

I'm still good with my current MacBook Pro for about two years or so, which gets me well into the transition period. I know that I'll definitely be tempted to privately buy a replacement for an old 2011 MacBook Air, which I still use for light browsing tasks and stuff at home and which will be totally fine if it's ARM-based. If anyone happens to come around with some kind of VM emulation solution for actually virtualizing x86_64 OSes on the ARM-based Macs - there at least seems to be a demand, so I haven't lost hope that one of the likes of Parallels or VMware might take up that task - I'm going to give that a try and see whether it's a solution to my remaining 20%.

If that doesn't work out, I'll have to consider either switching to ARM Mac regardless and keeping the old machine (or another PC laptop or something) around as a second machine for the low-level Windows/Linux work, or ditching the Mac platform and moving to a Linux-based machine as my main computer, with Windows in a VM, and maybe getting a small ARM Mac as a "sidekick" for that bit of MacOS compilation and testing work. Either of those will be worse than my current, nearly-perfect setup, which was originally enabled only by the beauty of ISA compatibility between the Mac and PC platforms.


Rosetta's goal is to support legacy Mac apps. It's quite finely scoped, but I hope Apple will go above this and make it available as part of their virtualization framework. This way e.g. Parallels and Docker could use this to provide some way to tap in probably the fastest way to run x86 on ARM Macs.

This would mean Rosetta would go beyond its stated scope, and certainly go beyond what the relatively short-lived Rosetta 1 did, but would make it easier to integrate ARM Macs in what is still an x86 world.

I'm worried they're not going to do this because x86 emulation is probably too slow (people will attempt to run their AAA games in a VM). It would also mean that Apple will need to support Rosetta 2 forever. If they're not going to do this, everybody will have to rely on qemu. Qemu is great, but I hope it will perform adequately for at least basic Docker stuff...


I think the longer Rosetta exists and macOS thereby supports a wider variety of binary executables, the more people will rely on it to deliver them day-to-day functionality in increasingly complex configurations, and the more Apple will expose itself to those users' criticisms.

If Apple's rationale for the transition is to gain further control over their product design and manufacture, the prospect of having to appease folks who won't give up their old software (like me!) but who come to expect indefinite Rosetta-type backwards compatibility doesn't sound all that appealing from Apple's perspective. We're talking about a company who denied Carbon support in the 64-bit transition while some of their largest software developers still clung to it.

Furthermore, the old software they're temporarily supporting through Rosetta represents the agreements and policies Apple followed with third party developers during earlier periods in their history, which I'm sure they want to get away from. In their eyes, the sooner they can ditch those policies and impose iOS-type restrictions on any and all macOS software development, the better.

This is probably my last Mac, after thirty years. It's been fun, watching them rise in power, but they've had to become a very different kind of company to get to where they are.


Your first two paragraphs track, but I think the second two are quite a leap.

Sure it’s possible, but I think it’s exceptionally unlikely that Apple will try to lock down the Mac the way the iPhone is.

The reason is they think of them as fundamentally different products with fundamentally different purposes, and for the most part the trend line has been moving to more opening of the iOS ecosystem rather than more lockdown of the Mac.

(For instance in the last few iOS releases they’ve added file system support, removing the default apps, and in this new release they’re allowing users to set alternative default apps, etc.)

Apple views the iPhone as a pure consumer device where the most important thing is that everything “just works.” It’s locked down in the same way a video game console is locked down, and for the same reasons: they gain a lot of “just works” benefits from keeping things very standardized. And most consumers LIKE that. Most consumers hate that their PC (or even Mac) often fails them in “mysterious” ways that require tech support or just living with a broken experience.

Apple talks about iOS as a “car” and Mac as a “truck.” They see Mac as their development platform, and literally called that out in their keynote.

There is certainly a desire on Apple’s part to make the Mac also an excellent consumer product, hence things like the existence of the App Store and app signing, but on the Mac there have always been “escape hatches” for those who know what they’re doing.

Now, the part that Apple does want to lock down and is able to lock down even more with this move is the Hardware. They’ve never made Hackintosh easy and have never done anything to indicate support for tinkering with their hardware, to the point that when the pro community began to bail they brought back a highly “configurable” Mac Pro (configurable with official parts but not really tinkerable).

But anyway, I don’t think Mac OS is going to get locked down substantially more than it already is. It’s a nice OS with some trade offs, if those trade offs have worked for you in the past they’re likely to keep being fine, and if not then you probably wouldn’t be a Mac user anyway.


It’s already happening. macOS Bug Sur (Apple’s wording) enforces a read only system partition, which cannot be easily disabled or mounted as rw.


Can you not go into recovery mode and turn off all the system integrity features like you always could?


There is no way to boot a with a R/W system partition anymore, nor can you mount it as such. You must reboot, make changes while in recovery, commit them, and then restart again.


Yes.


No.


FWIW they showed the latest Tombraider game running on Rosetta on a dev Mac said to be using the ARM chip from the high-end iPad. It ran pretty well. Not sure if I would quite count this as AAA


IDK, what I saw was a several year old game running on the lowest settings and still stuttering a bit.


The more interesting point...

>The demo wasn't perfect: the game ran at 1080p with fairly middling settings, but did so at what appeared to be at the very least a steady 30FPS, all while being run as an emulated x86 version of the game. That is: the game wasn't even natively compiled for ARM.

But consider that Intel's most powerful laptop chipset GPU, found in the 10th generation Ice Lake series, is not capable of breaking single digit frame rates on this game at 1080p.

Apple received some snark about this demo being lame, but it's only lame if you don't understand at all just how terrible modern integrated laptop GPUs are.

https://www.androidpolice.com/2020/06/22/apples-chipset-adva...


We all know that Intel iGPUs are terrible (at least until the new Xe based GPUs start to come out).

The comparison I don't see enough people making is to throw out as an option the one company that regularly ships semi custom, secure, x86 SoCs with GPUs, AMD. An Xbox One SoC is literally half of the transistor count of an A12Z (5B and 10B respectively).

And the frame rate was anything but steady there. And calling it "middling settings" is pushing it as well, the env lighting wasn't even on Lara. It looks like they dropped the settings as far down as they would go.


Yet the marketing push for Ice Lake was that even if the CPU cores lost most of their IPC gains to lower clock speeds, the GPU itself was to be a massive improvement.

I'm looking forward to seeing if the demos were on the Dev Kit hardware, because if that was the performance on a specially binned two year old iPad Pro chip, it's pretty darn impressive.


It's Tiger Lake that's supposed to see the passable GPU, when they use the same cores as their new discrete GPU.

Not that I'm really defending Intel here, they fell behind the curve.


Jam tomorrow, but never jam today?

With Intel, I've gotten to the point where I'll only believe their marketing spiel after the hardware ships and the claims are confirmed by third party testing.

However, they did indeed market Ice Lake as a huge GPU improvement.


Lol, like I said, Intel is obviously out of the picture. I'm not really defending them, just being as charitable to their side as I can for the sake of argument. But yes, their GPUs are perennially terrible, and their fab side has too many issues these days to rely on being able to ship a leading edge process node at scale.

I'm just on the side that a semicustom AMD APU would have made way more sense for the stated reasons of the arch switch, as you would have been able to have all the power/watt and raw GPU perf improvements (if not more) without switching arch from under your users. And the Xbox One is an incredibly secure chip that'll be the first major game console I know of to not be hacked to allow unsigned code during it's lifetime, and that's while running games in kernel mode.

I think the real reason for the switch is the fact that catalyst runs iOS apps unmodified on macOS/aarch64. When you look at their 10-K that's where they make their real money, and I suspect that they had a hard conversation about the mac app store not meeting expectations and this is how they intend to get macs on that same gravy train.

That has a lot of implications for their incentive structure going forward that doesn't look great for people who want a genreal purpose computer ala Cory Doctorow's "The Coming War on General Computation".


I think the real reason for the switch is the same as the switch away from PPC to Intel.

Higher performance per watt and less waste heat.

AMD going from an also-ran to highly competitive is certainly impressive. However, it's not "more than double the performance of the competition for less than half the power budget" level impressive.

>In the face-off against a Cortex-A55 implementation such as on the Snapdragon 855, the new Thunder cores represent a 2.5-3x performance lead while at the same time using less than half the energy.

https://www.anandtech.com/show/14892/the-apple-iphone-11-pro...

However, that's in the mobile space. Time will tell how well Apple does competing in the laptop/desktop arena.


> Higher performance per watt and less waste heat.

The thing is, I don't really see that in heir designs. You're not going to get "more than double the performance of the competition for less than half the power budget" even when you compare to Intel, and I doubt that you're going to beat AMD.

And yes, when you compare to other ARM cores the Thunder does well, but that's not it's competition in the laptop space. Intel and AMD are, and a Thunder core does not do well against a Zen 2 core.


Thunder is an ARM "little" core implementation. It's not meant to compete with a full performance core from anyone. It competes against other "little"cores.

I'm sure when Intel starts shipping their chips that mix Core and Atom, someone will start comparison testing.

As far as performance per watt and less waste heat, you don't even have to wait to see what their 5nm desktop/laptop chips look like, their iPhone chips with a tiny power budget and passive cooling are already competitive with Intel and AMD.

>Last year I’ve noted that the A12 was margins off the best desktop CPU cores. This year, the A13 has essentially matched best that AMD and Intel have to offer – in SPECint2006 at least. In SPECfp2006 the A13 is still roughly 15% behind.

https://www.anandtech.com/show/14892/the-apple-iphone-11-pro...


> Apple received some snark about this demo being lame, but it's only lame if you don't understand at all just how terrible modern integrated laptop GPUs are.

We do know that. But we also know that the high end laptop market is not all about iGPUs. Entry and mid level consumer and business laptop, maybe, but if you buy a $1,500+ Dell laptop, you're not getting iGPU only.

So not exactly the best comparison. MBPs _start_ at $1,299.


https://youtu.be/GEZhD3J89ZE?t=6070

It doesn't stutter whatsoever (you've claimed this in multiple posts, and I think you need to look at your playback device because the game is buttery smooth). And for integrated graphics of a binary translated x86-64 game running on ARM @ 1080p, that is pretty amazing.

No one is going to claim that the GPU on the A12Z beats a dedicated GPU, even a low-end one. But as someone whose kid has an iMac with a Iris Pro integrated graphics, this looks a world better than what Intel integrates, not even accounting for the whole binary translation thing.

And clearly the higher end Macs with Apple silicon will still have dedicated GPUs as well. Why wouldn't they?


An A12Z's GPU is around double an Xbox One's in gate count. On a GPU limited title like that it should be doing way better than this.

Also it still appears to be stuttering, and opening other videos doesn't show the same stutter.


You're really veering all over the place to dismiss this (which is funny given that it was on the A12Z, and we know production machines will be much more powerful).

Now this tiny (less than 1/3rd the die size of the Xbox One SoC...hell, 1/4) SoC needs to game better than an Xbox One whose SoC draws 4X+ the power. Also note that the A12Z has a 5TFlop+ neural engine (in addition to everything else), dedicated video encoders and decoders (separate from the GPUs), and loads of other hardware that consumes transistor counts.

And for that matter, the gameplay looks easily comparable to walkthroughs of this same game on an Xbox One.


Die size doesn't matter here, because they're on radically different process nodes; it's gate count that matters to compare apples to apples (or Apple to AMD, I guess, lol).

Xbox One has 5B, an A12Z has 10B.

And when you look at the layout, it's you can see that in each it's about a third of the space for the GPU cores, with each having very large L3 like banks of SRAM before the memory controller.


I don't think that was running on Rosetta 2's dynamic recompiler, but instead used Rosetta 2's other stated feature of being able to translate existing apps at install time. In the demo they state that they downloaded the game from the Mac App store.

Certainly static recompilation yields somewhat-close-to-native performance when it is available, but that is not possible for x86 "VMs".


> It ran pretty well

It was a postproduced, recorded promotional video, not a live demo. They could have shown laptops reaching Mars and you’d have thought “that landing gear looks pretty solid”.


I think Rosetta is mostly about userspace, and a Hypervisor is definitely not something that sticks to that - modern VMs run on the bare metal thanks to CPU acceleration and dedicated instructions (i.e. VT-x and AMD-V). Getting an app to run on a foreign architecture it's different than emulating a whole machine in order to get an unmodified OS kernel to run. You can do it this very day with QEMU, yesterday I tried just for fun to boot Windows XP on my Raspberry Pi 4 (it boots but it's horrendously slow, if you're interested). Maybe with time and effort, an emulator can get a satisfactory level of speed (probably never enough to run Windows games on emulated Windows though, but who knows).


AAA games spend lots of times inside system libraries. Porting those to native could be enough to get acceptable performance.

So, chess could be more of a challenge for emulation than AAA games (but probably less of an issue, as it would be easily ported, and have fewer users, anyways)


The page we're discussing here actually says you can't mix arm64 and amd64 code in a single process, if your binary is amd64 all your libraries will be too. I'd be surprised if they put in the work to generate thunks for all their system libraries and then didn't make it possible to use any others. More likely macOS will just have two copies of all the (non-iOS) libraries and the only point they work together is the kernel since that's a nice clear barrier.


If I read it correctly, it says it prevents you from doing that, “including all code modules that the process loads dynamically”, but I don’t think that means Rosetta won’t call arm64 libraries for libraries linked at program startup.

Think about it: Rosetta translates your code to arm64, and your code calls a function in a system library. Why would Rosetta use the x64 version of the library and translate its code to arm64 if an arm64 version is available? there may be ABI differences that aren’t trivial to correct for, but for expensive calls, I would think that’s dwarfed by the gains of running native code.

Certainly, how is this going to run the Accelerate framework API at decent speed, given the emulator doesn’t emulate AVX, AVX2, and AVX512?


I imagine the answer is going to be that it doesn't. Stuff using Accelerate will probably see the biggest performance hit because it _doesn't_ use the arm64 version and the emulator translates those instructions poorly. However, since those are the applications hit the hardest by this switch they'll (hopefully) be some of the first recompiled for the new CPUs as well.


It would be extremely bizarre if x86 programs are linking against arm libraries.


Apple has said that a Rosetta process is entirely x86_64 client code, including shared libraries, but I suppose that might be a "what's good for the goose isn't good for the gander" style white lie.


They would link against, say, OpenGL. The emulator wouldn’t emulate the x86 OpenGL, but call ARM OpenGL.

If Apple wants to, they could even support selected third-party libraries that way.


I thought this was the most interesting paragraph.

> What Can't Be Translated?

> Rosetta can translate most Intel-based apps, including apps that contain just-in-time (JIT) compilers.

I guess translation of JIT compiled stuff implies this isn't a once of translation. I guess translating plugins implies that too.

It sounds like very clever stuff to me!

> However, Rosetta doesn’t translate the following executables: > > Kernel extensions

Fair enough

> Virtual Machine apps that virtualize x86_64 computer platforms

I guess most VMs rely on hardware virtualization which would be tricky to translate well.

> Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions. If you include these newer instructions in your code, execute them only after verifying that they are available. For example, to determine if AVX512 vector instructions are available, use the sysctlbyname function to check the hw.optional.avx512f attribute.

These sound like they should be relatively straight forward to translate. I wonder why they didn't? Lack of time? Or perhaps because translating them means that they don't run fast enough to be useful and the fallback paths are likely to run quicker.


> These sound like they should be relatively straight forward to translate. I wonder why they didn't? Lack of time? Or perhaps because translating them means that they don't run fast enough to be useful and the fallback paths are likely to run quicker.

Probably because ARM's NEON doesn't have 256bit registers, greatly complicating the implementation.


I would guess that's right - AVX only came onto the scene in 2011 so most code will be able to fall back on an SSE implementation if AVX is unavailable, and presumably an AVX emulation which is a bad fit versus the hardware is actually slower than the SSE emulation.

A quick search suggests a couple of people already running into this issue with the older generation Mac Pros that also don't support AVX but it doesn't seem that widespread.


NEON doesn't, but SVE/SVE2 does: https://developer.arm.com/tools-and-software/server-and-hpc/...

Apple might not be using this in their first ARM Mac chips, but I wouldn't be surprised to see them use it in future chips.


RE AVX and above: My guess is that they are licensing the Transmeta IP which doesn't include those licenses IIRC. So unless they wanted to get sued they'd have to be careful. I suspect some SSE instructions may also not be supported as I don't recall if Transmeta had a license up to SSE4.


The original Rosetta was based on technology from a company called Transitive which was acquired by IBM a couple years after Apple's PowerPC to Intel transition. [1] It's not immediately obvious what technology licenses are involved with this new incarnation of Rosetta.

[1] https://en.wikipedia.org/wiki/QuickTransit


I doubt they're licencing Transmeta. Transmeta never had an x86_64 processor.


True, but they did have the license IIRC. Otherwise they'd be licensing from VIA or whomever owns that license now. That license AFAIK would include up to SSE4. It's not like Apple is going to tell us. But the JIT nature makes me think it's Transmeta as the company currently holding that IP is a "Patent enforcement entity".


Or they're just giving Intel the finger, banking on having a warchest of defensive patents from acquiring nearly every promising fabless CPU startup of the past 15 years.

Edit: Also, I don't think Transmeta's license would include SSE4. They were defunct at that point.


Possible, I kinda doubt they would as Intel could just cut them off at the knees as they threatened MS with for the Surface Pro X. There is the remote possibility they are licensing from Intel too. At the end of the day I doubt they are going to tell us. Once people start getting the dev machines they can find the edges of the implementation. I suspect it has some quirks we'll discover over time. Regardless this is speculation.


> I kinda doubt they would as Intel could just cut them off at the knees

And then Apple would cut Intel off at the knees in return. Microsoft doesn't really have any CPU design patents, Apple has thousands, so they'd be treated differently.


Transmeta died what, 20 years ago? All their patents have now expired.


Wikipedia says defunct in 2009, though the decline started earlier.


> I wonder why they didn't?

It could come down to licensing issues. The ISA extensions for AVX/AVX2/AVX512 are under different licensing and to my knowledge, Intel/AMD are the only legal license holders (and Intel would be the only holder were it not for the arrangement between Intel and AMD formed due to x86_64)


Having big flashbacks to the switch from PPC to x86 here. Rosetta worked relatively smoothly during that transition so fingers crossed it will be ok here too.

Though with Docker support on the mac already being a second class citizen to running on Linux I wonder if a lot of devs will stop using macs for dev


Highly unlikely Apple would cede the software engineering market for a competitor to step in. Almost everyone in my company would prefer to have a MacBook vs a Thinkpad/Dell/etc if given the choice.


Highly unlikely Apple would cede the software engineering market for a competitor to step in.

If you're not specifically developing for iOS or macOS, Apple doesn't care about you. They'll take your money if you want to write Unix or web software, but they'll drop support in a heartbeat if that makes it easier to support their mobile and consumer segments.


Speaking solely for myself, I do not intend to go with this experiment after using Macs for 16 years and change. I have found that with a few rough edges, Windows 10 and WSL fulfills my dev needs, and except for the Apple apps Logic and Final Cut Pro, virtually all my software is cross platform. Besides, I can use Ableton and Adobe Premiere which I also already "own."


> I do not intend to go with this experiment

I’m sure they lost people when they switched off of ppc too but just like last time Apple is playing the long game. Intel just can’t keep up anymore.


Intel can't (maybe). AMD can. In fact right now, AMD doesn't even have to keep up, because it's substantially out in front.


AMD has worse thermals than intel does which is likely the reason Apple is switching in the first place


What % of your devs are running Windows vs Linux on those?


People weren't running PowerPC VMs (besides Classic which was dropped in 10.5) so they didn't notice the loss of that functionality. Many people are running x86 VMs today which won't work on ARM.


>However, Rosetta doesn’t translate the following executables: Virtual Machine apps that virtualize x86_64 computer platforms

How will this impact docker? Does this mean you can't run x86 docker containers on new Apple laptops?


Docker on ARM will work, Docker for x86 will not. The State of the Union showed a demo of Hypervisor.framework with Parallels, and they made it clear that Debian AArch64 was running (uname -a). Since Docker runs inside a VM on the Mac, it'll have to be an ARM VM with ARM containers.

(Presumably, running docker build with your Dockerfile will make it work just fine, unless you need x86 specific libraries).


Could you run the linux docker machine on aarch64 and just the docker container/process on JIT-x86? The majority of syscalls doesn't care about architecture, do they? And the OS should theoretically be agnostic to the machine code the application is written in. Probably requires some cooperation on part of the linux docker host. Did I get the idea across?

You would go to hell for implementing that either way


You should be able to, since everything Docker runs inside of a Linux VM. You don't get to use Rosetta though (since it's a macOS thing), and the performance may be poor.


Apple mentioned they are specifically "working with Docker to support 'these things' in coming months". Confirmed with some Docker folks they are working on "something". All very nondescript, but they said they can't talk about it yet.


So I had to look this up: Docker do not rely on VT-x. If kernel has cgroups it runs.

Hypothetically it could run natively on Windows NT, Linux, Darwin, Solaris, etc if it ported for appropriate APIs. However, most Docker containers contain x86_64 Linux binaries, so the runtime environment has to be binary compatible with x86_64 Linux.

Thus, instead of going through the pain similar to WSL1, Docker people just spin up a midsize Linux VM, run remote controlled Docker daemon in it and call that as the "Docker for [PLATFORM]". This will suffice because use case of Docker is to debug on a laptop and deploy to GNU/Linux servers.

So if ARM macOS runs an x86_64 Ubuntu VM it also means Docker runs.


Docker on mac does rely on virtualizing a full linux kernel, and xnu doesn't have containerization APIs to do anything different.

> So if ARM macOS runs an x86_64 Ubuntu VM it also means Docker runs.

So far we haven't seen anything like that happening. The State of the Union talk yesterday went as far as running a uname -a just to show that it was an aarch64 debian they were running.


Possibly they'll do something like WSL/WSL2 to provide an ARM based Linux kernel interface and then use Rosetta2 to translate the x64 code in user space to ARM?

Having ARM only containers would be mostly useless (it's technically possible, but I bet there are very few ARM container images around).


There are heaps and heaps of ARM containers around thanks to the Raspberry Pi.

I've never had a problem running Docker on my Pi, and I don't expect anyone will have a problem running Docker on a Mac given Apple's much much larger developer market share.


All the base images are multi arch, which I suspect covers the majority of uses in dev envs.


My understanding is that the virtualization Apple provides is only for the same architecture as the host OS. In the demos given running Debian, they run uname -a and it reports aarch64


Docker has had an experimental feature for some time to build containers cross architecture [1]. I'm guessing this transition is a good excuse to finish that up. Running cross architecture containers with only Docker is not possible as far as I know.

I'm guessing that we're going to have to see a lot better adoption of cross-platform container builds because of this.

[1]: https://docs.docker.com/buildx/working-with-buildx/#build-mu...


Not sure about the details for how this works, but they specifically mentioned Docker to be supported at the keynote.


Docker already runs fine on ARM64, like on Raspberry Pis, so there would be no Rosetta translation needed for ARM64 images. If Rosetta could translate binaries in x86/AMD64 images, that would be very helpful indeed.

Likewise, will it support CLI programs at all?


x86 docker or just docker? There's ARM docker containers as well.


They haven’t clarified if Docker is ARM all the way down or if they are using QEMU to emulate x86 somewhere on the stack.


They wouldn't need qemu anywhere either way; they have their own binary translator.


They've specifically called out that OS level virtualization isn't available in Rosetta 2. So if there is an x86 dependency QEMU becomes necessary.


There's qemu-user for user mode only emulation in addition to qemu-system.


Which for all we know could be qemu ;)


I mean, they have a better emulator that they bought from the QuickTransit folk.


I would assume you’d only be able to run ARM Docker containers.


Which is less than ideal if you're using x86 servers since you no longer have the same dev containers as production containers.


I don't think people should consider the vm Docker For Mac uses anything close to resembling production as is.


Yeah. I predict a sudden increase of ARM servers as well.


And maybe even some decent ARM Linux laptops. One may dream...


Do Amazon have an ARM server product?



... and it's actually kinda kick-ass - the original A1 instances were fairly disappointing but the M6g is at an extremely aggressive price/performance point.


ಠ_ಠ when they refer to x86_64 as "Intel instructions": https://en.wikipedia.org/wiki/X86-64#AMD64


I don't think Apple has sold any AMD Macs, correct?

I understand your consternation, though. Dev docs shouldn't require dumbing things down.


From the page:

What Can't Be Translated?

Rosetta can translate most Intel-based apps, including apps that contain just-in-time (JIT) compilers. However, Rosetta doesn’t translate the following executables:

* Kernel extensions

* Virtual Machine apps that virtualize x86_64 computer platforms

Some people are not going to be happy about this.

Edit: But I personally am okay with that.


Apple wants kexts gone. They announced the deprecation last year, and this is another chance to force the change. (And for good reasons, if my layman's understanding of the security is at all correct.)

[https://developer.apple.com/support/kernel-extensions/](http...



Well color me surprised. I figured that’d be on the chopping block.


I was surprised too. I wonder which crazy peripheral (or maybe it was antivirus/intrusion detection?) made them not go all in on "only apple gets to be in kernel space".


Judging by this horrible Bitdefender Endpoint Security that IT has foisted upon me: device management and anti-malware stuff.


Maybe Dropbox?


Like... it's not ideal, maybe, but kernel extensions are already a little touchy so I'm not surprised that they at least need to recompile, and VMs likewise are unfortunate but they can just rebuild and be good (i.e. I'm betting less than a month before people are running around using qemu-system-x86_64 natively built for ARM again).


Meh, kernel extensions are on life support at this point.


The ability to switch a universal binary between running x86 and ARM versions so as to support non-updated x86 plugins is really cool.


The Apple transition to ARM for me is sad.

There was so much positivity around the Intel transition. It opened up the Mac platform. Now it’s going back into a closed black box.


I'm optimistic and while the constant references to ARM as "Apple Silicon" make it sounds proprietary I see it as an example of Apple following Jobs' goal of "skating to where the puck will be".

Microsoft is toying with the same transition with Windows 10 ARM builds and the Surface Pro X, and the news yesterday about an ARM-based supercomputer taking the TOP500 crown are all signs that ARM is the future. It will take years but I foresee most "general computing" devices making the transition in the future.


Can you install something other than Windows 10 on Surface Pro X? I think not. There may be hackers who can crack these things, but this is not an ideal solution. There will also be problems with the drivers, I'm sure of that. Mac is waiting for the same fate.


You can install Linux on an SPX [1]

It sucks, but that's less on Microsoft and more on a combination of Linux having bad AArch64 support, and Linux having bad tablet support.

[1] https://twitter.com/Sonicadvance1/status/1192289419572563968


I think I'm missing something. How is ARM closing things up? Does Apple have a bunch of processor extensions that make it incompatible with other ARM processors? Or do you think they'll take the chance to kill boot-camp or something?


The “ARM world” is almost inconceivably varied by x86 standards. There’s not even an agreed-upon boot procedure (the Raspberrry Pi’s CPU famously gets booted by the GPU). Efforts to standardise this in the server space are only just getting off the ground with the Arm Server Base Boot Requirements version 1.2 being ratified last year. I doubt you’ll find Apple subscribing to anything that they won’t develop in-house.

This is, after all, the company that developed the Lightening connector and sticks on its iPhones to it despite the whole industry shifting to USB-C.

Basically the OS situation on Macs post-ARM is probably going to look like the Linux situation on iDevices currently: occasionally somebody manages to text-mode-boot an iPhone 7 into a profoundly unstable hacked-up kernel and there is much merriment and rejoicing, but really... macs are going to become rather exotic (on the inside).

Have you ever wondered about your iPad’s boot sequence? Because you’ve never had to think about it. Notice how you can’t install versions of i(Pad)OS that are no longer signed? You think Apple will allow booting unsigned images? And all the Secure Enclave infrastructure they’ve built up?

Nah.


> This is, after all, the company that developed the Lightening connector and sticks on its iPhones to it despite the whole industry shifting to USB-C.

There is a rumor that Apple developed USB-C and donated it to the USB-IF [1]. Apple was also one of the first users of USB-C when they introduced the 12" MacBook in 2015.

Since then they've continuously grown their adoption of USB-C, first in the MacBook Pro in 2016, then in the iMac and iMac Pro in 2017, the iPad Pro, MacBook Air and Mac Mini in 2018 and most recently in the Mac Pro in 2019. In fact nowadays the only Apple devices still not featuring USB-C are the iPhone and the (non-Pro) iPads.

There are probably good reasons why Apple hasn't replaced Lightning with USB-C for the iPhone and iPad yet. If they replace Lightning with USB-C now, the whole 3rd-party accessory ecosystem has to adapt to that. While that's probably fine, if the rumors are true that Apple plans to remove all physical ports from iPhones [2] next year, that would result in USB-C equipped iPhones/iPads only being sold for 2-3 years, before there is another interface again. So Apples thinking is probably that it's more customer friendly to stick to Lightning until they replace it with a wireless solution and I'd agree with such an assessment.

[1]: https://www.macrumors.com/2015/03/13/apple-invents-usb-c/

[2]: https://www.macrumors.com/2019/12/05/kuo-iphone-without-ligh...


Regarding USB-C I agree with you entirely and think you entirely missed the point I was trying to make.

I am well aware Apple participated in the USB-C connector specification processes and only introduced the Lightening port when it became clear to it that USB-C would take longer to define than their urgency to ditch the old iPod 30-pin connector would allow, forcing them to ‘defect’. They’re probably rather sour about this deep down (since it forces them to have different “one size fits all” interfaces on different lines of products, which kind of undermines its own point).

At any rate, the thought I was trying to express is that I would be very surprised if Apple adopted any kind of ‘standard’ ARM architecture from upon which to begin its new era of ARM-based Macs. They’re not even calling them ARM-based, they’re using the term “Apple Silicon” (which is only a marketing gimmick but it’s pretty indicative).

So... they might, during the course of their preliminary analyses and feasibility studies identified a pre-existing ARM system architecture that they are comfortable with, but most likely they’ll go for something unashamedly homegrown such as whatever architecture they developed for the iPhone and iPad and consequentially whatever boot procedure they follow, which as far as I know is undocumented and subject to change (though it is the regular target of reverse-engineering, because of jail-breaking... and that game of cat-and-mouse has probably greatly enhanced code quality and reduced exposed surface area).

The end result is that “from the inside” from a very low level (the kinds of hardware levels) that matters when booting an OS and that it subsequently screens userland and the users from, these details matter, and I don’t expect them to be standard, I don’t expect them to be clear, and I expect that unsigned OS images won’t be bootable.

That said I read that Apple has had a session at the WWDC 2020 about booting other OSes from external drives on the ARM Developer Transition system, so who knows, I could be wrong.


The CPU booted by GPU is actually very similar to the way that the T2 boots the Intel CPU in existing Macs, providing signature assurance on the microcode, ME firmware, and boot image before the CPU even loads.


in all fairness the Lightening connector has been around a lot longer than usb-c


Beyond that, you have to acknowledge that Apple contributed heavily to USB-C. They are invested, but had to weigh the plusses and minuses of transitioning the iPhone. My guess is that it was something like this:

Plusses: - everyone shares the same connectors

Minuses: - all the old peripherals out there now don't work on the new phones - there are lots of debug systems for phones built into the Lightning ports that Apple uses internally, those are all gone - there are lots of junk chargers out there, and no real inspection scheme for USB-C, whereas Apple runs an inspection scheme already for Lighting

Since there is a large enough market for Apple products, Apple is not worried about the small bonus of everyone sharing the same connectors. Given that it seem pretty obvious that until there is something to push them off Lightening (e.g. power requirements, speed of connection, etc...), they will stay there.


They'll almost certainly only allow their signed kernels to run.

And that's in addition to not documenting any of the peripherals.


Apple already has a program for select third party companies to get their own signing keys for kernel extensions. They have been tightening that circle over the last few years as they provide more and more generic interfaces so that companies can replace KEXTs with user space code. From what they said in the State of the Union keynote they want to get that down to nothing, but realize they can't do that yet. No details were provided, but there might be more in the KEXT sessions if you are interested.


I think Apple has more maneuvers on ARM, since in general the PC architecture is open for modifications, unlike ARM. They will do it the same way as with iOS devices - a closed bootloader and hardware protection.


Everything will now be digitally signed, look at the iphone jailbreak scene on how hard is it to customize your own hardware to run the software you want.


What's old is new again.

This reminds me of Digital's VEST technology. VEST would convert VAX programs to run on Alpha. From 32-bit CISC to 64-bit RISC. Nearly 30 years ago.

https://web.stanford.edu/class/cs343/resources/binary-transl...


Or the PET emulator from 38 years ago that Commodore released because it wasn't sure that the C-64 would have any software.

Ditto for the C-64 CP/M cartridge of the same vintage.


The PET emulator was an interesting case. It relied on the fact that BASIC 2.0 on the C64 was similar enough to that on the PET (and IEC serial devices similar enough to IEEE-488 ones, from the BASIC side, at least) that many programs ran unmodified. The interesting bits were adding some goop so that many common POKEs and PEEKs from BASIC would "just work" as well (like CB audio, screen memory, etc.). This was even enough for some machine language programs, though many choked on the different memory map.


>Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.

No AVX will probably mean that the vast majority of pro/graphics intensive apps won't work out of the box with Rosetta.


Do those programs blindly try to execute AVX instructions without checking whether they’re supported? What happens if you're just running on an older intel processor?


All well-behaving software will of course check if AVX is supported before trying to use it, but at least in some cases that just means they'll refuse to install or run on systems that don't have it: https://www.pro-tools-expert.com/production-expert-1/2019/9/... (Developing parallel code paths and keeping them working indefinitely is a lot of extra work, and AVX has been around for close to a decade at this point.)


Most heavy math libraries that use AVX do function multiversioning. When calling the function it does a CPUID instruction check and then picks a code path appropriate for the CPU. Intel's MKL gets flak for this because non-Intel CPU's always fall back to the least optimal path even if they support a more efficient path.

I think you'd be hard pressed to find programs using AVX that didn't do some sort of feature check. While AVX debuted with Sandy Bridge in 2011, Intel's Celeron/Pentium branded CPUs don't support those instructions (or didn't for a long time).


Usually you'd want to check cpuid or similar and fall back to something reasonable.


Apple seems to suggest that all well behaving programs:

>use the sysctlbyname function to check the hw.optional.avx512f attribute

So my guess it they probably just replace the AVX instructions with no-ops and let it blow up if the correct checks/fallbacks aren't in place.


it would be a lot more normal to trap the illegal instruction, the same thing that happens if you give a CPU a nonsense nonexistent instruction.


Or just SIGILL rather than noop.


The “cheese grater” 2010/2012 Mac Pro didn’t support AVX either, but supported 10.14. Apps have only been able to depend on Macs having AVX for a year, but of course they should still check at runtime.


Are the limitations on x86_64 virtualization likely to be for technical reasons, or patent reasons? I read a comment on here alluding to some patents on x86_64 virtualization expiring later this year: https://news.ycombinator.com/item?id=23612256 - could that mean that there is a chance this might happen and they are keeping it quiet for now, or are patents likely unrelated?


Those are two different concerns. Rosetta isn't emulating x86_64 kernel mode. Microsoft isn't emulating x86_64 at all.

I imagine Apple has enough patents that Intel infringes on to be able to cross licence.


Microsoft is working on it.


Apple spent $1B buying Intel's modem business last year. https://www.apple.com/newsroom/2019/07/apple-to-acquire-the-... Apple knew this transition was coming. They could have easily slipped in other terms to deal with any IP licensing issues around implementing the x86_64 instruction set.


Except AMD owns x86_64 https://en.wikipedia.org/wiki/X86-64#Licensing

Rosetta doesn't translate i386 btw.


x86_64 is obviously derived from original x86 and is a rats nest of who owns what. The SSE3/4 extensions (which are supported by Rosetta 2) are obviously in Intel's camp.


If I had to guess it's because Rosetta isn't fully emulating an x86_64 instruction set but rather dynamically translating API calls for the most part.


I think it's neither of those things. The x86_64 machine code instructions are translated into ARM instructions, so there's no emulation and the API calls don't need to change.


> The x86_64 machine code instructions are translated into ARM instructions, so there's no emulation

That's what emulation is.


I think emulation would be interpreting the instructions one by one and doing what they say in software (a bit like Python's byte code interpreter). Rosetta translates the binary to another (now native) binary, and then runs it like normal, and much of the time this will be an ahead-of-time once off translation.


Recompilation is a valid emulation strategy. You're just amortizing client instruction decode by doing it ahead of time.

Last time I checked, that's how RPCS3 worked for instance, since PS3's GameOS didn't generally allow JITs anyway.


is the translation 100% static? for example if I do something like:

    function_pointer = blackbox_function_translator_does_not_understand();
    function_pointer()
then how do we end up calling the correct address in arm64 land? there are no type tags in assembly to distinguish between integers and addresses. in QEMU my understanding is translation is done dynamically so addresses would float around in memory as x86_64 addresses and then when you tried to `call` them it would look up a mapping table. In QEMU I suspect they also try and optimise this case similar to a JIT using inline caching so most of the time you wouldn't actually hit the mapping table.

But if you are not dynamically converting x86_64 addresses to arm64 addresses then you need to understand what all the addresses in the program are and understand all the manipulations that might be performed on those addresses. now, you shouldn't actually be doing weird manipulations of addresses to functions in memory but if you are running obfuscated code this often happens.

I think in QEMU this would work assuming (myfun+4)() does something intelligent:

    uint64_t add(uint64_t v) {
      return v + 4;
    }

    void myfun();

    int main() {
      void (*x)() = &myfun;
      x = (void*)add((uint64_t)x);
      x();
      printf("%lld\n", add(4));
    }
if you are holding function addresses as arm64 addresses in memory then you need to dispatch add() based on whether the argument is an integer or an address.


I wonder how/if it supports unaligned jumps which x86 supports IIRC. The consequence of unaligned jumps is that it can effectively make it impossible to know the set of instructions a binary might use.


So how fast will this be compared to native? As I understand, it's not "emulation" but something more low level?


Did you ever use PPC apps on Intel through Rosetta? It was a substantial slowdown, like at least 2x. But the app was usable.


Translating JITted code is a pretty cool trick.


You'll probably also enjoy Nvidia's Denver architecture[1] (used in the Tegra processors) which JITs ARM bytecode into their own internal instruction set inside of the processor.

[1] https://en.wikipedia.org/wiki/Project_Denver


Historically that work is similar to Transmeta's cores, and IBM's DAISY before that for anyone wanting to dig in further.

Someone reverse engineered large parts of the Transmeta crusoe here:

https://www.realworldtech.com/crusoe-intro/

https://www.realworldtech.com/crusoe-exposed/

DAISY was open sourced at one point, but I haven't been able to find a mirror of it. I wish companies would strive to keep their URLs stable. : \


At runtime it doesn't really matter where you code is coming from, as long as you do flushes at the right spots.


But doesn't this mean that they are translating and running the JIT and the JIT is creating new executable x86 code which then also needs to be translated? I'm a bit baffled as to how any of this can work properly, for example if my JIT is sampling instructions to determine when to optimize, does Rosetta need to reverse-translate the actual instruction pointer back to the original x86 code offset?


> But doesn't this mean that they are translating and running the JIT and the JIT is creating new executable x86 code which then also needs to be translated?

Yup.

> I'm a bit baffled as to how any of this can work properly, for example if my JIT is sampling instructions to determine when to optimize, does Rosetta need to reverse-translate the actual instruction pointer back to the original x86 code offset?

Well, not exactly. Let's take JS as an example. First of all, you're only sampling calls that are still in the native language, so your instruction pointer is in the emulated virtual machine itself anyway, which is agnostic to the CPU architecture. See [1] for where SpiderMonkey samples.

What happens then is that the engine sees a function being called many times, so the JIT compiler decides to compile a function; it compiles the function and writes the compiled function to memory, then overwrites that function's address with a native function pointer.

The easiest hook the OS has to this process is mprotect(2). In modern OSes, memory is generally either writable or executable, but not both; so if you want to compile and write, and then execute, you need to call mprotect() between those steps to set permissions.

Knowing that your binary is Rosetta'd, MacOS can assume your JIT code is also AMD64 and translate that page on the fly. All they have to do is make sure the virtual memory addresses are correct - to not mess up those function pointers.

They can either do it lazily, through a pagefault handler, or greedily when mprotect is called to add executable permissions to the JIT-allocated memory.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Sp...


> In modern OSes, memory is generally either writable or executable, but not both; so if you want to compile and write, and then execute, you need to call mprotect() between those steps to set permissions.

mprotect is slow. High-performance JIT compilers on macOS usually map memory as RWX or mirror-map the same page as RW- and R-X from two virtual addresses.


The hard part seems to be metadata, or information about the program rather than the program itself. Rosetta will translate the program but it doesn't know about the metadata, so you have a referential integrity problem, seemingly. For example what is going to happen if a program running under Rosetta uses setitimer to deliver SIGPROF? What's in si_addr?


You'll still have the address of the fault in si_addr. Addresses are addresses, both in AMD64 and ARM64. What's the difference?


Maybe I'm just imagining things, but if I have a map from PC to function or line and Rosetta rewrites all my functions then I have a problem unless the Rosetta output is by some miracle exactly the same length as the input.


They can keep a translation table around for fixing up what were the original instruction addresses. I've done that for writing emulators before. It ends up looking like DWARF.


So does this mean no windows VMs?


Yes, which is an interesting problem for Parallels and VMWare. I would imagine the key requirement for most of their Apple customers is to run Windows.

On the other hand it might be an opportunity to offer Windows in the cloud to Apple users.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: