I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.
What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.
Anyone know how they implemented PPC-to-x86 translation?
A lesser known bit of trivia about this is that IBM would go on to use Transitive's technology for the exact opposite of Rosetta -- x86 to PowerPC translation, in the form of "PowerVM Lx86", released that year (2008).
It's very fascinating to me, since IBM appears to have extended the PowerPC spec with this application specifically in mind. Up until POWER10, the Power/PowerPC ISA specified an optional feature called "SAO", allowing individual pages of memory to be forced to use an x86-style strong memory model, comparable to the proprietary extension in Apple's CPUs, but much more granular (page level and enforced in L1/L2 cache as opposed to the entire core).
As far as I can tell, Transitive's technology was the only application to ever use this feature, though it's mainlined in the Linux kernel, and documented in the mprotect(2) man page. IBM ditched the extension for POWER10, which makes sense, since Lx86 only ever worked on big endian releases of RHEL and SLES which are long out of support now.
One mystery to me, though, is IBM added support for this page marker to the new radix-style MMU in POWER9. It's documented in the CPU manual, but Linux has no code to use it -- unless I've missed it, Linux only has code to set the appropriate bits in HPT mode, and no reference to the new method for marking radix pages SAO as the manual describes. I can't imagine there was any application on AIX which used this mode (it only decreases performance), and unless you backported a modern kernel to a RHEL 5 userland, you couldn't use Lx86 with the new radix mode. Much strangeness...
In theory, PROT_SAO should be useful for qemu, and trivial to make patches implementing there. That's assuming the kernel actually sets it, though. The problem I encountered when I set out to do it a year or so ago, was that I couldn't find a good test case to fail without it...
I used box64 as a test case, where I had a game that would run in emulation, but only if I pinned it to a single core. On ARM64, it also worked, as the JIT translator on box64 uses manually inserted memory fences to force strongly ordered access.
The game never worked correctly, even after I patched the kernel to mark every page on the system as SAO, and confirmed this worked by checking the set memory flags. This might be a mistake in my understanding of what SAO should do, though. (or another failure in box64 on ppc64le)
One thought I've had recently is perhaps it's like the recently discovered tagged memory extension and only worked in big endian? There's nothing in the docs to suggest this, but since the only test case was BE-only, maybe?
Their customers are not enterprise, and consequently they are probably the best company in the world at dictating well-managed, reasonable shifts in customer behavior at scale.
So they likely had no need for Rosetta as of 2009.
Right, thanks for correcting my faulty memory on the timing.
It is possible that IBM tried to squeeze Apple, but given that IBM's interest in Transitive was for enterprise server migration, I suspect it is more likely that Apple got tired of paying whatever small royalty they'd contracted for with Transitive, and decided enough people had fully migrated to native x86 apps that they wouldn't alienate too many customers.
>"The last version of OS X that supported Rosetta shipped in 2009."
Interesting so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?
>"The last version of OS X that supported Rosetta shipped in 2009."
Interesting, so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?
I agree it was a bit worryingly short-lived. However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion in summer 2011 (and many people avoided it since it was problematic). So nearly-modern Mac OS X with Rosetta support was realistic for a while longer.
> However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion
Yes, but I was pointing out when the last version of OS X that did support Rosetta shipped.
I have no concrete evidence that Apple dropped Rosetta because IBM wanted to alter the terms of the deal after they bought Transitive, but I've always found that timing interesting.
In comparison, the emulator used during the 68k to PPC transition was never removed from Classic MacOS, so the change stood out.
I agree. And I suppose since it was so intrinsic to the operating system, if a 68k app worked in Mac OS 9 (some would some might not), you could continue to run it in the Classic Environment (on a PPC Mac, not Intel Mac) Mac OS 10.4 Tiger in the mid 20 00's!
I guess that's perjury because it cannot be true! Even Snow Leopard didn't even include Rosetta 1. But if it was deemed necessary, it would download and install it on-demand, similar to how the Java system worked.
>Open Firmware Forth Code may be compiled into FCode, a bytecode which is independent of instruction set architecture. A PCI card may include a program, compiled to FCode, which runs on any Open Firmware system. In this way, it can provide boot-time diagnostics, configuration code, and device drivers. FCode is also very compact, so that a disk driver may require only one or two kilobytes. Therefore, many of the same I/O cards can be used on Sun systems and Macintoshes that used Open Firmware. FCode implements ANS Forth and a subset of the Open Firmware library.
From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.
My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.
I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.
Assuming you're talking about PPC-to-x86, it was certainly usable, though noticeably slower. Heck, I used to play Tron 2.0 that way, the frame rate suffered but it was still quite playable.
Interactive 68K programs were usually fast. The 68K programs would still call native PPC QuickDraw code. It was processor intensive code that was slow. Especially with the first generation 68K emulator.
Most of the Toolbox was still running emulated 68k code in early Power Mac systems. A few bits of performance-critical code (like QuickDraw, iirc) were translated, but most things weren't.
I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.
What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.
Anyone know how they implemented PPC-to-x86 translation?