Hacker News new | past | comments | ask | show | jobs | submit login
A secret Apple Silicon extension to accommodate an Intel 8080 artifact (bytecellar.com)
269 points by ecliptik on Nov 17, 2022 | hide | past | favorite | 140 comments



The article describes how Apple included support for the x86 parity flag which comes from the 8080. Parity is relatively expensive to compute, requiring XOR of all the bits, so it's not an obvious thing to include in a processor. So why did early Intel processors have it? The reason is older than the 8080.

The Datapoint 2200 was a programmable computer terminal announced in 1970 with an 8-bit serial processor implemented in TTL chips. Because it was used as a terminal, they included parity for ASCII communication. Because it was a serial processor, it was little-endian, starting with the lowest bit. The makers talked to Intel and Texas Instruments to see if the board of TTL chips could be replaced with a single-chip processor. Both manufacturers cloned the existing Datapoint architecture. Texas Instruments produced the TMX 1795 microprocessor chip and slightly later, Intel produced the 8008 chip. Datapoint rejected both chips and stayed with TTL, which was considerably faster. (A good decision in the short term but very bad in the long term.) Texas Instruments couldn't find another buyer for the TMX 1795 so it vanished into obscurity. Intel, however, decided to sell the 8008 as a general-purpose processor, changing computing forever.

Intel improved the 8008 to create the 8080. Intel planned to change the world with the 32-bit iAPX 432 processor which implemented object-oriented programming and garbage collection in hardware. However, the 432 was delayed, so they introduced the 8086 as a temporary stop-gap, a 16-bit chip that supported translated 8080 assembly code. Necessarily, the 8086 included the parity flag and little endian order for compatibility. Of course, the 8086 was hugely popular and the iAPX 432 was a failure. The 8086 led to the x86 architecture that is so popular today.

So that's the history of why x86 has a parity bit and little-endian order, features that don't make a lot of sense now but completely made sense for the Datapoint 2200. Essentially, Apple is putting features into their processor for compatibility with a terminal from 1971.


Thanks, I wasn't aware of the iAPX 432! Funny that history repeated itself almost exactly 20 years later, when Intel introduced the 64 bit Itanium architecture, but eventually had to copy AMD's 64 bit extension of the x86 architecture because x86 just refused to die. And today, 40 years later, it's still around - although maybe Apple's efforts will finally manage to put a stake through its shriveled heart...


Heh, refused to die seems like a weird way to state it.

Intel tried to force a split in desktop/server design with a significant price premium and shipped a very complicated CPU that moved significant complexity/work into the compiler. It managed to scare Alpha, Mips, and PA-risc out of the market, and generally shipped years late and hit performance milestones even later.

AMD on the other hand doubled the x86's registers, added 64 bit support, improved performance and did so with basically zero increase in cost. So suddenly a random x86-64, even a desktop, could gasp make full use of over 4GB ram at zero cost.

So Itanium retreated from servers, to HPC, to enterprise, and HA enterprise then of course died.


Compatibility with a terminal from 1971 yet I can't run a copy of Logic Pro 9 I bought 10 years ago - and haven't been able to for years - because Apple.


Do you find Apple's contempt for backward compatibility surprising?

I'm still displeased with macOS Catalina and iOS 11 dropping support for 32-bit apps (including many games) but Apple has a long history of old apps not running on newer OS versions and newer hardware.

Over time macOS dropped support for 68K, PowerPC, Classic, Carbon, and 32-bit apps. We'll undoubtedly get a macOS that won't run on x86, and I will be unhappy, but not surprised, if/when Apple removes Rosetta 2 and x86 app support. I can even imagine an unpleasant time when they remove the Objective-C runtime and frameworks.

I'm sure there were some Apple 1 users who were unhappy that their software didn't work well on an Apple 2, even if the Apple 2 had much better hardware and software.


Parity is one of the cheapest thing you can compute in hardware


Compared to what? The sign flag has zero cost since it's just the top bit of the result. The zero flag is one NOR gate to compute. You get the carry flag for free with addition. (Although processors usually throw a lot of other functionality into carry (such as shifting) so the carry flag often ends up complicated.) A signed overflow flag is confusing but one gate to implement. In comparison, parity is difficult. You'll note that other early microprocessors such as the 6800, 6502, and 68000 did not have a parity flag. Same with early computers such as the System/360 and PDP-11.

Parity is expensive to compute with 1970s hardware. If you look at the die of the TMX 1795, the parity flag computation is a substantial part of the die, about half the size of the register file or the ALU. The first problem is that XOR is an inconvenient gate to implement, especially if you use standard MOS logic. Processors usually have special pass-transistor tricks to make XOR more compact. The second problem is that parity needs to XOR all the bits together, so you don't get parallelism.


While I know you already know this, technically all of the flags have large prop. delay and nothing is for free. :D

Computing any of them is equal amounts of delay if you want to mux/select any of them to output, and then on top of that you have to sequentially NOR-chain (and/or some hierarchical manner) the result for zero, and waiting on the adder carry chain for the carry flag. If you were sequentially (and/or some hierarchical manner) XOR-chaining for parity in parallel, is it really that much more delay? Especially if XOR uses domino logic etc.

I see a parity in the same realm as the adder carry chain as far as prop. delay goes (for computing sign/carry at the end.) Of course carry chains can be made more efficiently than a direct sequential chain, but you could be hierarchical about xor-chains for parity too..


Yes, addition is annoyingly slow due to carry propagation. But a problem with the parity flag is that it is computed on the result of your arithmetic operation, so it add another big delay to every operation. The other problem is that it adds a lot of circuitry (by 1970s standards).


Yeah agreed -- though that's what I meant about the zero flag: NOR instead of XOR combinatorial complexity, but same prop. delay critical path, only simpler circuits per bit if purely NMOS or PMOS! :) I think the 1970s Intel were doing MOSFETs then?

Technically parity could become a stable value before the zero flag could! ;)

Oh but I did forget that a single multiple input NOR could be large but without exponential amounts of gates


Obviously compared to any other ALU operation, except the most basic binary ops. It is just a parity of 8 bits, so even in a straightforward implementation you only have three levels of XOR, it is not completely serial, you do get to parallelize some. I do agree that it is not free on an 8080, but it is still a lot cheaper than add or sub.


How is the zero flag one NOR gate? Don't you have to NOR all the bits?


One very large NOR gate. But unlike XOR, an n-bit NOR or NAND is just 2n transistors in CMOS (at least in theory).


One biiiig NOR gate.


Yes, one 8-bit wide NOR gate. This is easy to implement in NMOS, basically you have a wire the width of the ALU and attach a transistor driven by each bit to pull it low.


By itself, sure, but fidelity to x86 requires the PF calculation be made alongside a wide range of common instructions, which is a significant cost either for the microarchitecture or for an emulator.


Yes the problem is that instructions such as CMP, that are often at the end of a basic block, will leave around PF for later use in other basic blocks (which likely will never happen but you cannot know!).


It's interesting what this tells us about Apple's approach to the Intel to Arm transition.

They must have performed deep analysis of the requirements of Rosetta down to the level of individual instruction emulation before they finalised the design of the M1.

Also the Rosetta team had enough influence to get the hardware team to build it. Not necessarily a given in all organisations.

What would be interesting - and perhaps not surprising - is if Arm added this to the standard Arm ISA in order to facilitate x86 to Arm translation elsewhere. After all they added the infamous FJCVTZS instruction which speeds up x86 derived floating point conversions in JS.


It's useful to think of the transition as starting close to 15 years ago with Apple buying PA Semi and Intrinsity, shaping ARMv8, doing test ports of macOS, etc. Of course deep analysis too, but some pretty empirical evaluation of multiple generations of running prototypes to find the gaps that mattered (product and performance). They telegraphed the plan pretty plainly when they announced the A7 as "desktop class."


It started even further back almost 34 years ago. The first PowerPC Macs shipped with Motorola 68000 emulation in 1994, but development was being investigated back in 1988 or so. Apple has successfully made this kind of transition before, and it's entirely possible they will do so again in the future!


The first version of 68K emulation they shipped was not great. At least one third party (Connectix) sold a faster version that replaced the OS provided routines with their own.


In defense of the original emulation strategy it shipped on time, fit in the ROM, and worked reliably with good compatibility. The first round of PowerPC Macs were so much faster than the 68k Macs than even emulated apps worked better. I don't remember the history any more but Apple did eventually do a new emulator, while Connectix made some other emulation products (PlayStation, PC) and eventually sold to Microsoft. I've read that Connectix people helped seed Microsoft's virtualization efforts.


I think that highlights the different incentives for an OS vendor and a third-party: Apple needed to be compatible but they didn’t want to be in that business long-term and focused more on getting developers to port their apps. Connectix arrived a year later and they had to deliver speed since nobody would buy it unless they did.


Absolutely. I suspect that someone has had 'develop Rosetta 2' on their job description since c2011 at the latest.


Very possible. I think of their approach as "experiment and remix," that is they test a lot of the technical and supply chain ingredients for new products years in advance, then combine at the last moment to create a widget for the public to buy. I've been watching the pieces of their VR/AR effort come together and my guess is they will do surprisingly large volume for how expensive the headsets will be. Collaboration, but done well.


> my guess is they will do surprisingly large volume for how expensive the headsets will be.

Bloomberg estimates that Apple intends to ship less than a half-million headsets it's debut year. By comparison, the Meta Quest 2 sold 10 million units in it's first year at sale.


Bloomberg also estimates that Apple's headset is going to cost $2000 or more, as opposed to the Quest 2 which debuted at $299. Your definition may vary, but I'd consider moving a half-million headsets at that price to be surprisingly large volume. (Also, surely anything at that price point is going to be positioned against the Quest Pro, right?)


It might turn a good hardware profit (Apple is known for sizable margins in that respect) but Apple really cares about having a wide audience. The only thing more profitable than hardware is taxing the transactions on that hardware. If Apple doesn't have a large installbase, it will be hard to justify keeping the platform alive.

> Also, surely anything at that price point is going to be positioned against the Quest Pro, right?

At what, $2,000? At that point, you're competing against everything. You're competing against the price of a new-in-box Valve Index with a VR-capable computer to-boot. You're competing against the price of buying everyone in your family a $300-400 headset. You're competing against the price of a Playstation 5 with PSVR2.

This product is a suicide mission for Apple. It will come out, and it might even be great, but it's value proposition is non-existent in a market that already lived through Beat Saber, VR Chat and Half Life Alyx.


I mean, the consensus I hear among the tech talking heads is that this is basically a product for (a) developers, so they can start seriously building V/A/XR apps and (b) Apple congregants who have both the disposable income and desire to buy literally any product Apple brings to market, no matter how pricey.

(a) for apps and (b) for hype, while Apple works on bringing an actually viable consumer headset to market in 2024.

This is a pretty bold departure from Apple's normal strategy of "wait until several others have rushed to market and them blow them out of the water with a 'premium' product you claim is better than all the competition, at a higher price point than all the competition."

In part it's because companies like Pimax are already sort of doing that, and because Meta got Apple scared that Zuckerberg was going to beat them to monopolizing the VR market if they waited too long.

(We'll have to wait and see the reception to the Quest 3 to see how concerned they really should have been, because the reception for the Quest Pro has been... not great. Turns out, making a quality headset is just expensive)


Apple is more about high margins than market share though.


Less high than combined - they don’t sell in the cheapest price points but since they aren’t dividing the money their products tend to be the same or even cheaper than equivalent quality PC/Android devices since you don’t have the duplicated overhead across multiple other companies.


That's really interesting. Anything interesting on the processor / SoC side on their VR/AR work?


Not that I've seen, though I expect it'll be like phones where they err on the side of more compute not less. If you look around on the software side there are pieces of a headset collaboration experience all over though, even beyond the obvious stuff in ARKit. Memojis, FaceTime spatial audio, AR Spaces in the Clips app, Freeform, App Clips, Metal variable rate shading, etc. It'll be interesting to see what actually ships.


You can even see Apple developing their XR marketing in public: https://www.apple.com/augmented-reality/


Previously people pointed out that a lot of the extra M1 stuff (like TSO, x87 FP) is already standardized and/or has shipped in other ARM chips.


x87 FP hasn’t been standardized or added to any ARM cpu. What was added is a mode where underflow is detected “after rounding” (as x86 SSE/AVX instructions use) rather than “before rounding” (what all ARM FP instructions default to).


This adds to my ongoing theory that "Notarization" (the process where you have to send your binaries to Apple) really has nothing to do with security.

Instead, they really just wanted a deeper selection of binaries to analyze for other various reasons.


Why is it important to keep some particular quirk of floating point conversion in JS? JS breaks stuff all the time. Nobody would notice if some floating point conversion will change its lowest bit.


> JS breaks stuff all the time.

JS libraries break all the time. APIs and experimental language features break sometimes.

But the low-level core fundamentals of the language, such as the way math works, have never changed in it's entire existence. Even the fairly minor changes in strict mode require explicitly opting-in.


The extension is putting Intel-compatible parity and adjust / aux carry flags in (architecturally zero) bits 26 and 27 of the ARM flags (NCZV) register, to avoid the ahead-of-time translator that is Rosetta 2 having to defensively compute them when it can’t tell they’re unused. No information on how to enable that extension.


I've only messed around in user-space, so I'm not sure how it's enabled. But if I were to guess, it might be bit 4 ("Enable APFLG") of ACTLR_EL1 ("(ARM standard-not-standard)"): https://github.com/AsahiLinux/docs/wiki/HW:ARM-System-Regist...


It's probably a bad idea to enable this for your own use.

I would bet real money that this goes away with the generation that no longer supports rosetta2


Implicit phrasing in the article suggests guests cannot enable the flag at all, as it's a CPU mode change. It might be important to bare metal OS writers though, hopefully those teams have already reverse engineered the method or are kindly answered by the manufacturer.


Wow, until now I thought Microsoft were the kings of absurd dedication to backwards compatibility.

Of course credit goes to Intel here, because they're the ones that retained that 8080 behavior even in their latest x86 chips.


Microsoft is the king in another way.

Apple has made Rosetta 2, one of the most impressive software/hardware mixture ever, and they can’t wait to get rid of it.

Microsoft will make a crappy conversion shim and keep it in their code forever. For. Ever.


> they can’t wait to get rid of it

Apple developed the 68k emulator used in PowerPC Macs in-house, and never removed it.

Apple licensed Rosetta from Transitive at a time when they didn't have tens of Billions of dollars in cash lying around.

Rosetta 2 was developed in-house, but even if they were still paying a licensing fee every time they shipped a new OS version, it wouldn't even be a rounding error to their bottom line today.

Why would they be in any rush to remove Rosetta 2 at all?


Apple could not remove the 68K emulator from the Classic Mac OS, because the Classic Mac OS was never fully ported to PowerPC.


Indeed. I recall reading, in the end, that some 90+% of the code had been converted shortly before OS X became the mainstay.


10.5 dropped Classic but continued to support PowerPC. 32-bit was also dropped at some point. Rosetta 2 will be deprecated one day.


Rosetta 2 could serve as a crutch for developers that don't want to expend the time and effort updating their apps to native ARM compiles. As such, those apps will perform more poorly than they would otherwise. Of course, that needs to be balanced against the potential lack of that app entirely if Rosetta 2 is removed. Looking back, Apple seems to lean towards the more harsh scenario. I am guessing it will be removed some time distant in the future.


The claim being made is that Apple will remove Rosetta 2 in the near future and that "they can't wait to get rid of it", which has no basis in reality.


> Why would they be in any rush to remove Rosetta 2 at all?

Due to the belief that any backwards compatibility comes at the cost of a higher difficulty moving forwards, I suppose.

There will probably come a time in which a new feature will collide with this compatibility layer and I have little doubt that the compatibility layer will lose.


Why is Apple cramming advertising into every first party application they have when they are insanely profitable? They'll drop Rosetta 2 as soon as they are off Intel to not have to pay for staff to maintain it.


You’re going to love the ARM64EC ABI target on Windows.


+1 to this.


I know, its almost like the job of an operating system is to run the programs their users have, rather than just churning the innovation wheel at the expensive of the users who have to replacing perfectly working solutions.

This is such a BS take, the nice thing with opensource is that you can go look at the commits that say remove windows 7 support from python. And pretty much anyone who isn't a child can see its less a technical move, and more political. Because out of a million line codebase those half dozen lines were causing so much grief.

Frankly, all this "we have to remove legacy ports" and "legacy code" is some kind of OCD levels of mental illness.


There is such a thing as technical debt. Apple couldn't have gotten so much performance with Rosetta 2 if it had to support 32-bit code. By dropping 32-bit support a year before the transition, they forced everyone to drop their last vestiges of old code. When the ARM transition then happened, it was smoother for everyone.

That you have to call this approach a mental illness is poor criticism.


Rosetta 2 does support 32-bit code - it’s used for running WINE.

macOS doesn’t ship a 32-bit version, because the 64-bit version on every platform is much faster and more secure, and shipping both would be twice the disk space.

(More secure because you can do so many tricks like PAC with those spare bits in every pointer.)


wine32on64 is used to translate x86 Windows APIs to x86_64, WINE translates these to x86_64 POSIX, and then Rosetta 2 translates these to aarch64 POSIX.


or maybe they could have just said, 32-bit programs will be slower...

Technical debt is when people make a mess, its perfectly possible to clean that mess without breaking ABIs. Particularly in OS's where the ABIs tend to be decoupled from the underlying code by abstraction layers. For example you can swap the filesystem in use and still maintain the behavior. Linux's syscall ABI has been basically static for decades, its only the userspace layers that don't try and adhear to those levels of compatibility.

PS: We like to give apple all this credit for having a "fast" machine, but the real question should be, if it can't solve the problem I have, does it matter how fast it is? Mac's for the vast majority of the people I see using them are basically trendy chromebooks, where the users are spending 99% of their time in safari. So its probably a good choice for apple if that is the user base they are interested in.


Can’t solve which problem?


Running old apps.


You run them in a virtual machine running an older version of the OS, the same way you do with legacy Windows apps that won't work on modern versions of Windows.


Which is overwhelmingly terrible compared with native. It seems there is a never ending set of copy/paste, window resize with multiple monitors, etc problems.

And it only works when your application doesn't have HW requirements that can no longer be met. Or the company in question just doesn't provide drivers anymore.

Frankly, the driver thing is somewhat understandable, but as you point out its easy to bolt software compatibility layers on, why the OS vendor can't manage to make it work well/transparently is a mystery.


Microsoft and Intel have been sharing that throne for a long time. Look up the A20 Gate for another detail that illustrates how dedicated Intel is (or at least was) to backwards compatibility.



Apple must know of some not-too-obscure software that did rely on those flags, curious if anyone's found what it might be.


My money (but not a lot of it) is on Microsoft Excel. Its number arithmetic is, let’s say, special.

Microsoft claims they use IEEE 754 floats (https://learn.microsoft.com/en-us/office/troubleshoot/excel/...), but they don’t completely do that.

For example, do

  A1: 0.3
  A2: 0.2
  A3: 0.1
  A4: =A1-A2-A3
  A5: =(A1-A2-A3)
  B4: =A4=0
  B5: =A5=0
A4 will show as “0”, A5 as “-2.77556E-17”, and B4 and B5 show these aren’t purely display issues. B4 shows “TRUE”, B5 shows “FALSE”.

My hunch is that they sometimes use BCD arithmetic.

More info at http://people.eecs.berkeley.edu/~wkahan/ARITH_17.pdf.


You are shaking my foundations here. I frequently, when writing complicated formulas, leave in a set of parentheses too much, just because when writing the formula, I wasn't sure yet whether I'd need them.

Bill - j'accuse!


Is `B5: =_A6_=0` intentional or should it be A5?


Typo. Sorry. Fixed. Thanks.


iWork/Numbers does this properly, and repost 0/0/TRUE/TRUE in each case


define "properly".

It should be false/false for the IEEE floats...


Apple has clearly said that they don't use regular floats.

https://web.archive.org/web/20190629181619/https://support.a...


Spreadsheets should probably use decimal not IEEE 754.


I don't know anything about the topic, but there may be decimal standards that are part of IEEE 754?

https://en.wikipedia.org/wiki/Decimal64_floating-point_forma...

"formally introduced in the 2008 version of IEEE 754 as well as with ISO/IEC/IEEE 60559:2011."


AF is more commonly known as half-carry flag, used in BCD, it's uncommon that anything other than BCD opcodes would use it, in which case a correct implementation of DAA/AAA (6 ops in total) would suffice (though the state of the previous calculation has to go somewhere).

Since flags can be load/stored or push/pop-ed I'd speculate avoiding the hit in correctly handling every flags access was deemed to be worth it. Based on experience from a long time ago (8-bit micros, z80 in particular, and implementing a 6800 emulator, DAA was more code than any other): unusual opcodes and edge case behaviours are common in optimisations, and anti-debug/disassembly. Games perhaps?

In any case, I find the idea of a correctly behaving, deterministic binary translation more appealing than a bunch of beartr^W on-the-fly software fixups.


> Games perhaps?

It's very hard to conclude Apple care much about breaking a bunch of games on the Mac when they cut off 80% of the Macs gaming library a year earlier by removing 32bit execution, presumably mostly to ease the use of Rosetta 2.

And, y'know, literally every other action Apple has taken for twenty years.


I think the issue is probably games devs don’t want to use Xcode so they don’t get simple upgrades. If you use Xcode almost every change Apple make there is a path to be supported by these changes.


I think it's little a bit more complicated than that. Game devs don't actually update their old games very often, if at all. 32 -> 64 is not a trivial upgrade if you have external dependencies; you might even need to pay new license fees.


It’s clear Apple dropped 32-bit to make the transition to ARM go smoothly. It also highlighted the issue with game development on Mac. Most of the time, it’s done by a subcontractor who’s paid to deliver it once. No one is there to keep the lights on and maintain development, with a few exceptions like Civ 5.


And why should they? Did the games suddenly have new bugs? Are the game companies going to get more money for basically porting to a new OS?

People here complain about software subscription models, but then also like to complain about OS's that maintain backwards compatibility. I mean someone has to pay for engineers to spend their days keeping up with all that frequently pointless churn. Why would a game that works perfectly well with a 32-bit address space need 64 bits?

There isn't any reason for old programs to suddenly stop working because someone in the OS/library chain was to lazy to provide backwards compatible ABIs.


I've learned more than once in my education and career to never underestimate how common (and weirdly useful sometimes) BCD is. Lots of random hardware still uses BCD for all sorts of strange timing tasks. The classic 7-segment display is easiest to program against in BCD as one bit of still somewhat ubiquitous types of hardware displays (though slowly disappearing as other screen formats have gotten cheaper).

There's a reason that BCD is in the middle of EBCDIC, IBM used BCD math so heavily in its decades of history that even their text encoding and punch cards were built around it. I've seen Enterprise software that still needs to read/write and sometimes even do math in BCD for compatibility with old Mainframe apps and software still written in COBOL.


It could very well be that they cannot reliably prove those flags are unused in enough cases to optimise out their (expensive) calculation.


The flags get saved and restored on task switches and interrupts. I would think deciding whether those bits actually get used across that is infeasible, but yet, https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-f... says:

“This almost entirely prevents inter-instruction optimisations. There are two known exceptions. The first is an “unused-flags” optimisation, which avoids calculating x86 flags value if they are not used before being overwritten on every path from a flag-setting instruction.”

Does this mean the emulator is buggy (if I write a loop that sets and clears flags, but doesn’t do anything with them, an interrupt handler that samples the flag values would need to see both values), but in a way that no sane code would notice?


It does technically sound like there is a gap in the emulation, yes. Though, since interrupts are normally transparent to user space software, you'd have to use something like ptrace or be messing with the signal delivery mechanism to even have a chance at noticing the gap. And you'd have to do quite obscure and absurd things even by those APIs' standards.


Rosetta 2 doesn't emulate interrupts (they're not visible to userspace components)

It does have to emulate signals but its moderately easy to handle those by deferring them to "safe" points where all architectural state is known


They might know of some, but I'm of the opinion that it's worth it to support the specification. It only takes one previously undiscovered application to rely on that specified behaviour, and then you need to fix it. If it were solely a game emulator I wouldn't expect this - performance probably comes first... But for running professional software, I'd choose correctness any day.

I don't know of any applications the use AF, or the PF result in question, but I haven't looked into it. I'd maybe check some of the big professional things with a lot of legacy history like Photoshop or Excel. As I understand it Rosetta 2 also supports 32-bit x86, not for native macOS applications (where 32-bit isn't supported), but to allow Wine/Crossovers to work, so a whole bunch of ancient 32-bit Windows games might be in scope too.


Yeah that seems odd.

The article says: "While almost no modern applications read these AF and PF bits"

It can't really be legacy software, as most AMD64 software on the Mac is fairly recent (2006) and I suspect that we're talking software older than that for the AF and PF to be in regular use. So it also have to be something fairly important like virtualization or something used in image processing or compression. It seems like overkill for something that would be a problem for an end-user application that would transition to Apple Silicon anyway.


Isn't it more like that computing these flags is basically free in hardware? So might as well add support for it.


Sure, but designing those flags into the hardware has to come at a cost. Maybe it's simple and cheap enough that it doesn't matter.


Could it have been a test case "to see if we can" do these kinds of things. And then it got left in with the more important stuff


There's this somewhat well-known sequence that uses AF:

    cmp al, 10
    sbb al, 69h
    das


I know PF is used in the inner loop of at least one lossy compression algorithm implementation.



Couldn't rosetta (or other x86 emulator) do some static analysis and emit the instructions for computing PF and AF only in cases where there is a chance that they will be inspected? And not emitting extra instructions if nothing until the next instruction which overwrites flags reads the problematic flags. That way in most case it could use the regular arm add instruction. And only in the rare cases where analysis fails you would need the extra instructions.

Or are there some common patterns where making such analysis is difficult? Something like copying the whole state of status register for later use (don't remember if x86 even allows that) instead of checking flags directly. Although in that case you would need an extra code anyway to reorganize value so that it matches the strucutre of x86 flag register. I assume the order on x86 and arm is not identical.


Every time there's an indirect branch (including a return) there's a chance that they will be inspected. The "unused flags" optimisation does remove most of them, in a way that gets it right 100% of the time (excluding signals/interrupts inspecting state at random points, and possible bugs), but the pattern "an add or subtract or compare followed by an indirect branch or return" is still very common.

(I haven't looked at Linux Rosetta 2, other than to note the parity-flag computation coincidentally showing up in a screenshot posted to Twitter, so I'm not sure exactly how often they do the manual computation, but I'm guessing it's anywhere flags are used, in which case the unused-flags optimisation could be extended further by tracking AF and PF separately to the usual flags, which is roughly your suggestion.)

You could still remove the vast majority of the remaining computations with some heuristics that work well, but then you've gone from 100% correct to 99.9% correct, which is nice to avoid when you have the option.


I assume there is no extra cost to the extra subs/adds instructions, so there is no reason to perform this extra analysis when those are available. The author does say:

>In a VM, it isn’t able to configured the host CPU, so it can’t use this functionality. There are two other options. Either, you can skip computing the flags, because they’re mostly useless and most software won’t care. Or you can compute them the long way shown above. Rosetta 2 chooses the second option, and this mostly works out fine, because they have an “unused flags” optimisation that avoids the computation a lot of the time.

So there's an optimization pass that does what you mentioned, it's just not needed when you have the right instruction available, and the thing is probably a wee bit faster if you don't do it.


if you got a jit in between you are doomed


I understand there's probably licensing and crap and all the things that are the reason we can't have nice things, but with the very very ... very ... large amount of silicon real estate these days and the "we have so many cores we don't know what to do with them", ... why not have an x86 core or two on the chip? I bet it can share the most of the caches, which are most of the silicon real estate these days anyway.

At least Intel could make some more revenue for a few years off of the Apple arm switch with some licensing.


Power, of course. The defining characteristic of these ARM chips is low power.

Throw on fully functioning x86 cores and you've defeated the entire point.

Remember, running x86 code in emulation on ARM uses less power than running the native code on Intel.


That sounds like a nightmare of complexity to get right, and I don't see why Apple would want to Frankenstein together cores when they could build something like Rosetta2 instead.


Rosetta2 was/is far from perfect. Seemed like a coin flip since most of the things I needed it for were "hard" apps that weren't going to be ported anytime soon, but were "harder" for Rosetta to run properly.

An actual hardware core would have the advantage of better compatibility. And you only spin it up when needed.

But it would probably slow adoption too....


> A generic Snapdragon ARM SoC, say, would deliver notably less performance in this specific scenario that is critically important to Mac users.

In some senses it's even more important for Snapdragon - Google doesn't yet ship a native Windows ARM build of Chrome. So for MacBook competitors using Windows on Snapdragon, you either have to use Edge or the emulated x86_64 Chrome.


I wonder why Chrome doesn't ship one. They can demonstrably target that CPU, since they ship a mac/arm64 build, android builds, and of course ChromeOS for arm both 32 and 64. Perhaps the market share of Windows on arm is just negligible?


Yes that's probably the most likely factor. For a while, clang didn't have all the support necessary to build chrome on win-aarch64 but that's no longer the case. Chromium builds exist already so there's probably not much except a business commitment to build/test/support the platform. And yes maybe the share just isn't enough to justify that yet.


Do they license codecs for Windows or something?


I didn't understand the author's point here... Do they think Qualcomm couldn't make these hardware changes as well? Of course they can, though they may not have considered it until now.


> While almost no modern applications read these AF and PF bits

> A generic Snapdragon ARM SoC, say, would deliver notably less performance in this specific scenario that is critically important to Mac users.

Sorry, I'm not following.

If almost no modern applications read these AF and PF bits, why is it critically important to Mac users?


My take away is this is why riscv is so important, because everyone should be able to enjoy the same ability to provide for themselves.


Doesn't Rosetta infringe on Intel's copyrighted instruction set and architecture of a CPU? Why Apple can use this instruction set and others cannot?

UPD: and ARM seems to be infringing as well:

> There’s a standard ARM alternate floating-point behaviour extension (FEAT_AFP) from ARMv8.7, but the M1 design predates the v8.7 standard, so Rosetta 2 uses a non-standard implementation.

> (What a coincidence – the “alternative” happens to exactly match x86.

Good luck persuading a judge that this was "a coincidence".


You can copyright code / implementation but you can't copyright the behaviour of an individual instruction!

It's very unlikely but there could conceivably be a patent, but it would have long since expired.

No case to answer.

Edit: Just to clarify - referring to treatment of one instruction here. As peer comment has said Rosetta translates ISA rather than implements so it's even further removed from being a copyright issue.


> It's very unlikely but there could conceivably be a patent, but it would have long since expired.

For those who weren't following the 32-bit to 64-bit transition on the x86 world back then, the x86-64 ISA is from the year 2000 (https://web.archive.org/web/20000817014037/http://www.x86-64...), so any patent which applies to that ISA (without the ISA being prior art for the patent) is now over 20 years old.


The x86-64 extension was made by AMD, though, not by Intel. So if at all the license would have to be obtained from AMD...

Interestingly, https://en.wikipedia.org/wiki/X86-64 says this:

> x86-64/AMD64 was solely developed by AMD. AMD holds patents on techniques used in AMD64; those patents must be licensed from AMD in order to implement AMD64


Thanks. I guess I was thinking that any patents on the instruction in the original article and of FP behaviour would predate 64-bit, but you're right that there could be relevant patents on x86-64.

Thinking aloud, I wonder if AVX 512 is translated?


AVX instructions are not supported. See here: https://medium.com/macoclock/m1-rosetta-2-limitation-illegal...


Thanks! Presumably most retail x86-64 applications will have an execution path that avoids AVX?


Intel still sells x86_64 CPUs without AVX. See Jasper Lake and Comet Lake Celeron/Pentium.


The x86 instructions are translated, not directly executed; There is no implementation of the x86 instruction set in the circuitry.


In this case AMD could also claim that they translate instructions in their CPU before execution.


Back in the 386 days Intel sued AMD for copying their microcode (which is copyrighted) not for reimplementing instructions.


And as I recall Intel lost that suit due to an old cross-licensing agreement.

In this case there would be no copying of microcode since the underlying hardware is completely different.


I think the point was that the hardware implementation is what could be infringing, it seems you are thinking the instruction set itself is?

AMD has the rights to x86, and some like transmeta did x86 translation in the chip.


The edit OP made now includes that Apple Silicon may infringe, but original wording was just Rosetta. Rosetta isn't infringing, because it's not actually implementing anything other than translation. It is possible AS is infringing on a patent, but that wasn't there originally in the post. New context changes the discussion.


These quirks appear to be over 20 years old so the patents should be expired.


Maybe they license it? Maybe they cross license things to each other?


Factoid: the 8080 was used in the Space Invaders arcade game.


Ajskwie


I think the conclusions are a bit far fetched: "A generic Snapdragon ARM SoC, say, would deliver notably less performance in this specific scenario that is critically important to Mac users."

The usage of these flags is NOT common.

And as observed above, if they're not used, then you can "have an “unused flags” optimisation that avoids the computation a lot of the time".

So I don't see how it follows that this is a "critically important" scenario. Reality is that adding the logic to compute these flags is almost free, so it's a cheap (in terms of hardware) optimization to do to squeeze some marginal extra performance. But to say it's a deal breaker? If it is, I don't think you can conclude that based on the evidence presented.


Broadly agree but how easy is it for Rosetta to prove to itself that these flags aren't used somewhere in the code path?


I'd think it would be fairly easy considering it transpiles the entire binary in a single shot. I don't see a lot of cases where you'd branch to read either of these flags.

I suppose it would be much harder for JITs and other dynamically generated code.


x86 (and 8080, Z80) can push/pop the flags register, allowing the flags bit mask to be used as regular data, not just with instructions which explicitly check those flags. So proving that specific flags are not used by the code might actually not be that easy.


Also, Rosetta surprisingly does not do much for optimization, opting for correctness and leaving speed to the silicon.


> I suppose it would be much harder for JITs and other dynamically generated code.

Which is... just about anything these days, I suppose? Half of the desktop apps are Electron, half of new CLI apps are in Node.JS, half of old CLI apps, including near-ubiquitous ones like git, are a random assortment of half a dozen scripting languages... I haven't actually counted it properly, but ad-hoc random sampling gives me an impression that a third of typical Linux distro userspace is in Python, and most of it not even compiled AOT.


But how many of those need binary translation? Common for two of those is that you just need a good V8 implementation for ARM to completely displace Rosetta 2. For git it's mostly bash which is interpreted and therefore can be one-shot transpiled, or even just compiled for arm natively since it's written in C.

The overlap of "is JIT" and "doesn't have a runtime for ARM" is overwhelmingly small. That's probably mostly because runtimes and JITs are opensourced, which mean you can just recompile them for the target. Rosetta 2 is more focused on the proprietary space where they don't often develop proprietary JIT runtimes.


This is about old apps that haven't been recompiled for ARM by the maker, no? If such a legacy app was built using Electron, it'll be Electron built for x64 and targeting x64 for its JIT. It doesn't really help if Electron ships an ARM runtime today, unless the OS uses it to replace the one shipped with the app - but then you may be breaking the app, if it relies on some old Electron behavior.


In the case of Python (specifically CPython, I'm ignoring PyPy), there is no dynamically generated machine code - CPython interprets without JIT. That's fairly common across most older scripting languages (Perl, Tcl, etc.).

The more common instances of JITs are, like you said, Electron/Node.js programs, Java programs, and the odd Ruby program. Anecdotally, I think Rust and Go are very common languages for new CLI apps, and I definitely have significantly more Rust and Go CLI apps than Node.js ones installed at the moment.


Fair comment. If these flags are hardly used then I guess it is a bit surprising that this is necessary.


I have no real information on the topic, but to me it looks like a "why not" optimization. It was probably pretty cheap to put into the hardware (seeing as it was already an ARM extension), the engineers probably figured there was some risk that it would be really important for some workload, and once they had included it they might as well use it.

In other words, it's probably not necessary but was included early on out of an abundance of caution. They knew that if it turned out to be used in some binary somewhere, it would be a major performance killer.

I'd be interested in seeing a benchmark with the hardware flag turned off and the translation/optimization setup used for linux enabled. I bet the difference would be negligible.


The flags are hardly read, but computed often.

Computing them while doing operations is nearly free in hardware, so it makes sense to add them in hardware if you can. It's not nearly free in software, but it's important to do it where not doing it might be obervable in normal flow (tricky things with interrupts are out of luck, even if you always do the software calculation, you could interrupt in the middle of it). In many cases, it's easy to determine the flags aren't observable and you can skip software computation.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: