Wow! This might actually make it possible for Actually Portable Executable to support running on Windows ARM. I'm already putting the ARM code inside all my binaries. There's just never been a way to encode that in the PE headers. But if my emulated WinMain() function for x86-64 could detect that it's being emulated and then simply ask a WIN32 API to jump to the ARM entrypoint instead, it'd be the perfect solution to my problems. I actually think I'm going to rush out and buy a Windows ARM computer right now.
And this kind of beautiful insanity is why you're one of my favorite developers of this era.
Also,
> I'm already putting the ARM code inside all my binaries.
Wait, I thought CPU architecture was the one limitation that did affect APE - you mean on unix-likes APE binaries are already compatible across amd64 and aarch64?
Edit: rereading https://justine.lol/cosmo3/ it does say that, doesn't it - and ARM64 listing "Windows (non-native)" just means that one platform uses (for the next few hours-days, at least...) emulation. That's amazing:)
I would recommend getting an official consumer build to test all of the latest consumer features like Copilot
Parallels has a Microsoft partnership and has an official AMR64 image which I was able to grab (and run in anything). I’m sure there are a lot more now, though!
Even more: clone3, __clone2 (only exists on Itanium), fchmodat2, preadv2, pwritev2, pipe2, sync_file_range2, mmap2 (only certain architectures; for x86, only 32-bit), renameat2, mlock2, faccessat2, epoll_pwait2
My personal prediction is sooner or later we'll see execveat2, to permit setting /proc/PID/comm when using execveat [0].
I doubt we'll ever see clone4, because clone3 is passed a structure argument with the structure size, so new fields can be supported just by increasing the structure size. If other syscalls had done that from the start, much of the 2/3/etc would have been avoided. It is actually a very common practice on Windows (since NT), it has only much more recently been adopted in the Linux kernel
I work on a team that supports some equipment related to airplanes. An acronym for one piece of equipment that is decades old is "RCSU". When I got a support call talking about "RSCU", I assumed the person meant "RCSU".
Nope. It turns out, when they made their next-generation piece of equipment, the vendor differentiated it by swapping the inner two letters in an already easy-to-say-wrong acronym.
My reaction was, "WTF didn't they just call it the RCSU2?!"
The vague convention, as far as I understand, is that Ex denotes an extended variant of a function for more advanced use cases (e.g. with additional parameters/options) which is functionally a superset of the original function, whereas numbered versions are intended to be complete replacements and/or may change the semantics compared to the prior version.
Windows 9x can run 16-bit realmode (V86), 16-bit protected mode, and 32-bit protected mode code in the same process by using different segment descriptors. Too bad amd64 wasn't compatible with that model, nor the virtualisation features that came afterwards, or Intel could've made ARM32/64-mode segments a reality if they decided to add an ARM decoder to their microarchitecture.
> ... 16-bit realmode (V86), 16-bit protected mode, and 32-bit protected mode code in the same process by using different segment descriptors...
> ...Intel could've made ARM32/64-mode segments a reality...
While I myself admire this particular breed of masochism, the direction that Intel currently wants to take is apparently quite the opposite.
In May last year, they proposed X86S[1][2][3] which tosses out 16-bit support completely, along with 32 bit kernel mode (i.e. the CPU boots directly into 64 bit mode, 32 bit code is only supported in ring 3).
The proposal trims a lot of historical baggage, including fancy segmentation/TSS shenanigans, privilege rings 1 & 2, I/O port access from ring 3, non-flat memory models, etc... limiting the CPU to 64 bit kernel mode, and 64 or 32 bit x86 user mode. With the requirement for 64 bit kernel mode, it effectively also removes un-paged memory access.
The TSS was always one of the most obnoxious aspects of the 80286 that stuck around much longer than it should have. On 386 or anything newer, using it was _slower_ than implementing it in software, yet you still needed them to implement task gates necessary for things like exceptions and interrupts.
If anyone actually has a serious need to use ancient 16 bit software, emulators like 86Box work very well. Software that old doesn’t really need performance faster than, say, a Pentium 90, which 86Box has no trouble achieving on my M1 (ARM) MacBook.
You can also use winevdm[1] on modern 64 bit Windows operating systems. I have this in production use for a niche case where someone can’t give up a particular 16 bit app, and I didn’t want to tangle with a VM for them.
The technical details of making sure a modern CPU still functions exactly like an 80386, which in turn made sure it functioned like an 80286, when you fire up a 16 bit task on, say, 32-bit Windows 10 (or 64-bit with something like winevdm[1]) sound like a nightmare for a microcode engineer or QA tester.
Oh it doesn't, AMD and Intel gave up on that awhile back. v8086 mode might... but I'd guess it has quite a bit of errata. Everything else has most certainly changed. CPUs don't support the A20 gate for example. Nor do they truly support real mode (they boot in 'unreal mode' now). If you want a 386 compatible you're looking at ALi or DM&P CPUs that are basically Pentium/486/386 clones.
I'd argue the break started with the Pentium Pro, at that point things shifted architecturally.
The 80286 and 80386 never had special support for the "A20 gate". That was provided by (often slow) external circuitry.
Some CPUs (I cannot remember which) built in an A20 gate to their CPUs to improve performance.
The P6 was a complete implementation of the 80286 and 80386, Virtual 8086 mode, TSS, and all - you could boot DOS or an 80286 operating system on a P6 without any problems, although the design was not optimised to improve performance of 16-bit software. This was enough of a problem that they rolled back that design by the Celeron era because there were still a lot of people using 16-bit apps.
Actually it did get used. Linux and Windows used the x86 TSS for process context-switching for years.
During that time, Linux had a limit on the number of processes, which was due the maximum number of TSS entries that fit in the x86 GDT.
Eventually the Linux kernel was changed to the more versatile context-switch method it uses today. Among other things, this change was important for thread performance, as thread context switches can skip the TLB flush. Same for kernel mode tasks. Software task switching also greatly increased the number of processes and threads that can be launched, from about 8000 (across all CPU cores) to millions.
> the direction that Intel currently wants to take is apparently quite the opposite.
It's not just Intel. It's clear that ARM is also going in the same direction, by allowing newer cores to be 64-bit (AArch64) only, dropping compatibility with the older 32-bit ARM ISA (actually three ISAs: traditional 32-bit ARM, Thumb, and Thumb2), and IIRC some manufacturers of ARM-based chips are already doing that.
Allegedly there are already off list SKUs from both AMD and Intel that don't support 16/32bit code and boot up without the legacy bits. How far they went in that? I don't know. I'd hope they removed LDT etc. and reduced GDT to just ES and GS (or just used an esbase and gsbase MSRs).
A tiny amount of die area, a huge amount of engineering and validation effort. If segmentation issues can cause the register renamer to lose track of who owns a physical register that's the sort of issue that's terrible to find and debug but which also can't be allowed in a real device. Intel has traditionally been able to just throw more engineers at the problem than their competitors, but I"m not sure that'll be the case going forwards.
Mainline OS's have been 64bit for about 15-20 years by this point, the point is to trim parts of X86 that isn't used when running a 64bit OS.
Notice that only 32bit kernel/R-0 is removed, but not usermode/R-3 so even when reducing this your 64bit Windows will still run clean 32bit software built for Win95 from the 90s.
Even today you need to run a virtualized 32bit OS to run old 16bit software (the negative part is if you still run a virtualized 32bit OS then it'll need to be emulated instead of HW virtualized if the virtualization solutions allowed that).
> Intel apparently forgot what made them worth choosing over competitors like ARM
People (myself and others I know) choose ARM chips because they don't absolutely mandate the purchase of sanctioned chipsets/other supporting components you don't have access to, impossible-to-obtain specs, etc.
Sounds similar to what NVidia was doing with their Project Denver cores, using a mix of emulated ARM and native VLIW instructions with gradual compilation from one to another.
* Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)
* Allows usage of third-party x64 DLLs in ARM apps without recompilation. (Source isn't always available or might be too much of a headache to port on your own.)
3. Improve x64 emulation performance for everybody. Windows 11 on ARM ships system DLLs compiled as Arm64EC - makes the x64 binaries run native ARM code at least within system libraries.
It's not worth using ARM64EC for just for incremental porting -- it's an unusual mode with even less build/project support than Windows ARM64 and there are EC-specific issues like missing x64 intrinsic emulations and slower indirect calls. I wouldn't recommend it except for the second case with external x64 DLLs.
At that point why trust the emulator over the port? Either you have sufficient tests for your workload or you don’t, anything else is voodoo/tarot/tea leaves/SWAG.
"Why trust the emulator?" sounds a lot like asking "why trust the compiler?". It's going to be much more widely-used and broadly-tested than your own code, and probably more thoroughly optimized.
> Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)
Wouldn't it make more sense to have a translator that translates the assembly, instead of an emulator that runs the machine code?
The SIMD part will be emulated as normal, as far as I understand. So you can ship a first version with all-emulated code, and then incrementally port hotspots to native code, while letting the emulator handle the non-critical parts.
At least in theory, we'll see how it actually pans out in practice.
I feel like binary translation is a better approach. It’s a temporary workaround that allows users to use non-native programs while they are ported properly. ARM64EC seems like it will incentivize “eh that’s good enough” partial porting efforts that will never result in a full port, while making the whole system more complicated, with a larger attack surface (binary translation also makes the system more complicated, but it seems more isolated/less integrated with the rest of OS).
The use-case is huge apps that have a native plugin ecosystem, think Photoshop and friends. Regular apps will typically just compile separate x64 and ARM64 versions
Yes, bite the bullet and port. Of course it makes no sense.
These sorts of things are only conceived in conversations between two huge corporations.
Like Microsoft needs game developers to build for ARM. There’s no market there. So their “people” author GPT-like content at each other, with a ratio of like 10 middlemen hours per 1 engineer hour, to agree to something that narratively fulfills a desire to build games for ARM. I can speculate endlessly how a conversation between MS and EA led to this exact standard but it’s meaningless, I mean both MS and EA do a ton of things that make no sense, and I can’t come up with nonsense answers.
Anyway, so this thing gets published many, many months after it got on some MS PM’s boss’s partner’s radar. Like the fucking devices are out! It’s too late for any of this to matter.
Is it really even a big enough concern to think about them? Windows 10 on ARM lacks x64 emulation support and the devices never sold well. I can't imagine there's too too many Windows 10 on ARM devices hanging around still running Windows 10.
Rosetta 2 operates on the process level -- on an Apple Silicon system, a process can run an ARM executable and run all ARM code, or can run an x86_64 executable and run all x86_64 code. ARM64EC allows processes to run a mixture of native and emulated code. Whether this is actually useful is debatable, but the option exists.