Arm64EC – Build and port apps for native performance on Arm

jart · 2024-06-24T03:29:45 1719199785

Wow! This might actually make it possible for Actually Portable Executable to support running on Windows ARM. I'm already putting the ARM code inside all my binaries. There's just never been a way to encode that in the PE headers. But if my emulated WinMain() function for x86-64 could detect that it's being emulated and then simply ask a WIN32 API to jump to the ARM entrypoint instead, it'd be the perfect solution to my problems. I actually think I'm going to rush out and buy a Windows ARM computer right now.

yjftsjthsd-h · 2024-06-24T04:13:34 1719202414

And this kind of beautiful insanity is why you're one of my favorite developers of this era.

Also,

> I'm already putting the ARM code inside all my binaries.

Wait, I thought CPU architecture was the one limitation that did affect APE - you mean on unix-likes APE binaries are already compatible across amd64 and aarch64?

Edit: rereading https://justine.lol/cosmo3/ it does say that, doesn't it - and ARM64 listing "Windows (non-native)" just means that one platform uses (for the next few hours-days, at least...) emulation. That's amazing:)

mappu · 2024-06-24T05:53:15 1719208395

I found the http://www.emulators.com/docs/abc_arm64ec_explained.htm article extremely helpful at understanding what the ABI is doing, you might like it too.

adzm · 2024-06-24T13:47:23 1719236843

This was a brilliant and informative read, deserving of its own post really. Thanks!

KerrAvon · 2024-06-24T21:58:17 1719266297

tl;dr: Microsoft reinvented the Mixed Mode Manager from Mac System 7.1.x circa 1993.

pjmlp · 2024-06-25T08:04:05 1719302645

Or the Windows 3.1 Enhanced Mode with V86, more likely.

justinfrankel · 2024-06-24T11:32:18 1719228738

If you have an Apple Silicon mac you can install Win11 in UTM and it works great for dev purposes. Can get the free builds via Windows Insider, too.

conradev · 2024-06-24T19:17:28 1719256648

I would recommend getting an official consumer build to test all of the latest consumer features like Copilot

Parallels has a Microsoft partnership and has an official AMR64 image which I was able to grab (and run in anything). I’m sure there are a lot more now, though!

lewurm · 2024-06-24T05:11:35 1719205895

> I actually think I'm going to rush out and buy a Windows ARM computer right now.

If you have an Apple Silicon machine you can run a Windows Insider build via UTM in a VM.

zorgmonkey · 2024-06-24T04:16:39 1719202599

I'm pretty sure the API you'll want to detect that is IsWow64Process2.

szundi · 2024-06-24T06:44:28 1719211468

I love the 2s and Exes at the end of Windows API call names

formerly_proven · 2024-06-24T07:15:31 1719213331

wait/waitid/waitpid/wait3/wait4

dup/dup2/dup3

creat/open/openat/openat2

cough

skissane · 2024-06-24T09:38:14 1719221894

Even more: clone3, __clone2 (only exists on Itanium), fchmodat2, preadv2, pwritev2, pipe2, sync_file_range2, mmap2 (only certain architectures; for x86, only 32-bit), renameat2, mlock2, faccessat2, epoll_pwait2

My personal prediction is sooner or later we'll see execveat2, to permit setting /proc/PID/comm when using execveat [0].

I doubt we'll ever see clone4, because clone3 is passed a structure argument with the structure size, so new fields can be supported just by increasing the structure size. If other syscalls had done that from the start, much of the 2/3/etc would have been avoided. It is actually a very common practice on Windows (since NT), it has only much more recently been adopted in the Linux kernel

[0] see https://uapi-group.org/kernel-features/

phaedrus · 2024-06-24T16:02:35 1719244955

I work on a team that supports some equipment related to airplanes. An acronym for one piece of equipment that is decades old is "RCSU". When I got a support call talking about "RSCU", I assumed the person meant "RCSU".

Nope. It turns out, when they made their next-generation piece of equipment, the vendor differentiated it by swapping the inner two letters in an already easy-to-say-wrong acronym.

My reaction was, "WTF didn't they just call it the RCSU2?!"

Just_Harry · 2024-06-24T13:36:37 1719236197

My favourite is MapViewOfFile3FromApp [0], which is one of the seven variants of MapViewOfFile.

[0]: https://learn.microsoft.com/windows/win32/api/memoryapi/nf-m...

orthoxerox · 2024-06-24T08:34:46 1719218086

IsWow64Process2ForRealThisTime

ale42 · 2024-06-24T10:07:37 1719223657

wonder why they didn't call it IsWow64ProcessEx

layer8 · 2024-06-24T17:04:04 1719248644

The vague convention, as far as I understand, is that Ex denotes an extended variant of a function for more advanced use cases (e.g. with additional parameters/options) which is functionally a superset of the original function, whereas numbered versions are intended to be complete replacements and/or may change the semantics compared to the prior version.

throwaway2037 · 2024-06-25T03:55:54 1719287754

I agree.

Examples:

CreateWindowW -> https://learn.microsoft.com/en-us/windows/win32/api/winuser/...

CreateWindowExW -> https://learn.microsoft.com/en-us/windows/win32/api/winuser/...

dwattttt · 2024-06-24T11:09:38 1719227378

PE headers have a field for machine type, you're looking for Machine in FileHeader: https://learn.microsoft.com/en-us/windows/win32/debug/pe-for... and https://learn.microsoft.com/en-us/windows/win32/debug/pe-for...

EDIT: Or maybe what you're looking for is hybrid PEs? https://ffri.github.io/ProjectChameleon/new_reloc_chpev2/

darby_nine · 2024-06-25T15:24:38 1719329078

It looks like this might be provided by "Arm64X", which does seem to encode the dual-arch nature at the binary level: https://learn.microsoft.com/en-us/windows/arm/arm64x-pe

dboreham · 2024-06-24T13:17:11 1719235031

> could detect that it's being emulated

Down the rabbit hole...

> buy a Windows ARM computer

You can still get Surface Pro X (16G/LTE) on Amazon for $800

officeplant · 2024-06-24T14:51:09 1719240669

If you're lucky you can also find the snapdragon thinkpads for under $350 on amazon.

adastra22 · 2024-06-24T18:56:29 1719255389

officeplant · 2024-06-25T19:17:34 1719343054

unfortunately that time seems to have passed, the refurb units I can find now are back up to $650.

szundi · 2024-06-24T06:42:58 1719211378

Quickly create a donation page, you have this moment haha

Aissen · 2024-06-24T07:10:47 1719213047

A long-term contributor to LuaJIT (@corsix) added Arm64EC support and introduced the franken ABI at FOSDEM 2024, with a very entertaining talk: https://fosdem.org/2024/schedule/event/fosdem-2024-1762-arm6...

classichasclass · 2024-06-24T03:12:06 1719198726

This ( https://learn.microsoft.com/en-us/windows/arm/arm64ec-abi ) feels a lot like a modern rethinking of Universal Procedure Pointers (i.e., between PowerPC and the 68K emulator on Power Macintosh).

userbinator · 2024-06-24T03:56:09 1719201369

Windows 9x can run 16-bit realmode (V86), 16-bit protected mode, and 32-bit protected mode code in the same process by using different segment descriptors. Too bad amd64 wasn't compatible with that model, nor the virtualisation features that came afterwards, or Intel could've made ARM32/64-mode segments a reality if they decided to add an ARM decoder to their microarchitecture.

st_goliath · 2024-06-24T06:38:43 1719211123

> ... 16-bit realmode (V86), 16-bit protected mode, and 32-bit protected mode code in the same process by using different segment descriptors...

> ...Intel could've made ARM32/64-mode segments a reality...

While I myself admire this particular breed of masochism, the direction that Intel currently wants to take is apparently quite the opposite.

In May last year, they proposed X86S[1][2][3] which tosses out 16-bit support completely, along with 32 bit kernel mode (i.e. the CPU boots directly into 64 bit mode, 32 bit code is only supported in ring 3).

The proposal trims a lot of historical baggage, including fancy segmentation/TSS shenanigans, privilege rings 1 & 2, I/O port access from ring 3, non-flat memory models, etc... limiting the CPU to 64 bit kernel mode, and 64 or 32 bit x86 user mode. With the requirement for 64 bit kernel mode, it effectively also removes un-paged memory access.

[1] https://en.wikipedia.org/wiki/X86-64#X86S

[2] https://www.intel.com/content/www/us/en/developer/articles/t...

[3] https://news.ycombinator.com/item?id=36006446

trollbridge · 2024-06-24T13:45:20 1719236720

The TSS was always one of the most obnoxious aspects of the 80286 that stuck around much longer than it should have. On 386 or anything newer, using it was _slower_ than implementing it in software, yet you still needed them to implement task gates necessary for things like exceptions and interrupts.

If anyone actually has a serious need to use ancient 16 bit software, emulators like 86Box work very well. Software that old doesn’t really need performance faster than, say, a Pentium 90, which 86Box has no trouble achieving on my M1 (ARM) MacBook.

You can also use winevdm[1] on modern 64 bit Windows operating systems. I have this in production use for a niche case where someone can’t give up a particular 16 bit app, and I didn’t want to tangle with a VM for them.

The technical details of making sure a modern CPU still functions exactly like an 80386, which in turn made sure it functioned like an 80286, when you fire up a 16 bit task on, say, 32-bit Windows 10 (or 64-bit with something like winevdm[1]) sound like a nightmare for a microcode engineer or QA tester.

[1] https://github.com/otya128/winevdm

leeter · 2024-06-24T15:50:25 1719244225

Oh it doesn't, AMD and Intel gave up on that awhile back. v8086 mode might... but I'd guess it has quite a bit of errata. Everything else has most certainly changed. CPUs don't support the A20 gate for example. Nor do they truly support real mode (they boot in 'unreal mode' now). If you want a 386 compatible you're looking at ALi or DM&P CPUs that are basically Pentium/486/386 clones.

I'd argue the break started with the Pentium Pro, at that point things shifted architecturally.

trollbridge · 2024-06-27T00:13:55 1719447235

The 80286 and 80386 never had special support for the "A20 gate". That was provided by (often slow) external circuitry.

Some CPUs (I cannot remember which) built in an A20 gate to their CPUs to improve performance.

The P6 was a complete implementation of the 80286 and 80386, Virtual 8086 mode, TSS, and all - you could boot DOS or an 80286 operating system on a P6 without any problems, although the design was not optimised to improve performance of 16-bit software. This was enough of a problem that they rolled back that design by the Celeron era because there were still a lot of people using 16-bit apps.

userbinator · 2024-06-24T13:54:45 1719237285

On 386 or anything newer, using it was _slower_ than implementing it in software

...and thus it didn't get used, meaning Intel didn't make it faster, and so the vicious cycle continued.

Hardware task switching could've made software simpler and more forward-compatible.

Of course they eventually reinvented most of it with the virtualisation extensions anyway.

jlokier · 2024-06-24T20:10:41 1719259841

Actually it did get used. Linux and Windows used the x86 TSS for process context-switching for years.

During that time, Linux had a limit on the number of processes, which was due the maximum number of TSS entries that fit in the x86 GDT.

Eventually the Linux kernel was changed to the more versatile context-switch method it uses today. Among other things, this change was important for thread performance, as thread context switches can skip the TLB flush. Same for kernel mode tasks. Software task switching also greatly increased the number of processes and threads that can be launched, from about 8000 (across all CPU cores) to millions.

cesarb · 2024-06-24T12:33:14 1719232394

> the direction that Intel currently wants to take is apparently quite the opposite.

It's not just Intel. It's clear that ARM is also going in the same direction, by allowing newer cores to be 64-bit (AArch64) only, dropping compatibility with the older 32-bit ARM ISA (actually three ISAs: traditional 32-bit ARM, Thumb, and Thumb2), and IIRC some manufacturers of ARM-based chips are already doing that.

leeter · 2024-06-24T15:53:11 1719244391

Allegedly there are already off list SKUs from both AMD and Intel that don't support 16/32bit code and boot up without the legacy bits. How far they went in that? I don't know. I'd hope they removed LDT etc. and reduced GDT to just ES and GS (or just used an esbase and gsbase MSRs).

userbinator · 2024-06-24T13:59:12 1719237552

The proposal trims a lot of historical baggage

All of that is a tiny amount of die area relative to the whole CPU. After all, a 386 has only 275k transistors.

X86S is Stupid. Intel apparently forgot what made them worth choosing over competitors like ARM and now RISC-V. Non-compatible x86 makes little sense.

...and if they want to include the virtualisation extension, they still need to include that backwards-compatible functionality.

Symmetry · 2024-06-24T14:49:07 1719240547

A tiny amount of die area, a huge amount of engineering and validation effort. If segmentation issues can cause the register renamer to lose track of who owns a physical register that's the sort of issue that's terrible to find and debug but which also can't be allowed in a real device. Intel has traditionally been able to just throw more engineers at the problem than their competitors, but I"m not sure that'll be the case going forwards.

whizzter · 2024-06-24T16:18:37 1719245917

Mainline OS's have been 64bit for about 15-20 years by this point, the point is to trim parts of X86 that isn't used when running a 64bit OS.

Notice that only 32bit kernel/R-0 is removed, but not usermode/R-3 so even when reducing this your 64bit Windows will still run clean 32bit software built for Win95 from the 90s.

Even today you need to run a virtualized 32bit OS to run old 16bit software (the negative part is if you still run a virtualized 32bit OS then it'll need to be emulated instead of HW virtualized if the virtualization solutions allowed that).

15155 · 2024-06-24T21:48:36 1719265716

> Intel apparently forgot what made them worth choosing over competitors like ARM

People (myself and others I know) choose ARM chips because they don't absolutely mandate the purchase of sanctioned chipsets/other supporting components you don't have access to, impossible-to-obtain specs, etc.

Dwedit · 2024-06-24T19:49:14 1719258554

For x64, there's OTVDM to run Windows 3.1 applications.

sylware · 2024-06-24T13:49:57 1719236997

Funny, I started to code some of my linux x86_64 programs... using RV64 assembly (the new C), with a small in-process RV64 assembly interpreter.

Everything seems to converge more and more toward RISC-V these days.

Symmetry · 2024-06-24T14:39:50 1719239990

Sounds similar to what NVidia was doing with their Project Denver cores, using a mix of emulated ARM and native VLIW instructions with gradual compilation from one to another.

frozenport · 2024-06-24T04:40:24 1719204024

Struggling with the use case.

It seems like this is when you have the source or the libs but choose to mix x86 and arm?

It would seem if you have the source etc you should just bite the bullet and port everything.

adamjs · 2024-06-24T05:21:35 1719206495

Two use-cases jump to mind:

* Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)

* Allows usage of third-party x64 DLLs in ARM apps without recompilation. (Source isn't always available or might be too much of a headache to port on your own.)

vsl · 2024-06-24T07:25:18 1719213918

3. Improve x64 emulation performance for everybody. Windows 11 on ARM ships system DLLs compiled as Arm64EC - makes the x64 binaries run native ARM code at least within system libraries.

ack_complete · 2024-06-24T19:26:13 1719257173

It's not worth using ARM64EC for just for incremental porting -- it's an unusual mode with even less build/project support than Windows ARM64 and there are EC-specific issues like missing x64 intrinsic emulations and slower indirect calls. I wouldn't recommend it except for the second case with external x64 DLLs.

callalex · 2024-06-24T06:01:49 1719208909

At that point why trust the emulator over the port? Either you have sufficient tests for your workload or you don’t, anything else is voodoo/tarot/tea leaves/SWAG.

wtallis · 2024-06-24T09:05:45 1719219945

"Why trust the emulator?" sounds a lot like asking "why trust the compiler?". It's going to be much more widely-used and broadly-tested than your own code, and probably more thoroughly optimized.

szundi · 2024-06-24T06:45:32 1719211532

We might be lucky and the emulator guys might have enough testing

amelius · 2024-06-24T09:18:11 1719220691

> Allows incremental porting of large codebases to ARM. (It's not always feasible to port everything at once-- I have a few projects with lots of hand-optimized SSE code, for example.)

Wouldn't it make more sense to have a translator that translates the assembly, instead of an emulator that runs the machine code?

frozenport · 2024-06-24T06:59:49 1719212389

Yeah but you need to port the SIMD before shipping anyways?

So if you're doing incremental stuff might as well stub out the calls with "not implemented", and start filling them in.

creshal · 2024-06-24T07:14:41 1719213281

The SIMD part will be emulated as normal, as far as I understand. So you can ship a first version with all-emulated code, and then incrementally port hotspots to native code, while letting the emulator handle the non-critical parts.

At least in theory, we'll see how it actually pans out in practice.

selimnairb · 2024-06-24T11:13:52 1719227632

I feel like binary translation is a better approach. It’s a temporary workaround that allows users to use non-native programs while they are ported properly. ARM64EC seems like it will incentivize “eh that’s good enough” partial porting efforts that will never result in a full port, while making the whole system more complicated, with a larger attack surface (binary translation also makes the system more complicated, but it seems more isolated/less integrated with the rest of OS).

PaulHoule · 2024-06-24T12:35:52 1719232552

My understanding is that ARM64EC only makes sense in terms of binary translation. That is, the x64 bits get translated and the ARM bits don’t.

anaisbetts · 2024-06-24T17:35:52 1719250552

The use-case is huge apps that have a native plugin ecosystem, think Photoshop and friends. Regular apps will typically just compile separate x64 and ARM64 versions

doctorpangloss · 2024-06-24T16:40:03 1719247203

Yes, bite the bullet and port. Of course it makes no sense.

These sorts of things are only conceived in conversations between two huge corporations.

Like Microsoft needs game developers to build for ARM. There’s no market there. So their “people” author GPT-like content at each other, with a ratio of like 10 middlemen hours per 1 engineer hour, to agree to something that narratively fulfills a desire to build games for ARM. I can speculate endlessly how a conversation between MS and EA led to this exact standard but it’s meaningless, I mean both MS and EA do a ton of things that make no sense, and I can’t come up with nonsense answers.

Anyway, so this thing gets published many, many months after it got on some MS PM’s boss’s partner’s radar. Like the fucking devices are out! It’s too late for any of this to matter.

You can’t play Overwatch on a Snapdragon whatever (https://www.pcgamer.com/hardware/gaming-laptops/emulation-pr... ) End of story. Who cares what the ABI details are.

Microsoft OWNS Blizzard and couldn’t figure this out. Whom is this for?

comex · 2024-06-24T18:59:27 1719255567

> Anyway, so this thing gets published many, many months after it got on some MS PM’s boss’s partner’s radar.

Arm64EC is not new. It was released back in 2021.

Tempest1981 · 2024-06-24T07:34:50 1719214490

> requires the use of the Windows 11 SDK and is not available on Windows 10 on Arm.

So what should developers do re: Win10 users? Separate builds for them?

goosedragons · 2024-06-24T11:54:15 1719230055

Is it really even a big enough concern to think about them? Windows 10 on ARM lacks x64 emulation support and the devices never sold well. I can't imagine there's too too many Windows 10 on ARM devices hanging around still running Windows 10.

dmitrygr · 2024-06-24T16:14:58 1719245698

> Windows 10 on ARM lacks x64 emulation support

The last build of win10 on arm supported x64 and many of us who do not want win11 still use it.

goosedragons · 2024-06-24T16:29:09 1719246549

Sort of. An insider build that was never fully released. Does that even get updates anymore?

dmitrygr · 2024-06-24T16:50:59 1719247859

Windows Update? Why would you volunteer for that experience‽‽‽ It is a VM in my MacBook. It needs no updates

NelsonMinar · 2024-06-24T22:11:29 1719267089

There must be literally tens of you.

pjmlp · 2024-06-24T11:48:16 1719229696

From Microsoft's point of view, ignore them after 2025, unless they pay big.

In reality, yes, different builds, like it already happened with previous Windows versions.

AshamedCaptain · 2024-06-24T10:18:14 1719224294

The same thing you do for users of all previous failed windows on arm attempts?

If you meant x86 win10 users you can use the win11 sdk to target them

TiredOfLife · 2024-06-24T13:15:19 1719234919

Only the first Snapdragon 835 is not capable to run windows11. Starting with Snapdragon 850 all are compatible.

Snapdragon 835 is also horribly slow.

spullara · 2024-06-24T17:29:54 1719250194

How is this different than what Apple did for the x86 -> ARM transition?

duskwuff · 2024-06-24T20:04:29 1719259469

Rosetta 2 operates on the process level -- on an Apple Silicon system, a process can run an ARM executable and run all ARM code, or can run an x86_64 executable and run all x86_64 code. ARM64EC allows processes to run a mixture of native and emulated code. Whether this is actually useful is debatable, but the option exists.

tedunangst · 2024-06-24T20:28:51 1719260931

Rosetta allows loading x86 plugins into arm apps.

duskwuff · 2024-06-24T20:37:50 1719261470

Source? My understanding is that cross-architecture plugins are handled out of process over XPC.

tedunangst · 2024-06-24T21:01:50 1719262910

That sounds right. I think I misread a source. Was just looking at this yesterday, but didn't look close enough.

anaisbetts · 2024-06-24T17:34:42 1719250482

ARM64EC is usually for stuff like plugins or really large apps - most people will simply compile an ARM64 and x64 version of their app

tedunangst · 2024-06-24T20:28:15 1719260895

It requires you to recompile your application a third time if you want to load x64 plugins, and then it becomes incompatible with arm plugins.

nomercy400 · 2024-06-24T07:02:14 1719212534

So is this Arm64EC Windows-only? Is it standardized?

If not, is this not just another target architecture? You cannot use it on arm64 architectures, and your app already supports x86.

ComputerGuru · 2024-06-24T07:55:02 1719215702

It’s not anything special, it’s arm code compiled with the x64 abi. The theory behind it is simple enough.