While Apple M* chips seems to have an incredible unified memory access, the available learning resources seem to be quite restricted and often convoluted. Has anyone been able to get past this barrier?
I have some familiarity with general purpose software development with CUDA and C++. I want to figure how to work with/ use Apple's developer resources for general purpose programming.
If you're looking for a high level introduction to GPU development on Apple silicon I would recommend learning Metal. It's Apple's GPU acceleration language similar to CUDA for Nvidia hardware. I ported a set of puzzles for CUDA called GPU-Puzzles (a collection of exercises designed to teach GPU programming fundamentals)[1] to Metal [2]. I think it's a very accessible introduction to Metal and writing GPU kernels.
You can help with the reverse engineering of Apple Silicon done by a dozen people worldwide, that is how we find out the GPU and NPU instructions[1-4]. There is over 43 trillion float operations per second to unlock at 8 terabit per second 'unified' memory bandwidth and 270 gigabits per second networking (less on the smaller chips)....
You can use a high level APIs like MLX, Metal or CoreML to compute other things on the GPU and NPU.
Shadama [5] is an example programming language that translates (with Ometa) matrix calculations into WebGPU or WebGL APIs (I forget which). You can do exactly the same with the MLX, Metal or CoreML APIs and only pay around 3% overhead going through the translation stages.
I estimate it will cost around $22K at my hourly rate to completely reverse engineer the latest A16 and M4 CPU (ARMV9), GPU and NPU instruction sets. I think I am halfway on the reverse engineering, the debugging part is the hardest problem. You would however not be able to sell software with it on the APP Store as Apple forbids undocumented API's or bare metal instructions.
any place you have your current progress written up on? Any methodology I could help contribute on? I've read each one of the four links you've given over the years and it seems vague with how far people have currently gotten and exact issues.
Several people have already contacted me today with this request. This is how I give out details and share current progress with you.
Yes, you can help, most people on HN could. It is not that difficult work and it is not just low level debugging, coding and FPGA hardware. It is also organizing and even simple sales, talking to funders. With patience, you could even get paid to help.
>any place you have your current progress written up on?
Not any place in public, because of its value for zero-day exploits. This knowledge is worth millions.
I'm in the process of rewriting my three scientific papers on reverse engineering Apple Silicon low level instructions.
>it seems vague with how far people have currently gotten and exact issues.
Yes, I'm afraid you're right, my apologies. It's very much detailed and technical stuff, some of it under patent and NDA, some even sensitive for winning economic wars and ongoing wars (you can guess those are exiting stories). It even plays a role in the $52.7 billion US, €43 billion EU and $150 billion (unconfirmed) Chinese Chips Acts. Apple Silicon is the main reason TSMC opened a US factory [1], keeping its instruction set details secret is deemed important.
If you want more information, you should join our offline video discussions for more info. Maybe sometimes sign an NDA for the juicy bits.
You are right. The zero-day exploits might be worth roughly a million each, but not the family tree of native GPU's, ANE, CPU instruction sets and microarchitecture on which they would be based.
My apology for writing unclearly, English is not my native language. I'm surprised it is yours.
Saving on energy, programming effort and purchase cost of a supercomputer in case of M4 instruction sets and microarchitecture knowledge would also save millions.
Apple wants total freedom to rework lower levels of the stack down to the hardware, without worrying about application compatibility, hence their answer will continue to be Metal.
I agree that it allows Apple to redefine Apple Silicon instruction sets without having do explain it to 3rd party software developers, but it is certainly not the main reason they hide the technical documentation of the chips.
Metal is the answer. Everything else is just implementation detail as GP said.
Apple doesn’t provide developer support to other OSes. The only OS they do anything for* is macOS. So to them there’s no point.
All they’d get is people relying on implementation details they shouldn’t, other companies stealing what they consider their trade secrets, or more surface area for patent trolls to scan.
* Someone on the Asahi team, I think Hector Martin, has commented before the Apple is doing things that clearly seem designed to allow others to make and securely boot other OSes on their Apple Silicon hardware. They clearly could be clamping it down far more but are choosing not to. However that’s exactly as far as the support appears to go.
> Metal is the answer. Everything else is just implementation detail as GP said.
You can say this as long as you want, Nvidia makes money hand-over-fist supporting CUDA alongside OpenCL and DirectX. It's all just business to them - they don't have to play the same game as Apple because they're just not quite so petty with the ecosystem politics.
Look at MacOS, for example. Plenty of legacy software never was supported in Metal, it's "implementation detail" never manifested. It wasn't even really used in AI either until Apple upstreamed their own MPS hacks into Pytorch and people got BERT et. al. working, and even that was a pint-sized party trick that you could do on a Raspberry Pi. Apple themselves aren't even using their own servers for serious inference either, because you can't. It's gotta be offloaded to a lower-latency platform.
It's not just that Metal as a platform has failed it's users, although it's certainly contributed to developers giving up on Mac hardware for serious compute. Apple's GPU design is stuck in iPhone mode and they refuse to change their approach with Apple Silicon desktop hardware. It was Apple's big bet on NPUs that hamstrung them, not an implementation detail, and if you don't believe me then wait and see. Xserve didn't tear down the 1U market, Asahi didn't upend Linux HPC, and Metal isn't going to upend AI compute any more than DirectX will. This is the same "Apple will get 'em next year" quote we always hear when they fuck up, and they never actually seem to swallow their pride and take notes.
Apple are using their own servers for inference, that's the whole private cloud compute thing. Siri and other things use models and probably aren't running on it (though it's not announced), but those are older.
> Apple's GPU design is stuck in iPhone mode and they refuse to change their approach with Apple Silicon desktop hardware.
I can't guess what is the main reason. There might not even be a main reason, as many groups of people at Apple and its shareholders decided this over the years.
(Also see my speculations below in this thread).
So not in any order of importance to Apple:
1) Create the same moat as NVIDIA has with CUDA.
2) Ability to re-define the microcode instruction set of all the dozens of different Apple Silicon chips now and in the future without having to worry about backwards compatibility. Each Apple Silicon chip simply recompiles code at runtime (similar to my adaptive compiler).
3) Zero hardware documentation needed, much cheaper PR and faster time to market, also making it harder to reverse engineer or repair.
4) Security. Security by obscurity
5) Keeping the walled garden up longer.
6) Frustrating reverse engineering of Apple software. You must realize Apple competes with their own third party developers. Apple can optimize code on the GPU and ANE, third party developers can not and are forbidden too by Apple.
7) Frustrating reverse engineering of Apple hardware.
8) It won't make Apple more sales if 3rd party developers can write faster and more energy efficient GPU and NPU software.
There certainly is a reason and indeed you don't see it because Apple downplays these things in their PR.
It might be the same reason that is behind NVDIA's CUDA moat. CUDA lock-in prevented competitors like AMD and Intel to convince programmers and their customers to switch away from CUDA. So there was no software ported to their competitive GPU's. So you get anti-trust lawsuits [1].
I think you should put yourself in Apples management mindset and then reason. I suspect they think they will not sell more iPhones or Macs if they let third party developers access the low level APIs and write faster software.
They might reason that if no one knows the instruction sets hackers will write less code to break security. Security by obscurity.
They certainly think that blocking competitors from reverse engineering the low power Apple Silicon and blocking them from using TSMC manufacturing capacity will keep them the most profitable company for another decade.
CUDA didn't prevent anything at least not in the way you believe.
Intel and AMD had no competitive offer, period. They still don't.
NVIDIA is simply offering an ecosystem that is battle tested and is ready out of the box. Look at the recent semianalysis test to see how not ready AMD is, who would be the only company to have a real shot at this. Their HW on paper is better or equal, yet their software ecosystem is nowhere ready.
> Look at the recent semianalysis test to see how not ready AMD is, who would be the only company to have a real shot at this. Their HW on paper is better or equal, yet their software ecosystem is nowhere ready.
Reading that was kind of odd. It seems like their conclusion was that on paper AMD should be significantly less expensive and significantly faster, whereas in practice they're significantly less expensive and slightly slower because of unoptimized software, which actually seems like it'd still be a pretty good deal. Especially if the problem is the software, because then the hardware could get better with a software update after you buy it.
They also spend a lot of time complaining about how much trouble it is to install the experimental releases with some improvements that aren't in the stable branch yet, but then the performance difference was only big in a few cases and in general the experimental version was only a couple of percent faster, which either way should end up in the stable release in the near future.
And they do a lot of benchmarks on interconnect bandwidth which, fair enough, Nvidia currently has some hardware advantage. But that also mainly matters to the small handful of companies doing training for huge frontier models and not to the far larger number of people doing inference or training smaller models.
It feels like they were more frustrated because they were using the hardware as the problems were being solved rather than after, even though the software is making progress and many of the issues have already been resolved or are about to be.
Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the mac pro, which could support nics at that speeds (and above according to my quick maths#), but there is not really any driver support for modern Intel or Mellanox/Nvidia NICs as far as I can tell.
My use case would be hooking up a device which spews out sensor data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as possible. The new mac mini has the GPU power, but there's no way to get the data into it.
> Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the Mac pro
We both should restate and specify the calculation for each different Apple Silicon chip and the PCB/machine model it is wired onto.
The $599 M4 Mac mini base model networking (aggregated Wifi, USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps. Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number is to high because the 2x Gen4x16 is shared/oversubscribed with the other PCIe lanes for x8 PCIe slots, SSD and Thunderbolt. You need to measure/benchmark it, not read the marketing PR.
I estimate the $1400 M4 Pro Mac mini networking bandwidth by adding the external WiFi, 10 Gbps Ethernet, two USC-C ports (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps) but subtracting the PCIe 64 Gbps limit and not counting the internal SSD. Two $599 M4 Mac mini base models are faster and cheaper than one M4 Pro Mac mini.
The point of the precise actual measurements I did of the trillion opereations per second and the billion of bits per second networking/interconnect of the M4 Mac mini against all the other Apple silicon machines is to find which package (chip plus pcb plus case) has the best price/performance/watt balanced against them networked together. On januari 2025 you can build the cheapest fastest supercomputer in the world from just off the shelf M4 16Gb Mac mini base models with 10G Ethernet, Mikrotek 100G switches and a few FPGA's. It would outperform all Nvidia, Cerebras, Tenstorrent and datacenter clusters I know of, mainly because of the low power Apple Silicon.
Note that the M4 has only 1,2 Tips unified memory bandwidth and the M4 Pro has double that. The 8 Tops unified memory bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB DRAM. Without it you cant's reach 50 trillion operations per second. A Mac Studio has only around 190 Gbps external networking bandwidth but does not reach 43 trillion TOPS, as does the 720 Gbps of your Mac Pro estimate. By reverse engineering the instruction set you could squeeze a few percent extra performance out of this M4 cluster.
The 43 trillion TOPS of the M4 itself is an estimate. The ANE does 34 TOPS, the CPU less than 5 TOP depending on float type and we have no reliable benchmarks for the CPU floating point.
It's very weird to add together all kinds of very different networking solutions (WiFi, wired ethernet, TB) and talk about their aggregate potential bandwidth as a single number.
You use SerDes high speed serial links (up to 224 Gbps in 2025) to communicate between chips. A PCIe lane is just a Serdes with a 30% packet protocol overhead that uses DMA to copy bytes between to SRAM or DRAM buffers.
You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1).
You could also built mesh networks from SerDes but now instead of PCIe switches You would need SerDes switches or routers (Ethernet, NVlink, Infiniband).
You need those high speed links between chips for much more than SSD/NVME cards. Other NAS, Processors, Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM, power etc. For intercore communication (between processors or between chiplets), between networked PCB's, between DRAM chips (DDR5 is just another SerDes protocol), Flash Chips, camera chips, etc.
Any other chip at faster then 250 Mbps speeds.
I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.
My talk [1] on Wafer Scale Integration and free space optics goes deeper into how and why SerDes and PCIe will be replaced by fiber optics and free space optics for power reasons. I'm sure several parallel 2 Ghz optic lambdas per fiber (but no SerDes!) will be the next step in Apple Silicon as well: the M4 power budget already is mostly in the off-chip SerDes/Thunderbolt networking links.
> I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.
That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?
It is not the first time they built super computers from off the shelf Apple machines [1].
M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.
>do you happen to have some further information on that?
Yes, the information is in my highly detailed custom documentation for the programmers and buyers of 'my' Apple Silicon super computer, Squeak and Ometa DSL programming languages and adaptive compiler. You can contact me for this highly technical report and several scientific papers (email in my profile).
Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?
Because the problem with HPC is that almost all funders and managers buy supercomputers with a safe brand name (Nvidia, AMD, Intel) at triple the cost and seldom from a super computer researcher as myself. But some do, if they understand why. I have been designing, selling, programming and operating super computers since 1984 (I was 20 years old then), this M4 Apple Silicon Cluster will be my ninth supercomputer. I prefer to build them from the ground up with our own chip and wafer scale integration designs but when an off-the-shelf chip is good enough I'll sell that instead. Price/Performance/Watt is what counts, ease of programming is a secondary consideration for what performance you achieve. Alan Kay argues you should rewrite your software from scratch [2] and do your own hardware [3] so that is what I've done sinds I learned from him.
>Is it just a bunch of FPGAs issuing DMA TLPs?
No. The FPGA's are optional for when you want to flatten the inter-core (=inter-SRAM cache) networking with switches or routers to a shorter hop topology for the message passing like a Slim fly diameter two hop topology [4].
DMA (Direct Memory Access) TLPs (Transaction Layer Packets) are one of the worst ways of doing inter-core and inter-SRAM communication and on PCIe it has a huge 30% protocol overhead at triple the cost. Intel (and most other chip companies like NVIDIA, Altera, AMD/XILINX) can't design proper chips because they don't want to learn about software [2]. Apple Silicon is marginally better.
You should use pure message passing between any process, preferably in a programming language and a VM that uses pure message passing at the lowest level (Squeak, Erlang). Even better if you then map those software messages directly to message passing hardware as in my custom chips [3].
The reason to reverse Apple Silicon instructions for CPU, GPU and ANE are to be able to adapt my adaptive compiler to M4 chips but also to repurpose PCIe for low level message passing with much better performance and latency than DMA TLPs.
To conclude, if you want to get the cheapest Capex and Opex M4 Mac mini supercomputer you need to rewrite your supercomputing software in a high level language and message passing system like the parallel Squeak Smalltalk VM [3] with adaptive load balancing compilation. C, C++, Swift, MPI or CUDA would result in sub-optimal software performance and orders of magnitude more lines of code when optimal performance of parallel software is the goal.
I forgot to add links to talk [5] by IBM Research on massively parallel Squeak Smalltalk and why it might be relevant for Apple Silicon reverse engineering and M4 clusters.
Talk [6] on free space optical interconnects without SerDes some day showing up on low power Apple Silicon (around M6-M8 models).
Yes, knowing the exact CPU and ANE assembly instructions (or the underlying microcode!!) allows for general purpose software to adaptively compile processes on all the core types, not just the CPU ones. Its won't always be faster, you get more cache misses (some cores don't have cache) and different DMA and thread scheduling, some registers can't fit the floats or large integers, etc etc.
But yes, it will be possible to use all 140 cores of the M2 Ultra or the 36 cores of the M4. There will be an M6 Extreme some day, maybe 500 cores?
Actually, the GPU and ANE cores themselves are built from teams of smaller cores, maybe a few dozens, hundreds or thousand in all, same as in most NVDIA chips.
>A steal for $22k but I guess very niche for now...
A single iPhone or Mac app (a game, an LLM, pattern recognition, security app, VPN, de/encryption, video en/dec
coder) that can be sped up by 80%-200% can afford my faster assembly level API.
A whole series of hardware level zero-day exploits for iPhone and Mac would become possible, now that won't be very niche at all. It is worth millions to reverse Apple Silicon instruction sets.
What would a "llvm compilable" hello world look like that matches the libc GPU example for "AGX" (Apple Graphics)? It's not possible from MacOS, right? It'd have to be done from Linux?
No, I don't think it is impossible for MacOS. I might be missing a detail here, not sure. I have to think it over.
I have seen [1] you can patch ANECompilerService, so you can even speed up existing code, because Apple compiles your code just in time (at runtime) on each machine. We could do that for MacOS libc too.
You (or your compiler) write the instructions and data into unified memory (up to 192 GB) and jump to the first instruction (usually of a loop) on each core. GPU and ANE processor cores are not fundamentally different from CPU cores, they just have fewer transistors (gates) and therefore more limitations in what a register can address, what data type or what instruction it can execute. Some cores can only execute the same instruction as there neighbor core in a team, but on different data. Or at a different time, synchronized with neighbors. But they still are Turing complete processors so in essence are the same as their cousins the CPU cores. Sometimes cores input or output addresses are in a pipeline between cores (so it limits its address offset).
MacOS only plays a role in allocating and protecting the instruction or data memory regions for the GPU and ANE processors.
It's hard to answer not knowing exactly what your aim is, or your experience level with CUDA and how easily the concepts you know will map to Metal, and what you find "restricted and convoluted" about the documentation.
<Insert your favorite LLM> helped me write some simple Metal-accelerated code by scaffolding the compute pipeline, which took most of the nuisance out of learning the API and let me focus on writing the kernel code.
People have already mentioned Metal, but if you want cross platform, https://github.com/gfx-rs/wgpu has a vulkan-like API and cross compiles to all the various GPU frameworks. I believe it uses https://github.com/KhronosGroup/MoltenVK to run on Macs. You can also see the metal shader transpilation results for debugging.
With what the OP asked for, I don't think wgpu is the right choice. They want to push the limits of Apple Silicon, or do Apple platform specific work, so an abstraction layer like wgpu is going in the opposite direction in my opinion.
Indeed. I'm curious how much overhead there is in practice given the fact that the hardware wasn't designed to provide vulkan support. I honestly have no clue what to expect.
If you know CUDA, then I assume you know a bit already about GPUs and the major concepts. There’s just minor differences and different terminology for things like “warps” etc.
I’d note a lot of their stuff was still written in Objective-C, which I’m not that familiar with. But most of that is boilerplate and the rest is largely C/C++ based (including the Metal shader language).
I just ported some CPU/SIMD number crunching (complex matrices) to Metal, and the speed up has been staggering. What used to take days now takes minutes. It is the hottest my M3 MacBook has ever been though! (See https://x.com/billticehurst/status/1871375773413876089 :-)
[1] https://github.com/srush/GPU-Puzzles
[2] https://github.com/abeleinin/Metal-Puzzles
reply