Hacker News new | past | comments | ask | show | jobs | submit login
How to get 1.5 TFlops of FP32 performance on a single M1 CPU core (jott.live)
359 points by signa11 on Jan 5, 2023 | hide | past | favorite | 119 comments



It's amazing to me that there are four separate pieces of hardware in M1 devices that can do matrix multiplies.

In addition to running on the CPU, M1 Max devices have three separate kinds of hardware-accelerated `gemm`: the GPU, the ANE (Apple Neural Engine), and this special matrix coprocessor. Here's a fairly detailed post that benchmarks each:

https://tlkh.dev/benchmarking-the-apple-m1-max

And here's a great post about the justification for having so much special-purpose hardware:

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

As for the matrix coprocessor, Apple's built-in BLAS implementation (Accelerate.framework) uses this chip. You can link Numpy against this to benefit in your Python programs, for example. Here are some old instructions: https://gist.github.com/MarkDana/a9481b8134cf38a556cf23e1e81...

All this represents yet another cycle on the Wheel of Reincarnation... (http://catb.org/jargon/html/W/wheel-of-reincarnation.html)


Isn't this wheel of reincarnation simply a result of a shifting bottleneck? A computation can be CPU-bound or memory-bound, and this can change over hardware generations.


Makes sense... We're also seeing energy efficiency and model size and latency becoming significant constraints these days, and the more unique constraints an application has, perhaps the more beneficial it is to have many different implementations with different tradeoffs.


> energy efficiency (...) many different implementations

Yep, thermal throttling is a thing, and sometimes all you need is either useless silicon padding or some specialized, most of the time dark silicon to both make it feasible to cool and prevent it from melting.


I suspect Apple was more worried about battery use in this case.


It is, but the fact that the bottleneck has shifted multiple times (as opposed to just this one recent time) is nonobvious (to someone unfamiliar with computing history) and worthy of pointing out.


Since there is no summary, these are the benchmark findings:

    AMX co-processor 2 TFLOPS FP32
    GPU 8 TFLOPS FP32
    Neural Engine 5.5 TFLOPS FP16


Note that AMX can achieve roughly double the FLOPS with FP16, and 8 TFLOPS for the GPU is only about 77% of peak. You can do better than that, especially using FP16 90+% is possible (which is >9.4 TFLOPS).


Is there any easy way to use all of these at the same time? Ie. some library you can ask to do a big matrix multiply and it will loadbalance between the bits of hardware?

Or do you have to manually split the computation between them?


I’m by no means an expert in any of this. I mainly work on video processing using the GPU. That said, I would think if any library would do load balancing between them, it would likely be the Accelerate.framework that ships with the system.

However, I do have some experience with having the same code run on the GPU and the CPU. In my work, we have tried breaking images (usually frames of video) into various sized chunks and processing on both the CPU and GPU at the same time. Our conclusion is that the overhead of using both outweighs any benefit you’d get. The GPU is so much faster than the CPU, there’s no point in involving the CPU at all. These experiments were done several years ago, so perhaps the landscape has changed since then, but that was what we found.


You might find David Wright's presentations about Unreal 5 interesting:

https://highperformancegraphics.org/slides22/Journey_to_Nani...

https://advances.realtimerendering.com/s2022/index.html#Lume...

They're great presentations with a lot of depth in the notes. I think videos are around somewhere if you prefer that.

Two specifics I'd mention:

It seems a lot of games now use feedback between frames as a way to tolerate the latency of moving data between CPU and GPU. Eg the CPU will use GPU crunched data from the previous frame as a source for CPU crunching that optimizes what data gets passed to the GPU next.

The other is that fixed functionality is moving into shaders. Unreal 5 uses a mix of hardware rasterization and software rasterization in a shader (and path tracing now as well). There the tradeoff between the two is triangle size in pixels.


Oh wow! Thanks! That looks really cool.


They're great. I dunno if you find 3d interesting vs video, but the section in that nanite presentation where he goes through how he arrived at the LoD clustering design is some of the smartest stuff I've ever seen any developer say, ever. Like John Carmack probably saw this and went "dang, wish I'd thought of that" levels of smart.


So why would you choose to use the Neural Engine rather than the GPU?

Just power efficiency?


That and if you want to use the GPU at the same time.


> And here's a great post about the justification for having so much special-purpose hardware:

Ok but it doesn't actually justify why AMX & ANE both exist. It makes kind of a vague handwavy "well AMX latency is better[1] and that's useful[2]"

1: but not measured, so not actually known, and with a note that it's been called out that the AMX understanding is incorrect, so is this point even still accurate?

2: but not elaborated on in the slightest or any comparison of test workloads

So why do both AMX & ANE exist? CPU team did AMX before the ANE team showed up with something bigger & better? Are they actually used for differing workloads or simultaneously?


> All this represents yet another cycle on the Wheel of Reincarnation...

Isn't this adding new cores directly onto the main chip? That doesn't sound like it fits to me.

And at this point GPUs have been straddling both sides of the divide for decades, depending on the particular device form factor and the necessary power.

The only thing I would actually say has gone through a cycle lately is the crypto accelerator for mac SSDs.


> Isn't this adding new cores directly onto the main chip? That doesn't sound like it fits to me.

These are coprocessors, which are a very different thing from just another CPU core. For one, they use a different architecture (instruction set, registers/memory, etc.).

The "wheel of reincarnation" refers to features/capabilities on coprocessors eventually being folded into the main CPU. While CPUs have adopted insights from GPU implementations, GPU functionality has never been fully folded into CPUs (software rasterizers don't count).


> These are coprocessors, which are a very different thing from just another CPU core.

Well that's why I didn't say "just another CPU core". But fine, I don't want to argue semantics.

> The "wheel of reincarnation" refers to features/capabilities on coprocessors eventually being folded into the main CPU.

Then that's definitely not happening here, and hasn't happened to x86/arm since they gained floating point, right?


There's also the media encoder hardware accelerator, which isn't quite `gemm`, but certainly contains hardware that performs `mm`s.


The AMX units are really nice, especially because you can use them simply with standard sgemm through the Accelerate framework. However, in most applications where latency is not an issue, you'll probably want to use Metal Performance shaders instead, not only are they much faster most applications, they can also be more energy efficient.

For instance, we did benchmarks of spaCy (natural language processing) transformer models across various Apple Silicon SoCs and MPS was 1.9x (M1) to 5.5x faster (M1 Ultra) while providing far more performance per Watt. E.g. using MPS on an M2 MacBook Air used 4W less energy while being 2.7x faster than AMX.

Full benchmarks are at the end of this post:

https://explosion.ai/blog/metal-performance-shaders


This is a fairly ridiculous amount of performance, all things considered.

It always seemed to me like SIMD/AVX/etc would eventually come for the GPU's lunch money... How many more product generations of "SIMD on steroids" before this is practically true?

The latency factor is the biggest thing for me. The GPU is a turtle compared to CPU-bound techniques. I can see emerging applications for this in real-time/streaming where every millisecond counts.


That's how we felt when we were writing the software rasterizer for Larrabee! The issue is that that 1.5TFLOP is probably way more power than the M1 GPU's ~2.5TFLOP. The second issue is that a SW rasterizer is going to spend ~50% of its budget emulating fixed function. So, now you're way more power, for 1/4 the perf (best case). Also, you can't run any other apps, and you're probably going to have bandwidth issues to the display controller.

GPUs are an optimization to try to use the excess Moore's law we have to get to the ghost of Dennard's law.


There’s nothing to preclude a CPU from adding fixed function units they can call for certain things. Larabee 2 (one of the knights series) actually included these on the chip (though they were disabled in the final compute only product).

You also forgot to mention the usability bit. Branching is bad for GPUs. They either take both branches or in the optimized version, they stall the pipeline until they get a result then attempt to optimize which units need to execute on which branch.

The first is simply inefficient. The last is efficient, but impracticality slow on branchy code.

On the power side, bloated x86 is hardly a good showcase for what such an architecture could do. I also suspect that an automatic thread manager like most modern GPUs use could also add a lot of value.


> Branching is bad for GPUs.

Branching is bad for wide SIMD workloads of which GPUs are the most common. But just shoving that SIMD under the control of a CPU instead won't fix your branching issues. You're still probably going to choose the path of just executing both branches & masking off the results.


I worked on the LRB through LRB3. I wrote the non-polygon pipeline for omatic (lines, points, billboards); later I switched to the shader compiler (masher) under mattomatic. Larrabee 1 (all of omatic) had fixed function on the turns: texture units, and pointer chasing. The later parts did not have GPU FF; they didn’t even have some of the rasterization opcodes (faddsets & fmad233).

I seem to remember that LRB could test & jmp in 4 cycles, if you were careful. Since it was 4x barrel processed, those jmps were “free”.

I moved to integrated GPUs, later. x86 is bloated, but LRB was not. Also, the decoder — even a big one like x86 — isn’t a major HW problem. I’d say x86’s memory hierarchy is more of an issue.


I think a power/latency/perf tradeoff could be agreeable for certain applications. GPUs in the cloud are not exactly cheap. Many gaming experiences do not require nanite-level graphics.

Building something that can reliably output reasonable-quality 3d graphics without relying on specific GPU technologies will give you a much broader realm to operate with.

I believe something along this path is the solution for streaming gaming. I perceive the failure of Stadia, et. al. as being a consequence of trying to bolt streaming onto existing GPU-based, local gaming solutions. Build something from scratch with streaming/latency as a #1 priority, and you can dramatically expand the operational radius of each datacenter (e.g. ~100km per millisecond saved).


I feel like that's a somewhat out-of-touch interpretation, as Stadia failed largely because of Google's terrible reputation and the (completely valid) concerns from gamers about companies intending to turn even single player games into fragmented streaming platforms where the content is entirely dependent on the whims of the company (a fitting example being Google doing its thing and killing Stadia). They had no shortage of GPUs.

NVIDIA's streaming service is doing relatively fine in comparison. They simply share a GPU between several users for anything that isn't demanding enough. They also get around some of the concerns about gaming being turned into another streaming style fragmented mess by not actually selling the games. You simply log into your account on Steam/GOG/whatever and play the games you already own as you might on a local PC.

Additionally, "building something that can reliably output reasonable-quality 3d graphics without relying on specific GPU technologies" doesn't make much sense to me. If it's an accelerator designed to handle relatively modern 3d graphics, due to the programmability of a modern graphics pipeline it's effectively just a GPU. There aren't any underlying technologies that are required to be used as long as they can produce a similar output (mobile GPUs tend to have a different approach to how they implement the graphics pipeline compared to desktop GPUs for instance).


Well; except for the fact that the 1.5TFLOP quoted in the article is because of the AMX part. The actually useful throughput of the big-core is probably more like 35GFLOP peak. This compares to the 1–2TFLOP throughput of the GPU. The CPU is easily going to be 50–100x slower than the GPU.

If you're talking full-screen Angry Birds with, say, a 2x average compositing, you're going to be fine on the CPU; but, energy- and jitter- wise you'll still be happier with the GPU, overall.


well 2 things:

1) 1.5 TFLOPS is already less than the GPUs in most current phones. Like you're not exactly talking a big number here. You're talking 14 year old graphics (when desktop GPUs crossed the 1TFLOP mark)

2) And, this is the bigger issue, this AMX unit only does matrix multiplies. I'd be fascinated to see someone create a 3d renderer with only matrix multiplication. I'm sure it's possible but this 1.5 tflops ain't remotely comparable to a GPU's 1.5 tflops. Like if you tried to do "traditional" pixel shader rendering code on this unit you'd instantly be cutting that 1.5 tflops to 1/16th the performance - you're now at a theoretical peak of a mere 93 gflops of GPU-comparable performance.


well, AMX does asynchronous outer products from byte-addressable buffers with accumulation. it's not like tensor cores or intel's AMX where the silicon literally only does a matrix multiplication. you can easily do efficient sliding window math from the buffers (convolutions) or sparse data loads (since the instructions run asynchronously from the CPU)


Light is 300km/ms in a vacuum. Is it that much slower through switched fiber?


Signal speed in fiber is about ⅔ of that in vacuum (https://en.wikipedia.org/wiki/Optical_fiber#Refractive_index), but fiber won’t be straight-line between sender and receiver, light doesn’t move in a straight line inside the fiber, and the switched adds delays.

https://www.pingdom.com/blog/theoretical-vs-real-world-speed...: “you should probably double the “ideal” response times shown above for a more realistic target to aim at“

So yes, ⅓ of light speed in vacuum seems a decent heuristic.


Speed of light in glass is about 2/3 of the speed of light in a vacuum (refractive index of glass is around 1.5).


Round trip latency matters here, which would get you down to 150km without any slowdown through fiber.


You'd need a fairly drastic shift in the memory architecture of CPUs for that. Not something unheard of, such as Intel's new Xeon Max beast with HBM 2e on the CPU module. But it's definitely not an issue of just throwing some big SIMD blocks onto the die & calling it a day. That is, after all, basically what AVX-512 is. And while it has it's place, it's also not eating anyone's lunch money.

And also, as weird as it is, 1.5TFlops isn't actually that ridiculous. We had that performance 14 years ago at 150w with desktop GPUs. 14 years to reduce from 150w to what, 5w?, is cool but also honestly pretty par for the course is it not? Especially for a fixed-function block?


"You'd need a fairly drastic shift in the memory architecture of CPU". You mean like selling laptops (at 400GB/sec) and desktops (at 800GB/sec) with much improved memory systems.

I don't want to give up SO-DIMMs for a few mm thinner laptop, but going from the intel/amd standard 70GB/sec to 400GB/sec is a pretty big incentive.


M1's CPUs can't achieve that rate fwiw. It's also generic LPDDR5, not actually anything particularly new or exciting. The 800GB/s isn't to a single compute unit, either, it's 2x 400GB/s controllers.

Regardless, no that's not the drastic shift necessary. It's not even a shift at all. 4, 6, and 8 channel memory architectures have all existed for a while now in data center SKUs. It hasn't changed the desire for GPU compute in the data center


A comment here on benchmarking the M1 managed to get 360GB/sec observed out of a 400GB/sec memory system, so I'd consider that pretty good and certainly a better percentage of peak than I see out of the x86-64 systems I benchmarked.

Well the 4, 6, and 8 channel memory architectures have been dramatically slower than the Apple 800GB/sec with dramatically less memory transactions in flight (1 per channel) than Apple's 800GB/sec with 64 channels. The push for Apple like memory systems with HBM from Intel, AMD, Fujitsu and others does show that there's a need for more compute bandwidth and Fujitsu in particular has shown that a healthy CPU with a GPU like memory system can get great performance on real world codes without the hassle of rewriting for CUDA.

With all that said, yes there's a healthy demand for GPUs and codes that run best on GPUs. However I find it refreshing that at last one company is providing significantly improved memory systems in laptops and small desktops.


> A comment here on benchmarking the M1 managed to get 360GB/sec observed out of a 400GB/sec memory system, so I'd consider that pretty good and certainly a better percentage of peak than I see out of the x86-64 systems I benchmarked.

"the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side; From a single core perspective [..] it’s able to stress the memory fabric to up to 102GB/s. [..] Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve"

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

So peak CPU bandwidth on the M1 Max is 224GB/s. Which is really good for a CPU to be sure, but we're still talking bandwidth numbers far below what you'd need to eliminate the GPU.

> Well the 4, 6, and 8 channel memory architectures have been dramatically slower than the Apple 800GB/sec with dramatically less memory transactions in flight (1 per channel)

Huh? I guarantee you Xeons and Epycs are not limited to 1 transaction in flight, nor is Graviton2.

The AMD APUs in things like the steam deck, PS4, and PS5 (448GB/s fwiw) are also definitely not limited like that, either.

> than Apple's 800GB/sec with 64 channels

M1 ultra is 32 channels. Well, it's really 2x 16 channels since it's a NUMA configuration, like a dual-socket system.

> The push for Apple like memory systems with HBM from Intel, AMD, Fujitsu

lol? Apple's memory system isn't like HBM. It's more like a traditional GPU's but with GDDR swapped out for LPDDR. M1 Max is 512-bit memory bus, just like many GPUs have done over the years. 512-bit bus and it can only hit 400GB/s shows you just how much less bandwidth LPDDR has vs. GDDR, too. AMD & NVidia are hitting 1TB/s with 384-bit buses these days. Which is also what the HBM2e Xeon's are claiming.


> "the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side

Sure, wouldn't want to starve the GPU when the CPU is busy. Not sure if the AMX has it's own connection or shares the CPU complex bandwidth.

> I guarantee you Xeons and Epycs are not limited to 1 transaction in flight, nor is Graviton2

Right, I said 1 per channel, not 1. I've tested this on Xeons and Epycs, but not genoa yet. Generally maximum random throughput tends to be with 16 or so cache misses per socket with 8 channels. From what I can tell about half the latency is cache misses in L1, L2, and L3. Then you queue in the memory controller and wait for whatever memory channel (or two depending on config). By keeping 16 in the queue as soon as one of the 8 channels free you have another request pending for that channel, at least most of the time.

The Genoa looks particularly promising on this front, they upgraded from 8 channels to 12, but the DDR5 dimms actually provide two channels per dimm, so you end up with 24 narrow (32 bit) DDR5 channels. I've already seen results showing that Genoa scales better to high core counts than the zen3 Epyycs, which makes sense with 24 memory requests in flight instead of 8.

> M1 ultra is 32 channels. Well, it's really 2x 16 channels since it's a NUMA configuration, like a dual-socket system.

Heh, well monolithic chips are on the way out. Even the zen2 epycs used multiple pieces of silicon (chiplets) and the new Intel Xeon/Sapphire Rapids do the same. Don't see a particular difference. Even the desktop Ryzens have multiple pieces of silicon (one IOD and one CPU chiplet at the minimum). In any case the Apple M1 Max memory system matches the Genoa and beats the newest Intel Sapphire Rapids with a single piece of silicon, and wins handily doubles that with two pieces of silicon, unless you wait for the unreleased HBM flavors.

Compared to 100% of Intel/AMD laptops and probably 98% of desktops that apple memory system is pretty amazing. Sure there are threadrippers (4 channel), threadripper pro (8 channel), and some similar Intel options, but all are expensive, low volume, and desktop/workstation only.


> Sure, wouldn't want to starve the GPU when the CPU is busy.

Ok, then surely you won't take issue with me describing my x86 desktop as having 1.2TB/s of memory bandwidth? It just happens to be in a NUMA configuration and after all, wouldn't want to starve the GPU when the CPU is busy...

> Don't see a particular difference.

You don't see a difference between 1 memory controller and 2 memory controllers? Compare Zen1 TR/epyc vs. zen2 TR/epyc.

Hint: m1 ultra is like Zen1, and probably ain't the future

Oh yeah and Intel did this, too, with the 56 core xeons a couple years ago. And, like AMD, also quickly moved away from it.


> Ok, then surely you won't take issue with me describing my x86 desktop as having 1.2TB/s of memory bandwidth?

Unless it's a rather odd config that's counting the GPU memory bandwidth which is decidedly not NUMA since the GPU mem isn't cache coherent with the main memory.

> You don't see a difference between 1 memory controller and 2 memory controllers?

From what I can tell the multiple memory controllers per chip AMDs performed well on most codes, but certain codes, OSs, and benchmarks didn't handle the different latencies within the socket well. As a result AMD moved to one memory controller per socket in the next generation.

The M1 ultra moved to two chips and 2 memory controllers, but as a result manages to hit 800GB/sec peak which is a pretty big win compared to any normal socket, even the 96 core AMD Genoa or Intel SPR are sitting in the half to 1/3rd range, despite coming our a year or so later.


Apple Silicon chips share memory between the CPU and GPU, would that play into any calculation of the relative benefits? Presumably the GPU isn't getting the full benefits of a GPU optimised memory set up so the difference would be smaller?


That's been doable on other products for over a decade. AMD's APUs and Intel's CPUs, for example, along with nearly every mobile phone SoC from the last 10 years

It makes some GPGPU workloads viable that otherwise wouldn't be, but it doesn't reduce the bandwidth needs of the GPU for traditional GPU workloads so net-net you're worse off overall. You either use DDR and penalize your GPU performance, or you use GDDR (like consoles do) and sacrifice CPU memory latency. Also power efficiency of GDDR is worse, especially compared to Apple Silicon's choice of using LPDDR

HBM2e gets you everything except it's expensive and limited capacity. But it'll be interesting to see how that plays out with the Xeon Max, which can also still supplement the 64GB HBM2e with some absurd channel count DDR5


Aside from Apple's processors is 1.5TFlops in 5w possible with other archs?


It sounds pretty similar to Qualcomm's hexagon ( https://www.anandtech.com/show/13680/snapdragon-855-going-in... ) and probably whatever Google is doing with their Pixel SoC

But otherwise that's less than what mobile GPUs can do, although those take more power they're not in entirely different ballparks either


A typical goal is 60hz, which is 17k microseconds. My cursory research says as of 10 years ago a typical write/receive latency for an nvidia card, i7, pcie2.0 is 20 microseconds. That gives you a large budget despite the fact SIMD on chip is measured in cycles not microseconds. Inside the GPU you have a huge amount of space and resources to do highly specialized operations in vast concurrency, I.e., bandwidth for compute is huge and specialized. I don’t see how CPU’s or SOC will solve this without vastly increasing die sizes and heat and power consumption to be close to that of a GPU with all its cooling requirements and heavy power needs.

That said I think the “good enough” metric is already there and unless you’re doing hardware ray tracing or extreme details at high resolutions you won’t need or care about a GPU any more.

Latency though isn’t the issue. The times involved for human perception are long and not getting shorter.


Things have been "good enough" since 2012. But then VSCode and bigger webpages came along and suddenly a Core2Duo just doesn't cut it anymore. ML models need somewhere to run, locally, and both Apple and Google have dedicated hardware on smartphones for that. Support for bigger and bigger models (read GPU performance) in smaller and smaller packages is just the latest iteration of progress.


Yes I agree. Except I think real time ray tracing really is that much better and shifts the goal posts again.


One interesting data point here is the Fugaku supercomputer is based around ARM's scalable vector stuff (basically Cray style variable length vectors vs short vector SIMD like AVX) and no gpu. Using HBM is a key enabler here.

I'm not sure GPUs will be displaced, looking at the difficulties Larrabe had on the driver side, but I do think we'll see more flexible alternatives becoming popular.


The GPU people are also reaching for simd and fixed matmul hw to increase perf. Tensor Cores (int, fp16, tf32 and even fp64 on A100) and the new DPX instructions. RT cores are a different kind of horse but still specialized for BVH traversal and ray-triangle intersection.


We're reaching a point where CPUs are increasingly getting more specialized, and GPUs are becoming increasingly generalized. Going at improvements from both sides of the sandwich.


The GPU is more like a slow U-Haul truck, whereas the CPU is a super fast race car. Both have merit in their own domain. And GPU training is pretty solidly in the "slow and steady" camp.


Training in production, yes. Developing locally is still a thing for many reasons. More importantly, inference is more «sports car» - you want the app to stay interactive!


I think the author downplays the significance his work because it only applies to "small neural networks". There are a lot of use-cases that can benefit from this type of optimizations. Discovering how to use an undocumented fast accelerator available on millions of devices is very valuable.


Apple has done a wonderful job making CoreML smoothly integrated with iOS, iWatchOS, iPadOS, and macOS development.


I get the value of the common APIs, but as a developer how do you deal with the wide range of performance in different form factors and product generations? Is there some way to gracefully adapt the same models to a specific device’s capabilities?


There are a bunch of easy ways to scale neural nets. Quantization and distillation being the main approaches (or some combination of the two). Both typically require more training time, but not much more human-effort.

You can normally expect to get way more than half the 'outcome' from a neural net with half the ram/compute/time/power budget. So neural nets scale 'down' pretty well.


> Apple has done a wonderful job making CoreML

Apple has done a wonderful job of further locking their user into the golden cage they call a platform.


I think you're right and you're wrong, it's a bit more complicated.

ML is one of the few applications that benefit from platform-specific optimizations, so if you need every ounce of performance, you have your choice of which walled garden you want to tether your application to. The "lock-in" comes from the specific capabilities of your special-purpose hardware, and for serious applications, you're already thinking hard about whether to design your entire implementation around Apple, NVidia, Google/TPU, or even Android devices. For big models, platform-specific needs influence every aspect of model design, including data/model sharding, quantization, training loops...

For non-scientific applications, it's usual practice to train your model in platform-agnostic ways using PyTorch or Tensorflow or whatever and then deploy it to devices in platform-specific ways, whether that's XLA, CoreML, Edge TPU, Android NNAPI, TensorflowJS, or hell, custom-written GLSL shaders or whatever.

We're just starting to see cross-platform frameworks that abstract model inference: TFLite, PyTorch Mobile, ONNX. To their credit, CoreML can act as a backend for any of these, so you don't even need to worry about your platform.


Every platform is a golden cage in some respect. Ask any business who is stuck on ancient Win32 and even DOS applications, source code long gone. (Looking at you my local McDonalds, Menards, Tractor Supply)…


The worst thing is, their users don’t even seem to be totally happy with the state of affairs! It’s like they don’t even realize their preferences are wrong. :(


This was intended to be obvious sarcasm, but I somehow accidentally added “don’t” which… really just makes it confusing. Oops, haha.


Not up to date on a lot of "AI"/"ML" things, why isn't this significant for medium/large neural networks as well?


RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of this.


The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48 FP16 according to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

EDIT: I fell for NVIDIA's marketing. The dense FP16 performance is only half of 284.48, which is 142. Thanks to adgjlsfhk1 for the correction.


That 285 is listed as (2:1 sparse) which means it's only valid for matrices where 2 out of every 4 numbers are zero. For dense matrices it's half that.


Are 2:1 sparse matrices a common thing? It seems weird, like clearly that’s not sparse enough to want to use, like, sparse matrix “CSR” style storage or something, haha. I would just treat it as dense I guess.


They aren't. As far as I can tell, Nvidia does this to be able to double the number of TFlops they put on their website. (this might be a little unfair, the real reason is that in ML it might be possible to train a NN such that your matrices have this structure, but I haven't seen anyone other than Nvidia use it)


What you might do is train using dense matrices, then sparsify those (pick the 2 out of each set of 4 weights that are closest to zero, mask them out), then do a few more training iterations with the mask in place.

It turns out that even without the extra training iterations you often lose surprisingly little in terms of quality of output. In reality you can sparsify a lot more, but 2 out of 4 is so simple and easy to implement in hardware, more complex schemes are much harder to justify.

However, small matmuls (say, <2048 bytes in the K dimension) won't get anywhere near 2x performance.


I’m trying to think of cases where it might accidentally come up, and all I can think of is something like “oops I used complex but my values are actually real.”


There has been some work in that direction but it hasn't really caught on as fast as NVIDIA may have expected it to.


Yeah, still waiting for this feature to be available in PyTorch natively.


The 1.5 here is for a single core, though. So if we assume that the performance core on an M1 is around 7.5 watts (I’m not actually sure, seems like a reasonable upper bound though if a whole M1 mini is around 39 watts), we’d be looking at around 750 watts to match. Which seems like a surprisingly non-crazy amount of power given these are 32 bit flops, unlike the 16 in the RTX 3090, and they come from a CPU.


This code runs on AMX co-processor. From the article:

> An important distinction is that the AMX:CPU ratio is not 1:1; not every core has its own AMX co-processor.

My understanding is there's only 1 of those per regular M1 CPU, maybe 4 on the largest one (Ultra).


I tried gemm-benchmark on my M1 Max, and it took 22W to hit 2.2 Tflops with AMX (accelerate) or 36W to hit 270 GFlops with NEON (OpenBLAS)

So that's actually just about as power-efficient for fp32 as a 3090, which according to wikipedia is 35 Tflops in 350W. Supposedly AMX can do 2x rate for fp16 as opposed to the 3090's 4x rate, so maybe 2x less efficient than a 3090 for fp16.

Interestingly, fp64 hits 370 Gflops at 15W...


Apple did prefer to expose it through their own Accelerator.framework API however...


Has it been verified that they actually use these instructions in Accelerate.framework? I just benchmarked this on my 2019 intel i9 mbp, and got the following speeds for 128x128 matrices, 32 repeats:

  cblas_sgemm: 36 GFLOP/s
  vDSP_mmul: 41 GFLOP/s
That's a pretty big deal if these functions are >30x faster on the M1...!

edit: that seems to be verified in the tlkh.dev blog post above. Interestingly, I ran the same code on my bargain-basement 2020 iphone SE, and got 259GFLOP/s! These apple devices are pretty mindblowing.


Has it been verified that they actually use these instructions in Accelerate.framework?

Yes. Aside from benchmarks, you can easily verify this by profiling an application with Instruments and then inspecting the disassembly.

However, it should be said that AMX does not scale linearly with the number of cores, but with the number of core clusters. So, on the M1 if you use Accelerate in two threads (rather than one), performance will barely improve, because the first thread can keep the AMX unit busy enough.

However, e.g. the M1 Pro and M1 Max have two performance core clusters with AMX units in them. So matrix multiplication doubles roughly two times compared to the M1. Similarly, the M1 Ultra has fours performance core clusters, so matrix multiplication performance is roughly twice that of the M1 Pro/Max and four times that of the M1.

Benchmarks:

https://github.com/danieldk/gemm-benchmark#1-to-16-threads


Hey not related but you mentioned using kvm to run arm64 macOS on linux aarch64. I would like to give this a shot, but can't find a project for it. Would you mind sharing the deets?


Of course they do, Apple like to remain as much in control as possible. If suddenly it becomes more efficient/faster to run ML/AI stuff on Asahi Linux on Mac hardware then with macOS, I'm sure they be embarrassed enough to take some sort of action. And I'm pretty sure that action will be towards the side of "closing things down" rather than "opening stuff up", as is tradition.


Wrong answer.

AMX is an unstable ISA that changes between product generations. That's why it's not publicly documented.

Arm SME is the standardisation of the concept, but is not inmarket yet.

https://community.arm.com/arm-community-blogs/b/architecture...


For comparison...

A single Google TPUv4 'pod' (entire row of datacenter racks) gives 1,126,400 TFlops.

Thats why your pet ML projects will always be behind those done at big companies.


I have always been under the impression that there will eventually be a way to distribute ML projects across many personal computers (like the Folding@home or SETI@home) that could give even Google a run for their money! A few hundred million personal computers is a lot of processing!


One issue with this is that many ML tasks are more memory bandwidth bound than compute bound. A forward pass through a neural network is a bunch of simple operations on gigabytes of data, whereas with protein folding there's a much higher ratio of compute to data.


SETI@home had 90,000 users at its peak. Even assuming all these users have the most powerful Macbook, and are happy to run it 24/7, the whole fleet still doesn't equal just one of Googles pods.

And googles pods have microsecond latency terabit interconnects, while a fleet of macbooks would have hundreds of milliseconds of latency and low bandwidth...

And Google has many pods...

I'm afraid even a massive team of home users will never beat companies with dedicated hardware.


>SETI@home had 90,000 users at its peak. Even assuming all these users have the most powerful Macbook, and are happy to run it 24/7, the whole fleet still doesn't equal just one of Googles pods.

Well ... SETI@home wasn't trying to create ML boobs for anime waifus! ... I suspect the army of young guys that want to do that would dwarf any other distributed service including bitcoin!


Some folks may be interested in the Armv9 Scalable Matrix Extensions which appear to do something very very similar. https://community.arm.com/arm-community-blogs/b/architecture...


Anyone with a comparison with Intel's deep learning boost or VNNI, which is available on avx-512 processors such as https://ark.intel.com/content/www/us/en/ark/products/213805/...


I don't see throughput information on Intel's AMX, but their VNNI has information: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

0.5 cycles per instruction max, 3.7GHz clock, that's 7.4e9 instructions per second. If I'm reading it right, that instruction does 16 4-wide dot products, which is ~128 ops. So ~950Gops peak in int8 precision on a server class Xeon assuming no clock throttling.

(edit: flops -> ops)


Int8 operations are not "flops" =)


I remember driving to college nearly 20 years ago and one of the headlines on the radio (NPR morning show) was that the DOE had unveiled the world's fastest supercomputer at the White House that day. It was going to do a whopping 6 Teraflops, and they explained what that meant. And I remember thinking about all the possibilities with that kind of compute.

I understand that this 1.5 TFlops may not be an exact comparison (or maybe it's the same), but if it's even within an order of magnitude, it is beyond mind-blowing, and we've just crossed over into at Exaflops at the supercomputer level.


So, this is about Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit.

Is there one such unit per CPU core?


M1 has one AMX unit per cluster AFAIK. This however can and does change between different chips.


Yes, there is one per core cluster. The title is a bit misleading, because it suggests that going to two or three cores will scale linearly, though it won't be much faster. See here for sgemm benchmarks for everything from the M1 to M1 Ultra and 1 to 16 threads:

https://github.com/danieldk/gemm-benchmark#1-to-16-threads


> So, this is about Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit.

CPU instruction -> AMX instruction -> AMX result -> CPU?

How are these kinds of things usually kept in sync/in a manageable state? Like does the CPU block until the AMX returns?


No.

So the title is misleading, even if it is true that you get this performance with a program that uses a single CPU core.


I love all the posts by Bram. Please keep writing them!


Posts like these are always awesome to look at how much we can push consumer hardware.

It's hard not to really appreciate some of the devices we have today. For instance, an RTX 4090 is capable of 660 TFlops of FP8 (MSRP 1600). Would not be surprised if we soon have laptops that can do petaflops of computation soon!


Nitpick... This paragraph is somewhat confusing. I think it is worded incorrectly:

> Let's simplify the problem and implicitly transpose the matrix multiplication. Both A and B (our inputs) will have K (our reduction dimension) as the leading dimension. This doesn't really matter much in practice, but it simplifies our code a lot.

The code is

  C[n * 16 + m] += A[k * 16 + m] * B[k * 16 + n];
Which means that actually *m* is the leading dimension of A with stride 16, and for B it is *n* with stride 16.


Why would Apple hide such optimization from public APIs?


It is available via public APIs, but the hardware instructions themselves are not documented. This lets the instructions change in future CPUs, vs having the ISA be baked in stone forever.

Example: AMX predated the standard ARM matrix multiply instructions. Perhaps Apple will add the ARM versions someday and now can remove AMX without breaking compatibility. Or maybe there will be a non-additive AMXv2.


It's not hidden from the public APIs. It's exposed through Accelerate.framework, etc. The instructions themselves aren't stable so there's no guarantee they'd exist in other generations - as others have said these instructions predate ARM actually adding the standardized scalable vector extension, presumably Apple could switch to the standard ISA in future, but if the AMX instructions were exposed directly to developers they'd have to continue to support them in future and thus continue to burn silicon space for them.


It's only really used for their internal applications and the OS level stuff so assume they want to prevent performance issues with it having to deal with 3rd party stuff.


How is fp64? Nvidia crippled the 4090 to just 1.3T fp64 flops so if a mac mini with m1 could match that it'd be a solid win


You know you can see your upvoted stories right?


Someone should run this on a M2 model.


Does Apple use the AMX in their own code? Is anything like the AMX present in their mobile CPUs?


It's a fast console, there's no doubt of that. Kind of like the Playstation 3 when it came out. Fast, not much software support without lots of special considerations, non-upgradable hardware, limited peripheral support. All in all, a fast CPU embedded in a marginal console-like "computer". People out there who were tricked into buying the M1 8GB ram version can confirm.


I love how Apple hater rhetoric hasn't changed in 30 years.


Tricked how? I’ve got an M2 8GB and loving it


Even with swap, 8gb is pretty paltry on a memory-hungry system like MacOS, let alone a system that shares GPU and system memory. 16gb is the minimum for me even though I really only edit/compile code, and even then it can be pretty easy to max out your system memory after a couple docker containers...

It might not be a 'trick' per-se, but anyone who intends to use a Mac for work should consider upgraded memory (IMO).


I will agree with your last point. If I had bought this machine for doing seriously development I would've gone for 16gb. Saying that, I've been pleasantly surprised with its power. I've been playing with Metal, throwing together a native version ShaderToy, and it hasn't felt unpowered once. Even when running the iPad emulator.

I did feel a little duped when I learned that some M1/M2 machines can only support one external monitor. Now I have to replace my two monitors with a widescreen.


IMO, the 'problem' is that MacOS will use 4-5gb OOB, and using an Electron app with a browser open will easily push that into swap space. For most daily drivers, even light users, they'll be happy to have upgraded memory.


Right now with just safari and a few background things I'm hovering at 6gb in use, so you're not wrong about how much memory is being used. Regardless I don't think it's a problem for light users. A light user imo would be just browsing and email. 8GB will give you plenty of headroom in that case.

I'm going to keep an eye on ram usage for the next few days. I'm curious what it will look like on a more full workload because if things have been swapping out, I haven't noticed.


I was "compelled" to get MacOS as I develop an iOS app using flutter. I needed a lightweight power-efficient laptop so got a second-hand 8GB M1 and can fairly comfortably develop (VSCode + dotnet + nodejs + many tabs in firefox) inside a Debian VM running i3 with zram & UTM... Perhaps it depends on how memory-hungry one's development actually is - a massive Java Eclipse project might not work nearly as well.


I help my dad when he has computer problems. His M1 8GB isn't even enough ram to edit 20-50MB photos from an SLR camera. At first I thought it was because he had a bunch of stuff running but it turns out all the rosetta translation and such stuffs the cache and combined with adobe creative suite background programs and all the mac os stuff leaves only ~2 GB of ram left for the photo editing program. And that's after the browser is closed.

You can say that it was my dad's mistake to buy an M1 8GB but I say it was pretty lame of Apple to sell a computer that expensive that can't do basic tasks. And don't even get me started on peripherals like external monitors.


I would blame Adobe. The Adobe suite is massive, and it sounds like it hasn't been ported to M1. So I'm not surprised that it doesn't perform well on the base model. In my experience, it has more than enough ram for web browsing, streaming, and developing, all at the same time.

Peripherals have been fine, except for monitors as you mentioned. I do think it's ridiculous that I can only have one external monitor when it's clearly able to support more than that. I can add a third monitor through my iPad or DisplayLink, but both of those methods breaks DRM video.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: