You can blame ARM for the popularity of CUDA. At least x86 had a few passable ve...

robotresearcher · 2024-10-11T23:28:50 1728689330

> Since you're not going to unify vectors and mobile hardware anytime soon

Apple's M4 as described in its iPad debut has about the same chip area for CPU and GPU/vector functions. It's as much a vector machine as not.

https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...

oivey · 2024-10-11T15:43:53 1728661433

CUDA clobbered x86, not ARM. Maybe if x86’s vector ops were better and more usable ARM would have been motivated to do better.

refulgentis · 2024-10-11T16:15:56 1728663356

Whole concept sounds like groping in the dark for a Take to me: GPUs (CUDA) are orthogonal to consumer processors (ARM / X86). Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.

talldayo · 2024-10-11T18:18:35 1728670715

> GPUs (CUDA) are orthogonal to consumer processors (ARM / X86).

We're talking about vector operations. CUDA is not a GPU but a library of hardware-accelerated functions, not necessarily different from OpenCL or even NEON for ARM. You can reimplement everything CUDA does on a CPU, and if you're using a modern CPU you can vectorize it too. x86 handles this well, because it's still got dedicated logic that keeps pace with the SIMD throughput an integrated GPU might offer. ARM leaves out the operations entirely (which is smart for efficiency), and therefore either relies on someone porting CUDA code to an ARM GPU shader (fat chance) or offloading to a remote GPU. It's why ARM is excellent for sustained simple ops but collapses when you benchmark it bruteforcing AI or translating AVX to NEON. SIMD is too much for a base-spec ARM core.

> Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.

Xeon Phi or Itanium flavored?

refulgentis · 2024-10-11T18:51:06 1728672666

I've read this 10x and get more out of it each time.

I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"

That aside, I'm building something that relies on llama.cpp for inference on every platform.

In this scenario, Android is de facto "ARM" to me.

The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)

iOS is de facto "not ARM" to me because it runs on the GPU.

I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.

That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"

That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.

Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)

oivey · 2024-10-11T16:47:00 1728665220

Yeah, that’s true. CUDA is in large part for big HPC servers, where ARM historically wasn’t a player and still isn’t dominant. x86 got clobbered for HPC by CUDA.

saagarjha · 2024-10-11T20:26:02 1728678362

ARM has SVE these days. This comment makes no sense, anyway: people don’t do numerical computing on phones.

robotresearcher · 2024-10-11T23:21:06 1728688866

I bet the majority of AI inference FLOPS will be executed on phones before long.

Our phone camera pipelines are doing lots of numerical compute already.