Hacker News new | past | comments | ask | show | jobs | submit login

You can blame ARM for the popularity of CUDA. At least x86 had a few passable vector ISA ops like SSE and AVX - the ARM spec only supports the piss-slow NEON in it's stead. Since you're not going to unify vectors and mobile hardware anytime soon, the majority of people are overjoyed to pay for CUDA hardware where GPGPU compute is taken seriously.

There were also attempts like OpenCL, that the industry rejected early-on because they thought they'd never need a CUDA alternative. Nvidia's success is mostly built on the ignorance of their competition - if Nvidia was allowed to buy ARM then they could guarantee the two specs never overlap.




> Since you're not going to unify vectors and mobile hardware anytime soon

Apple's M4 as described in its iPad debut has about the same chip area for CPU and GPU/vector functions. It's as much a vector machine as not.

https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...


CUDA clobbered x86, not ARM. Maybe if x86’s vector ops were better and more usable ARM would have been motivated to do better.


Whole concept sounds like groping in the dark for a Take to me: GPUs (CUDA) are orthogonal to consumer processors (ARM / X86). Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.


> GPUs (CUDA) are orthogonal to consumer processors (ARM / X86).

We're talking about vector operations. CUDA is not a GPU but a library of hardware-accelerated functions, not necessarily different from OpenCL or even NEON for ARM. You can reimplement everything CUDA does on a CPU, and if you're using a modern CPU you can vectorize it too. x86 handles this well, because it's still got dedicated logic that keeps pace with the SIMD throughput an integrated GPU might offer. ARM leaves out the operations entirely (which is smart for efficiency), and therefore either relies on someone porting CUDA code to an ARM GPU shader (fat chance) or offloading to a remote GPU. It's why ARM is excellent for sustained simple ops but collapses when you benchmark it bruteforcing AI or translating AVX to NEON. SIMD is too much for a base-spec ARM core.

> Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.

Xeon Phi or Itanium flavored?


I've read this 10x and get more out of it each time.

I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"

That aside, I'm building something that relies on llama.cpp for inference on every platform.

In this scenario, Android is de facto "ARM" to me.

The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)

iOS is de facto "not ARM" to me because it runs on the GPU.

I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.

That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"

That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.

Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)


Yeah, that’s true. CUDA is in large part for big HPC servers, where ARM historically wasn’t a player and still isn’t dominant. x86 got clobbered for HPC by CUDA.


ARM has SVE these days. This comment makes no sense, anyway: people don’t do numerical computing on phones.


I bet the majority of AI inference FLOPS will be executed on phones before long.

Our phone camera pipelines are doing lots of numerical compute already.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: