I really hope that there will be some real alternative in terms of hardware soon. This kind of comment makes me laugh and makes me sad:
> any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed
Any GPU (as long as it's NVIDIA).
Also, it sounds like those models won't get the .cpp variant if they don't do the processing with quantized weights, right? (The list claims they're unpacked and still running on floats)
Julia has some pretty swell cross-GPU packages. I was really hoping that it would catch on in the ML community, but I think we're past that point: the inferior solution has more momentum.
We recently showed DiffEqGPU.jl generating customized ODE solver kernels for NVIDIA CUDA, AMD GPUs, Intel OneAPI, and Apple Metal, where for CUDA it matches state of the art (MPGOS) which is about 10x-100x something like Jax/PyTorch (where the performance difference comes from inefficiencies of using vmap vs actually writing and calling a kernel). It's all in https://arxiv.org/abs/2304.06835. So this stuff exists and people are using it. Of course the caveat here is this is the context of engineering applications so someone would need to do similar for LLMs to fully relate back to the article, but it shows the tools are ready to a large extent for someone to step up in the ML space.
As an ODE solver, you wouldn't do nanoGPT with it though, you'd need to go back to KernelAbstractions and write a nanoGPT based on that same abstraction layer. Again, this is a demonstration of the cross-GPU tools for ODEs, but for LLMs you'd need to take these tools and implement an LLM.
The way I see it, people need to develop a CUDA translation layer or build a more attractive alternative. DirectX is being killed in this fashion, and Microsoft is helpless to stop it. Problem is... both AMD and Apple gave up on their collaborative OpenCL library years ago. Someone has to step up to the plate for things to change.
> any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed
Any GPU (as long as it's NVIDIA).
Also, it sounds like those models won't get the .cpp variant if they don't do the processing with quantized weights, right? (The list claims they're unpacked and still running on floats)