I really hope that there will be some real alternative in terms of hardware soon. This kind of comment makes me laugh and makes me sad:
> any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed
Any GPU (as long as it's NVIDIA).
Also, it sounds like those models won't get the .cpp variant if they don't do the processing with quantized weights, right? (The list claims they're unpacked and still running on floats)
Julia has some pretty swell cross-GPU packages. I was really hoping that it would catch on in the ML community, but I think we're past that point: the inferior solution has more momentum.
We recently showed DiffEqGPU.jl generating customized ODE solver kernels for NVIDIA CUDA, AMD GPUs, Intel OneAPI, and Apple Metal, where for CUDA it matches state of the art (MPGOS) which is about 10x-100x something like Jax/PyTorch (where the performance difference comes from inefficiencies of using vmap vs actually writing and calling a kernel). It's all in https://arxiv.org/abs/2304.06835. So this stuff exists and people are using it. Of course the caveat here is this is the context of engineering applications so someone would need to do similar for LLMs to fully relate back to the article, but it shows the tools are ready to a large extent for someone to step up in the ML space.
As an ODE solver, you wouldn't do nanoGPT with it though, you'd need to go back to KernelAbstractions and write a nanoGPT based on that same abstraction layer. Again, this is a demonstration of the cross-GPU tools for ODEs, but for LLMs you'd need to take these tools and implement an LLM.
The way I see it, people need to develop a CUDA translation layer or build a more attractive alternative. DirectX is being killed in this fashion, and Microsoft is helpless to stop it. Problem is... both AMD and Apple gave up on their collaborative OpenCL library years ago. Someone has to step up to the plate for things to change.
This entry very casually goes from describing FP8 and then E4M3 and E5M2, to suddenly talking about FP4. It describes the demonstration of having the mantissa bits "1101" but that by itself is four bits. Unless it's including the implicit leading 1 but even then that leaves just 1 bit for the sign. So no exponent bits at all? Then later it gives an example with zero mantissa bits. ?
Are there any decent but easily digest summaries of FP4? The best I can find is a giant paper. I do not understand why the linked entry gave great summaries of the larger FP types but then waves hand / fog about FP4.
Yeah that section isn't very clear. Also, the article doesn't seem to explain what NF4/NormalFloat is at all? I would guess it has something to do with the position of the value on a gaussian, but that's just a guess
I wonder if some sort of logarithmic with pedestal encoding could be used, that way you could have better precision around where you expect most of your values to be yet have a large range as well. Bonus, multiplies are just adds.
> any GPU could be used to run the 4bit quantization as long as you have CUDA>=11.2 installed
Any GPU (as long as it's NVIDIA).
Also, it sounds like those models won't get the .cpp variant if they don't do the processing with quantized weights, right? (The list claims they're unpacked and still running on floats)