One of these days I will find time to write more about model inference optimizations without going through distillation / quantization. Case in point: switching llama.cpp from custom kernel to cublas's GEMM implementation reduces speed from 70tok/s to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).
Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware?
I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?
I was experimenting with getting a few models to output Rhai scripting and found that the non-quantized models or 6 bit were able to do it as I requested with a few hints, but the 4 or 5 bit ones got confused.
Whereas the 4 or 5 bit could handle equivalent requests with Python.
My conclusion was that I should find tune a 4 or 5 bit on Rhai scripting question output pairs and it I made enough good ones, the performance on my task would improve.
Maybe if I just switch to Exllama2 or something then the 6 bit will run fast enough.
I think distillation in the original sense isn't being done anymore but finetuning on outputs from larger models like GPT-4 is a form of distillation (top-1 logit vs all logits and a curated synthetic data instead of the original dataset)
On quantization though its still weird how just the weights are quantized in methods like gptq / int8 while there are other methods which quantize the activations as well. There's also the matter of KV cache still being in original 16bit precision regardless which is also unsolved here. Do you have any thoughts or insights into this?
It’s not clear to me what’s happening on the distillation front. I agree no one is doing it externally, but I suspect that the foundation model companies are doing it internally, performance is just too good.
There’s a bunch of recent work that quantizes the activations as well, like fp8-LM. I think that this will come. Quantization support in PyTorch is pretty experimental right now, so I think we’ll see a lot of improvements as it gets better support.
The KV cache piece is tied to the activations imo- once those start getting quantized effectively, the KV cache will follow.
1) i actually think that’s too high, i bet it’s more like 30%. My logic is that they have to have _some_ margin, but LLMs are too expensive to have typical software margins. Total speculation though.
2) It generally tracks pretty well unless the model is gaming the metric (training on the test set, overfit to the specific source of data, etc). The relative rankings will typically match in both.
3) alas, not with the mild winter North America’s having. They only stop below -5C or so. I am lucky though. The woodpecker stopped attacking my house and started attacking my neighbor’s. Even worse, it used to be a downy woodpecker,and it’s now been replaced by a pileated one (think: Woody).