Efficient LLM Inference (2023)

liuliu · on Jan 4, 2024

One of these days I will find time to write more about model inference optimizations without going through distillation / quantization. Case in point: switching llama.cpp from custom kernel to cublas's GEMM implementation reduces speed from 70tok/s to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).

ColonelPhantom · on Jan 5, 2024

Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware?

I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?

golly_ned · on Jan 4, 2024

I'd be interested in this, so you've got at least one reader.

sroussey · on Jan 5, 2024

So isn’t that why they have a custom kernel?

thelastparadise · on Jan 5, 2024

I'm interested as well.

ilaksh · on Jan 4, 2024

I was experimenting with getting a few models to output Rhai scripting and found that the non-quantized models or 6 bit were able to do it as I requested with a few hints, but the 4 or 5 bit ones got confused.

Whereas the 4 or 5 bit could handle equivalent requests with Python.

My conclusion was that I should find tune a 4 or 5 bit on Rhai scripting question output pairs and it I made enough good ones, the performance on my task would improve.

Maybe if I just switch to Exllama2 or something then the 6 bit will run fast enough.

fnbr · on Jan 4, 2024

Ha! I got a ton of new subscribers this morning and was wondering why. Let me know if I can answer any questions (I am the author).

ramesh1994 · on Jan 4, 2024

I think distillation in the original sense isn't being done anymore but finetuning on outputs from larger models like GPT-4 is a form of distillation (top-1 logit vs all logits and a curated synthetic data instead of the original dataset)

On quantization though its still weird how just the weights are quantized in methods like gptq / int8 while there are other methods which quantize the activations as well. There's also the matter of KV cache still being in original 16bit precision regardless which is also unsolved here. Do you have any thoughts or insights into this?

fnbr · on Jan 4, 2024

It’s not clear to me what’s happening on the distillation front. I agree no one is doing it externally, but I suspect that the foundation model companies are doing it internally, performance is just too good.

There’s a bunch of recent work that quantizes the activations as well, like fp8-LM. I think that this will come. Quantization support in PyTorch is pretty experimental right now, so I think we’ll see a lot of improvements as it gets better support.

The KV cache piece is tied to the activations imo- once those start getting quantized effectively, the KV cache will follow.

sheikheddy · on Jan 5, 2024

1) Any particular reasoning behind estimating OpenAI’s margins are 60%?

2) How much does human preference diverge from benchmark scores in your experience?

3) Do woodpeckers stop attacking houses when it’s winter in Alberta?

fnbr · on Jan 5, 2024

1) i actually think that’s too high, i bet it’s more like 30%. My logic is that they have to have _some_ margin, but LLMs are too expensive to have typical software margins. Total speculation though.

2) It generally tracks pretty well unless the model is gaming the metric (training on the test set, overfit to the specific source of data, etc). The relative rankings will typically match in both.

3) alas, not with the mild winter North America’s having. They only stop below -5C or so. I am lucky though. The woodpecker stopped attacking my house and started attacking my neighbor’s. Even worse, it used to be a downy woodpecker,and it’s now been replaced by a pileated one (think: Woody).