- Hybrid offloading with llama.cpp, but with slow inference.
- Squeezing it in with extreme quantization (exllamav2 ~2.6bpw, or llama.cpp IQ3XS), but reduced quality and a relatively short context.
30B-34B is more of a sweetspot for 24GB of VRAM.
If you do opt for the high quantization, make sure your laptop dGPU is totally empty, and that its completely filled by the weights. And I'd recommend doing your own code focused exl2/imatrix quantization, so it doesn't waste a megabyte of your vram.
What's best that can run fast on 4090 laptop?