Does this run on 4090 16gb vram? What's best that can run fast on 4090 laptop?

brucethemoose2 · on Feb 22, 2024

Your options are:

- Hybrid offloading with llama.cpp, but with slow inference.

- Squeezing it in with extreme quantization (exllamav2 ~2.6bpw, or llama.cpp IQ3XS), but reduced quality and a relatively short context.

30B-34B is more of a sweetspot for 24GB of VRAM.

If you do opt for the high quantization, make sure your laptop dGPU is totally empty, and that its completely filled by the weights. And I'd recommend doing your own code focused exl2/imatrix quantization, so it doesn't waste a megabyte of your vram.