A data point for you: 7B models at 5-bit quantization run quite comfortably under llama.cpp on the AMD Radeon RX 6700 XT, which has 12GB VRAM and was part of a lot of gaming PC builds around 2021-22.
I can’t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.
I can’t give this as a recommendation - there are far more tools available for Nvidia GPUs, but larger VRAM is available on AMD GPUs at lower prices from what I can see.