Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.
I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.
Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.
GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.
Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.
I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.