Hacker News new | past | comments | ask | show | jobs | submit login

Well the Llama.cpp running on CPUs with decent speed and fast development improvements, hints towards CPUs. And there the size of the model is less important as the RAM is the limit. At least for interference this is now a viable alternative.



Outside of Macs, llama.cpp running fully on the cpu is more than 10x slower than a GPU.


But having 32 real cores in a cpu is so much cheaper than having mumtiple gpus. RAM is also much cheaper as VRAM.


For local yes, but at data center level the parallelization is still often worth it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: